Handbook of Molecular life sciences. An encyclopedic reference 9781461415312, 1461415314, 9781461415299, 9781461415305

The Encyclopedia examines biological phenomena at the molecular level and their interactions that govern life processes.

234 39 34MB

English Pages 1251 Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Handbook of  Molecular life sciences. An encyclopedic reference
 9781461415312, 1461415314, 9781461415299, 9781461415305

Table of contents :
Content: Intro
Section Editors
Contributors
A
Abasic Site Formation
Adducts on T, Effects of
Synonyms
Definition
Discussion
Cross-References
References
Allostery and Quaternary Structure
Synopsis
Introduction
Hemoglobin as an Exemplar of Allosteric Principles
Cross-References
References
Alternative Splicing
Amide Bond
Amino Acid Monomer
Apurinic Site Formation
Archaeplastida
Archaeplastidians
Artificial Chromosomes
Definition
Discussion
Cross-References
References
Artificial Endonucleases for Genome Editing
Definition
Discussion
Cross-References
References. Cross-ReferencesReferences
Bacteriophage and Viral Cloning Vectors
Synopsis
Introduction
Main Text
Cloning Vectors Derived from Bacteriophages
Filamentous Phage (M13 and f1) and Phagemids
Bacteriophage Lambda
Vectors Based on Bacteriophage Lambda: Cosmids and Lambda ZAP
Bacteriophage P1
Vectors Derived from Eukaryotic Viruses
Vectors Based on Retroviruses and Lentiviruses
Adenovirus and Adeno-associated Viruses
Baculovirus Vectors
Cross-References
References
Base Intercalation in DNA
Definition
Discussion
Cross-References
References
Base Substitution Mutation. Bioactivation of CarcinogensSynonyms
Definition
Discussion
Cross-References
References
Bioinorganic Chemistry
Synonyms
Synopsis
Introduction
Transport and Bonding in Bioinorganic Systems
Metals in Biological Systems Acting as Charge Carriers, Reaction Triggers, and Structure Facilitators
Ion Channels
Calcium Pumps
Potassium Ion Channels
Magnesium in Ribozymes
Zinc Fingers
Enzymes Transporting Dioxygen
Hemoglobin and Myoglobin
Hemocyanin
Metals in Biological Systems Facilitating Electron Transfer
Iron-Sulfur Clusters
Cytochromes
Cytochrome P450. Cytochrome b(6)f and the Photosynthetic PathwayCytochrome c Oxidase
Metals in Biological Systems Facilitating Enzyme Catalysis
Nitrogenase
Vitamin B (Cobalamin)
Ascorbate Oxidase
Hydrogenases
Superoxide Dismutase
Metal Toxicity
Metals in Medicine
Platinum-Containing Antitumor Drugs
Antirheumatic and Anticancer Gold Compounds
Bioorganometallic Drugs
Radiopharmaceuticals
Conclusions
Explanation of Terms
Cross-References
References
Biological Inorganic Chemistry
Blue/White Selection
Definition
Discussion
References
C
CAG Repeat Pathologies.

Citation preview

Robert D. Wells · Judith S. Bond Judith Klinman · Bettie Sue Siler Masters Editors-in-Chief Ellis Bell Scientific Managing Editor Laurie S. Kaguni Volume Editor

Molecular Life Sciences

An Encyclopedic Reference

Molecular Life Sciences

Editors-in-Chief

Robert D. Wells • Judith S. Bond Judith Klinman • Bettie Sue Siler Masters Scientific Managing Editor

Ellis Bell Volume Editor

Laurie S. Kaguni

Molecular Life Sciences An Encyclopedic Reference

With 428 Figures and 50 Tables

Editors-in-Chief Robert D. Wells Texas A&M University College Station, TX, USA

Judith S. Bond Penn State College of Medicine Hershey, PA, USA

Judith Klinman University of California Berkeley, CA, USA

Bettie Sue Siler Masters University of Texas Health Science Center San Antonio, TX, USA

Scientific Managing Editor Ellis Bell University of Richmond Richmond, VA, USA Volume Editor Laurie S. Kaguni Michigan State University East Lansing, MI, USA

ISBN 978-1-4614-1529-9 ISBN 978-1-4614-1531-2 (eBook) ISBN 978-1-4614-1530-5 (print and electronic bundle) https://doi.org/10.1007/978-1-4614-1531-2 Library of Congress Control Number: 2017949888 # Springer Science+Business Media, LLC 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer Science+Business Media, LLC The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.

Section Editors

Lawrence I. Grossman Wayne State University, Detroit, MI, USA Douglas A. Julin University of Maryland, College Park, MD, USA Jon M. Kaguni Michigan State University, East Lansing, MI, USA I. Robert Lehman Stanford University, Stanford, CA, USA

v

Contributors

Chaitanya Aggarwal Center for Pharmaceutical Biotechnology, Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, Chicago, IL, USA Katie J. Aldred Department of Biochemistry and the Vanderbilt Institute of Chemical Biology, Vanderbilt University, School of Medicine, Nashville, TN, USA Patrizia Aracri Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy Robert Augustin Department of Cardiometabolic Diseases Research, Boehringer-Ingelheim Pharma GmbH & Co KG, Biberach an der Riss, Germany Andrea Becchetti Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy Mikkel Bentzon-Tilia Department of Systemsbiology, The Technical University of Denmark, Lyngby, Denmark Daniel Bogenhagen Department of Pharmacological Sciences, Stony Brook University, Stony Brook, New York, NY, USA Suzanne Bohlson Department of Microbiology and Immunology, Des Moines University, Des Moines, IA, USA Edward Bolt School of Life Sciences, University of Nottingham, Nottingham, UK Linda Bonen Department of Biology, University of Ottawa, Ottawa, Canada Gabriel S. Brandt Franklin & Marshall College, Lancaster, PA, USA William Broach Department of Biological Sciences, Ohio University, Athens, OH, USA Celeste J. Brown Department of Biological Sciences, Institute for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, Moscow, ID, USA vii

viii

C. Robin Buell Department of Plant Biology, Michigan State University, East Lansing, MI, USA John Wesley Cain Department of Mathematics and Computer Science, University of Richmond, Richmond, VA, USA Charles Chalfant Department of Biochemistry and Molecular Biology, Virginia Commonwealth University, Richmond, VA, USA Research Career Scientist, Research and Development, Hunter Holmes McGuire VAMC, Richmond, VA, USA Kevin L. Childs Department of Plant Biology, Michigan State University, East Lansing, MI, USA Gilbert Chu Departments of Medicine and Biochemistry, Division of Oncology, CCSR 1145, Stanford University School of Medicine, Stanford, CA, USA Scott Cooper Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA Gaofeng Cui Department of Biochemistry and Molecular Biology, Mayo Clinic College of Medicine, Rochester, MN, USA Fernando de la Cruz Instituto de Biomedicina y Biotecnología de Cantabria (IBBTEC), Universidad de Cantabria-CSIC-IDICAN, Santander, Cantabria, Spain Carolyn Dehner Massachusetts College of Liberal Arts, North Adams, MA, USA Gloria del Solar Centro de Investigaciones Biológicas, CSIC, Madrid, Spain Nora Engel Fels Institute for Cancer Research, Temple University School of Medicine, Philadelphia, PA, USA Manuel Espinosa Centro de Investigaciones Biológicas, CSIC, Madrid, Spain Hanna Fabczak Department of Cell Biology, Nencki Institute of Experimental Biology, Polish Academy of Sciences, Warsaw, Poland Stanisław Fabczak Department of Cell Biology, Nencki Institute of Experimental Biology, Polish Academy of Sciences, Warsaw, Poland Michael J. Federle Center for Pharmaceutical Biotechnology, Department of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, University of Illinois at Chicago, Chicago, IL, USA Cris Fernández-López Centro de Investigaciones Biológicas, CSIC, Madrid, Spain Deborah B. Foreman Warsaw, IN, USA

Contributors

Contributors

ix

Sarah Friday Department of Biology, University of Richmond, Richmond, VA, USA Laura S. Frost Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada John Gallon Department of Biology, University of Richmond, Richmond, VA, USA Manuel Galvan Department of Biological Sciences, Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA Department of Microbiology and Immunology, School of Medicine, Indiana University, South Bend, IN, USA Shabir Ahmad Ganai Centre for Nanotechnology and Advanced Biomaterials (CeNTAB), School of Chemical and Biotechnology, SASTRA University, Thanjavur, Tamilnadu, India Padmashree Institute of Management and Sciences, Bangalore, Karnataka, India M. Pilar Garcillán-Barcia Instituto de Biomedicina y Biotecnología de Cantabria (IBBTEC), Universidad de Cantabria-CSIC-IDICAN, Santander, Cantabria, Spain Michael W. Gray Department of Biochemistry and Molecular Biology, Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, NS, Canada Mallary Greenlee-Wacker Inflammation Program and Department of Internal Medicine, Roy J. and Lucille A. Carver College of Medicine, University of Iowa, Iowa City, IA, USA Lawrence I. Grossman Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, Detroit, MI, USA Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA Samir M. Hamdan Division of Biological and Environmental Sciences and Engineering, Laboratory of DNA Replication and Recombination, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia Lars Hestbjerg Hansen Section for Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark Department of Environmental Science, Aarhus University, Roskilde, Denmark April Hill Department of Biology, University of Richmond, Richmond, VA, USA Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

x

Candice N. Hirsch Department of Agronomy and Plant Genetics, University of Minnesota, Saint Paul, MN, USA Jon Hobman School of Biosciences, University of Nottingham, Sutton Bonington, UK Qi Hu Department of Biochemistry and Molecular Biology, Mayo Clinic College of Medicine, Rochester, MN, USA Jimeng Hua Department of Biology, Dalhousie University, Halifax, NS, Canada Maki Inada Biology Department, Ithaca College, Ithaca, NY, USA Jonathan P. Jacobs Division of Digestive Diseases, Department of Medicine, University of California, Los Angeles, Los Angeles, CA, USA Ning Jiang Department of Horticulture, Michigan State University, East Lansing, MI, USA Timothy J. Johnson Department of Veterinary and Biomedical Sciences, University of Minnesota, Saint Paul, MN, USA Stephanie Jones Medical College of Wisconsin, Milwaukee, WI, USA Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA Jon M. Kaguni Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA Adam C. Ketron Department of Biochemistry and the Vanderbilt Institute of Chemical Biology, Vanderbilt University School of Medicine, Nashville, TN, USA Scheherazade Khan Department of Biology, University of Richmond, Richmond, VA, USA Hannah Klein Department of Biochemistry, New York University School of Medicine, New York, NY, USA Andrew B. Kouse Department of Biological Sciences, Ohio University, Athens, OH, USA Jan-Ulrich Kreft School of Biosciences and Institute of Microbiology and Infection and Centre for Systems Biology, University of Birmingham, Edgbaston, Birmingham, UK Christopher E. Lane Department of Biological Sciences, University of Rhode Island, Kingston, RI, USA B. Franz Lang Centre Robert Cedergren, Département de Biochimie, Université de Montréal, Montréal, QC, Canada

Contributors

Contributors

xi

Dennis V. Lavrov Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA Robert W. Lee Department of Biology, Dalhousie University, Halifax, NS, Canada Sang Eun Lee Department of Molecular Medicine, University of Texas Health Science Center, San Antonio, TX, USA I. Robert Lehman Department of Biochemistry, Beckman Center, Stanford University School of Medicine, Stanford, CA, USA Guo-Min Li Graduate Center for Toxicology and Markey Cancer Center, University of Kentucky College of Medicine, Lexington, KY, USA Lei Li Department of Experimental Radiation Oncology, Department of Molecular Genetics, M. D. Anderson Cancer Center, Houston, TX, USA Lili Li Section for Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark R. Hunter Lindsey Department of Biochemistry and the Vanderbilt Institute of Chemical Biology, Vanderbilt University, School of Medicine, Nashville, TN, USA Ovidiu Lipan Department of Physics, University of Richmond, Richmond, VA, USA Bertrand Llorente Aix-Marseille Universite, Marseille, France Fabián Lorenzo-Díaz Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria and Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias, Centro de Investigaciones Biomédicas de Canarias, Universidad de La Laguna, Santa Cruz de Tenerife, Spain Martin Lowe Department of Systems Neuroscience, Tokyo Medical and Dental University, Tokyo, Japan Julius Lukeš Biology Centre, Institute of Parasitology, Czech Academy of Sciences, and Faculty of Science, University of South Bohemia, České Budĕjovice, Czech Republic Wenting Luo Section for Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark Federica Marini Dipartimento di Bioscienze, Università di Milano, Milan, Italy James Marion Department of Chemistry and Biochemistry, University of California, San Diego, CA, USA Bonnie Marvin Biology Department, Ithaca College, Ithaca, NY, USA

xii

Eric Mayoux Department of Cardiometabolic Diseases Research, BoehringerIngelheim Pharma GmbH & Co KG, Biberach an der Riss, Germany Alexander V. Mazin Department of Biochemistry and Molecular Biology, Drexel University College of Medicine, Philadelphia, PA, USA Olga M. Mazina Department of Biochemistry and Molecular Biology, Drexel University College of Medicine, Philadelphia, PA, USA Charles S. McHenry Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO, USA Rachel McMullan Department of Biology, University of Richmond, Richmond, VA, USA Chanaka Mendis Department of Chemistry, University of Wisconsin Platteville, Platteville, WI, USA Max Mergeay Unit of Microbiology, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium Dennis Miller Department of Molecular and Cell Biology, The University of Texas at Dallas, Richardson, TX, USA Erin Murphy Department of Biomedical Sciences, Life Sciences Building, Ohio University Heritage College of Osteopathic Medicine, Athens, OH, USA Jeffrey K. Myers Department of Chemistry, Davidson College, Davidson, NC, USA Walter R. P. Novak Department of Chemistry, Wabash College, Crawfordsville, IN, USA Megan A. O’Brien Department of Biological Sciences, University of Rhode Island, Kingston, RI, USA Muse Oke Division of Biological and Environmental Sciences and Engineering, Laboratory of DNA Replication and Recombination, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia Neil Osheroff Department of Biochemistry and the Vanderbilt Institute of Chemical Biology, Vanderbilt University School of Medicine, Nashville, TN, USA Mary Ann Osley Molecular Genetics and Microbiology, University of New Mexico School of Medicine, Albuquerque, NM, USA Shallee T. Page Division of Environmental and Biological Sciences, University of Maine at Machias, Machias, ME, USA Mark Paget School of Life Sciences, University of Sussex, Falmer, Brighton, UK Margaret A. Park Department of Biochemistry and Molecular Biology, Virginia Commonwealth University, Richmond, VA, USA

Contributors

Contributors

xiii

Adam R. Parks Molecular Control and Genetics Section, Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD, USA David A. D. Parry Institute of Fundamental Sciences and Riddet Institute, Massey University, Palmerston North, New Zealand Achille Pellicioli Dipartimento di Bioscienze, Università di Milano, Milan, Italy Dimitri Perrin Laboratory for Systems Biology, RIKEN Center for Developmental Biology, Kobe, Japan Joseph E. Peters Department of Microbiology, Cornell University, Ithaca, NY, USA Galina Petukhova Department of Biochemistry and Molecular Biology, Uniformed Services University of the Health Sciences, Bethesda, MD, USA Marina Ramirez-Alvarado Department of Biochemistry and Molecular Biology, Mayo Clinic College of Medicine, Rochester, MN, USA Wasantha K. Ranatunga Department of Physiology and Biomedical Engineering and Nephrology and Hypertension, Mayo Clinic College of Medicine, Rochester, MN, USA Ajna Rivera Biological Sciences, University of the Pacific, College of the Pacific, Stockton, CA, USA Rosette Roat-Malone Chemistry Department, Washington College, Chestertown, MD, USA Michael F. Romero Department of Physiology and Biomedical Engineering and Nephrology and Hypertension, Mayo Clinic College of Medicine, Rochester, MN, USA José Angel Ruiz-Masó Centro de Investigaciones Biológicas, CSIC, Madrid, Spain Laura Runyen-Janecky Department of Biology, University of Richmond, Richmond, VA, USA Heather J. Ruskin Centre for Scientific Computing and Complex Systems Modelling, Dublin City University, Dublin, Ireland Andrea Sajuthi Biological Sciences, University of the Pacific, College of the Pacific, Stockton, CA, USA Anton Sanderfoot Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA Sergio Santa Maria Space Biosciences Division, NASA Ames Research Center, Mountain View, CA, USA Prabha Sarangi Molecular Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA

xiv

Keisuke Sato Faculty of Life Sciences, University of Manchester, Manchester, UK Malli K. Shashwath Centre for Nanotechnology and Advanced Biomaterials (CeNTAB), School of Chemical and Biotechnology, SASTRA University, Thanjavur, Tamilnadu, India Shin-Han Shiu Department of Plant Biology, Michigan State University, East Lansing, MI, USA Claudio Slamovits Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS, Canada Søren Johannes Sørensen Section for Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark Panos Soultanas School of Chemistry, Centre for Biomolecular Sciences, University of Nottingham, Nottingham, UK Dov J. Stekel School of Biosciences, University of Nottingham, Loughborough, UK Jon R. Stoltzfus Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA Dan Su State Key Laboratory of Biotherapy, West China Hospital, Huaxi Campus, Sichuan University, Chengdu, Sichuan, People’s Republic of China Haruo Suzuki Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA Nicolas Tanguy Le Gac Institut de Pharmacologie et de Biologie Structurale, CNRS-Université Paul Sabatier Toulouse III, Toulouse, France Christopher M. Thomas Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, UK Eva M. Top Department of Biological Sciences, Institute for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, Moscow, ID, USA Rob Van Houdt Unit of Microbiology, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium Thomas L. Vandergon Biology Department, Pepperdine University, Malibu, CA, USA Mahadevan Vijayalakshmi Centre for Nanotechnology and Advanced Biomaterials (CeNTAB), School of Chemical and Biotechnology, SASTRA University, Thanjavur, Tamilnadu, India Giuseppe Villani Institut de Pharmacologie et de Biologie Structurale, CNRS-Université Paul Sabatier Toulouse III, Toulouse, France Thomas J. Wiese Department of Chemistry, Fort Hays State University, Hays, KS, USA

Contributors

Contributors

xv

Nathan Winter Department of Chemistry and Biochemistry, St. Cloud State University, St. Cloud, MN, USA Manal S. Zaher Division of Biological and Environmental Sciences and Engineering, Laboratory of DNA Replication and Recombination, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia Xiaolan Zhao Molecular Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA

A

Abasic Site Formation

the alteration is a part of the characterization of the damage.

▶ Depurination

Discussion

Adducts on Tm, Effects of Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synonyms Changes in DNA melting curves

Definition Thermal melting of DNA involves the breaking of double-stranded DNA into single strands, defined by the “melting temperature” (Tm), the temperature at which one-half of the DNA is melted. Modification of DNA by chemical damage often changes the thermodynamics of the process, and

# Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

The presence of DNA adducts often disrupts normal pairing of DNA bases (see ▶ “DNA Base Pairing, Modes of”), and the melting temperature (Tm) of a double-stranded oligonucleotide is decreased. The procedure involves use of a doublestranded oligonucleotide, usually 12–18 bases in length. If the oligonucleotide is too short, it will not anneal, even at low temperature. If the oligonucleotide is too long, then the difference imposed by the adduct may not be prominent enough to observe. The presence of high salt increases the Tm, so experiments are often done in 0.5 M NaCl. The usual procedure is to begin the experiment with the sample at low temperature (15–23  C) and increase the temperature slowly (~1  C/min). The most common procedure is to monitor ultraviolet (UV absorbance at 260 nm (A260) (Fig. 1). The A260 will increase, then increase more rapidly near the Tm, and then only slowly increase. The Tm is the temperature at which the DTm is half maximal. Practically this is determined by

2

Adducts on Tm, Effects of, Fig. 1 Idealized melting curve for a double-stranded oligonucleotide (lower panel). An idealized first-derivative curve is shown in the upper panel

Adducts on Tm, Effects of, Fig. 2 Van’t Hoff plot for oligonucleotide Tm data and relevant equations (Persmark and Guengerich 1994)

Adducts on Tm, Effects of

converting the sigmoidal plot of A260 vs. T to the first derivative (using the spectrophotometer software). The maximum of the first derivative is the Tm. If the adduct is stable, the melting gradient can be reversed, i.e., the same system slowly cooled to obtain a similar profile. In principle, other spectroscopy can be used instead of UV, e.g., circular dichroism or NMR spectroscopy, although UV is more convenient. Tm is one parameter, and if one desires to obtain more extensive thermodynamic analysis, then the measurements should be done at multiple oligonucleotide concentrations so the data can be used for van’t Hoff plots with the equation, yielding DG, DH, and DS (Fig. 2). In order to obtain useful signals (A260) with a range of concentrations, cuvettes of varying path length can be used. An alternative is to use calorimetry, which generates all of the information in a single run. Also, some DNA adducts raise Tm. This is the case with some intercalated adducts, e.g., aflatoxin B1 (Mao et al. 1998).

Allostery and Quaternary Structure

Cross-References ▶ Base Intercalation in DNA ▶ DNA Base Pairing, Modes of ▶ DNA Damage, Types of ▶ Spectroscopy of Damaged DNA

References Mao H, Deng Z, Wang F et al (1998) An intercalated and thermally stable FAPY adduct of aflatoxin B1 in a DNA duplex: structural refinement from 1H NMR. Biochemistry 37:4374–4387 Persmark M, Guengerich FP (1994) Spectroscopic and thermodynamic characterization of the interaction of N7-guanyl thioether derivatives of d(TGCTG*CAAG) with potential complements. Biochemistry 33:8662–8672

Allostery and Quaternary Structure Thomas J. Wiese Department of Chemistry, Fort Hays State University, Hays, KS, USA

Synopsis Quaternary proteins are multisubunit proteins held in relation to each other through non-covalent forces. Allosteric proteins have another site which binds an effector molecule. The allosteric site ligand binding influences the binding of substrate in a positive or negative (inhibitory) manner. The presence of sigmoidal (as opposed to hyperbolic) kinetics is a valuable clue in the identification of an allosteric protein. Much of our understanding of cooperativity comes from the study of hemoglobin. Ligand binding induces conformational changes in a protein which historically have been investigated using X-ray crystallography. Two models (MWC and KNF) are used to relate conformational change and function. As additional allosteric proteins are studied and new techniques are applied, some long-standing assumptions are

3

being challenged in the absolute, though the assumptions remain true in the general.

Introduction Quaternary proteins are proteins in which two or more polypeptides are held in relation to each other through non-covalent, intermolecular forces. They may be homomeric (identical subunits) or heteromeric (not identical subunits), and the subunits may be dimeric through multimeric. The individual polypeptides are referred to as protomers. Separation techniques and sequencing are used to determine whether a protein exhibits quaternary structure. Allostery can be deduced by structural determination or by kinetics. There are a number of ways to determine if a protein acts as an oligomer. Pioneering studies of Svedberg were used to show that a protein having a 1 mass might appear by centrifugation to have a mass of 2, or larger (for a homomer). Suppose that by SDS-PAGE, protein masses of 25 and 50 kDa were observed. If, by native gel electrophoresis or size exclusion chromatography, a mass of 150 kDa was observed, it might mean that a subunit composition of A2B2 is present. The use of chemical cross-linking agents, for example, glutaraldehyde, would prevent dissociation. Use of Sanger’s reagent would demonstrate multiple amino acids present if different polypeptides were present in a pure protein sample. “Allosteric” means “other shape” or “other site.” This other site allows a small molecule to bind, whether these molecules are additional substrate molecules or non-substrate molecules. X-ray crystallography is often used to show these other sites. Figure 1 shows one molecule of ADP bound to phosphofructokinase (PFK) adjacent to substrate (fructose-6-phosphate) in the upper right corner (active site). The lower left corner shows a molecule of ADP bound (allosteric site), responsible for inhibiting the PFK. The presence of sigmoidal kinetics (as opposed to hyperbolic kinetics) is a valuable clue in the identification of an allosteric protein and in

4

Allostery and Quaternary Structure

Allostery and Quaternary Structure, Fig. 1 Phosphofructokinase (4PFK, gray ribbons) bound to ADP and fructose-6phosphate (space-filling molecules). A second molecule of ADP occupies the allosteric site in the lower left corner

enzyme exhibiting sigmoidal kinetics is assumed to be an allosteric enzyme until proven otherwise. It is possible, under some circumstances, for non-allosteric proteins to exhibit sigmoidal kinetics, though this is not common. The focus of this chapter will be allosteric quaternary proteins using hemoglobin and allosteric enzymes as exemplars. This choice of selection emphasizes that non-enzyme proteins may exhibit the same attributes as enzymes.

Hemoglobin as an Exemplar of Allosteric Principles Hemoglobin is the prototypical model for the investigation of allosterism and cooperativity for numerous reasons. The concept of positive cooperativity was developed from observations

of Bohr in 1904, making this a long known example. Hemoglobin can be easily purified from readily accessible blood. A number of physiological and pathological changes are known, viz., hemoglobin fetal (HbF) with an affinity for oxygen higher than hemoglobin adult (HbA) and hemoglobin sickle (HbS) wherein deoxygenated blood polymerizes. Many mutations are known, Hb Kansas (Asn102bThr), for example. In fact, mutations are known for virtually all amino acid residues, and many of these affect oxygen binding. Binding of physiologically important ligands (H+, 2,3-bisphosphoglycerate, etc.) affects oxygen binding. Finally, cross-species comparison reveals important phenomena. Hemoglobin is a tetramer, composed of a2b2 subunits. Each subunit contains an ironcontaining porphyrin prosthetic group held in a cleft. Hemoglobin exists in two different

Allostery and Quaternary Structure Allostery and Quaternary Structure, Fig. 2 Molecular dynamics of hemoglobin upon binding of oxygen. (a), The binding of oxygen pulls the iron ion into the plane of the porphyrin ring (█▌ █▌). Ion-dipole interactions are denoted by dashed lines. To avoid steric interaction, histidine converts to an in-line conformation. (b), Molecular motions induced by oxygen binding cause a shuffling of hydrogen bonds to occur

5

a

A

b

conformational states: the T-(taut) state predominates in the deoxy form, which switches to predominantly R-(relaxed) form when oxygen binds. Ligand-induced conformational changes impart conformational changes on adjacent subunits. Just as a crowd of people will disperse in response to a noxious event, the encroachment of one atom on another will induce a repellent flux. Offsetting these repellent forces are attractive forces which maintain quaternary structure. The X-ray structure of Perutz et al. (1960) was interpreted to mean that both inter- and intramolecular salt bridges were responsible for the T structure (Perutz 1970). Symmetry is evident in these salt bridges. The amino terminus of each a-chain is electrostatically connected to the carboxy terminus of the other a-chain. Lysine C5 of the a1 and a2 subunits are bonded to this carboxy terminus of the b1 and b2 subunits, respectively. Both b-subunits have an intramolecular bond between aspartate at the corner of the F and G helices, to the histidine carboxy terminus. In the deoxy form, a steric repulsion between the porphyrin ring and a histidine residue keeps the iron ion out of the plane of the porphyrin ring (Fig. 2a, left). On binding of the oxygen molecule,

the iron ion is pulled into the plane of the porphyrin ring, shifting the histidine residue to an in-line position (Fig. 2a, right). The change in tertiary shape as a result of oxygen binding causes these electrostatic bonds to be weakened and then break. Other interactions form as different residues come in contact. Notably, a hydrogen bond between Tyr42 of the a1 subunit and Asp99 of the b1 subunit is broken as the subunits slide past one another. In turn, a new hydrogen bond between Asp94 and Asn102 forms (Fig. 2b). These alterations in intermolecular forces and conformational changes from T- to R-state are correlated with increased oxygen binding in hemoglobin and enzymatic activity in numerous examples (Barford and Johnson 1989, e.g.,). The significance of the sigmoidal shape is that it indicates that the binding of a substrate molecule by one subunit influences the likelihood that a subsequent subunit will bind the substrate molecule. In the case of hemoglobin, this is a positive influence: the binding of additional oxygen is increased when other binding sites are occupied. The Hill coefficient (Hill 1913) is used to quantify the change. A Hill coefficient n = 1 is when there is no influence between subunits. An n > 1 is

6 Allostery and Quaternary Structure, Fig. 3 Sigmoidal binding and the Hill coefficient. (a), The initial portion of the sigmoid curve is shown. A Hill coefficient less than one ( ) results in an initial higher rate of binding which falls off more rapidly as substrate concentration increases. With an n > 1, there is an initial lag followed by a rapid increase in affinity. (b). The entire sigmoid curve is shown. In the presence of an allosteric stimulator ( ) there is a leftward shift of the curve to encompass binding at lower concentrations. An allosteric inhibitor (. . .. . .) causes a shift to the right, meaning a higher substrate concentration is needed to achieve the same level of activity

Allostery and Quaternary Structure

a

b

a positive cooperativity and n < 1 is a negative cooperativity. The Hill coefficient for hemoglobin is 3. As shown in Fig. 3a, a Hill coefficient 1, the initial rate of increase is lower but becomes steeper as the concentration increases. Allosteric effectors can shift the curve left or right. Figure 3b shows oxygen binding at physiological, slightly acidic (shifted right) and slightly basic (shifted left) conditions. Aspartate transcarbamoylase is a K-type allosteric enzyme which displays a shift right in the presence of CTP (an inhibitor) and a shift to the left (stimulation) in the presence of ATP (Krantrowitz 2012). It should not be left unsaid that binding of activator can lead to a complete separation of

catalytic and regulatory subunits, as in the case of protein kinase A. A tetramer of two regulatory and two catalytic subunits represents the inactive state. Binding of cAMP by the regulatory subunits causes the release of the two catalytic subunits and concomitant exposure of the active site to the intracellular milleu. Many quaternary proteins are allosteric with regulatory roles, as the protein kinase A example. Enzymes regulating flux through metabolic pathways are almost invariably allosteric enzymes, and most are subject to covalent modifications by groups such as phosphate, methyl, etc. These covalent modifications increase allosteric sensitivity. Flux through a rate-determining step may be altered by allosteric control, covalent modification, substrate concentrations, gene expression, or some combination of these.

Allostery and Quaternary Structure

7

A

Allostery and Quaternary Structure, Fig. 4 Comparison of the MWC (panel A) and KNF (panel B) models of cooperativity. (a), In the MWC model, all subunits are either in the T-state (boxes) or R-state (circles)

simultaneously (concerted). (b). The sequential model (KNF) allows any protomer to be either T or R regardless of the state of other subunits. Four of 25 possible combinations are shown

The concerted model of Monod, Wyman, and Changeux (the MWC model, Monod et al. 1965) explains the positive cooperativity displayed by hemoglobin. The MWC model is one in which all protomers are either T-state or all R-State (Fig. 4a). Feedback inhibition is seen in most metabolic pathways; this negative cooperativity is not predicted, and the kinetic data do not fit the MWC model. Because of this, in 1966, Koshland and coworkers published the sequential model which has become commonly referred to as the KNF model (Koshland et al. 1966). The KNF model allows for any subunit to be in either the R- or T-state, regardless of the state of the other subunits (Fig. 4b). For a tetrameric protein, 25 different combinations are possible; only four are shown in the figure.

The twenty-first century finds workers in the field examining quaternary structure and allosteric regulation through the lens of systems biology (Huang et al. 2011). This allows the analysis of contacts between complexes of proteins, such as all enzymes of an entire pathway, or between an enzyme and microtubules. Many additional allosteric proteins are being examined, and a database has been established to collate information. Greater understanding of conformational changes is being gained by using new NMR techniques, and small-angle X-ray scattering (SAXS) in solution. This is particularly important as it must ever be kept in mind that although crystal X-ray structures provide the average atomic positions, atoms exhibit significant fluctuations from the average when in solution.

8

Alternative Splicing

Allostery and Quaternary Structure, Fig. 5 The morpheein model of allosteric complexes. The protomer form can be in either of two states, active when oligomerized

(triangle) or inactive when oligomerized (square). When oligomerization occurs, the multimeric protein is then either active or inactive form

Finally, some long-standing assumptions are being challenged by new concepts. An evolving concept in allostery is that of morpheeins. Morpheeins are proteins which form two different oligomeric states, depending on the conformation of the protomer (Fig. 5). A long-standing concept has been that of symmetry: subunits are held together by symmetric intermolecular forces and change shape in a likewise symmetric fashion. The morpheein model does not require the maintenance of symmetry (Selwood and Jaffe 2012). Although this model challenges the absolute requirement for symmetry, it must be pointed out that most examples from history do involve symmetry. Interestingly, it is reported by van Holde, et al. (2000) that cooperativity is observed only in proteins whose quaternary structure includes a small number of protomers, such as hemoglobin, and not those with a large number, particularly hemocyanin.

Monod J, Wyman J, Changeux JP (1965) On the nature of allosteric transitions: a plausible model. J Mol Biol 12:88–118 Perutz MF, Rossmann MG, Cullis AF, Muirhead H, Will G, North AC (1960) Structure of haemoglobin: a threedimensional Fourier synthesis at 5.5-A. resolution, obtained by X-ray analysis. Nature 185:416–422 Perutz MF (1970) Stereochemistry of cooperative effects in hemoglobin. Nature 222:726–739 Selwood T, Jaffe EK (2012) Dynamic dissociating homooligomers and the control of protein function. Arch Biochem Biophys 519:131–143 van Holde KE, Miller KI, van Olden E (2000) Allostery in very large molecular assemblies. Biophys Chem 86:165–72

Alternative Splicing ▶ Co-transcriptional Eukaryotes

mRNA

Cross-References ▶ Tertiary Structure, Forces Maintaining the Stability of

Amide Bond ▶ Secondary Structure

References Barford D, Johnson LN (1989) The allosteric transition of glycogen phosphorylase. Nature 340:609–616 Hill AJ (1913) The combinations of haemoglobin with oxygen and with carbon monoxide. I. Biochem J 7:471 Huang Z et al (2011) ASD: a comprehensive database of allosteric proteins and modulators. Nucleic Acids Res 39:D663–D669 Krantrowitz ER (2012) Allostery and cooperativity in Escherichia coli aspartate transcarbamoylase. Arch Biochem Biophys 519:81–90 Koshland DE Jr, Nemethy G, Filmer D (1966) Comparison of experimental binding data and theoretical models in proteins containing subunits. Biochemistry 5:365–385

Amino Acid Monomer ▶ Secondary Structure

Apurinic Site Formation ▶ Depurination

Processing

in

Artificial Chromosomes

9

Archaeplastida ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Archaeplastidians ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

inserts as large as 2,000 kb (Monaco and Larin 1994), were the first artificial chromosomes (Burke et al. 1987). Bacterial artificial chromosomes (BACs) (Shizuya et al. 1992) and P1-derived artificial chromosomes (PACs) (Ioannou et al. 1994) were developed to overcome problems of insert chimerism and instability that were encountered with yeast artificial chromosomes. Mammalian artificial chromosomes (MACs) have been developed for mammalian cells, including human cells (HACs).

Discussion

Artificial Chromosomes Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Definition Artificial chromosomes are cloning vectors that can carry DNA inserts orders of magnitude larger than is possible with plasmids or lambda-phagederived vectors (Table 1). Artificial chromosomes, like other cloning vectors, contain the DNA sequence elements that are necessary for replication and stability of the molecule in the host cell and for its faithful partitioning to daughter cells upon cell division. Yeast artificial chromosome vectors (YACs), able to carry DNA Artificial Chromosomes, Table 1 Characteristics of cloning vectors (Monaco and Larin 1994) Vector Cosmid

Host E. coli

P1 clone BAC

E. coli

PAC

E. coli

YAC

S. cerevisiae

E. coli

Structure Circular plasmid Circular plasmid Circular plasmid Circular plasmid Linear

Insert size (kb) 35–45 70–100 150 kb (Mok and Marians 1987a, b). These multiple interactions that contribute to processivity appear redundant (Marians et al. 1998). During processive replication of long singlestranded templates, Pol III HE typically stops synthesis upon encountering a duplex (O’Donnell and Kornberg 1985) or after displacing a small number of nucleotides (Dohrmann et al. 2011). However, a strand displacement activity of the DNA Pol III HE has been observed under a variety of conditions (Canceill and Ehrlich 1996; Stephens and McMacken 1997; Xu and Marians 2003; Yao et al. 2000). Interaction of the leading strand polymerase with the lagging strand template, mediated by a Pol III-t-c-w-SSB bridge, is essential for efficient strand displacement (Yuan and McHenry 2009). A recently discovered e-b interaction is also required for strand displacement (Jergic et al. 2013). Extrapolating these findings to natural replication forks suggested the leading strand polymerase might be stabilized by interactions with the lagging strand coated with SSB, mediated through a t-c-w link. Firmly establishing this notion is the recent observation that w mutants that are defective in interaction with SSB exhibit a defect in the processivity of leading strand synthesis (Marceau et al. 2011), even though SSB is thought to be exclusively associated with the lagging strand template. t serves an additional role of protecting b within elongating complexes from removal

B

22

Bacterial DNA Replicases

Bacterial DNA Replicases, Fig. 4 Modular organization of Pol III a. The names and colors of the domains shown are from Bailey et al. (2006) except that their C-terminal domain was further divided into the OB fold and t-binding domains. The residue numbers that define domain borders in E. coli a are shown above the bar in black. The position of antimutator mutations (marked below the dnaE gene in blue) and mutations selected to discriminate dideoxynucleotides (red above the bar) are indicated (Fijalkowska and Schaaper 1993; Hiratsuka and RehaKrantz 2000; Oller and Schaaper 1994; Vandewiele et al. 2002). It is likely that these influence either the rate of polymerization or base selection and reside within the

polymerase active site. Sde mutations (McHenry 2011) that likely interfere with initiation complex formation are shown in magenta above the bar. Mutator mutations (not shown) in dnaE (Maki et al. 1991; Strauss et al. 2000; Vandewiele et al. 2002) also map within the polymerase domain (palm, thumb, fingers) with the exception of two temperature-sensitive alleles (74 and 486) that exhibit a slight mutator phenotype at the permissive temperature (Vandewiele et al. 2002). dnaE74 maps to position 134 within the PHP domain and dnaE486 maps to position 885 between the b2 binding site and the HhH element within the b2 binding domain. A presumed template slippage mutant maps to residue 133 (Bierne et al. 1997)

catalyzed by exogenous g complex (Kim et al. 1996b). The dd0 subunits of the DnaXcx, best known for their roles in b2 loading, are also required for optimal processivity (Song et al. 2001). It is not understood whether their role in processivity is in protecting b2 from removal, in concert with t, or in some other function. The a subunit of Pol III has been classified as a Class C polymerase, distinct from eukaryotic polymerases and the other polymerases found in E. coli. Functional and genetic experiments have demonstrated the modular nature of Pol III a, and recent structures have refined the definition of its domain boundaries and provided valuable insight into its function (Fig. 4). Three acidic side chains (E. coli (Eco) D401, D403, and D555) in the internal polymerase domain coordinate two Mg++ ions, facilitating catalysis of nucleotide insertion (Pritchard and McHenry 1999). Antimutator and nucleotide selection mutants, presumably associated with polymerase function, helped to further define the limits of the polymerase domain (Fig. 4). Like all polymerases, Pol III a contains palm, thumb, and finger domains, in the shape of a cupped right hand. Superposition of the a palm with that of mammalian Pol b aligns the three

identified catalytic residues of a (Bailey et al. 2006) with those of Pol b (Sawaya et al. 1997). An apoenzyme structure of the fulllength Thermus aquaticus (Taq) a subunit showed that the palm domain has the basic fold of the X family of DNA polymerases that includes the slow, non-processive Pol bs, placing bacterial replicases as a special class within that family (Bailey et al. 2006). A structure of Eco a truncated within the b-binding domain also exhibited a Pol b-like fold with perturbations in the active site which are presumably corrected upon substrate binding (Lamers et al. 2006). A ternary complex of a dideoxy-terminated primed template, incoming dNTP, and full-length Taq a provided significant insight into the function of Class C polymerases (Wing et al. 2008). Among the primed template-induced conformational changes is the movement of the thumb domain toward the DNA bound by the palm driven by interaction of two-thumb a helices in parallel with the DNA to make contacts with the sugar-phosphate backbone in the minor groove. The fingers of a also move, and a portion that rotates ca. 15 , together with the palm and the 30 -terminus of the primer, forms a pocket that positions the incoming dNTP. The incoming dNTP is positioned above the three essential

Bacterial DNA Replicases

catalytic aspartates. The polymerase contacts the template from its terminus to a position 12 nucleotides behind the primer terminus, in excellent agreement with photo-cross-linking experiments (Reems et al. 1995). The finger domain creates a wall at the end of the primer terminus that forces a sharp kink in the emerging template strand (Wing et al. 2008). The terminal domains of Pol III a confer special properties upon it, including the ability to bind to and communicate with other replication proteins. Analysis of a deletion mutants revealed that C-terminal domains are responsible for interactions with both t and b (Kim and McHenry 1996a, b). An essential b2 interaction site (Eco 920–924) (Dalrymple et al. 2001) was verified by mutagenesis, coupled with functional, genetic, and biophysical experiments (Dohrmann and McHenry 2005). Deletion of residues from the C-terminus abolished t binding, but N-terminal deletions extending into the fingers domain also diminished t binding, suggesting either extensive t interactions or structural perturbations (Kim and McHenry 1996a). More detailed mutagenesis studies (Dohrmann and McHenry 2005) have identified the C-terminus as critical for t binding, but the binding site has not been firmly identified. The C-terminal region of a contains additional domains identified by similarity to elements found in other DNA-binding proteins. These include a helix-hairpin-helix motif (HhH) (Eco 836–854) (Bailey et al. 2006; Doherty et al. 1996) and an OB fold (Eco 964–1,078) (Bailey et al. 2006; Theobald et al. 2003; Fig. 4). A structure of Taq a revealed a well-organized b2-binding domain with dsDNA-binding capability. DNA binding occurs through an HhH motif and its flanking loops (Wing et al. 2008). The b2 binding consensus sequence is presented in a loop that is oriented adjacent to dsDNA as it exits the polymerase in the correct position to bind b2 as it surrounds DNA. The b2 binding domain rotates 20o and swings down into position as the enzyme binds DNA (Wing et al. 2008), a reorientation that is apparently driven energetically by the HhH motif binding to DNA and likely coupled to conformational changes of the thumb, palm, OB fold, and PHP domains.

23

The structure of the ternary complex of Taq a with a primed template and incoming dNTP also revealed a striking conformational change that includes the OB fold moving to a position near the single-strand template distal to the primer (Wing et al. 2008). The path of the emerging template, which can be traced from electron density of the ribose-phosphate backbone, appears to come close to the OB fold. The element of the OB fold that comes closest to the ssDNA template, the b1-b2 loop, often contributes to ssDNA binding (Theobald et al. 2003). However, the b1-b2-b3 face that commonly interacts with ssDNA (Theobald et al. 2003) appears to “face away” from the emerging template and to face the t binding domain. So, binding of the OB fold, if it takes place, occurs either in a nonstandard way or there are further rearrangements as the template strand becomes longer or when additional protein subunits are present. The second of the two C-terminal domains in the Taq a structure revealed a domain containing an incompletely conserved sequence that binds weakly to b2, but is not required for processive replication in vitro or function in vivo. This domain is loosely packed against the OB fold, with many polar residues in the interface (Bailey et al. 2006). Mutational studies support the importance of this subdomain in binding t (Dohrmann and McHenry 2005). Further information regarding possible sites of interaction of this extreme C-terminal domain with t was derived from a genetic screen for suppression of a dominant lethal phenotype of an extra chromosomally expressed dnaE that formed initiation complexes but was unable to elongate (McHenry 2011). Two sde mutations in the C-terminal domain (W1134C and L1157Q) appeared to severely diminish the interaction with t. The availability of a structure of b2 on dsDNA (Georgescu et al. 2008), and a knowledge of the b2binding site for polymerases (Bunting et al. 2003; Burnouf et al. 2004), permitted construction of a model of these proteins interacting on DNA (Wing et al. 2008). The model places b2 approximately 20 nucleotides behind the primer terminus (Fig. 5), in agreement with foot printing, FRET, and photocross-linking studies (Griep and McHenry 1992; Reems and McHenry 1994; Reems et al. 1995).

B

24

Bacterial DNA Replicases

Bacterial DNA Replicases, Fig. 5 Models of b2 binding to Pol III a from reference (Wing et al. 2008) with a proposed position for binding of the g subunit of the DnaXcx. (a) The b2 binding site in Pol III a is indicated, docked to one of two polymerase binding sites within the b2 clamp (indicated by arrow to blue-purple b2 structure). The remaining protein components represent Pol III a, colored as in Fig. 4. The structure in (a) was prepared from pymol files provided by R. Wing and rotated so that the PHP domain is facing away from the plane of the paper and the t-binding domain is projecting toward the viewer. (b) Positions of primer contacts with Pol III HE subunits determined by photo-cross-linking (Reems et al. 1995).

These results suggest the g subunit of the DnaXcx fits into the open gap as indicated, contacting the primer strand (grey). Because domain V of the t subunit is believed to contact the upper portion of the C-terminal domain of a (shown in dark red in (a)), g could be drawn into position partially by that interaction. With a subunit sitting between the polymerase and the b2 clamp, the DnaXcx could be in a position to modulate polymerase switching, perhaps using its chaperoning activity, and to facilitate polymerase release and recycling during Okazaki fragment synthesis. This model would need to accommodate the newly discovered interaction between e and b (Jergic et al. 2013; Ozawa et al. 2013; Toste-Rego et al. 2013)

A proposal was made that the two polymerase binding sites on b2 could be used as an entry point for polymerase exchange at the replication fork (Burnouf et al. 2004). The model accommodates such an interaction. The same photo-crosslinking experiments that correctly assigned the contacts of b2 and Pol III a with DNA also showed a clear cross-link of g when photo-reactive probes were placed on nucleotide 18 of the primer (Reems et al. 1995). We note that the open cleft might accommodate g, which could be sequestered in mixed t/gDnaXcxs by interaction of t with the t-binding C-terminal domain.

fragment, rapidly dissociate and reassociate with a new primer for the next Okazaki fragment in under 0.1 s. Two primary models have been proposed for how this occurs. In the collision model, it was proposed that polymerase dissociation was triggered by collision of the polymerase with the preceding Okazaki fragment (Leu et al. 2003; Fig. 6a). In the signaling model, synthesis of a new primer for the next Okazaki fragment drives cycling, even if the preceding Okazaki fragment is not finished (Wu et al. 1992b; Fig. 6b). In both models, polymerase leaves b2 behind, and a new b2 is loaded on the next primer (Stukenberg et al. 1994). Kinetic tests of the collision model suggest it is inadequate, by itself, to support a physiologically relevant rate of polymerase release (Dohrmann et al. 2011). A study that exploited selective modulation of lagging strand synthesis on rolling circle templates with highly asymmetric nucleotide composition supported the signaling model and refuted the

Termination Upon Completion of an Okazaki Fragment and Cycling to the Next Primer on the Lagging Strand Simultaneous with the exceedingly highly processive leading strand Pol III HE, the lagging strand polymerase must, upon completion of an Okazaki

Bacterial DNA Replicases

25

B

Bacterial DNA Replicases, Fig. 6 Two models for lagging strand polymerase cycling. (a) In the collision model, it was proposed that all Okazaki fragments are synthesized to completion and collision of the elongating polymerase with the 50 end of the preceding Okazaki fragment triggers release and recycling to the next primer at the replication

fork (Georgescu et al. 2009). (b) In the signaling model, it was proposed that a signal that accompanies the synthesis of a new primer is transmitted to the elongating lagging strand polymerase and it dissociates, even if the gap between Okazaki fragments has not been filled (Wu et al. 1992b)

collision model for E. coli (Yuan and McHenry 2014). A modification was required in the signaling model specifying the availability of a new primer as the signal rather than the action of primase, per se. A proposal was made that the clamp loader was the sensor in E. coli (Yuan and McHenry 2014), in agreement with a similar proposal made for T4 cycling (Chen et al. 2013). In model systems provided by bacteriophages T4 and T7, which encode their own replication proteins, the signaling model appears dominant (Hamdan et al. 2009; Yang et al. 2004).

exonuclease exists as a separate polypeptide chain, e (Scheuermann et al. 1983), which binds to the Pol III a subunit (Fig. 1) through a’s N-terminal PHP domain (Jergic et al. 2013; Ozawa et al. 2013; Toste-Rego et al. 2013; Wieczorek and McHenry 2006). The structure of the catalytic domain of e has been determined and is consistent with two Mg++ catalysis, though the protein coordinates two Mn++ ions derived from the crystallization buffer in the structure (Cisneros et al. 2009; Derose et al. 2002; Hamdan et al. 2002). Based on this structure, a mechanism for nucleotide removal by hydrolysis has been proposed (Hamdan et al. 2002). The two metal ions are held in place by interaction with three essential acidic active site residues. Asp 12, Glu 14, and Asp 167 all coordinate metal ion A. Ion B is bound solely by Asp 12, suggesting it may be bound more weakly and thus could dissociate with each catalytic turnover. The proposed mechanism begins with metal ion A interacting with the reactive phosphate and coordinating the attacking hydroxide ion with the assistance of general base catalysis from an active site His 162. Metal ion B coordinates a phosphate oxygen and presumably, with A, withdraws electrons and makes the reactive phosphate more susceptible to nucleophilic attack. Both ions may serve to shield the charge on the phosphate,

Proofreading Within the Bacterial Replicase Structure, Function, and Interactions of «: The Major Proofreading Subunit During DNA replication, a high level of fidelity is attained by the action of a proofreading exonuclease that removes nucleotides misincorporated by an associated polymerase. The proofreading exonucleases of most eukaryotic, bacterial, and viral DNA replicases are homologous and contain acidic residues that chelate two Mg++ ions that participate directly in catalysis (Beese and Steitz 1991). In E. coli and other bacteria that use only one Pol III replicase, the proofreading

26

reducing charge repulsion of the attacking hydroxide anion. Metal ion B is proposed to coordinate the departing 30 -OH of the DNA chain, stabilizing the developing negative charge of the transition state. e binds a nonessential (Slater et al. 1994) auxiliary subunit, y, whose only apparent function is to stabilize e (Taft-Benz and Schaaper 2004), and e also binds a through its C-terminal domain (Ozawa et al. 2008). A direct interaction of the exonuclease catalytic domain and Pol III a has not been detected. However, in other polymerases, the relationship of the polymerase active site and proofreading exonuclease is more rigidly fixed, and a channel connects the two sites (Steitz 1999). It is possible that weaker or regulated interactions between the e catalytic domain and Pol III a exist that permit a direct coordination of elongation with proofreading. Kinetic studies indicate that e has a high catalytic capacity (280 nt removed/s) and, by itself, acts distributively (Miller and Perrino 1996). However, when part of the replicative complex, it can processively digest primers to a limit of 6 nt, perhaps determined by instability of a limited primer-template duplex (Reems et al. 1991). That exonuclease action within the Pol III HE is processive indicating that b2 and other processivity factors are making similar contributions to both proofreading and polymerization. The kinetics of nucleotide removal by the proofreading exonuclease appear to be slower within full Pol III HE replicative complexes, suggesting that the catalytic capacity of the exonuclease may not be the rate-limiting step (Griep et al. 1990). The PHP Domain of Pol III a: A Second Proofreading Activity? The PHP domain was first identified by its sequence similarity to histidinol phosphatase, and the proposal was made that it might have pyrophosphatase activity (Aravind and Koonin 1998). However, such an activity is not present (Lamers et al. 2006). The structure of YcdX, a protein more closely related to the Pol III PHP domain and whose function is unknown, revealed a Zn++ trinuclear center with characteristics similar to several phosphoesterases (Teplyakov et al. 2003).

Bacterial DNA Replicases

This information prompted a search for intrinsic hydrolytic activity in a in the absence of e, which led to the discovery of a second proofreading activity within Pol III (Stano et al. 2006). The second activity (the PHP exonuclease) follows the classical criteria for proofreading initially established for E. coli DNA polymerase I (Brutlag and Kornberg 1972). The PHP exonuclease exhibits higher activity on mispaired termini, and removal of a mispair precedes elongation by the associated polymerase (Stano et al. 2006). The activity was distinguished from the prototypical proofreading exonuclease by being dependent on an endogenous metal ion that is not Mg++, likely Zn++. Addition of a Zn++ chelator in the presence of excess Mg++ destroys activity (Stano et al. 2006). The PHP exonuclease activity in pure recombinant Thermus thermophilus (Tth) a was distinguished from mesophilic exonucleases by high thermal stability that decayed in parallel with polymerase activity (Stano et al. 2006). The structure of Taq a revealed that a cluster of nine residues in the PHP domain, including eight of the ligands predicted from informatics approaches (Wieczorek and McHenry 2006), chelates three metal ions (Bailey et al. 2006), as shown directly for the E. coli YcdX homolog (Teplyakov et al. 2003). Kuriyan and colleagues, from the structure of the Eco a, pointed out a channel between the polymerase active site and the proposed PHP active site (Lamers et al. 2006). The PHP domain contains a long loop (Eco 107–116) that interacts extensively with the thumb. There may also be contacts between the PHP domain and DNA (Wing et al. 2008). This would explain the dependence of polymerase activity on the integrity of the PHP domain. Deletion of 60 N-terminal PHP residues or a D43A point mutation within the proposed active site abolishes polymerase activity (Kim et al. 1997), and mutation of single acidic residues within PHP decreases polymerase activity (Pritchard and McHenry 1999). Cooperative unfolding of the PHP and polymerase domains has been observed, consistent with an overall structural role for PHP (Barros et al. 2013). Novel proofreading exonuclease activities have been observed in bacterial DNA polymerases that resemble eukaryotic Pol b

Bacterial DNA Replicases

but contain an extra PHP domain on their C-terminus (Banos et al. 2008; Blasius et al. 2006). In the case of Pol X from B. subtilis, the activity has been linked to PHP by point mutation and deletion analysis (Banos et al. 2008). Such domains may function as independent proofreaders or may bind a separate e proofreading subunit. Possible functions of the PHP domain of Pol III a have been speculated upon (McHenry 2011).

Organisms that Contain Multiple DNA Polymerase IIIs: Functions and Interactions Both eukaryotic and prokaryotic replicases use a nearly structurally identical sliding clamp processivity factor and a five-protein ring-shaped AAA+ ATPase clamp loader (▶ Eukaryotic DNA Replicases). However, while in E. coli, the Pol III HE is the sole replicase, in eukaryotes, three replicases exist. Pol e is the leading strand replicase, Pol d is the lagging strand replicase, and Pol a is part of the priming apparatus and elongates nascent primers with dNTPs for a short distance before handing them off to Pol d (Dua et al. 1999; Nethanel and Kaufmann 1990; Nick McElhinny et al. 2008; Pursell et al. 2007). Interestingly, Gram-positive bacteria also contain multiple Pol IIIs and, along with them, other features reminiscent of eukaryotic replication systems. The E. coli DnaXcx, through the C-terminal domain of its t subunit, binds to Pol III very tightly, with a Kd of 70 pM (Kim and McHenry 1996a), and because t is oligomeric, the leading and lagging strand polymerases are maintained in one coupled complex. In contrast, Gram-positive DnaX complexes bind their cognate polymerases extremely weakly (Bruck and O’Donnell 2000; Bruck et al. 2005), suggesting a transient interaction and/or the presence of additional factors that make the interaction more stable. In low-GC Gram-positive bacteria, two Pol IIIs exist, termed PolC and DnaE (Koonin and Bork 1996). They are homologous, but PolC has some of its domains rearranged and it contains an endogenous standard Mg++-dependent

27

proofreading activity. DnaE is more closely related to the E. coli Pol III. A suggestion was made, based on genetic/cell physiology studies, that PolC is the leading strand polymerase and DnaE is the lagging strand polymerase in B. subtilis (Dervyn et al. 2001). However, DnaE has low fidelity, at least in vitro (Bruck et al. 2003; Le Chatelier et al. 2004). Yet, its overproduction in vivo does not increase mutation rates (Le Chatelier et al. 2004). These observations argue against a major replicative role. A B. subtilis rolling circle replication system has been reconstituted using 13 purified B. subtilis replication proteins (Sanders et al. 2010) predicted to be required by previous genetic or biochemical investigations (Bruand et al. 1995, 2001; Bruck and O’Donnell 2000; Dervyn et al. 2001; Polard et al. 2002; Velten et al. 2003). This system appears to accurately mimic the reaction at the replication fork of a Gram-positive bacterium, in terms of both its correspondence with genetic requirements and the replication fork rate in vivo (500 nt/s at 30  C) (Wang et al. 2007). Leading strand replication requires 11 proteins, including the Pol III encoded by polC. The second Pol III encoded by dnaE will not substitute. In addition to these 11 proteins, lagging strand replication requires DnaE and primase (Sanders et al. 2010). This is consistent with proposals for a lagging strand role for DnaE (Dervyn et al. 2001). However, the elongation rate of DnaE is too slow (~25 nt/s) to keep up with the replication fork. In contrast, PolC supports a physiologically relevant elongation rate (~500 nt/s). PolC discriminates against RNA primers; DnaE uses RNA primers efficiently (Sanders et al. 2010). These characteristics suggest a role for B. subtilis DnaE, analogous to eukaryotic Pol a, in which it extends RNA primers initially and then hands them off to a replicase. Consistent with the eukaryotic Pol a role, model systems using RNA-primed ssDNA show inefficient use by PolC, with a marked stimulation by low levels of DnaE to a level of synthesis greater than DnaE alone (Sanders et al. 2010). However, in the absence of PolC, DnaE can catalyze extensive synthesis. This makes it difficult to assess the exact position of the hand-off of an

B

28

extended primer from DnaE to PolC, because DnaE appears to continue synthesis in the absence of another polymerase. This issue was addressed by using a specific PolC inhibitor (HB-EMAU) (Tarantino et al. 1999). This class of inhibitors likely acts as a dGTP analog, forming a ternary complex with the primed template and PolC and trapping the enzyme in a dead-end complex (Low et al. 1974). When HB-EMAU was included in cooperative RNA primer extension reactions containing DnaE and PolC, synthesis was drastically inhibited, indicating the hand-off to PolC occurs early in the reaction (Sanders et al. 2010). The precise window for the hand-off awaits further experimentation.

Cross-References ▶ DNA Polymerase III Structure ▶ Eukaryotic DNA Replicases ▶ Initiation Complex Formation, Mechanism of ▶ Replication Origin of E. coli and the Mechanism of Initiation

References Aravind L, Koonin EV (1998) Phosphoesterase domains associated with DNA polymerases of diverse origins. Nucleic Acids Res 26:3746–3752 Ason B, Bertram JG, Hingorani MM, Beechem JM, O’Donnell ME, Goodman MF, Bloom LB (2000) A model for Escherichia coli DNA polymerase III holoenzyme assembly at primer/template ends DNA triggers a change in binding specificity of the g complex clamp loader. J Biol Chem 275:3006–3015 Ason B, Handayani R, Williams CR, Bertram JG, Hingorani MM, O’Donnell ME, Goodman MF, Bloom LB (2003) Mechanism of loading the Escherichia coli DNA polymerase III b sliding clamp on DNA Bona fide primer/templates preferentially trigger the g complex to hydrolyze ATP and load the clamp. J Biol Chem 278:10033–10040 Bailey S, Wing RA, Steitz TA (2006) The structure of T. aquaticus DNA polymerase III is distinct from eukaryotic replicative DNA polymerases. Cell 126:893–904 Banos B, Lazaro JM, Villar L, Salas M, De Vega M (2008) Editing of misaligned 30 -termini by an intrinsic 30 -50

Bacterial DNA Replicases exonuclease activity residing in the PHP domain of a family X DNA polymerase. Nucleic Acids Res 36:5736–5749 Barros T, Guenther J, Kelch B, Anaya J, Prabhakar A, O’Donnell M, Kuriyan J, Lamers MH (2013) A structural role for the PHP domain in E. coli DNA polymerase III. BMC Struct Biol 13:8 Beese LS, Steitz TA (1991) Structural basis for the 30 -50 exonuclease activity of Escherichia coli DNA polymerase I: a two metal ion mechanism. EMBO J 10:25–34 Bierne H, Vilette D, Ehrlich SD, Michel B (1997) Isolation of a dnaE mutation which enhances recA-independent homologous recombination in the Escherichia coli chromosome. Mol Microbiol 24:1225–1234 Binder JK, Douma LG, Ranjit S, Kanno DM, Chakraborty M, Bloom LB, Levitus M (2014) Intrinsic stability and oligomerization dynamics of DNA processivity clamps. Nucleic Acids Res 42:6476–6486 Blasius M, Shevelev I, Jolivet E, Sommer S, Hübscher U (2006) DNA polymerase X from Deinococcus radiodurans possesses a structure-modulated 30 !50 exonuclease activity involved in radioresistance. Mol Microbiol 60:165–176 Blinkowa AL, Walker JR (1990) Programmed ribosomal frameshifting generates the Escherichia coli DNA polymerase III g subunit from within the t subunit reading frame. Nucleic Acids Res 18:1725–1729 Bruand C, Ehrlich SD, Janniere L (1995) Primosome assembly site in Bacillus subtilis. EMBO J 14:2642–2650 Bruand C, Farache M, McGovern S, Ehrlich SD, Polard P (2001) DnaB, DnaD and DnaI proteins are components of the Bacillus subtilis replication restart primosome. Mol Microbiol 42:245–256 Bruck I, O’Donnell ME (2000) The DNA replication machine of a gram-positive organism. J Biol Chem 275:28971–28983 Bruck I, Goodman MF, O’Donnell ME (2003) The essential C family DnaE polymerase is error-prone and efficient at lesion bypass. J Biol Chem 278:44361–44368 Bruck I, Georgescu RE, O’Donnell M (2005) Conserved interactions in the Staphylococcus aureus DNA PolC chromosome replication machine. J Biol Chem 280:18152–18162 Brutlag D, Kornberg A (1972) Enzymatic synthesis of DNA. XXXVI. A proof reading function of the 30 !50 exonuclease activity in deoxyribonucleic acid polymerases. J Biol Chem 247:241–248 Bunting KA, Roe SM, Pearl LH (2003) Structural basis for recruitment of translesion DNA polymerase Pol IV/DinB to the b-clamp. EMBO J 22:5883–5892 Burnouf DY, Olieric V, Wagner J, Fujii S, Reinbolt J, Fuchs RP, Dumas P (2004) Structural and biochemical analysis of sliding clamp/ligand interactions suggest a

Bacterial DNA Replicases competition between replicative and translesion DNA polymerases. J Mol Biol 335:1187–1197 Canceill D, Ehrlich SD (1996) Copy-choice recombination mediated by DNA polymerase III holoenzyme from Escherichia coli. Proc Natl Acad Sci U S A 93:6647–6652 Chen D, Yue H, Spiering MM, Benkovic SJ (2013) Insights into Okazaki fragment synthesis by the T4 replisome: the fate of lagging-strand holoenzyme components and their influence on Okazaki fragment size. J Biol Chem 288:20807–20816 Cho WK, Jergic S, Kim D, Dixon NE, Lee JB (2014) Loading dynamics of a sliding DNA clamp. Angew Chem Int Ed Engl 53:6768–6771 Cisneros GA, Perera L, Schaaper RM, Pedersen LC, London RE, Pedersen LG, Darden TA (2009) Reaction mechanism of the e subunit of E. coli DNA polymerase III: insights into active site metal coordination and catalytically significant residues. J Am Chem Soc 131:1550–1556 Dalrymple BP, Kongsuwan K, Wijffels G, Dixon NE, Jennings PA (2001) A universal protein-protein interaction motif in the eubacterial DNA replication and repair systems. Proc Natl Acad Sci U S A 98:11627–11632 Derose EF, Li D, Darden T, Harvey S, Perrino FW, Schaaper RM, London RE (2002) Model for the catalytic domain of the proofreading e subunit of Escherichia coli DNA polymerase III based on NMR structural data. Biochemistry 41:94–110 Dervyn E, Suski C, Daniel R, Bruand C, Chapuis J, Errington J, Janniere L, Ehrlich SD (2001) Two essential DNA polymerases at the bacterial replication fork. Science 294:1716–1719 Doherty AJ, Serpell LC, Ponting CP (1996) The helixhairpin-helix DNA-binding motif: a structural basis for non-sequence-specific recognition of DNA. Nucleic Acids Res 24:2488–2497 Dohrmann PR, McHenry CS (2005) A bipartite polymerase-processivity factor interaction: only the internal b binding site of the a subunit is required for processive replication by the DNA polymerase III holoenzyme. J Mol Biol 350:228–239 Dohrmann PR, Manhart CM, Downey CD, McHenry CS (2011) The rate of polymerase release upon filing the gap between Okazaki fragments is inadequate to support cycling during lagging strand synthesis. J Mol Biol 414:15–27 Downey CD, McHenry CS (2010) Chaperoning of a replicative polymerase onto a newly-assembled DNA-bound sliding clamp by the clamp loader. Mol Cell 37:481–491 Downey CD, Crooke E, McHenry CS (2011) Polymerase chaperoning and multiple ATPase sites enable the E. coli DNA polymerase III holoenzyme to rapidly form initiation complexes. J Mol Biol 412:340–353

29 Dua R, Levy DL, Campbell JL (1999) Analysis of the essential functions of the C-terminal protein/protein interaction domain of Saccharomyces cerevisiae pol e and its unexpected ability to support growth in the absence of the DNA polymerase domain. J Biol Chem 274:22283–22288 Fang J, Nevin P, Kairys V, Venclovas C, Engen JR, Beuning PJ (2014) Conformational analysis of processivity clamps in solution demonstrates that tertiary structure does not correlate with protein dynamics. Structure 22:572–581 Fay PJ, Johanson KO, McHenry CS, Bambara RA (1981) Size classes of products synthesized processively by DNA polymerase III and DNA polymerase III holoenzyme of Escherichia coli. J Biol Chem 256:976–983 Fay PJ, Johanson KO, McHenry CS, Bambara RA (1982) Size classes of products synthesized processively by two subassemblies of Escherichia coli DNA polymerase III holoenzyme. J Biol Chem 257:5692–5699 Fijalkowska IJ, Schaaper RM (1993) Antimutator mutations in the a subunit of Escherichia coli DNA polymerase III identification of the responsible mutations and alignment with other DNA polymerases. Genetics 134:1039–1044 Fijalkowska IJ, Schaaper RM, Jonczyk P (2012) DNA replication fidelity in Escherichia coli: a multi-DNA polymerase affair. FEMS Microbiol Rev 36(6):1105–1121 Flower AM, McHenry CS (1990) The g subunit of DNA polymerase III holoenzyme of Escherichia coli is produced by ribosomal frameshifting. Proc Natl Acad Sci U S A 87:3713–3717 Gao D, McHenry CS (2001a) t binds and organizes Escherichia coli replication proteins through distinct domains. Domain III, shared by g and t, binds dd0 and wc. J Biol Chem 276:4447–4453 Gao D, McHenry CS (2001b) t binds and organizes Escherichia coli replication proteins through distinct domains. domain IV, located within the unique C terminus of t, binds the replication fork helicase, DnaB. J Biol Chem 276:4441–4446 Gao D, McHenry CS (2001c) t binds and organizes Escherichia coli replication proteins through distinct domains: partial proteolysis of terminally tagged t to determine candidate domains and to assign domain Vas the a binding domain. J Biol Chem 276:4433–4440 Georgescu RE, Kim SS, Yurieva O, Kuriyan J, Kong XP, O’Donnell M (2008) Structure of a sliding clamp on DNA. Cell 132:43–54 Georgescu RE, Kurth I, Yao NY, Stewart J, Yurieva O, O’Donnell M (2009) Mechanism of polymerase collision release from sliding clamps on the lagging strand. EMBO J 28:2981–2991 Glover BP, McHenry CS (1998) The wc subunits of DNA polymerase III holoenzyme bind to single-stranded

B

30 DNA-binding protein (SSB) and facilitate replication of a SSB-coated template. J Biol Chem 273:23476–23484 Glover BP, McHenry CS (2000) The DnaX-binding subunits d0 and c are bound to g and not t in the DNA polymerase III holoenzyme. J Biol Chem 275:3017–3020 Glover BP, Pritchard AE, McHenry CS (2001) t binds and organizes Escherichia coli replication proteins through distinct domains. Domain III, shared by g and t, oligomerizes DnaX. J Biol Chem 276:35842–35846 Goodman MF (2002) Error-prone repair DNA polymerases in prokaryotes and eukaryotes. Annu Rev Biochem 71:17–50 Griep MA, McHenry CS (1992) Fluorescence energy transfer between the primer and the b subunit of the DNA polymerase III holoenzyme*. J Biol Chem 267:3052–3059 Griep M, Reems J, Franden M, McHenry C (1990) Reduction of the potent DNA polymerase III holoenzyme 30 !50 exonuclease activity by template-primer analogs. Biochemistry 29:9006–9014 Hamdan S, Carr PD, Brown SE, Ollis DL, Dixon NE (2002) Structural basis for proofreading during replication of the Escherichia coli chromosome. Structure 10:535–546 Hamdan S, Loparo JJ, Takahashi M, Richardson CC, van Oijen AM (2009) Dynamics of DNA replication loops reveal temporal control of lagging-strand synthesis. Nature 457:336–339 Hayner JN, Bloom LB (2013) The beta sliding clamp closes around DNA prior to release by the Escherichia coli clamp loader gamma complex. J Biol Chem 288:1162–1170 Heller RC, Marians KJ (2005) The disposition of nascent strands at stalled replication forks dictates the pathway of replisome loading during restart. Mol Cell 17:733–743 Hiratsuka K, Reha-Krantz LJ (2000) Identification of Escherichia coli dnaE (polC) mutants with altered sensitivity to 20 ,30 -dideoxyadenosine. J Bacteriol 182:3942–3947 Indiani C, O’Donnell ME (2003) Mechanism of the d wrench in opening the b sliding clamp. J Biol Chem 278:40272–40281 Ivanov I, Chapados BR, McCammon JA, Tainer JA (2006) Proliferating cell nuclear antigen loaded onto double-stranded DNA: dynamics, minor groove interactions and functional implications. Nucleic Acids Res 34:6023–6033 Jergic S, Horan NP, Elshenawy MM, Mason CE, Urathamakul T, Ozawa K, Robinson A, Goudsmits JM, Wang Y, Pan X, Beck JL, van Oijen AM, Huber T, Hamdan SM, Dixon NE (2013) A direct proofreader-clamp interaction stabilizes the Pol III replicase in the polymerization mode. EMBO J 32:1322–1333

Bacterial DNA Replicases Jeruzalmi D, Yurieva O, Zhao Y, Young M, Stewart J, Hingorani M, O’Donnell ME, Kuriyan J (2001) Mechanism of processivity clamp opening by the d subunit wrench of the clamp loader complex of E. coli DNA polymerase III. Cell 106:417–428 Kelch BA, Makino DL, O’Donnell M, Kuriyan J (2011) How a DNA polymerase clamp loader opens a sliding clamp. Science 334:1675–1680 Kim DR, McHenry CS (1996a) Biotin tagging deletion analysis of domain limits involved in proteinmacromolecular interactions: mapping the t binding domain of the DNA polymerase III a subunit. J Biol Chem 271:20690–20698 Kim DR, McHenry CS (1996b) Identification of the b-binding domain of the a subunit of Escherichia coli polymerase III holoenzyme. J Biol Chem 271:20699–20704 Kim DR, McHenry CS (1996c) In vivo assembly of overproduced DNA polymerase III: overproduction, purification, and characterization of the a, a  e, and a  e  y subunits. J Biol Chem 271:20681–20689 Kim S, Dallmann HG, McHenry CS, Marians KJ (1996a) Coupling of a replicative polymerase and helicase: a t-DnaB interaction mediates rapid replication fork movement. Cell 84:643–650 Kim S, Dallmann HG, McHenry CS, Marians KJ (1996b) t protects b in the leading-strand polymerase complex at the replication fork. J Biol Chem 271:4315–4318 Kim DR, Pritchard AE, McHenry CS (1997) Localization of the active site of the a subunit of the Escherichia coli DNA polymerase III holoenzyme. J Bacteriol 179:6721–6728 Kong XP, Onrust R, O’Donnell ME, Kuriyan J (1992) Three-dimensional structure of the b subunit of E. coli DNA polymerase III holoenzyme: a sliding DNA clamp. Cell 69:425–437 Koonin EV, Bork P (1996) Ancient duplication of DNA polymerase inferred from analysis of complete bacterial genomes. Trends Biochem Sci 21:128–129 Kornberg A, Baker TA (1992) DNA replication. WH Freeman, New York LaDuca RJ, Crute JJ, McHenry CS, Bambara RA (1986) The b subunit of the Escherichia coli DNA polymerase III holoenzyme interacts functionally with the catalytic core in the absence of other subunits. J Biol Chem 261:7550–7557 Lamers MH, Georgescu RE, Lee SG, O’Donnell M, Kuriyan J (2006) Crystal structure of the catalytic a subunit of E. coli replicative DNA polymerase III. Cell 126:881–892 Le Chatelier E, Becherel OJ, D’Alencon E, Canceill D, Ehrlich SD, Fuchs RP, Janniere L (2004) Involvement of DnaE, the second replicative DNA polymerase from Bacillus subtilis, in DNA mutagenesis. J Biol Chem 279:1757–1767 Lebowitz JH, McMacken R (1986) The Escherichia coli DnaB replication protein is a DNA helicase. J Biol Chem 261:4738–4748

Bacterial DNA Replicases Leu FP, Georgescu R, O’Donnell ME (2003) Mechanism of the E. coli t processivity switch during laggingstrand synthesis. Mol Cell 11:315–327 Low RL, Rashbaum SA, Cozzarelli NR (1974) Mechanism of inhibition of Bacillus subtilis DNA polymerase III by the arylhydrazinopyrimidine antimicrobial agents. Proc Natl Acad Sci U S A 71:2973–2977 Maki H, Mo JY, Sekiguchi M (1991) A strong mutator effect caused by an amino acid change in the a subunit of DNA polymerase III of Escherichia coli. J Biol Chem 266:5055–5061 Manhart CM, McHenry CS (2013) The PriA replication restart protein blocks replicase access prior to helicase assembly and directs template specificity through its ATPase activity. J Biol Chem 288:3989–3999 Marceau AH, Bahng S, Massoni SC, George NP, Sandler SJ, Marians KJ, Keck JL (2011) Structure of the SSB-DNA polymerase III interface and its role in DNA replication. EMBO J 30:4236–4247 Marians KJ, Hiasa H, Kim DR, McHenry CS (1998) Role of the core DNA polymerase III subunits at the replication fork: a is the only subunit required for processive replication. J Biol Chem 273:2452–2457 McHenry CS (1982) Purification and characterization of DNA polymerase III0 : identification of t as a subunit of the DNA polymerase III holoenzyme. J Biol Chem 257:2657–2663 McHenry CS (1988) DNA polymerase III holoenzyme of Escherichia coli. Annu Rev Biochem 57:519–550 McHenry CS (2011) DNA replicases from a bacterial perspective. Annu Rev Biochem 80:403–436 Millar D, Trakselis MA, Benkovic SJ (2004) On the solution structure of the T4 sliding clamp (gp45). Biochemistry 43:12723–12727 Miller H, Perrino FW (1996) Kinetic mechanism of the 30 ! 50 proofreading exonuclease of DNA polymerase III analysis by steady state and pre-steady state methods. Biochemistry 35:12919–12925 Moarefi I, Jeruzalmi D, Turner J, O’Donnell ME, Kuriyan J (2000) Crystal structure of the DNA polymerase processivity factor of T4 bacteriophage. J Mol Biol 296:1215–1223 Mok M, Marians KJ (1987a) Formation of rolling-circle molecules during fΧ174 complementary strand DNA replication. J Biol Chem 262:2304–2309 Mok M, Marians KJ (1987b) The Escherichia coli preprimosome and DNA B helicase can form replication forks that move at the same rate. J Biol Chem 262:16644–16654 Nethanel T, Kaufmann G (1990) Two DNA polymerases may be required for synthesis of the lagging DNA strand of simian virus 40. J Virol 64:5912–5918 Nick McElhinny SA, Gordenin DA, Stith CM, Burgers PMJ, Kunkel TA (2008) Division of labor at the eukaryotic replication fork. Mol Cell 30:137–144

31 O’Donnell ME, Kornberg A (1985) Complete replication of templates by Escherichia coli DNA polymerase III holoenzyme. J Biol Chem 260:12884–12889 Oller AR, Schaaper R (1994) Spontaneous mutation in Escherichia coli containing the DnaE911 DNA polymerase antimutator allele. Genetics 138:263–270 Ozawa K, Jergic S, Park AY, Dixon NE, Otting G (2008) The proofreading exonuclease subunit e of Escherichia coli DNA polymerase III is tethered to the polymerase subunit a via a flexible linker. Nucleic Acids Res 36:5074–5082 Ozawa K, Horan NP, Robinson A, Yagi H, Hill FR, Jergic S, Xu ZQ, Loscha KV, Li N, Tehei M, Oakley AJ, Otting G, Huber T, Dixon NE (2013) Proofreading exonuclease on a tether: the complex between the E. coli DNA polymerase III subunits alpha, epsilon, theta and beta reveals a highly flexible arrangement of the proofreading domain. Nucleic Acids Res 41:5354–5367 Paschall CO, Thompson JA, Marzahn MR, Chiraniya A, Hayner JN, O’Donnell M, Robbins AH, McKenna R, Bloom LB (2011) The Escherichia coli clamp loader can actively pry open the beta-sliding clamp. J Biol Chem 286:42704–42714 Polard P, Marsin S, McGovern S, Velten M, Wigley DB, Ehrlich SD, Bruand C (2002) Restart of DNA replication in gram-positive bacteria: functional characterisation of the Bacillus subtilis PriA initiator. Nucleic Acids Res 30:1593–1605 Pritchard AE, McHenry CS (1999) Identification of the acidic residues in the active site of DNA polymerase III. J Mol Biol 285:1067–1080 Pursell ZF, Isoz I, Lundstrom EB, Johansson E, Kunkel TA (2007) Yeast DNA polymerase e participates in leading-strand DNA replication. Science 317:127–130 Reems JA, McHenry CS (1994) Escherichia coli DNA polymerase III holoenzyme footprints three helical turns of its primer. J Biol Chem 269:33091–33096 Reems JA, Griep MA, McHenry CS (1991) The proofreading activity of DNA polymerase III responds like the elongation activity to auxiliary subunits. J Biol Chem 266:4878–4882 Reems JA, Wood S, McHenry CS (1995) Escherichia coli DNA polymerase III holoenzyme subunits a, b and g directly contact the primer template. J Biol Chem 270:5606–5613 Sanders GM, Dallmann HG, McHenry CS (2010) Reconstitution of the B. subtilis replisome with 13 proteins including two distinct replicases. Mol Cell 37:273–281 Sawaya MR, Prasad R, Wilson SH, Kraut J, Pelletier H (1997) Crystal structures of human DNA polymerase b complexed with gapped and nicked DNA: evidence for an induced fit mechanism. Biochemistry 36:11205–11215 Scheuermann R, Tam S, Burgers PMJ, Lu C, Echols H (1983) Identification of the e-subunit of Escherichia coli DNA polymerase III holoenzyme as the dnaQ gene

B

32 product: a fidelity subunit for DNA replication. Proc Natl Acad Sci U S A 80:7085–7089 Shamoo Y, Steitz TA (1999) Building a replisome from interacting pieces: sliding clamp complexed to a peptide from DNA polymerase and a polymerase editing complex. Cell 99:155–166 Simonetta KR, Kazmirski SL, Goedken ER, Cantor AJ, Kelch BA, McNally R, Seyedin SN, Makino DL, O’Donnell M, Kuriyan J (2009) The mechanism of ATP-dependent primer-template recognition by a clamp loader complex. Cell 137:659–671 Slater SC, Lifsics MR, O’Donnell ME, Maurer R (1994) holE, the gene coding for the y subunit of DNA polymerase III of Escherichia coli: characterization of a holE mutant and comparison with a dnaQ (e-subunit) mutant. J Bacteriol 176:815–821 Song MS, Pham PT, Olson M, Carter JR, Franden MA, Schaaper RM, McHenry CS (2001) The d and d0 subunits of the DNA polymerase III holoenzyme are essential for initiation complex formation and processive elongation. J Biol Chem 276:35165–35175 Stano NM, Chen J, McHenry CS (2006) A coproofreading Zn(2+)-dependent exonuclease within a bacterial replicase. Nat Struct Mol Biol 13:458–459 Steitz TA (1999) DNA polymerases: structural diversity and common mechanisms. J Biol Chem 274:17395–17398 Stephens KM, McMacken R (1997) Functional properties of replication fork assemblies established by the bacteriophage l O and P replication proteins. J Biol Chem 272:28800–28813 Stewart J, Hingorani MM, Kelman Z, O’Donnell ME (2001) Mechanism of b clamp opening by the d subunit of Escherichia coli DNA polymerase III holoenzyme. J Biol Chem 276:19182–19189 Strauss BS, Roberts R, Francis L, Pouryazdanparast P (2000) Role of the dinB gene product in spontaneous mutation in Escherichia coli with an impaired replicative polymerase. J Bacteriol 182:6742–6750 Studwell PS, O’Donnell ME (1990) Processive replication is contingent on the exonuclease subunit of DNA polymerase III holoenzyme. J Biol Chem 265:1171–1178 Stukenberg PT, Turner J, O’Donnell ME (1994) An explanation for lagging strand replication: polymerase hopping among DNA sliding clamps. Cell 78:877–887 Taft-Benz SA, Schaaper RM (2004) The y subunit of Escherichia coli DNA polymerase III: a role in stabilizing the e proofreading subunit. J Bacteriol 186:2774–2780 Tainer JA, McCammon JA, Ivanov I (2010) Recognition of the ring-opened state of proliferating cell nuclear antigen by replication factor C promotes eukaryotic clamploading. J Am Chem Soc 132:7372–7378 Tarantino PM, Zhi C, Gambino JJ, Wright GE, Brown NC (1999) 6-Anilinouracil-based inhibitors of Bacillus subtilis DNA polymerase III: antipolymerase and antimicrobial structure-activity relationships based on substitution at uracil N3. J Med Chem 42:2035–2040

Bacterial DNA Replicases Teplyakov A, Obmolova G, Khil PP, Howard AJ, Camerini-Otero RD, Gilliland GL (2003) Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site. Proteins 51:315–318 Theobald DL, Mitton-Fry RM, Wuttke DS (2003) Nucleic acid recognition by OB-fold proteins. Annu Rev Biophys Biomol Struct 32:115–133 Thompson JA, Paschall CO, O’Donnell M, Bloom LB (2009) A slow ATP-induced conformational change limits the rate of DNA binding but not the rate of b-clamp binding by the Escherichia coli g complex clamp loader. J Biol Chem 284:32147–32157 Toste-Rego A, Holding AN, Kent H, Lamers MH (2013) Architecture of the Pol III-clamp-exonuclease complex reveals key roles of the exonuclease subunit in processive DNA synthesis and repair. EMBO J 32:1334–1343 Tsuchihashi Z, Kornberg A (1990) Translational frameshifting generates the g subunit of DNA polymerase III holoenzyme. Proc Natl Acad Sci U S A 87:2516–2520 Vandewiele D, Fernandez de Henestrosa AR, Timms AR, Bridges BA, Woodgate R (2002) Sequence analysis and phenotypes of five temperature sensitive mutator alleles of dnaE, encoding modified alpha-catalytic subunits of Escherichia coli DNA polymerase III holoenzyme. Mutat Res 499:85–95 Velten M, McGovern S, Marsin S, Ehrlich SD, Noirot P, Polard P (2003) A two-protein strategy for the functional loading of a cellular replicative DNA helicase. Mol Cell 11:1009–1020 Wang JD, Sanders GM, Grossman AD (2007) Nutritional control of elongation of DNA replication by (p) ppGpp. Cell 128:865–875 Wieczorek A, McHenry CS (2006) The NH(2)-terminal php domain of the a subunit of the E. coli replicase binds the e proofreading subunit. J Biol Chem 281:12561–12567 Wing RA, Bailey S, Steitz TA (2008) Insights into the replisome from the structure of a ternary complex of the DNA polymerase III a-subunit. J Mol Biol 382:859–869 Wu CA, Zechner EL, Marians KJ (1992a) Coordinated leading and lagging-strand synthesis at the Escherichia coli DNA replication fork I multiple effectors act to modulate Okazaki fragment size. J Biol Chem 267:4030–4044 Wu CA, Zechner EL, Reems JA, McHenry CS, Marians KJ (1992b) Coordinated leading- and lagging-strand synthesis at the Escherichia coli DNA replication fork: V Primase action regulates the cycle of Okazaki fragment synthesis. J Biol Chem 267:4074–4083 Xu L, Marians KJ (2003) PriA mediates DNA replication pathway choice at recombination intermediates. Mol Cell 11:817–826 Yang J, Zhuang Z, Roccasecca RM, Trakselis MA, Benkovic SJ (2004) The dynamic processivity of the T4 DNA polymerase during replication. Proc Natl Acad Sci U S A 101:8289–8294

Bacteriophage and Viral Cloning Vectors Yao N, Hurwitz J, O’Donnell ME (2000) Dynamics of b and proliferating cell nuclear antigen sliding clamps in traversing DNA secondary structure. J Biol Chem 275:1421–1432 Yuan Q, McHenry CS (2009) Strand displacement by DNA polymerase III occurs through a t-c-w link to SSB coating the lagging strand template. J Biol Chem 284:31672–31679 Yuan Q, McHenry CS (2014) Cycling of the E. coli lagging strand polymerase is triggered exclusively by the availability of a new primer at the replication fork. Nucleic Acids Res 42:1747–1756

Bacteriophage and Viral Cloning Vectors Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Synopsis Viruses and bacteriophages (the viruses that infect bacteria) provided vectors for some of the first cloning experiments, and they continue to be used in a variety of applications. The most important bacteriophages have been the filamentous phages (especially M13), bacteriophage lambda, and bacteriophage P1. In addition, vectors such as phagemids and cosmids have been constructed that combine features of both plasmids and phage. A large variety of vectors can be used to generate infectious recombinant viruses for use in eukaryotic cells. These include viruses derived from human immunodeficiency virus type 1 (HIV-1), the causative agent of AIDS.

Introduction Some of the earliest cloning experiments made use of bacteriophages or viruses to carry recombinant DNA molecules, introduce them into cells, and enable the production of new phage or viral particles containing copies of the recombinant

33

DNA. Since then, many cloning vectors derived from natural bacteriophages and viruses have been developed for use in a wide variety of applications. Viruses and bacteriophage consist of, minimally, a nucleic acid genome (DNA or RNA) enclosed within a protein capsid. Some viruses are enclosed within a lipid membrane envelope. The virus or bacteriophage enters an appropriate host cell where the genome can be replicated and viral- or phage-specific proteins expressed. New virus or phage particles are produced, which leave the original cell to infect other cells. Bacteriophages, the viruses that infect bacteria, are thought to be the most abundant “organisms” on earth (Krupovic et al. 2011).

Main Text Cloning Vectors Derived from Bacteriophages Cloning vectors have been developed from filamentous bacteriophages (M13 and f1), bacteriophage lambda, and bacteriophage P1, taking advantage of the unique characteristics of each of these phages. Bacteriophages, commonly called phages, have several advantages over plasmids as cloning vectors. The phage infection process is an easy and highly efficient way to introduce recombinant molecules into host cells. Infected cells can be spread on an agar plate, on which the cells grow to produce a continuous lawn of bacteria except for the small region surrounding an infected cell. The phage produced by the initially infected cell will infect neighboring cells, either killing them or inhibiting their growth. This results in formation of a clear area in the bacterial lawn, called a plaque. Formation of the plaque readily indicates the presence of the phage vector in that region of the agar plate. High infection or transformation efficiencies of phage increase the chance of recovering a clone containing the desired recombinant DNA. Some phage vectors can hold larger inserts than most plasmids, allowing the cloning of large eukaryotic genes and their regulatory elements. The larger insert size also reduces the total number of clones

B

34

needed for a DNA library to contain the entire genome from a species. Filamentous Phage (M13 and f1) and Phagemids Joachim Messing developed a cloning system based on the M13 phage in the late 1970s (Messing 1983, 1991). These phages are useful because their genome can exist and be isolated in either circular double-stranded (dsDNA) or single-stranded (ssDNA) forms. The circular single-stranded recombinant DNA can be used in some DNA sequencing and site-directed mutagenesis applications (although these approaches have been superceded by newer methods that use dsDNA). The phage particles can also be used to make strand-specific ssDNA probes for Southern blotting. These vectors are also

Bacteriophage and Viral Cloning Vectors, Fig. 1 Bacteriophage M13 life cycle. Phase 1: the single-stranded circular phage DNA (+) strand enters the bacterial cell upon infection. It is converted to a nicked double-stranded form after RNA primer synthesis by the host RNA polymerase and DNA polymerase III. The nicked form is converted to the closed-circular supercoiled replicative form I (RFI) by the host DNA polymerase I, ligase, and gyrase. The () strand in RFI is the template for phage mRNA synthesis by the host RNA polymerase. Phase 2: the gene 2 protein encoded by the phage (gp2)

Bacteriophage and Viral Cloning Vectors

important historically in the development of cloning technology (Messing 1991). M13 and other filamentous phage particles consist of a circular single-stranded genome encased in a capsid of about 2,700 monomers of the phage-encoded gene 8 protein (gp8) (Kornberg and Baker 1991). The phage particle is a narrow filament that is about 895 nm long but only 6 nm in diameter. The phage infects cells by binding to the F pilus formed by proteins encoded by the F plasmid (fertility factor) that must be present in the host cell. Once inside the cell, the 6,407 nt genome is converted to a circular double-stranded form called replicative form I (RFI) (Fig. 1). RFI serves as a template for production of circular single-stranded viral DNA molecules, which are themselves converted to more RFI molecules. RFI also provides the

nicks the RFI and initiates synthesis of new single-stranded circular (+) strands. These new single-stranded circles are used for synthesis of more RFI molecules early in infection. Phase 3: later in infection, the gp5 protein (red) binds the single-stranded (+) strand products and prevents them from being replicated to RFI. Instead, gp5 is displaced by gp8 (green) to form new phage particles that are extruded through the cell membrane. Phage proteins gp3, gp6, gp7, and gp9, that are present in the phage particle, are not shown (After Kornberg and Baker 1991)

Bacteriophage and Viral Cloning Vectors

template for phage mRNA production, allowing synthesis of phage proteins. As phage proteins accumulate in the infected cell, the new ssDNA circles are coated by the gene 5 protein (gp5) which prevents them from being converted to RFI and instead shunts them to packaging into phage particles. Packaging involves displacement of gp5 by gp8, the capsid protein. Gp8 resides in the bacterial inner membrane, so packaging is concomitant with extrusion of the gp8-DNA phage particle through the cell membrane and to the surrounding medium. The process continues until the ssDNA is completely coated by gp8 and extruded, regardless of the size of the DNA. Thus, recombinant phage can be formed with fairly large DNA inserts. The extrusion process does not kill the infected cell, although the infection process slows cell growth. Thus, plaques formed by infected cells can be seen as areas of slower bacterial growth in the surrounding bacterial lawn. Recombinant phage can be produced in large amounts by picking a plaque from the lawn and growing the phage in liquid culture containing host cells. The phage particles that are released into the growth medium can be isolated by precipitation and centrifugation and the circular ssDNA isolated by phenol extraction (Messing 1983). The double-stranded RFI can be isolated from infected cells by methods used for plasmid DNA isolation. Messing introduced many novel features into the natural M13 phages, producing the M13mp series of vectors, of which the most important are M13mp18 (Fig. 2) and M13mp19. These features include a multicloning site (MCS) with several adjacent and unique restriction enzyme cleavage sites from the plasmid pUC18, and the E. coli lac promoter and a downstream lacZa gene fragment. Cloning into these vectors is done by isolating the double-stranded RFI form from infected cells and ligating insert DNA into appropriate restriction sites in the MCS. The recombinant circular DNA is transformed into host cells where the dsDNA is replicated by the normal process, to produce phage particles containing the circular ssDNA recombinant genome. The lacZa gene allows recombinant phage to be isolated by blue-white screening.

35

B

Bacteriophage and Viral Cloning Vectors, Fig. 2 Map of M13mp18. Thick line is M13 DNA. Thin line is DNA from the E. coli lac operon. The locations of the origin of replication region of M13, the lacZa gene (for blue-white screening), and the multicloning site (MCS) are shown. Based on sequences in GenBank (Accession numbers M77815 and J02465)

A novel extension of the use of filamentous phage vectors was the development of phagemid vectors. Phagemids combine the ssDNA production of filamentous phage vectors with the simplicity of plasmids. A phagemid contains a plasmid origin of replication and the phage sequences from M13 or f1 that are required to initiate circular ssDNA production (Mead et al. 1986). The recombinant phagemid can be maintained and propagated in a cell as a plasmid. Phagemid ssDNA can be produced by coinfecting cells with the phagemid and a “helper” phage or a plasmid that carries the genes encoding phage proteins required for ssDNA production. The phagemid is then replicated and packaged as a phage which can be isolated and processed as above. Bacteriophage Lambda Peter Lobban, while a graduate student at Stanford University, proposed in 1969 to use bacteriophage lambda as a cloning vector by replacing a nonessential portion of the phage DNA with insert DNA from another source (Berg and Mertz 2010). One of the first applications of this approach was by Murray and Murray, who created

36

Bacteriophage and Viral Cloning Vectors

Bacteriophage and Viral Cloning Vectors, Fig. 3 The cos site of bacteriophage lambda. The double-stranded cos sequence is in red. Cleavage of the site in linear phage lambda DNA concatemers by the terminase enzyme produces the 12-nt cohesive ends that are found in the linear

phage genome. The host DNA ligase catalyzes joining of the cohesive ends in the monomer-length lambda genome to produce the circular form in the infected cell (see Fig. 4)

a lambda phage with a portion of the nonessential region of its genome deleted and only one EcoRI restriction site in the entire vector. Using this vector they were able to clone a DNA fragment containing part of the E. coli trp operon genes (Murray and Murray 1974). At about the same time, Ron Davis and co-workers produced a recombinant lambda phage carrying DNA from Drosophila melanogaster (Thomas et al. 1974). Bacteriophage lambda has an icosahedralshaped proteinaceous head that contains the DNA genome and a tail that enables the phage to attach to the E. coli host cell (Kornberg and Baker 1991). The phage infects the cell by binding to the LamB protein, a maltose-binding receptor protein in the E. coli outer membrane, and then injecting its DNA genome into the cell. For this reason, host cells must be grown on medium containing maltose. The lambda genome is a linear doublestranded DNA molecule of 48 kb with 12 nucleotide single-stranded overhangs on the 50 -ends. The two overhangs, called cohesive ends, are complementary which enables the DNA to circularize by base pairing inside the infected cell (Fig. 3). The DNA circle is sealed by the host DNA ligase and DNA gyrase introduces negative supercoils (Fig. 4). The circularized DNA molecule serves as the template for DNA replication

and for transcription, leading to expression of phage proteins. A lambda phage infection can take either a lysogenic or lytic course. The lysogenic phase is when the phage DNA is inserted into the host cell genome by the action of the Int protein, a phage-encoded protein that acts as a site-specific recombinase. The lytic phase involves production of several hundred progeny phages which lyse and kill the host cell and are thereby released and able to infect other cells. The choice between lysogenic and lytic growth depends on three transcription regulatory proteins, the cI repressor (lambda repressor), cII repressor, and Cro repressor. In lytic growth (Fig. 4), phage-encoded proteins are produced and the phage DNA is replicated. Replication depends on host enzymes, except for the phage-encoded O and P proteins that act in replication initiation to direct the host replication machinery to the viral genome. Replication occurs initially by a theta-type mechanism, producing closed-circular product DNA. About 15 min after initiation, the replication switches to a rolling circle mechanism that produces linear double-stranded DNA concatemers in which monomer-length genomes are joined in a headto-tail fashion. The concatemeric genome is necessary for packaging of monomer-length genomes

Bacteriophage and Viral Cloning Vectors

37

B

Bacteriophage and Viral Cloning Vectors, Fig. 4 Bacteriophage lambda lytic life cycle. The double-stranded lambda genome is injected into a host cell (top, left). The cos cohesive ends enable the linear genome to circularize (see Fig. 3). The host DNA ligase and gyrase produce a supercoiled genome. The lambda O protein binds to the replication origin and the P protein recruits host replication enzymes, to initiate replication via a theta structure intermediate. This process produces more supercoiled genomes. Later in infection, the replication

switches to a rolling circle mechanism that produces linear concatemers of the phage genome. The linear DNA can enter phage proheads as they assemble in the cell. When the head is completely filled with one genome length of DNA, terminase cleaves two cos sites, to release a monomer-length genome in the phage head. Phage particles are completed by addition of the tail and released to the surroundings after cell lysis. The thick red line represents double-stranded DNA, in which individual strands are not shown (After Kornberg and Baker 1991)

into the phage proheads that assemble in the infected cell. Packaging requires cleavage of monomer genomes from the concatemer by the phage-encoded terminase endonuclease. Terminase cleaves the DNA at the cos site to regenerate the 12-nt cohesive ends of the original phage DNA. Lytic infection requires genes on both ends of the lambda phage’s linear genome. However, the DNA between these two “arms,” about one-third of the genome, is not essential for phage

propagation and can be replaced with insert DNA (Fig. 5). Cloning involves replacing the central nonessential “stuffer fragment” by ligating insert DNA between the two arms, packaging the recombinant DNA into phage particles in vitro and infecting E. coli cells with the recombinant phage particles. Phage replication in the host cell produces dsDNA molecules consisting of repeated recombinant phage DNA joined at cos sites. Terminase processing cleaves the DNA at the cos sites, allowing packaging of the

38

Bacteriophage and Viral Cloning Vectors

Bacteriophage and Viral Cloning Vectors, Fig. 5 Bacteriophage lambda genome map. The locations of the genes mentioned in the text are indicated, including the genes that encode the small and large subunits of terminase (Nu1 and A gene), the O and P genes encoding phage-specific DNA replication proteins, and the cI, cII, and cro genes involved in the choice of lysogenic or lytic growth. The DNA between the J and N genes is

dispensable and can be replaced by a DNA insert when the phage is used as a cloning vector. The replaceable region includes genes encoding enzymes for integration and excision of the prophage (int and xis) and for homologous recombination (exo, bet, and gam), among others. See (Sambrook et al. 1989) for a more complete map. Map coordinates are from the sequence in GenBank, accession # NC_001416

recombinant DNA into phage particles that eventually lyse the cell, releasing the phage. The amount of DNA that enters the prohead must be enough to fill the limited volume of the head and thus must be between 75% and 105% of the normal length of the phage genome (Kornberg and Baker 1991; Chauthaiwale et al. 1992; Weigel and Seitz 2006). This limits both the minimum and maximum sizes of the DNA insert. Lambda insertion vectors allow cloning of small inserts into the nonessential region, while lambda replacement vectors remove the entire nonessential region and allow inserts to around 23 kb in length (Chauthaiwale et al. 1992). Lambda phage vectors can be purchased as individual arms produced by restriction endonuclease cleavage, ready to be ligated to insert DNA that was cut with the same enzyme. E. coli cell extracts containing the enzymes required for packaging are also available, so that the recombinant phage DNA can be packaged in vitro into viral particles. A number of improved lambda phage vectors have been developed (Chauthaiwale et al. 1992), by adding multicloning sites for cloning insert DNA, promoters for gene expression, and the lacZa gene fragment for blue-white selection capability. For example, the lgt11 vector (Young and Davis 1983) contains the lac promoter and the lacZ gene. DNA up to 7.2 kb in size can be inserted into a unique EcoRI site within the lacZ coding sequence. Recombinant plaques can be

screened for production of the desired protein as a LacZ fusion protein by using an antibody directed against the target protein. Vectors Based on Bacteriophage Lambda: Cosmids and Lambda ZAP Cosmid vectors were developed in the late 1970s (Collins and Hohn 1978). These vectors are small plasmids (~5 kb) that contain (minimally) a plasmid origin of replication and antibiotic resistance gene and the cos site from bacteriophage lambda. The cos site allows the recombinant plasmid DNA to be cleaved by lambda terminase and packaged into phage particles which can be introduced very efficiently into host cells by the phage infection process (Collins and Hohn 1978). Cosmids can accommodate DNA inserts of 35–45 kb, larger than are easily handled using ordinary plasmid vectors. Cosmids have been particularly useful for generating DNA libraries. DNA can be inserted into a cosmid vector by cleavage with an appropriate restriction enzyme, followed by ligation under conditions where the ligation products are linear concatemers rather than circles. The concatemers contain multiple copies of the linear cosmid vector ligated to and separating random insert DNA molecules. The cos sequences in these concatemers enable the recombinant molecules to be packaged into phage particles. The ligation mixture is mixed with a cell extract that contains phage proheads,

Bacteriophage and Viral Cloning Vectors

tails, and packaging enzymes, including the terminase enzyme. Terminase cleaves the linear DNA concatemers at the cos sites and the resulting linear products are packaged as phage. The phages are then used to infect E. coli cells. The cohesive ends from the cleaved cos sites in the recombinant linear DNA enable the DNA to be circularized and converted to a plasmid in the cell (Fig. 3), where it is subsequently replicated using the plasmid replicon. Phages are not produced in these cells because the cosmid lacks all lambda genes that are needed for producing phage particles and for cell lysis. Vectors with small or no DNA insert are too small to be packaged into phage heads, and so these molecules are selected against. The lambda ZAP vector combines features of phage lambda, filamentous phage f1, and the Bluescript plasmid, giving a vector that provides high efficiency of transformation, simplicity of storage, ease of screening a lambda phage library, and easy excision of the insert DNA into the pBluescript plasmid vector for further analysis (Short et al. 1988). The latter feature greatly facilitates subcloning of DNA from the recombinant phage to the plasmid, where the DNA is in many respects easier to manipulate than in the phage vector. The lambda ZAP vector consists of the arms from phage lambda, between which the Bluescript phagemid has been inserted. Bluescript has both a plasmid (pUC) replicon and that from the f1 filamentous phage. Foreign DNA is ligated into the Bluescript portion of the lambda ZAP vector, and the resulting recombinant molecules are packaged into lambda phage and used to infect an E. coli host. The Bluescript plasmid and cloned DNA insert can be excised in circular doublestranded form by introducing the recombinant lambda ZAP phage into cells that express f1 phage proteins required for replication of the f1 replicon contained within pBluescript. The recombinant DNA is maintained thereafter as a circular plasmid. The f1 replicon can also be used to generate the circular single-stranded form of the recombinant Bluescript plasmid. The pBluescript insert in lambda ZAP has

39

a multicloning site for convenient insertion of foreign DNA, and the lac promoter and lacZa, allowing for blue-white selection using X-gal. Bacteriophage P1 Vectors based on bacteriophage P1 were developed in the early 1990s (Sternberg 1990). These vectors can accommodate very large DNA inserts (ca. 100 kb) without the instability problems that were encountered with yeast artificial chromosome (YAC) vectors. Recombinant DNA in P1 vectors can be introduced into cells with high transformation efficiency by the phage infection process. The P1 bacteriophage particle has a linear 94.8 kb double-stranded DNA genome (Lobocka et al. 2004). The phage can pursue both lytic and lysogenic phases in an infected cell. The phage exists as a circular plasmid when in the lysogenic phase. The plasmid form is replicated as the cell undergoes replication and division, and it is maintained at one to two copies per cell by a phage-encoded partition system. Replication of the circular prophage requires an origin of replication (oriR) and the phage-encoded repA originbinding protein, along with several host enzymes (Kornberg and Baker 1991). Circularization of the linear phage genome after entry into the host cell depends on the Cre-lox site-specific recombination system. The phage-encoded Cre protein can bind, cleave, and rejoin two loxP sequences in double-stranded DNA. The linear phage genome has a terminal redundancy in which about 9–12% of the genome sequence is repeated at the ends of the genome (Kornberg and Baker 1991). If the repeated region happens to include the loxP site, then cleavage and joining of the two sites by Cre produces the circular plasmid form of the phage. P1 cloning vectors (Sternberg 1990) contain the pBR322 replicon that allows them to be maintained as multi-copy plasmids prior to insertion of foreign DNA. The vectors also contain the P1 packaging site (pac) required for packaging of recombinant DNA into the P1 phage head in vitro. Packaging of DNA continues until the head is filled, at which point the DNA is cleaved by the

B

40

phage-encoded pacase enzyme present in the in vitro packaging extract. The amount of DNA that is packaged (110–115 kb) is determined by the amount of the DNA required to fill the icosahedral phage head. The recombinant P1 phages are then used to infect an E. coli cell that expresses the Cre recombinase. The recombinant DNA is circularized and maintained in the cell as a single copy by the P1 plasmid replicon in the vector, since the pBR322 replicon is removed during circularization. The P1 lytic replicon in the recombinant molecule is under transcriptional control of the lac operon. Expression of the lytic genes, which increases the copy number of the plasmid 25-fold, can be induced by adding IPTG to relieve repression by the Lac repressor. The increase of the plasmid copy number aids isolation of the recombinant DNA. Vectors Derived from Eukaryotic Viruses Many vectors have been developed by modifying natural viruses for convenient use in mammalian cells, much as was done with bacteriophage vectors described above. Vectors have been based on retroviruses, lentiviruses, adenoviruses, adenoassociated viruses, and others (Blesch 2004; Chailertvanitkul and Pouton 1992; Deyle and Russell 2009; McConnell and Imperiale 2004; Quinonez and Sutton 2002). A useful vector includes viral sequences that are essential for stable replication and selection in a mammalian host cell and for packaging into a virion particle. The vector also has sequences that allow it to be maintained as a circular plasmid in a bacterial host. DNA is cloned into the vector by standard cloning techniques using E. coli as the host. The recombinant vector is then introduced into an appropriate eukaryotic host cell that has been engineered to produce the viral proteins required for packaging the viral DNA (or RNA) genome into virion particles. Finally, the virus particles that now contain the recombinant vector with DNA insert are introduced into the host cell of interest, where the inserted DNA can be expressed. These vectors can be used in several applications, including to introduce genes into

Bacteriophage and Viral Cloning Vectors

mammalian cells for study of the biological effect of expression of that gene or for gene therapy. Induced pluripotent stem cells have been generated from somatic cells by introducing genes encoding key regulatory proteins carried on virus vectors (Patel and Yang 2010). Baculovirusbased vectors are used to achieve high-level expression of active proteins from eukaryotic sources, in eukaryotic (insect) host cells. Plasmid (episome) vectors capable of stable replication in mammalian cells have been developed from natural viruses (see the Short Essay ▶ “Plasmid Cloning Vectors”). Vectors Based on Retroviruses and Lentiviruses Retroviruses and lentiviruses have singlestranded RNA genomes. The RNA serves as the mRNA for viral protein synthesis, and it is converted to dsDNA by reverse transcription in the infected cell. The dsDNA can then be integrated into the host cell chromosome, giving a cell line that is stably transfected by the virus. The Moloney murine leukemia virus was one of the first important retroviral vectors (Quinonez and Sutton 2002). A number of lentiviruses have been used more recently, including human immunodeficiency virus (HIV), the causative agent of AIDS. Lentivirus-based vectors are advantageous because the virus can infect both dividing and nondividing cells (Lois et al. 2002). Retroviral and lentiviral vectors contain sequences required for maintenance as a dsDNA plasmid in E. coli and (minimally) the viral 50 - and 30 -long terminal repeat sequences (LTRs) that are required for viral RNA reverse transcription in a human host cell and the viral Psi (C) packaging signal sequence (Fig. 6). Insert DNA is ligated into restriction sites placed between the LTRs, under the control of a promoter such as that from cytomegalovirus (CMV). The recombinant plasmid version of the vector is transfected into a packaging cell line. These cells contain plasmids with the genes encoding proteins necessary for viral RNA production and packaging into viral particles with a protein capsid and membrane envelope. The virion particles contain the viral

Bacteriophage and Viral Cloning Vectors

41

B

Bacteriophage and Viral Cloning Vectors, Fig. 6 Examples of vectors derived from a lentivirus (pFUGW-H1) and retrovirus (pCX4hyg). The pFUGWH1 vector contains the following elements from HIV-1: 50 and 30 -long terminal repeats (LTR) (the 30 LTR has a deletion (DU3) that inactivates transcription of the integrated provirus from the 50 LTR), Psi (C) region for packaging into virion particles, and the Rev-response element (RRE) for efficient pre-mRNA splicing. Other features are the cytomegalovirus (CMV) promoter to increase expression of the viral RNA genome during transient transfection, the enhanced green fluorescent protein (EGFP) transgene controlled by the human ubiquitin-C (hUbC) promoter, and the woodchuck hepatitis virus

posttranscriptional response element (WPRE) for transgene mRNA stabilization. The vector also has the replication origin and ampicillin resistance gene (Ampr) from pBR322, enabling the plasmid to be maintained in E. coli (From Lois et al. 2002; Fasano et al. 2007) and http://www. addgene.org/25870/). The pCX4hyg vector contains the 50 and 30 LTRs, Psi packaging signal region, and splice acceptor site elements from the Moloney murine leukemia virus (MLV). It also contains an internal ribosome entry site (IRES) from encephalomyocarditis virus (ECMV), a gene encoding resistance to hygromycin (Hyg), the CMV immediate early enhancer, and the ampicillin resistance gene and origin of replication from pBR322 (From GenBank (accession # AB086387) and Akagi et al. 2003)

RNA and the reverse transcriptase and integrase enzymes that act when the recombinant viruses infect the final target host cell. However, the recombinant viruses lack the genes encoding these proteins, as well as all other genes necessary

for producing new infectious virus in the target cells. The HIV Env protein in the membrane envelope of the virus normally binds to the CD4 protein on the surface of T cells (Quinonez and Sutton 2002). However, packaging cell lines are

42

available that introduce envelope proteins from other viruses that enable the recombinant virus to target a wide variety of cell types. Once introduced into the target cell, the viral RNA is reverse transcribed and transported into the nucleus. There, the viral integrase enzyme catalyzes irreversible integration of the dsDNA genome into the host chromosome. The insert gene is then expressed from the promoter carried in the vector. HIV-based viruses can be used for introducing genes for functional studies and in gene therapy applications. A potential drawback with these integrating viruses is that they might disrupt an important gene in the target host cell, with potential adverse consequences for the cell. Adenovirus and Adeno-associated Viruses Adenoviruses have linear dsDNA genomes of about 36 kb. They are taken up by endothelial cells via clathrin-mediated endocytosis and transported to the cell nucleus where viral DNA replication is catalyzed by a viral-encoded DNA polymerase. Adenoviruses have been used for gene therapy, although they suffer from a number of drawbacks including an intense inflammatory response and being immunogenic (Chailertvanitkul and Pouton 1992). Thus, vectors derived from adenovirus have most viral genes deleted, in order to minimize the immune response in a human host (McConnell and Imperiale 2004). This allows the vector to hold a relatively large DNA insert. As described above, the vector is first used to infect cells that produce proteins necessary for viral packaging. The resulting virus particles are then used to infect a target cell or individual. Adeno-associated virus (AAV) has a small (~4.7 kb) linear ssDNA genome (Deyle and Russell 2009). AAV is naturally dependent on other viruses for replication and packaging proteins. AAV vectors typically have only the inverted terminal repeat sequences, between which the foreign DNA is inserted. Recombinant viruses are produced by infecting helper cells and then used to infect a target host cell. The viral DNA enters the nucleus of the infected cell, where it can exist

Bacteriophage and Viral Cloning Vectors

as a circular episome or integrate into the cell chromosome. Integration is rare, occurring with only about 0.1% of the viruses that enter cells. Baculovirus Vectors Baculovirus vectors that allow for high-level expression of cloned genes in eukaryotic cells (insect cells) were developed in the early 1980s. They are useful for the expression of proteins from mammalian and other eukaryotic sources, since the proteins undergo correct posttranslational modification, such as glycosylation, in the insect cell host (Miller 1989; Smith et al. 1983). These modifications do not occur when the protein is expressed in bacterial cells. Baculoviruses have large (80–180 kb) dsDNA genomes (Miller 1989). They infect lepidopteran insects, such as Autographa californica (alfalfa looper), Bombyx mori (silkworm), and Spodoptera frugiperda (fall armyworm). The viral life cycle involves two stages. New viruses are formed and bud from the membrane of an infected cell, ca. 10–24 h’ postinfection. These viruses travel to and infect new cells in the infected organism. Later (ca. 18–70 h’ postinfection), membrane-enveloped virus particles accumulate in the nucleus of the cell in the form of viral occlusions. The occlusions include large amounts of a highly expressed viral protein called polyhedrin. Polyhedrin and a second viral protein called P10 are expressed at very high levels late in infection by the baculovirus. Neither the polyhedrin nor the P10 protein is required neither for viral infection of a host cell nor for production of infectious progeny viruses. Therefore, the DNA encoding either of these proteins can be replaced by foreign insert DNA. The insert DNA is then downstream of the strong polh or p10 promoter, and the foreign gene is expressed at high levels. This is the basis of using baculoviruses for expressing foreign proteins. Baculoviruses are rod shaped and normally contain 80–200 kb of dsDNA. The viral rod can expand to accommodate recombinant viruses containing large inserts, similar to filamentous bacteriophages.

Bacteriophage and Viral Cloning Vectors

43

B

Bacteriophage and Viral Cloning Vectors, Fig. 7 Construction of a recombinant baculovirus by homologous recombination. The gene to be expressed (yellow) is first cloned into a transfer plasmid by standard techniques in bacterial cells. The gene is downstream of the strong p10 promoter from baculovirus and flanked by the lef2 and orf1629 genes from the virus. The transfer plasmid is co-transfected into insect cells along with a linearized baculovirus genome with a truncated form of orf1629.

Recombination between the homologous viral sequences in the two DNA molecules generates a complete recombinant baculovirus, with the DNA insert and p10 promoter. The recombinant baculovirus also gains a complete copy of orf1629, an essential viral gene, which facilitates selection of cells containing recombinant virus. Other variations are available (see van Oers 2011). Figure is based on van Oers (see van Oers 2011). Transfer plasmid and baculovirus are not drawn to the same scale

The size of the viral genome makes it impractical to insert DNA by simple ligation of DNA fragments. Instead, the insert DNA is first cloned into a transfer plasmid that is propagated in and isolated from E. coli cells. The transfer plasmid contains the promoter from either the polyhedrin gene (polh) or the P10 gene (p10). The promoter is flanked on both sides by viral DNA. The foreign DNA is inserted downstream of the promoter and within the flanking DNA by standard ligation methods. The recombinant transfer plasmid and baculovirus DNA are co-transfected into insect cells (Sf9 or Sf21 cell). Homologous recombination between the plasmid and the virus is catalyzed by the cellular recombination machinery (Fig. 7). The flanking DNA in the targeting plasmid replaces the corresponding viral DNA after crossovers on both sides of the DNA insert. The first application of baculovirus was for the expression of human beta interferon (Smith et al. 1983). The beta interferon gene replaced the polyhedrin gene in the recombinant virus. The interferon protein was produced in large amount, secreted from the cells as expected, and glycosylated. Several modifications to the baculovirus system have made the cloning and screening

procedures more convenient. The bacmid system was developed to simplify and streamline the process of obtaining recombinant virus containing the desired DNA insert (Luckow et al. 1993). The DNA insert is first ligated into a transfer plasmid, as described above. The transfer plasmid contains the polyhedrin gene promoter flanked by the left and right ends of the Tn7 transposon rather than by baculovirus DNA. The plasmid vector also contains the lacZa gene downstream of the polyhedrin promoter and a gentamicin resistance gene. The baculovirus is in the form of a shuttle vector, called a bacmid, that contains a mini-F replicon that can be replicated in E. coli and viral sequences required for virus production in insect cells. The bacmid also contains the attachment site for the Tn7 transposon (attTn7) and a kanamycin resistance gene. The transfer plasmid and bacmid are introduced into E. coli cells, along with a helper plasmid that encodes the Tn7 transposase (TnsABCD genes). The transposase catalyzes transposition of the mini-Tn7 (containing the insert DNA downstream of the polyhedrin promoter) into the bacmid. The recombinant bacmid can then be introduced into insect cells, where it replicates and expresses protein encoded by the inserted

44

gene. The LacZa gene allows for blue-white screening for recombinant bacmids. The efficiency of recombinant virus isolation was improved by adding the sacB gene from Bacillus subtilis to the transfer plasmid used in the bacmid system, to make the pBVboost vector (Airenne et al. 2003). SacB encodes the enzyme levansucrase, which produces a toxic product in E. coli when the cells are grown in the presence of sucrose. E. coli cells that still contain intact transfer plasmid after the Tn7-mediated transposition step, presumably because the transposition failed in that cell, are killed by growth on sucrose. This negative selection step results in greater ease of isolating desired recombinant baculoviruses (bacmids).

Cross-References ▶ Artificial Chromosomes ▶ Blue/White Selection ▶ Plasmid Cloning Vectors ▶ Recombineering

References Airenne KJ, Peltomaa E, Hytonen VP, Laitinen OH, Yla-Herttuala S (2003) Improved generation of recombinant baculovirus genomes in Escherichia coli. Nucleic Acids Res 31:e101 Akagi T, Sasai K, Hanafusa H (2003) Refractory nature of normal human diploid fibroblasts with respect to oncogene-mediated transformation. Proc Natl Acad Sci USA A100:13567–13572 Berg P, Mertz JE (2010) Personal reflections on the origins and emergence of recombinant DNA technology. Genetics 184:9–17 Blesch A (2004) Lentiviral and MLV based retroviral vectors for ex vivo and in vivo gene transfer. Methods 33:164–172 Chailertvanitkul VA, Pouton CW (1992) Adenovirus: a blueprint for non-viral gene delivery. Curr Opin Biotechnol 21:627–632 Chauthaiwale VM, Therwath A, Deshpande VV (1992) Bacteriophage lambda as a cloning vector. Microbiol Rev 56:577–591 Collins J, Hohn B (1978) Cosmids: a type of plasmid genecloning vector that is packageable in vitro in

Bacteriophage and Viral Cloning Vectors bacteriophage lambda heads. Proc Natl Acad Sci USA A75:4242–4246 Deyle DR, Russell DW (2009) Adeno-associated virus vector integration. Curr Opin Mol Ther 11:442–447 Fasano CA, Dimos JT, Ivanova NB, Lowry N, Lemischka IR, Temple S (2007) shRNA knockdown of Bmi-1 reveals a critical role for p21-Rb pathway in NSC self-renewal during development. Cell Stem Cell 1:87–99 Kornberg A, Baker TA (1991) DNA replication, 2nd edn. W. H. Freeman, New York Krupovic M, Prangishvili D, Hendrix RW, Bamford DH (2011) Genomics of bacterial and archaeal viruses: dynamics within the prokaryotic virosphere. Microbiol Mol Biol Rev 75:610–635 Lobocka MB, Rose DJ, Plunkett G 3rd, Rusin M, Samojedny A, Lehnherr H, Yarmolinsky MB, Blattner FR (2004) Genome of bacteriophage P1. J Bacteriol 186:7032–7068 Lois C, Hong EJ, Pease S, Brown EJ, Baltimore D (2002) Germline transmission and tissue-specific expression of transgenes delivered by lentiviral vectors. Science 295:868–872 Luckow VA, Lee SC, Barry GF, Olins PO (1993) Efficient generation of infectious recombinant baculoviruses by site-specific transposon-mediated insertion of foreign genes into a baculovirus genome propagated in Escherichia coli. J Virol 67:4566–4579 McConnell MJ, Imperiale MJ (2004) Biology of adenovirus and its use as a vector for gene therapy. Hum Gene Ther 15:1022–1033 Mead DA, Szczesna-Skorupa E, Kemper B (1986) Singlestranded DNA ‘blue’ T7 promoter plasmids: a versatile tandem promoter system for cloning and protein engineering. Protein Eng 1:67–74 Messing J (1983) New M13 vectors for cloning. Methods Enzymol 101:20–78 Messing J (1991) Cloning in M13 phage or how to use biology at its best. Gene 100:3–12 Miller LK (1989) Insect baculoviruses: powerful gene expression vectors. Bioessays 11:91–95 Murray NE, Murray K (1974) Manipulation of restriction targets in phage lambda to form receptor chromosomes for DNA fragments. Nature 251:476–481 Patel M, Yang S (2010) Advances in reprogramming somatic cells to induced pluripotent stem cells. Stem Cell Rev 6:367–380 Quinonez R, Sutton RE (2002) Lentiviral vectors for gene delivery into cells. DNA Cell Biol 21:937–951 Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning: a laboratory manual, 2nd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor Short JM, Fernandez JM, Sorge JA, Huse WD (1988) Lambda ZAP: a bacteriophage lambda expression vector with in vivo excision properties. Nucleic Acids Res 16:7583–7600

Base Intercalation in DNA Smith GE, Summers MD, Fraser MJ (1983) Production of human beta interferon in insect cells infected with a baculovirus expression vector. Mol Cell Biol 3:2156–2165 Sternberg N (1990) Bacteriophage P1 cloning system for the isolation, amplification, and recovery of DNA fragments as large as 100 kilobase pairs. Proc Natl Acad Sci USA A87:103–107 Thomas M, Cameron JR, Davis RW (1974) Viable molecular hybrids of bacteriophage lambda and eukaryotic DNA. Proc Natl Acad Sci USA A71:4579–4583 van Oers MM (2011) Opportunities and challenges for the baculovirus expression system. J Invertebr Pathol 107(Suppl):S3–S15 Weigel C, Seitz H (2006) Bacteriophage replication modules. FEMS Microbiol Rev 30:321–381 Young RA, Davis RW (1983) Efficient isolation of genes by using antibody probes. Proc Natl Acad Sci USA A80:1194–1198

Base Intercalation in DNA Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Definition Intercalation is the stacking of a molecule between two bases in DNA (within one strand). This is a common process with a number of aromatic molecules and is driven by p- p, hydrophobic, steric, and other interactions.

Base Intercalation in DNA, Fig. 1 Intercalation of aflatoxin B1 8,9-epoxides into DNA, stacking with guanine (Iyer et al. 1994)

45

Discussion One important mechanism in the interaction of carcinogens with DNA is base intercalation. This phenomenon is common with polycyclic aromatic structures, e.g., polycyclic aromatic hydrocarbons and aflatoxin B1. The process involves “stacking” of an aromatic ring(s) between two bases, particularly purines. The physical chemistry involves p interactions between the rings, allowing the carcinogen to slip in between two bases, in a sandwich mode. Intercalators can be mutagenic themselves in that they can induce frameshift mutations without binding covalently, e.g., ethidium bromide. With reactive molecules such as aflatoxin B1 8,9-epoxide and benzo[a]pyrene 8,9-dihydrodiol9,10-epoxide, intercalation is important in determining the sequence selectivity of reactions because the initial event is in intercalation, which appears to happen very rapidly. With aflatoxin B1, the intercalation process occurs with both the exoand endo-isomers. In the case of the exo-isomer, the C8 atom of the epoxide becomes positioned for a very efficient SN2 attack by the N7 atom of a guanine (Iyer et al. 1994; Fig. 1). With the endo-isomer, intercalation also occurs, but the juxtaposition of the guanine N7 atom (and any others) is such that no reaction occurs and hydrolysis of the epoxide results to inactivate the epoxide. Intercalation also determines the sequence specific course of binding of aflatoxin B1 epoxides (Gopalakrishnan et al. 1990).

B

46

Several methods can be used to test for the occurrence of intercalation. One is the displacement of known intercalators such as ethidium bromide. Another is an upfield shift in the 1H-NMR spectrum of the ligand in the presence of an (double-stranded) oligonucleotide (Gopalakrishnan et al. 1990).

Cross-References ▶ Adducts on Tm, Effects of ▶ DNA Base Pairing, Modes of ▶ Spectroscopy of Damaged DNA

Base Substitution Mutation

Bioactivation of Carcinogens Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synonyms Metabolic activation

Definition References Gopalakrishnan S, Harris TM, Stone MP (1990) Intercalation of aflatoxin B1 in two oligodeoxynucleotide adducts: comparative 1H NMR analysis of d (ATCAFBGAT).d(ATCGAT) and d(ATAFBGCAT)2. Biochemistry 29:10438–10448 Iyer R, Coles B, Raney KD et al (1994) DNA adduction by the potent carcinogen aflatoxin B1: mechanistic studies. J Am Chem Soc 116:1603–1609

Base Substitution Mutation ▶ Mismatch Repair

Most chemicals that cause cancer do not react directly with DNA. They are first converted to electrophilic products (usually by oxidation, reduction, or conjugation) to reactive molecules that form covalent adducts with DNA.

Discussion Many carcinogens and mutagens are inherently reactive themselves but are converted to reactive species by enzymes in the body. Such compounds are thus termed “pro-mutagens” or “procarcinogens.” The concept of bioactivation of

Bioactivation of Carcinogens, Fig. 1 Examples of enzymatic oxidation of chemicals to forms that react with DNA

Bioinorganic Chemistry

these was largely first established by James and Elizabeth Miller of the University of Wisconsin (Miller and Miller 1947). A common mode of bioactivation involves oxidation, particularly by cytochrome P450 enzymes, but conjugation and radical pathways are also well established. In some cases, multiple steps are involved in the activation of a pro-mutagen to an entity that reacts with DNA. Enzymatic reactions are also involved in the detoxication of pro-mutagens and partially activated products. Several examples of bioactivation and detoxication reactions are shown in Fig. 1. For more extensive compilations, see Rendic and Guengerich (2012).

Cross-References ▶ Damage DNA, Natural Products that ▶ DNA Damage by Endogenous Chemicals ▶ DNA Damage, Types of ▶ Electrophiles, Types of ▶ Selectivity of Chemicals for DNA Damage

47

Synopsis This entry will discuss: • The way in which sodium, potassium, and calcium ions move across membranes and within cells to facilitate important cellular processes • The role of magnesium in catalytic RNA • The numerous proteins and enzymes that contain porphyrin cofactors • Iron–sulfur clusters, important in biological redox chemistry • The role of some metalloenzymes that facilitate or moderate the reactions of dioxygen (O2), peroxide (O22), and superoxide (O2) • Efforts by bioinorganic chemists to synthesize small molecules that model the biological activity of proteins and enzymes • The toxicity of certain inorganic species in biological systems • Metals used in medical applications See Crichton (2008), Bertini et al. (2007), and Roat-Malone (2007).

References

Introduction

Miller EC, Miller JA (1947) The presence and significance of bound amino azodyes in the livers of rats fed p-dimethylaminoazobenzene. Cancer Res 7:468–480 Rendic S, Guengerich FP (2012) Contributions of human enzymes in carcinogen metabolism. Chem Res Toxicol 25:1316–1383

Bioinorganic chemistry studies the role of inorganic species in biological compounds. In this regard, proteins, enzymes, DNA, and RNA are the most important biological entities that incorporate inorganic species. In many cases, the important inorganic species are metal cations in coordination spheres that help determine the shape and reactivity of proteins and enzymes. Inorganic anions such as phosphate, sulfate, chloride, iodide, fluoride, and oxide play important roles in maintaining electronic neutrality. Peroxide ions have important roles in oxidation and reduction reactions. Reactive superoxide ions must be dealt with to prevent unwanted and dangerous biological side reactions. Neutral inorganic species have less well-known biological activity except for nitric oxide (NO). This important molecule participates in regulation and mediation of many nervous and immune systems and takes part in many cardiovascular processes.

Bioinorganic Chemistry Rosette Roat-Malone Chemistry Department, Washington College, Chestertown, MD, USA

Synonyms Biological inorganic chemistry; Inorganic chemistry of biological compounds

B

48

Analytical methods used by bioinorganic chemists provide important structural and functional information about the systems being studied. X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy visualize biological molecules with the goal of understanding how the structure relates to biological function. NMR spectroscopy studies biological molecules in an aqueous environment, more closely related to actual physiological conditions. However, NMR spectroscopy has been limited to the study of small proteins or parts of more complex bioinorganic systems. X-ray crystallography can determine the structure of larger biological molecules but requires the production of a suitable single crystal. The solidstate crystal studied in any X-ray crystallographic determination is, by its nature, a snapshot of the protein in one of many states available to it. Biological systems that contain metal ions with unpaired electrons in “d” shells can be studied by electron paramagnetic resonance (EPR). This technique is especially useful for determining the oxidation state of metal ions. Extended X-ray absorption fine structure (EXAFS) can confirm the length of metal ion–ligand bonds. Bioinorganic chemists synthesize metalcontaining molecules that are models of bioinorganic systems and thereby learn more about how metal ions function in these biological systems. They study the toxicology of metals in biological systems. Serendipitously and by design, chemists have developed metalcontaining compounds that are used to treat diseases such as cancer, rheumatoid arthritis, and diabetes. Technetium compounds are used widely in diagnostic medicine. Transport and Bonding in Bioinorganic Systems Metals in biological systems almost exclusively exist as ions in positive oxidation states. Metal ions may be transported to protein and enzyme active sites as aqueous solvated ions, always with counterions (Cl, I, SO42, NO3, PO43) to maintain charge neutrality. The concentration of neutral nonmetal species such as

Bioinorganic Chemistry

dioxygen (O2) and carbon dioxide (CO2) in mammalian species is controlled by the inhalation and exhalation through the lungs and the movement of blood – O2 by complexation with the heme group of myoglobin and hemoglobin and CO2 through reaction with hydronium ion (H3O+) to form blood-soluble bicarbonate (HCO3) ions. Intake of carbon monoxide (CO) is toxic because it binds to hemoglobin’s heme group displacing O2 and causing asphyxiation. Nitric oxide (NO), an important vasodilator and controller of blood flow, is synthesized endogenously by nitric oxide synthase (NOS) enzymes from the amino acid arginine, O2, and the cofactor nicotinamide adenine dinucleotide phosphate. Studies of homeostasis – the maintenance of proper concentration of biological molecules and ions at sites where they are required – constitute an important research topic for bioinorganic chemists. Once metal ions reach their biological target protein or enzyme, they must be incorporated and held in place for purposes of the protein’s structural integrity or for participation in enzymatic activity upon a substrate molecule or ion. Chaperone proteins guide metal ions to their positions in a protein or enzyme and help assemble its active conformation (the protein’s fold). Usually the metal ion forms covalent coordination bonds with ligands present in the side chain groups of amino acids in the protein’s primary structure or with ligands provided by prosthetic groups incorporated into the protein. See Fig. 1. Metals may also bind to groups along the protein’s primary chain, principally at carbonyl (C=O) oxygens. The coordination spheres of metal ions vary significantly depending upon the nature of the metal itself, i.e., main group or transition, and are often distorted from ideal geometry with long and short bond lengths and empty coordination sites. The tables included herein illustrate metal ion coordination spheres and preferred ligands for a number of metal ions. All tables include the accession numbers for structures determined by X-ray crystallography or NMR as found in the Protein Data Bank, PDB, at the website www.pdb.org.

Bioinorganic Chemistry

49

B

Bioinorganic Chemistry, Fig. 1 Common metal ion bonding modes to amino acid side chains (Reprinted from Fig. 2.5 of Roat-Malone (2007). This material is reproduced with permission of John Wiley & Sons, Inc.)

Metals in Biological Systems Acting as Charge Carriers, Reaction Triggers, and Structure Facilitators See Table 1. Ion Channels Homeostasis can be defined as maintenance of the steady-state concentration of ions, e.g., H+, Na+, K+, Ca2+, Mg2+, Zn2+, PO43, and Cl, as well as the ion concentration, control of charge neutrality, and energy flow inside and outside of cells. The concentration of potassium ion for instance is maintained at approximately 5 mM extracellularly but at approximately 150 mM intracellularly. Passive diffusion of ions through cell membranes, based on ion concentration, can occur. More commonly facilitated diffusion and active transport

through ion channels, also called ion pumps, span cell membranes and regulate the flow of ions in and out of cells to maintain homeostasis. They exhibit complex protein structures and exist in many different varieties. Voltage-gated channels open and close based on the electrical potential across the membrane. Ligand-gated channels open and close based on the binding of specific molecules to the extracellular portion of the ion channel proteins. Chemical and physical stimulation may also control the activity of ion channels. Proper functioning of ionic balance inside and outside of cells through ion channels is very important for all muscle and nerve function. Ion channels are highly selective for potassium (K+), sodium (Na+), calcium ions (Ca2+), and protons (H+). Ion selection takes place based on size and

50

Bioinorganic Chemistry

Bioinorganic Chemistry, Table 1 Metals in biological systems Charge carriers, triggers, structure facilitators Preferred ligands (amino acids, Coordination phosphate groups, sugars, nucleic number, acids) geometry Metal Sodium, Na+ 6, octahedral O-Ether, hydroxyl, carboxylate (aspartic and glutamic acids, glutamine, serine, threonine) Potassium, K+

6–8, flexible

O-Ether, hydroxyl, carboxylate (serine, threonine, aspartic acid)

Magnesium, Mg2+

6, octahedral

O-Carboxylate (aspartic and glutamic acids, phosphate and sugar oxygen atoms) (ring nitrogens of nucleic acids)

Calcium, Ca2+

6–8, flexible

Zinc, Zn2+

4, tetrahedral

O-Carboxylate, carbonyl (C=O) (phosphate, side chain O groups of aspartic and glutamic acids, threonine, main chain C=O groups of alanine, asparagine, isoleucine, and valine) O-Carboxylate, carbonyl, S-thiolate, N-imidazole

Manganese, Mn2+

6, octahedral

Sg of cysteines and Ne2 of histidines in zinc fingers Bidentate Od of aspartic acid, Nd1 atoms of histidines Sg of cysteines, Ne2 atoms of histidine, Od of aspartic acid O-Carboxylate, phosphate, N-imidazole Bidentate Od of aspartic and glutamic acids, Ne2 of histidines

repulsion of like charges. Ions must pass through the ion channel and not be attached to it. P-type ATPases (adenosine triphosphatases), known as proton or cation pumps, are one example of proteins that maintain homeostasis through electromechanical gradients and active transport of ions across cell boundaries. During this type of active transport, a phosphorylated aspartic acid residue is formed within the protein and energy is released through the ATP ! ADP reaction. P-type ATPases include Na+, K+- and Ca2+ATPases. One such pump, Na+, K+-ATPase (Na+, K+-adenosine triphosphatase), or the sodium pump, is known to hydrolyze

Functions and examples (PDB accession number, see www.pdb.org) Charge carrier, osmotic balance, nerve impulses Sodium pump, Na+/K+-ATPase (PDB 2ZXE, 1Q3I) Charge carrier, osmotic balance, nerve impulses Na+/K+-ATPase (PDB 2ZXE) Potassium ion channels (PDB 1BL8) Structure in hydrolases, isomerases, phosphate transfer, trigger reactions Pyruvate kinase (PDB 1A49) Group II intron ribozymes (PDB 3BWP), hammerhead ribozyme (PDB 2QUS, 2QUW) Structure, charge carrier, phosphate transfer, trigger reactions Ca2+-ATPase (PDB 3AR2, 2ZBD, 1WPG) Structure in zinc fingers, superoxide dismutase, gene regulation, anhydrases, dehydrogenases Zinc fingers (PDB 1P47) Superoxide dismutase (PDB 2SOD) Carbonic anhydrase (PDB 1DDZ) Structure in oxidases, photosynthesis, Photosystem II (PDB 1W5C)

one-quarter of all cytoplasmic ATP in resting humans. In nerve cells, approximately 70% of ATP is consumed to fuel this enzyme. In Na+, K+-ATPase, sodium ions are pumped from the cytoplasm, while potassium ions are pumped from outside the cell according to the equation: ATP þ H2 O þ Naþ ðInÞ þ Kþ ðOutÞ  ! ADP þ phosphate PO4 3 þ Naþ ðOutÞ þ Kþ ðInÞ: One study of the Na+, K+-ATPase (113 kDa, kDa = 1,000 g/mol) enzyme from pig has been

Bioinorganic Chemistry

51

Bioinorganic Chemistry, Fig. 2 The Ca2+-ATPase reaction cycle (Reprinted with permission from Fig. 6.28 of Roat-Malone (2007). This material is reproduced with permission of John Wiley & Sons, Inc. Adapted from Fig. 1 Lancaster CRD (2004) Nature 432:286–287. Copyright (2004) Nature Publishing Group)

carried out by the K. O. Hakansson group, and its X-ray crystallographic structure has been deposited in the Protein Data Bank (PDB) with the accession number 1Q3I. The numbering system in the following description is taken from the PDB 1Q3I structure. The catalytic alpha subunit contains 10 polypeptide segments that traverse the cell membrane (transmembrane segments) and three cytoplasmic segments: (1) the actuator (A) domain at the protein’s N-terminal side, (2) the N-domain where the ATP nucleotide binds (arg378–arg589), and (3) the P-domain that contains the aspartic acid residue to be phosphorylated (leu354–arg377 and ala590–leu773, asp369 is phosphorylated). The transmembrane segments are involved with the active transport of 3 Na+ ions out of and two K+ ions into the cell. The events can be summarized as follows: • The ion pump, with bound ATP, binds three intracellular Na+ ions. • ATP is hydrolyzed and asp369 is phosphorylated. • Conformational changes take place allowing the Na+ ions to escape. • The pump binds two extracellular K+ ions (perhaps with concomitant dephosphorylation of the alpha subunit). • ATP binds and the pump changes conformation again to release K+ ions into the cytoplasm (cell interior) completing the cycle.

B

Calcium Pumps In the extensively studied calcium ion pump (Ca2+-ATPase), much of the Na+, K+-ATPase tertiary structure is preserved and many amino acid residues perform identical tasks. This large protein containing many subunits links muscle excitation with contraction and is involved in neuronal excitation as well. The SERCA (sarco(endo)plasmic reticulum) 2+ Ca -ATPases transport calcium into the lumen of the sarcoplasmic reticulum (SR) and participate in muscle contraction–relaxation cycles. The SR takes up calcium ions from the sarcoplasm, the cytoplasm of the muscle fiber, to initiate muscle contraction. Calcium ions stored in vesicles and tubules of the endoplasmic reticulum (ER) of muscle cells are released as one step of the muscle relaxation process. Calcium pumps, Ca2+ATPases, then move the calcium ions from the cytoplasm to the ER and SR and SR lumen to complete the cycle. See PDB 2ZBD and Fig. 2. Figure 2 indicates the Protein Data Bank, PDB, www.pdb.org accession numbers for the known X-ray crystallographic structures corresponding to the various cycle intermediates. The reaction cycle proceeds from the upper left in Fig. 2 as the high-affinity form of the enzyme, E1, receives 2 Ca2+ ions from the cytoplasm and releases 2–3 H+ ions to the cytoplasm. ATP complexed with Mg2+ ions enters and binds at a specific enzyme site. The enzyme is then phosphorylated at a specific aspartic acid residue, as

52

Bioinorganic Chemistry

Bioinorganic Chemistry, Fig. 3 (a) Top-down view of the K+ ion channel, PDB: 1BL8. (b) Side view of the K+ ion channel, PDB: 1BL8 (Reprinted from Fig. 5.5 of Roat-Malone (2007). This material is reproduced with permission of John Wiley & Sons, Inc.)

energy is imparted to the system through ATP ! ADP hydrolysis. ADP is then released as the cycle continues – 2 Ca2+ ions are released to the lumen as 2–3 protons enter from the lumen. Phosphate ion is released along with magnesium ions to form the low-affinity E2 state that is transformed to the E1 state to restart the cycle. Potassium Ion Channels Potassium ion channels occur in great variety; the superfamily is described at the “Nomenclature of the potassium channel superfamily” website at https://www.ipmc.cnrs.fr/~duprat/ipmc/nomencla ture.htm. For instance, the voltage-gated potassium ion channel (Kv) contains almost 40 members divided into 12 subfamilies. Kv channels repolarize the cell membrane following action potentials due to neuron firing. Taking neuron firing as an example of voltage-gated ion channel, K+ (positive charge) builds up inside the cell creating an electric voltage across the membrane. This causes the potassium ion channel to open, spilling out K+ ions and then returning to the resting state. These particular ion channels generate nerve impulses, the basis of all movement and sensation. Malfunctions in these ion channels can lead to disease states such as diabetes, epilepsy, and irregular heartbeat (arrhythmia). Researchers now know that many potassium ion channels are similar in construction and consist of a tetramer of four subunits having four-fold symmetry about a central pore through which

potassium ions pass on their way into the cell. The four principal a-subunits are each composed of four domains (I–IV). Each domain consists of transmembrane a-helices with the four domains surrounding the central pore. Potassium ions enter through laterally oriented ports in each domain. Through site-directed mutagenesis (changing amino acid residues along a protein chain through gene mutation at a specific DNA site), several amino acid residues have been identified as essential to K+ channel function. Further studies pinpointed a K+ channel signature sequence of threonine, X (where X is a hydrophobic amino acid such as valine), glycine, tyrosine, glycine, aspartic acid (TX(V)GYGDX). Potassium ions that have shed their aqueous hydration shell enter the pore and are coordinated by oxygen atoms from backbone carbonyl groups of the signature sequence. The sequence has become known as the selectivity filter in potassium and other ion channels. The Mackinnon group published the first X-ray crystallographic visualization of a potassium ion channel in 1998, and Roderick Mackinnon received the Nobel Prize in 2003 for his work on ion channels. See PDB 1BL8 and Fig. 3. An “Overview of Molecular Relationships in the Voltage-Gated Ion Channel Superfamily” has been published. See Yu et al. (2005). Magnesium in Ribozymes In 1982, T. R. Cech and colleagues reported an RNA construct that conducts cleavage or ligation

Bioinorganic Chemistry

reactions as part of posttranscriptional modification processes leading to mature RNA – maturation. The construct acquired the name ribozyme to emphasize that it is a catalytic RNA – a nucleic acid polymer catalyzing biochemical reactions in a manner similar to enzymes – amino acid polymers. The fact that RNA (genetic material) can also be a biological catalyst (like protein enzymes) contributes to the “RNA world” hypothesis and suggests RNA was important in the evolution of prebiotic self-replicating systems. The first ribozymes characterized were called group I intron ribozymes. Introns are defined as sections of DNA within a gene that do not encode protein and are removed by group I intron ribozymes during maturation. Ribozyme splicing connects exons. Exons are DNA sections not removed from transcribed RNA, most but not all of which encode protein. Subsequent research has led to a new field of catalytic bioinorganic chemistry containing not just group I and group II intron ribozymes but also another large group of ribozymes – called hammerhead ribozymes because of their claw shape – principally found in plant viruses. Group I intron ribozymes may weigh up to 85 kDa while the generally larger group II intron ribozymes may weigh 135 kDa or more. The smaller hammerhead ribozymes vary greatly in size weighing between 15 and 45 kDa. Most ribozymes incorporate magnesium ions (Mg2+) bonded to phosphate (PO43) ion oxygen or to nitrogen and oxygen atoms of nucleic acids, principally guanines. Magnesium ions have been proven to have a catalytic role in the activity of group I and group II intron ribozymes facilitating reactive transition states for the cleavage and ligation reactions. In group II introns, the active site involves two divalent metal ions (usually Mg2+), and a two-metal ion mechanism for catalysis has been developed. Uncertainty still exists as to whether the magnesium ion plays only a structural role in shaping hammerhead ribozymes or plays a catalytic role as well. See PDB 3BWP for the group II ribozyme and PDB 2QUS and 2QUW for the hammerhead ribozyme.

53

Zinc Fingers Zinc finger proteins participate in protein–DNA interactions that protect double-stranded DNA from nuclease digestion. Although zinc ions (Zn2+) do not directly interact with DNA strands, they are required for correct zinc finger protein folding into its tertiary structure and for specific DNA interactions. One important reason that zinc ions have been chosen for this purpose is that its stable d10 filled shell configuration precludes oxidation–reduction reactions that could damage the DNA to which the Zn2+ ions are attached. Researchers have found the zinc finger motif recognizable in many different biological species with common criteria given by: • Multiple repeats of about 30 amino acids per zinc finger domain • Zn2+ ligands standardized as two histidine (Ne2 atom) and two cysteine (Sg atom) residues in tetrahedral geometry with their separation conserved across many species and named the cys2his2 type zinc finger • Two aromatic amino acid (aa) residues usually phenylalanine (phe, F) and tyrosine (tyr, Y) plus the hydrophobic aa leucine (leu, L) • A three-dimensional structure that resembles a finger The commonly found zinc finger sequence repeating multiple times in a given protein is often (phe, tyr)-X-cys-(X)2–5-cys-(X)3-phe-(X)5leu or other hydrophobic residue-(X)2-his-(X)3–5his-(X)5 where X is any amino acid. Residues in bold indicate Zn2+-binding sites. A single zinc finger protein subunit weighs approximately 30 kDa but reported structures can contain multiple protein subunits plus the DNA strand to which the finger binds. See PDB 1P47. Enzymes Transporting Dioxygen See Table 2. Hemoglobin and Myoglobin Chemists have studied hemoglobin (Hb), a universally known bioinorganic molecule, since the 1800s. The enzyme reversibly binds dioxygen (O2) and transports this vital molecule throughout

B

54

Bioinorganic Chemistry

Bioinorganic Chemistry, Table 2 Metals in biological systems Dioxygen (O2) transport Coordination number, geometry Metal Iron, Fe2+, Fe3+ 6, octahedral

Copper, Cu+, Cu2+

3, trigonal planar to trigonal pyramid

Preferred ligands (amino acids, phosphate groups, sugars, nucleic acids) N-Imidazole, porphyrin (Ne2 atoms of histidines, N atoms of protoporphyrin IX)

N-Imidazole (Ne2 atoms of histidines)

Bioinorganic Chemistry, Fig. 4 Heme protoporphyrin IX (heme b) as found in Hb, Mb, and some cytochromes (Reprinted from Fig. 7.1 of Roat-Malone (2007). This material is reproduced with permission of John Wiley & Sons, Inc.)

the bloodstream. Hemoglobin’s quaternary structure consists of four globular protein chains, each of which contains a prosthetic heme group described below. See Fig. 4. The four subunits exist as two identical alpha chains of 141 amino acids and two identical beta chains of 146 amino

Functions and examples (PDB accession number) Dioxygen transport in myoglobin (Mb) and hemoglobin (Hb) Deoxy Mb (PDB 1MBN) Oxy Mb (PDB 1MBO) Deoxy Hb (PDB 2HHB, 3HHB, 4HHB) Oxy Hb (PDB 1HHO) Dioxygen transport in hemocyanin (Hc) Deoxy Hc (PDB 1HC1) Oxy Hc (PDB 1OXY)

acids. The heme prosthetic group is required for hemoglobin’s dioxygen carrying capability – hemoglobin’s apoenzyme state (without the heme group) fails to bind or transport O2. Hemoglobin binds O2 cooperatively, that is, the uptake of an O2 molecule by one subunit facilitates the take-up of O2 by subsequent subunits. Myoglobin (Mb), the dioxygen carrier to the muscle tissue, is a single-chain protein containing 153 or 154 amino acid residues (18 kDa with heme group) with a tertiary structure of eight alpha helices wrapped around a hydrophobic core. Myoglobin contains the same heme prosthetic group as hemoglobin. John Kendrew, a structural biology pioneer, carried out the first X-ray crystallographic study of deoxymyoglobin in 1958. See PDB 1MBN. The structure with dioxygen attached (PDB 1MBO) was first deposited in the Protein Data Bank in 1982. The structure shows the characteristic bonding of dioxygen – attachment at the vacant sixth coordination position of Fe and bent with respect to the heme plane. Hemoglobin and myoglobin’s heme prothestic group has been exhaustively studied. It consists of a protoporphyrin IX (also called heme b) molecule that donates four ligand N atoms to the central Fe ion. The Fe ion is additionally coordinated by the Ne2 atom of the “proximal” histidine leaving one octahedral coordination site vacant. The vacant site serves to coordinate a dioxygen

Bioinorganic Chemistry

55

Bioinorganic Chemistry, Fig. 5 Quaternary structure of deoxyhemoglobin tetramer (PDB 4HHB) (Reprinted from Fig. 7.2 of RoatMalone (2007). This material is reproduced with permission of John Wiley & Sons, Inc.)

molecule that is in turn hydrogen bonded to a hydrogen atom attached to the Nd1 atom of a “distal” histidine. Hemoglobin was studied by X-ray crystallography beginning in the 1970s by M. F. Perutz and W. Bolton. See PDB 2DHB. This structure, of horse hemoglobin, shows the five-coordinate Fe2+, Fe(II), center bonded in an axial position to the Ne2 position of His87 in addition to the fourcoordinate ligation to the protoporphyrin IX pyrrole nitrogen atoms. In the PDB 2HHB, 3HHB, and 4HHB deoxyhemoglobin X-ray structures published by the Perutz group in 1984, it is also noted that the Fe ion in deoxyhemoglobin bulges out of the porphyrin plane toward the proximal coordinating histidine by almost 0.50 A . The X-ray structure of oxygenated hemoglobin was solved in 1983 by B. Shaanan and deposited as PDB 1HHO. The coordination of the iron ion is similar to the deoxyhemoglobin structure except that the iron ion now resides in the plane of the porphyrin ligand system and the sixth coordination position is occupied by dioxygen molecule, now bent with respect to the porphyrin plane as was found for myoglobin. See Fig. 5. Raman and infrared spectroscopy indicates that the bond length for the dioxygen in oxyhemoglobin is that of the superoxide ion (O2) leading to the theory that the iron ion takes on the Fe3+, Fe(III) oxidation state in oxyhemoglobin. This theory has been confirmed by extensive

B

analysis of deoxy- and oxyhemoglobin by many instrumental methods including: • X-ray crystallography of the metalloprotein from other species and by modification of the system through genetic manipulation to tease out structure–function relationships. In 2012, there were almost 600 X-ray crystallographic studies of hemoglobin and related structures listed in the Protein Data Bank, PDB, www. pdb.org. • EXAFS (extended X-ray absorption fine structure) that can confirm the length of metal–ligand bonds. • EPR (electron paramagnetic resonance) that can analyze the spin states of metal ion “d” shell electrons and thus determine oxidation states of metal ions in biological systems. • NMR (nuclear magnetic resonance) that can study bioinorganic systems in solution. Deoxy- and oxyhemoglobin have been studied extensively through the synthesis of model compounds by bioinorganic chemists. The study of model compounds for any bioinorganic system follows a strategy. First the metalloprotein is isolated, purified, crystallized, and studied by the various methods outlined above. Secondly the metal cofactor of interest in the metalloprotein is further characterized and compounds that may mimic the cofactor’s behavior are synthesized

56

and characterized. Finally the model compound is compared to the biological counterpart to further study the structure–function relationships in the metalloprotein. A long-term goal in designing model compounds for hemoglobin has been to develop a blood substitute that could be used in trauma situations or in any situation where the blood supply is questionable. While some hemoglobin substitutes have undergone or are undergoing Food and Drug Administration (FDA) testing, this goal has not been achieved to date. The first model compounds, all incorporating planar porphyrin ring systems as Fe2+ ligands, were unsuccessful because of the formation of the iron m-oxo dimer. Lacking the enzyme protective wrapping, the compounds formed Fe–O–O–Fe intermediates that quickly degenerated to the Fe3+–O–Fe3+ m-oxo dimer, a thermodynamic dead-end product. Chemists then synthesized porphyrin ring systems with built-in steric hindrance, blocking m-oxo dimer formation. Many such ligand systems have successfully bound and transported dioxygen through several oxy–deoxy cycles. The topic is reviewed in Momenteau and Reed (1994). However, none of these model compounds have been able to function over long periods at physiological conditions (aqueous solution, pH 6–8, temperature between 20  C and 40  C, 1 atm pressure). In a totally different approach, blood substitute research has concentrated on so-called polyhemoglobins (polyHb) and other variations on hemoglobin itself. The polyHb approach provides long-range storage capability for the blood substitutes and removes the danger of disease transmitted by whole red blood cells. In one system, recombinant and cell-free hemoglobin molecules are polymerized using the reaction of glutaraldehyde with side chain and terminal amino groups of hemoglobin. The polyHb has excellent oxygenation ability and does not suffer from one severe side reaction of cell-free hemoglobin itself – loss of oxygenation capability and combination with nitric oxide, NO. The attachment of NO to the heme of cell-free hemoglobin affects NO’s equilibrium concentration, and this in turn causes a dangerous blood pressure rise in the patient receiving the blood substitute. See Chang (2007).

Bioinorganic Chemistry

Hemocyanin Hemocyanin (Hc), a large multimeric protein found in arthropods and mollusks, binds dioxygen at a dinuclear type III copper site. See the ascorbate oxidase section for a description of type I, II, and III copper ions. In the deoxy state, the two copper ions are found in the colorless Cu+, Cu(I), oxidation state. A intermediate state having Cu(I) and Cu(II) centers is known. The blue oxy state, with two Cu2+, Cu(II), ions, leads to the bluish coloration of many species that use Hc as a dioxygen carrier. In arthropod hemocyanin, six subunits, each containing one dinuclear Cu ion–O2 coordinating center, aggregate to form a hexamer. Eight hexamers combine to form the final quaternary structure with 48 dinuclear copper dioxygen-binding sites. Molecular weight of these structures can reach as high as 3.5  106 Da. Binding of O2 to some subunits causes changes in quaternary structure, an allosteric effect, that in turn causes enhanced O2 uptake by remaining subunits. In deoxy Hc, two Cu(I) ions, varying in distance from each other depending on the species, coordinate conserved histidine Ne2 atoms. Three histidines coordinate to each copper ion, in a distorted trigonal planar to trigonal pyramidal environment. The X-ray crystallographic structure of deoxy Hc is deposited as PDB 1HC1. The X-ray crystallographic structure of the oxygenated form of hemocyanin has the PDB accession number 1OXY. In the Hc oxy form, the dioxygen is coordinated as an O22 (peroxo) dianion bonded to both Cu(II) ions and equidistant from both – m-Z2-Z2-bridging (sideon) peroxo mode. See Fig. 6. Karlin and coworkers have studied copper model compounds for hemocyanin and found some models exhibiting m-Z2-Z2-(side-on)peroxo dicopper(II), [Cu2II(O22)]2+, adduct geometry (like hemocyanin) while others exhibit bis(m-oxo) dicopper(III), [Cu2III(O2)2]2+, adduct geometry in reactions with O2. See Fig. 6. The basic starting material for the reactions are model compounds [CuI(R-PYAN)(MeCN)n]B(C6F5)4 where (R-PYAN ) = N-[2-(4-R-pyridin-2-yl)ethyl]-N,N0 ,N 0 -trimethyl-propane-1,3-diamine, R=NMe2, OMe, H, and Cl). If R=Cl or H, reaction with O2 at 78 in dichloromethane

Bioinorganic Chemistry Bioinorganic Chemistry, Fig. 6 Equilibrium between side-on peroxo and bis(m-oxo) dicopper cores (Reprinted with permission from Fig. 1 of Hatcher et al. (2006). Copyright (2006) American Chemical Society)

57

LCuII

O

2+ CuIIL

O side-on peroxo [Cu II2(O22−)]2+

2+

O LCu III

Cu IIIL O

bis(μ−oxo) [CuIII2(O2−)2]2−

LMCT ∼ 360 nm (intense) ∼ 520 nm (weak)

LMCT ∼ 300 nm (intense) ∼ 400 nm (intense)

v(O−O): ∼ 750 cm−1 v(Cu−Cu): ∼ 280 cm−1

v(O−O): N/A v(Cu−O): ∼ 600 cm−1

Cu...Cu ∼ 3.6 Å O−O ∼ 1.4 Å

(CH2Cl2) results in a product that has ultravioletvisible and resonance Raman results consistent with a hemocyanin-like m-Z2-Z2-(side-on)peroxo dicopper(II) adduct. If R=OMe, reaction products show equilibrium mixtures of side-on peroxo-Cu2(II) and bis(m-oxo)-Cu2(III) species. If R=NMe2, the reaction product is exclusively the bis(m-oxo)-Cu2(III) species. Changing solvent to acetone or tetrahydrofuran increases the concentration of the bis(m-oxo)-Cu2(III) species in all cases. The authors conclude that more electrondonating ligands favor formation of bis(m-oxo)Cu2(III) species. The researchers also studied oxidation reactions of the product species with multiple substrates finding differential oxidative activity dependent on the substrate as well as the oxidizing model complex. See Hatcher et al. (2006). Metals in Biological Systems Facilitating Electron Transfer See Table 3. Iron–Sulfur Clusters Iron–sulfur clusters, comprised of iron ions in ferrous (Fe2+) and ferric (Fe3+) oxidation states and sulfur ions nominally in their S2 oxidation state, occur in many proteins that are involved in electron transfer. Iron–sulfur clusters exist in a number of conformations including [2Fe:2S], [3Fe:4S], and [4Fe:4S] forms and are usually held within the protein by bonding to Sg atoms of cysteine residues. As an example, the enzyme aconitase reversibly catalyzes the addition or

Cu...Cu ∼ 2.8 Å O...O ∼ 2.3 Å

elimination of H2O in the second step of the citric acid (Krebs) cycle transforming citrate to isocitrate through a cis-aconitate intermediate. Activation of the aconitate moiety requires conversion of a [3Fe:4S] cluster within the enzyme into a [4Fe:4S] cluster. See PDB 7ACN. Iron–sulfur clusters will be further described in discussions of cytochrome b(6)f and hydrogenase and nitrogenase enzymes. Cytochromes Cytochromes, widely found in plants and animals, are proteins intimately involved with electron transfer. They carry a variety of heme cofactors, cycling between Fe2+ and Fe3+states, that carry the electrons being transferred or serve as catalytic sites for reduction of a substrate. Cytochrome superfamilies include those of cytochromes a, b, c, f, and P450 as listed in the Jena Library of Biological Molecules at http://www.fli-leibniz. de/IMAGE.html. For example, starting on the website home page, click on Search and then Main Search Page. In the Search, for a PDB/NDB entry via SWISS-PROT or SMART or Pfam box, type “cytochrome c” and then limit the search by choosing only “SWISS-PROT entry description.” The result will list more than 1,300 entries of structures containing cytochrome c in part or whole in the Protein Data Bank, PDB, www.pdb. org. Cytochrome c oxidases (CcO) are included in an enzyme superfamily coupling oxidation of Fe (II) cytochrome c to the 4 e/4 H+ reduction of O2 to H2O. The enzyme features two different heme a centers, a bimetallic copper site (CuA) and a

B

58

Bioinorganic Chemistry

Bioinorganic Chemistry, Table 3 Metals in biological systems Electron transfer

Metal Iron, Fe2+, Fe3+

Coordination number, geometry 4, tetrahedral

Preferred ligands (amino acids, phosphate groups, sugars, nucleic acids) S-Thiolate (Sg of cysteines)

(Sg of cysteines) Iron, Fe2+, Fe3+, Fe 4+

Copper, Cu+, Cu2+

6, octahedral

3, trigonal planar

4, square planar

Copper, Cu+, Cu2+

4, tetrahedral

O-Carboxylate, oxide, alkoxide, phenolate, N-imidazole (Ne2 atoms of histidines, N atoms of porphyrin) N-Imidazole, S-thiolate, thioether (Ne2 and Nd1 atoms of histidines) (Ne2 and Nd1 atoms of histidines) (Ne2 and Nd1 atoms of histidines, Sg of cysteines, Sd of methionines) S-Thiolate, thioether, N-imidazole (Nd1 atoms of histidines, Sg atoms of cysteines) (Sg atoms of cysteines, Sd atoms of methionine, Ne2 and Nd1 atoms of histidines) (Ne2 and Nd1 atoms of histidines in superoxide dismutase) (Sg atoms of cysteines, Sd atoms of methionine, Ne2 and Nd1 atoms of histidines)

monometallic CuB center coupled to one of the hemes. A similar search on cytochrome c oxidase in the Jena Library of Biological Molecules results in over 450 entries.

Functions and examples (PDB accession number) Electron transfer, nitrogen fixation in nitrogenases Iron–sulfur clusters in nitrogenase (PDB 1NIP, 2NIP, 2MIN, 3MIN, 1N2C, 2AFH, 2AFI, 2AFK, 3U7Q) Iron–sulfur clusters in aconitase (PDB 2B3X, 2B3Y) Electron transfer in oxidases, cytochromes Hemea and heme a3 in cytochrome c oxidase (PDB 2OCC) Electron transfer in heme-copper oxidases Type II CuB in cytochrome c oxidase (PDB 1OCC, 2OCC) Laccase (PDB 2XU9) Ascorbate oxidase (PDB 1ASO, 1ASP, 1ASQ)

Electron transfer in heme-copper oxidases, ascorbate oxidase, superoxide dismutase Type III CuA in cytochrome c oxidase (PDB 1OCC 2OCC) Ascorbate oxidase (PDB 1ASO, 1ASP, 1ASQ)

Distorted tetrahedral to square planar geometry in superoxide dismutase (PDB 2SOD, 3SOD, 1L3N) Types I, II, and III copper sites in laccase (PDB 2XU9)

alkyl or aryl molecule to an alcohol according to the reaction: RH þ O2 þ 2Hþ þ 2e ! ROH þ H2 O:

Cytochrome P450

Cytochrome P450 is one of a large group of ironcontaining oxygenases that use O2 from air to oxidize organic molecules – for instance from an

Cytochrome P450 is known to catalyze the first step in the metabolism of most pharmacological compounds according to the above reaction. The

Bioinorganic Chemistry

enzyme uses cofactors such as flavins or NAD(P) H to provide the necessary electrons for O2 reduction. Cytochrome P450’s iron-containing heme cofactor, similar to that of hemoglobin, coordinates and activates the dioxygen molecule. Instead of coordination to the proximal histidine as in myoglobin and hemoglobin, the heme center is held at the active site by a Sg atom of a cysteine amino acid residue. A well-studied substrate – camphor – has been used to work out the catalytic cycle involving Fe(II), ferrous; Fe(III), ferric; and Fe(IV), ferryl, protoporphyrin IX heme centers. See PDB 1DZ4, 1DZ6, 1DZ8, and 1DZ9. For the catalytic cycle, see also Denisov et al. (2005). Some of the cytochrome P450 intermediates that have been identified and studied for model compound formation (biomimetic chemistry), with the goal of discovering new methods for mild-condition oxidations of organic starting materials, are: • Ferric–peroxo complexes, [Fe(III)-O22]+ (the resting iron oxidation state) • Oxoiron(IV) porphyrin p-cation radicals (commonly called “compound I” in peroxidase and catalase catalytic cycles) [Fe(IV)=O]+ • Oxidant–iron(III) porphyrin adducts • Oxoiron(V) porphyrins (isoelectronic with “compound I” intermediates) • Oxoiron(IV) porphyrins (also known as “compound II” in peroxidase and catalase catalytic cycles) Researchers have identified oxoiron (IV) porphyrin p-cation radicals, “compound I” [Fe(IV)=O]+, as the active oxygenating species in cytochrome P450. The John T. Groves group identified one model compound that has been shown to afford the key p-cation radical and to hydroxylate xanthene to 9-xanthydrol. The reaction of Fe(III)-4-TMPyP (4-TMPyP is the model porphyrin ligand tetrakis (N-methyl-4pyridinium)porphyrin) with a peroxybenzoic acid generates the compound I intermediate [OFe(IV)-4-TMPyP]+  which then hydroxylates xanthene at extremely fast rates. See Groves and Bell (2009).

59

Cytochrome b(6)f and the Photosynthetic Pathway

Cytochrome b(6)f, found in green plant membranes, mediates electron transport from photosystem II into photosystem I while also generating a transmembrane proton gradient that can support ATP synthesis. In another electron-transferring capability called cycle Q, cytochrome b(6)f electrons are transferred between plastoquinol, plastoquinone, and plastocyanin. Plastocyanin is the Cu (II)/Cu(I)-containing enzyme receiving electrons from cytochrome b(6)f and transferring them to the primary electron donor of photosystem I, P700. Plastocyanins are small enzymes containing between 97 and 105 amino acid residues. The copper coordination sphere is distorted tetrahedral with two Nd1 ligands of two histidines, one cysteine Sg ligand, and one methionine Sd ligand. The cytochrome b(6)f complex, weighing approximately 217 kDa, is a dimer, each monomer containing eight subunits. Various subunits contain four heme cofactors: (1) two hemes in the cytochrome b family (bp near the lumen side and bn near the stromal side), (2) one heme in the cytochrome f family, and (3) a heme x. Heme b, protoporphyrin IX, is familiar from myoglobin and hemoglobin – four N atoms ligating to Fe(II)/Fe (III) ions from the porphyrin ring and axial histidine ligands. Heme f features, in addition to the planar porphyrin with four coordinating N ligand atoms, a covalent connecting cross-link to the surrounding protein through two cysteine Sg ligands. Fe(II)/Fe(III) ions in heme f have axial ligands – Ne2 atom of a histidine and a backbone amide N atom from a tyrosine residue. Heme x, also known as heme c1, features one covalent connection from the iron ion to the surrounding protein through a cysteine Sg atom, as well as axial H2O ligands. Cytochrome b(6)f also contains a [2Fe:2S] iron–sulfur cluster called a Rieske iron–sulfur protein (RISP) after its discoverer, John S. Rieske. The RISP cluster has histidine Fe–Nd1 as well as cysteine Fe–Sg bonding to the protein. The sequence of electron transfer events goes as follows: • An electron is transferred from the doubly reduced dihydroplastoquinone (PQH2) to a

B

60

Bioinorganic Chemistry

Bioinorganic Chemistry, Fig. 7 Schematic diagram for cytochrome c oxidase (Reprinted from Fig. 7.39 of RoatMalone (2007). This material is reproduced with

permission of John Wiley & Sons, Inc. Adapted from Fig. 1 of Kim et al. (2004). Copyright (2004) American Chemical Society)

high-potential electron chain comprised of the RISP and the protein containing heme f. These two centers reside on the lumen, or p, side of the plant membrane, exterior to the membrane. • A second electron is transferred sequentially through the system from PQH2 to heme bp, heme bn, and thus to heme x. All of these hemes reside in the cytochrome b subunit and within the membrane. • Concurrent with the electron transfer, protons are taken up from the stromal, or n, side of the membrane. See PDB 1VF5.

hydrophobic active site. Two more electrons arrive from the Fe ion in heme a3. After transferring its four electrons to bound dioxygen, the thus oxidized active site must be slowly recharged by Fe2+ cytochrome c. To prevent partially reduced oxygen species, known as PROS, O2 must only bind when both CuB and heme a3 have been fully reduced. In the schematic diagram of Fig. 7, the distances are taken from the PDB 2OCC X-ray crystallographic structure. Entry and exit channels for dioxygen, protons, and water are schematic only. See PDB 1OCC and 2OCC.

Cytochrome c Oxidase

Metals in Biological Systems Facilitating Enzyme Catalysis See Table 4.

Cytochrome c oxidase (CcO) catalyzes the reduction reaction: 4e þ 4Hþ þ O2 ! 2H2 O during the final stage of respiration. The reaction must take place without generation of partially reduced dioxygen species such as superoxide (O2) or peroxide (O22). The complete catalytic mechanism is poorly understood. During catalytic turnover, CuB, a copper center held in the protein through bonding to three histidine nitrogens, and Tyr244 (numbering system from PDB 1OCC, 2OCC) each delivers one electron to O2 at the

Nitrogenase Nitrogenase enzymes reduce the “inert” molecule N2 to ammonia (NH3) under atmospheric pressure and normal temperatures (20–30  C) according to the reaction: N2 þ 3H2 ! 2NH3 : This nitrogen fixation reaction, carried out by microbes called diazotrophs, provides reduced nitrogen for incorporation in nucleic and amino

Bioinorganic Chemistry

61

Bioinorganic Chemistry, Table 4 Metals in biological systems Enzyme catalysis

Metal Co+, Co2+, Co3+

Coordination number, geometry 6, octahedral

Preferred ligands (amino acids, phosphate groups, sugars, nucleic acids) O-Carboxylate, N-imidazole

Ne2 of histidine Nickel, Ni2+

Molybdenum, Mo4+, Mo5+, Mo6+,

6, octahedral

6, octahedral

S-Thiolate, thioether, N-imidazole, polypyrrole Sg of cysteines, Sg, and Od of S-hydroxycysteines, O-carboxylate, N-imidazole Nd1, Ne2 atoms of histidine, phosphate oxygens O-Oxide, carboxylate, phenolate, S-sulfide, thiolate Nd1 atoms of histidine, hydroxy and carboxy ligands of 3-hydroxy-3-carboxy adipic acid, S atoms of Fe7MoS9 cluster

acids. Up to 40% of a bacterium’s ATP production is used to produce ammonia when the organism is actively fixing nitrogen according to the reaction:  N2 þ 8Hþ þ 8e from Fdred þ 16MgATP

 ! 2NH3 þ H2 þ 16MgADP þ 16Pi PO4 3 :

Fdred in the equation refers to either ferredoxin or flavodoxin, one of a large group of iron–sulfur cluster-containing proteins that supply electrons for many different biological processes. Industrially, the Haber–Bosch process, using a heterogeneous iron catalyst and operating at very high temperatures and pressures, produces ammonia from dinitrogen and dihydrogen. Millions of tons of ammonia are produced annually for incorporation into agricultural fertilizers and other important products. Bioinorganic chemists have sought to understand nitrogenase structure and function with the goal of designing a molecular catalyst for ammonia production that would operate at more normal conditions of temperature and pressure. Nitrogenase is a large complex protein that features two separable components: (1) the Fe protein (also called dinitrogenase reductase)

Functions and examples (PDB accession number) Alkyl group transfer in enzymes with vitamin B12 (cyanocobalamin) cofactor Methionine synthase (PDB 1BMT, 3IV9) Hydrogenases, hydrolases NiFe hydrogenase (PDB 1WUI, 1WUJ, 1WUK, 1WUL) Urease (PDB 3LA4) Nitrogen fixation in nitrogenases, oxo transfer in oxidases Nitrogen fixation in nitrogenases (PDB 2MIN, 3MIN, 1N2C, 2AFH, 2AFI, 2AFK, 3U7Q)

containing four iron ions and having a molecular weight of approximately 68 kDa and (2) MoFe protein (also called dinitrogenase) containing 30 iron ions and two molybdenum (Mo) ions and having a molecular weight of approximately 240 kDa. Molybdenum is the only second-row transition metal found in biological species. The iron and molybdenum ions are found in metal–sulfur clusters within the protein that serve to transfer electrons to the active site and to coordinate N2 for reduction to NH3. Conserved amino acid residues in nitrogenases from different species often serve as ligands holding the metal–sulfur clusters in place. Normally the sulfur atom of a cysteine amino acid side chain coordinates iron ions in metal–sulfur clusters, one cysteine ligand per two Fe ions. However, the molybdenum ion’s ligands are a Nd1 atom of a histidine and two oxygen ligands from a homocitrate anion. See Fig. 8. The smaller Fe protein in nitrogenase contains two subunits. Each subunit contains an [4Fe:4S] iron–sulfur cluster (Fig. 8a) and a MgATP binding site. Fe protein, in conjunction with MgATP hydrolysis to MgADP, delivers electrons to the larger MoFe protein (dinitrogenase). This protein contains a so-called P-cluster, an [8Fe:7S]

B

62

Bioinorganic Chemistry

Bioinorganic Chemistry, Fig. 8 Nitrogenase metal–sulfur clusters (Reprinted with permission from Fig. 1 of Howard and Rees (2006). Copyright (2006) National Academy of Sciences, U.S.A.)

iron–sulfur cluster (Fig. 8c), that also serves as an electron transfer intermediate between Fe protein and MoFe protein. The active-site metal cluster, called FeMo cofactor, [7Fe-9S-Mo-homocitrate] (Fig. 8b), is the most complex and contains a vacant site – most probably where substrate N2 coordinates but identified as X in most of the literature. After electrons are transferred to the substrate, oxidized Fe protein dissociates and subsequently is reduced once more by a ferredoxin or flavodoxin as shown in the equation above. The cycle continues until eight electrons are delivered to reduce one dinitrogen to two ammonia molecules (N2 to 2NH3) and two protons (2H+) to H2. Because iron ions have two easily accessible oxidation states, Fe2+ and Fe3+, iron–sulfur clusters are a common motif in proteins that engage in oxidation–reduction reactions, for instance, see the discussion of hydrogenases. In nitrogenase’s Fe protein, the [4Fe:4S] iron–sulfur cluster is somewhat unusual in that it cycles through three overall oxidation states (rather than the more usual two) as in the following: 

2þ 2Fe2þ : 2Fe3þ : 4S2  1þ , 3Fe2þ : 1Fe3þ : 4S2  0 , 4Fe2þ : 4S2 :

At this time, researchers still question whether the Fe protein [4Fe:4S] cluster delivers one or two

electrons to the MoFe protein per reaction cycle. For an excellent overview of nitrogenase’s mechanism of activity, see Hatcher et al. (2006). MoFe protein contains two copies of two different subunits (called a and b) leading to an overall heterotetrameric protein. Each subunit coordinates two unique metalloclusters, the [8Fe:7S] P-cluster (Fig. 8c) and the [7Fe:9S:Mo: homocitrate] FeMo cofactor (Fig. 8b). The P-cluster receives electrons from Fe protein’s [4Fe:4S] cluster and passes them along to the FeMo cofactor. FeMo cofactor is unique in several ways: (1) it can be extracted from partially denatured protein and inserted into a cofactor-deficient MoFe protein; (2) it has only two direct protein side chain ligands per eight cluster metals instead of the normal one protein side chain ligand per cluster metal; (3) it features an unusual metal cluster ligand, homocitrate; and (4) it contains an empty coordination site with interstitial ligand X, found to be a small atom and possibly identified as carbon, nitrogen, or oxygen. Most recently, evidence has been presented that ligand X is interstitial carbon. In its most reduced state, FeMo cofactor has been identified as [1Mo4+:4Fe2+: 3Fe3+:9S2]3+. Rees and coworkers characterized nitrogenase’s Fe protein and MoFe protein in the 1990s. All of the structures referenced here were completed using the bacterium Azotobacter vinelandii. First published was Fe protein’s X-ray crystallographic structure (PDB 1NIP

Bioinorganic Chemistry

followed by PDB 2NIP in 1998). In 1997, MoFe protein’s structure was published. See PDB 2MIN and 3MIN. Also in 1997, Rees and coworkers published the first X-ray crystallographic study of the Fe protein/MoFe protein complex. See PDB 1N2C. Further studies observed the nitrogenase’s Fe protein/MoFe protein complex with no nucleotide (PDB 2AFH), with MgADP in the nucleotide binding site (PDB 2AFI) and with MgAMPPCP, a non-hydrolyzable MgATP analogue, in the nucleotide site (PDB 2AFK). Data from all of these structures combined with data concerning metal oxidation states from electron paramagnetic resonance (EPR), electron nuclear double resonance (ENDOR), and electron spin echo envelope modulation (ESEEM), and Mossbauer isomer shift studies have led researchers to the following conclusions regarding the nitrogenase protein: • All components of the nitrogenase protein must be assembled before enzymatic activity to produce NH3 is possible; no one has found a way to simplify the system. • As various metal clusters change oxidation state or transfer electrons, the shape of the nitrogenase’s component parts changes as their contact surfaces adjust to facilitate electron transfer. • The exact manner of N2 attachment at FeMo cofactor is unknown at this time. • The source of protons necessary for ammonia and H2 production is unknown at this time as is the exact manner in which protons would be added to the N2 triple bond. One example of a functional model (not attempting to reproduce nitrogenase’s active site), described by Richard Schrock and coworkers, uses a molybdenum (Mo) metal center surrounded by a complex tetradentate ligand system with N atom ligation to the Mo center. Addition of a proton source in heptane as solvent at room temperature and pressure reduced dinitrogen (N2) to ammonia as Mo cycled from Mo(VI) through Mo(III) oxidation states. See Schrock and Yandulov (2003).

63

Vitamin B12 (Cobalamin) Vitamin B12 (cobalamin, Cbl) cofactor functions to add and subtract alkyl (usually methyl, CH3) groups to biological substrates according to the following reactions: CH3 -X þ CoðIÞ ! CH3 -CoðIIIÞ þ X CH3 -CoðIIIÞ þ Y ! CH3 -Y þ CoðIÞ where CH3-X is the methyl group donor and Y is the methyl group acceptor. Vitamin B12 is an important biomolecule that contains a stable metal–carbon bond, i.e., a bioorganometallic molecule. Cobalt metal ions are held within a corrin ring (similar to the porphyrin ring of Fig. 4) in the cofactor imbedded within several enzymes that require alkyl transfers. Depending on the cobalt ion oxidation state in the catalytic cycle, metal ion coordination may vary among 4-, 5-, and 6-coordinate geometries. All geometries are important in the catalytic cycle. An example of an enzyme requiring vitamin B12 cofactor is cobalamin-dependent methionine synthase. This enzyme catalyzes the transfer of a methyl group from methyltetrahydrofolate to homocysteine (HS(CH2)2CH(NH2)(COOH), HCY) to produce methionine (CH3S(CH2)2CH (NH2)(COOH), M) and tetrahydrofolate (H4folate) according to the reactions: CH3 CoðIIIÞCbl þ HCY , CoðIÞCbl þ methionine: CoðIÞCbl þ CH3 -H4 folate , CH3 CoðIIIÞCbl þ H4 folate: Forms of cobalamin include methylcobalamin as described above with the methyl group coordinated in a fifth axial position (in addition to the four pyrrole nitrogen ligands of the corrin ring). Other forms include cyanocobalamin with a CN group in the fifth position, adenosylcobalamin, adenosine in the fifth position, and hydroxocobalamin, an OH group in the fifth position. Vitamin B12 supplements often contain cyanocobalamin.

B

64

Ascorbate Oxidase Ascorbate oxidase is one of a family of multicopper oxidases that includes laccases (benzenediol oxygen oxidoreductase), ascorbate oxidase (reduces ascorbic acid – vitamin C – to dehydroascorbic acid), and ceruloplasmin (named for its blue color, the major carrier of copper in the bloodstream, with a role in iron metabolism). Usually these oxidoreductases contain four copper ions – one mononuclear center and a trinuclear center at a distance of 12–13 Å. Three different types of Cu(II) ions are found in the multi-copper oxidases: • Type I Cu(II) ions that have three strong ligands, usually a cysteine Sg atom and two histidine N atoms, but may also carry a methionine sulfur ligand or oxygen. Their characteristic blue color arises from a strong visible absorption band around 600 nm. Electron paramagnetic resonance (EPR) absorption indicates unpaired electron density. • Type II Cu(II) ions are normally three or four coordinates with histidine N atom ligands plus water or hydroxyl ligands. They exhibit no detectable absorption in the UV/Vis region. Their EPR spectra are similar to those of low-molecular-weight copper coordination complexes. • Type III Cu(II) ions (in pairs as a binuclear center) usually feature three histidine ligands for each copper ion and a bridging ligand such as an oxygen (as oxide) or hydroxyl (hydroxide). They have a strong UVabsorption at 330 nm but no EPR signal since the pair of copper ions is antiferromagnetically coupled. When reduced to Cu(I), all types lose their UV–visible absorptions and their EPR signals (all electrons now paired in Cu(I)’s d10 filled shell). See PDB 1ASO, 1ASP, and 1ASQ for ascorbate oxidase and PDB 2XU9 for laccase. Hydrogenases Nickel/iron, [NiFe]-hydrogenase; iron/iron, [FeFe]-hydrogenase; and Fe hydrogenase reversibly catalyze the reaction:

Bioinorganic Chemistry

2Hþ ðaqÞ þ 2e , H2ðgÞ according to a complex catalytic cycle involving various metal oxidation states and multiple metal centers. In addition to the NiFe or FeFe active site, [NiFe]-hydrogenases and [FeFe]-hydrogenases contain additional metal centers including multiple iron–sulfur clusters and a Mg2+ ion. Several pathways and channels are involved: • An electron transfer pathway • A proton transfer pathway • A gas access channel In [FeFe]-hydrogenases, one of the iron ions in the active-site cluster (called the H-cluster) is redox inactive, probably the iron ion closest to a [4Fe:4S] cluster in this enzyme. The two iron ions in the H-cluster feature carbonyl (CO) and cyanide (CN) ligands as well as a bridging ligand identified as a thiomethyl ether in a recent X-ray crystallographic study. At least one iron–sulfur cluster is required for activity although multiple iron–sulfur clusters are present in the enzyme. See PDB 3C8Y. In [NiFe]-hydrogenases, the iron ion at the active site is not redox active. At the NiFe active site in [NiFe]-hydrogenases, the iron ion is coordinated by carbon monoxide (CO) and two cyanide (CN) ligands, both unusual metalloenzyme ligands, as well as thiolate Sg atoms of two cysteine amino acid residues. The nickel ion appears to be coordinated tetrahedrally to four cysteine thiolate Sg atoms, at least two of which bridge to the active-site iron ion. At least one [4Fe:4S] iron–sulfur cluster is required for the enzyme’s activity although multiple iron–sulfur clusters are present in the enzyme. See PDB 3UQY, 3USC, and 3USE. Fe hydrogenase also reversibly forms and consumes molecular dihydrogen. This hydrogenase, studied by X-ray crystallography (PDB 3DAF, 3DAG), shows a mononuclear iron ion coordinated by a cysteine, two carbon monoxide (CO) molecules, and a nitrogen atom of a 2pyridinol compound with bonding properties similar to a cyanide ion (CN). This hydrogenase

Bioinorganic Chemistry

does not appear to require an iron–sulfur cluster for activity. See PDB 3DAF and 3DAG. Bioinorganic chemists are interested in these enzymes and their models with the goal of synthesizing dihydrogen for use in fuel cells and other future energy technologies such as H2-powered automobiles. Currently, an expensive platinum catalyst is used in industrial dihydrogen production. The three hydrogenases also provide evidence for convergent evolution – the evolution of similar traits in otherwise unrelated species or lineages. Two August 2005 issues of Coordination Chemistry Reviews are devoted to a discussion of hydrogenases, their structure and function, and model compound synthesis and analysis. See Pickett and Best (2005). Superoxide Dismutase Superoxide dismutase (SOD) metalloenzymes function to disproportionate the biologically harmful superoxide ion radical O2  according to the following equation: 2 O2   þ 2Hþ ! H2 O2 þ O2 : In disproportionation reactions, the same reactant, O2  in this case, is both oxidized and reduced. The reaction kinetics are controlled by diffusion of O2  into the enzyme’s active site – an extremely fast reaction rate. Copper ions shuttle between Cu(I) and Cu(II) oxidation states during catalysis, and zinc ions are believed to have a structural role in maintaining the correct stereochemistry at the active site. The product hydrogen peroxide, H2O2, also a dangerous biological species, is removed by catalase (see PDB 3RE8, 3RGP, 3RGS), a porphyrin heme-containing enzyme, according to the reaction: 2H2 O2 ! 2H2 O þ O2 : Superoxide dismutase enzymes (SODs), containing both copper and zinc ions, are functional dimers of approximately 32 kDa, each subunit containing 153 amino acids. The enzyme dimers, found in eukaryotic and some bacterial

65

species, are held together by hydrophobic interactions between amino acid side chains of the two subunits. In addition to CuZnSODs, SODs may contain redox-active manganese (Mn), nickel (Ni), or iron (Fe) centers. In mammals, CuZnSOD is primarily found in the liver but may also be found in blood cells and brain tissue. Dimeric CuZnSOD found in mammalian species contains one copper and one zinc ion per dimer subunit. In disproportionating a superoxide ion radical, SOD acts as an antioxidant to inhibit aging and carcinogenesis. Mutations in the gene encoding human superoxide dismutase are known to cause amyotrophic lateral sclerosis (ALS) – Lou Gehrig’s disease. Over 100 separate single amino acid residue mutations have been identified for CuZnSODs. Researchers have linked many of these mutations to degenerative neuron diseases such as ALS. A complex system involving many biological components and toxicities has evolved. The first CuZnSOD X-ray crystallographic studies in the early 1980s revealed an unusual binding mode – a histidinate anion (a deprotonated histidine side chain) bridging the copper and zinc ions of the enzyme’s subunits. This bridging histidine was found to occur when copper ions were in the Cu(II) state. See PDB 2SOD and PDB 3SOD. The Protein Data Bank recently listed almost 300 superoxide dismutase structures containing copper, zinc, manganese, iron, and most recently, nickel. Many different species, including over 100 Homo sapiens structures, are represented. Several NMR solutions studies of CuZnSOD enzymes have been carried out. In a fully reduced form of the enzyme, Bertini and coworkers showed that the Ne2 atom of the bridging histidine is protonated, agreeing with the mechanism outlined below. See PDB 1L3N. The tertiary structure of CuZnSOD subunits consists of eight antiparallel b-strands that assemble into a so-called b-barrel Greek key motif. This motif, occurring in SODs as well as in other proteins such as plastocyanin and cytochrome c, not only stabilizes tertiary structure but also supports guidance of electrons through a protein and determines active-site chemistry involving metals. Hydrophobic amino acid side chain residues

B

66

Bioinorganic Chemistry

form the core of the b-barrel. Other specific amino acid residues connect the two SOD dimer subunits into the quaternary structure. Two loops form a channel guiding superoxide ion to the enzyme’s active site, and a disulfide bond between two cysteine residues stabilizes both the dimer interface and the protein’s overall geometry – the protein’s fold. Many studies have led to an outline of the mechanistic steps leading to the disproportionation reaction of the superoxide ion. The steps include: • Superoxide ion (O2) is guided to the active site. • O2  binds to Cu(II) ion that is connected to the zinc ion by the histidinate bridge. • Cu(II) ion is reduced to Cu(I) and O2  is oxidized to molecular O2. • The histidinate bridge is broken and the histidine Ne2 atom is protonated. • An electron from Cu(I) is donated to a second O2  to form peroxide ion O22 and reform Cu(II). • The protonated Ne2 bridging histidine hydrogen combines with O22 to form HO2. • Cu(II)–histidine–Zn(II) bridge reforms. The disproportionation reactions can be written as: O2   þ CuðIIÞZnSOD ! CuðIÞZnSOD þ O2 : O2   þ CuðIÞZnSOD þ 2Hþ ! CuðIIÞZnSOD þ H2 O2 : Researchers believe that familial amyotrophic lateral sclerosis (FALS), an inherited form of ALS, is caused by many different mutations in the gene encoding human CuZnSOD. ALS symptoms include motor neuron loss in the brain and spinal cord and fibrous material buildup in muscles. Researchers now believe that mutated CuZnSOD is destabilized by partially folded or misfolded CuZnSOD subunits and that this leads to the aberrant interactions with other proteins and the formation of aggregates and fibrous material. See Perry et al. (2010). No cure exists for FALS

although several treatment options are undergoing phase II and III trials at this time. See Pratt et al. (2012). Researchers have discovered a link between FALS and the activation of superoxide dismutase. The link involves the copper chaperone for SOD, CCS. Metal chaperones directly insert a specific metal into the target metalloprotein. Copper chaperones have been well described but other metals incorporated into proteins must have chaperones as well to maintain metal homeostasis and avoid metal toxicities. Concomitant with the insertion of copper ions into SOD by CCS, formation of an intrasubunit disulfide bond takes place. Not all SOD enzymes require a copper chaperone as they have other mechanisms for activation. Eukaryotic Cu,Zn-superoxide dismutases, called SOD1s, generally require the copper chaperone CCS and O2 to form the required disulfide bond for SOD1 activation. However, several metazoan species have been found to acquire SOD1 activity independent of CCS or of O2. See Leitch et al. (2009). Metal Toxicity Many elements and compounds are required at some level for an organism’s survival. Toxic effects can be found for these same elements and compounds if their concentration in the organism is too low (deficiency) or too high (toxicity). For instance, if concentration of iron in all its forms in the human body is too low, anemia results; if too high, a disease state named hemochromatosis, also called iron overload, ensues. Two metallic elements that appear to be toxic in humans at very low levels are lead (Pb) and mercury (Hg). Both these elements, considered “soft” – capable of easily forming compounds with organic species – accumulate low in the food chain with toxic results. Methyl mercury (monomethylmercury (II), CH3Hg+) accumulating in fish caused a terrible wasting disease in Japan before the cause of the toxicity was established. Methyl mercury can be produced in vivo by the action of vitamin B12-related enzymes on ingested mercury. Elemental Hg and CH3Hg+ both react with cysteine or methionine sulfur atoms in proteins and enzymes leading to

Bioinorganic Chemistry

mercury transport throughout the body including across the blood–brain barrier or the placenta. Lead poisoning causes many deleterious effects in the human body and is particularly dangerous for children through nervous system damage leading to mental deficiency. Tetraethyl lead, [(C2H5)4Pb], formerly used as an automobile gasoline additive and still used in some aviation fuels and in some developing countries for all transportation, is easily absorbed through the skin. Elemental lead droplets and lead compounds are readily ingested or inhaled if present in the atmosphere. Once in the body, lead and lead compounds have a long half-life; bone accumulations of Pb can be present for years once established. Reactions similar to those of mercury with sulfur ligands in proteins and enzymes cause toxic effects. One known effect is interference in the synthesis of heme groups responsible for carrying dioxygen throughout the bloodstream in hemoglobin causing anemia in the patient. Radioactive elements issuing alpha, beta, and especially energetic gamma particles can be toxic because they result in genetic mutations known to cause diseases such as leukemia and other cancers. The most toxic radioactive elements issue particles with long half-lives. Radioactive elements issuing particles with short half-lives have been very useful in medical diagnostic and treatment applications as will be discussed for technetium (Tc) radiopharmaceuticals in the Metals in Medicine section. Aluminum as Al3+ has been implicated in Alzheimer’s disease as it may interact with phosphates and cause protein cross-linking. Cadmium as Cd2+, also considered as a “soft” metal, has many of the same effects as lead or mercury blocking sulfhydryl groups in proteins and enzymes, competing with zinc, for instance. Cadmium is also known to interfere with Cu(II) and Zn(II) metabolism as Cd2+ features size and shape characteristics similar to the copper and zinc ions. Thallium (Tl+) ions, a nervous system poison, are similar to K+ ions in size and shape and can enter cells through K+ ion channels. Chromium (Cr) is an essential element for normal carbohydrate and lipid metabolism. Its

67

deficiency can lead to adult onset diabetes. However, Cr(VI) uptake via the chromate ion, (CrO4)2, through anion channel transport causes several toxic effects. After intracellular reduction to Cr(III), adduct formation involving the phosphate backbone of DNA, the N7 atom of guanine, cysteines, glutathione, and larger peptides and proteins result in similar toxicity as that described for mercury and lead. Arsenic’s (As) toxicity arises from bonding to critical biological thiols – cysteine in proteins being a prime example. All metals, both essential and toxic, can cause the formation of free radicals such as superoxide, superoxide O22, peroxide O22, hydroperoxide HO2, hydroxyl radical (HO), and other reactive oxygen and nitrogen species (ROS and RNS) that cause mutations in genetic material and other deleterious biological side reactions. For a recent comprehensive review of the topic, see Hengstler and Bolt (2008). Metals in Medicine Platinum-Containing Antitumor Drugs

In the mid-1960s, Barnett Rosenberg and coworkers serendipitously discovered the ability of platinum coordination compounds to cause filamentous growth instead of cell division in bacteria. The researchers reasoned that the ability to prohibit cell division inferred that the platinum compounds could be antitumor agents. After a long period of research, animal testing, and clinical trials in humans, the compound cisdiamminedichloroplatinum(II), cis-[Pt(NH3)2Cl2], cisplatin, Platinol, was approved by the US Food and Drug Administration (FDA) for anticancer treatment in 1979. Since then, cisplatin has become one of the most used anticancer drugs and is known as a virtual cure for testicular cancers. Further research has identified other platinum coordination compounds that have anticancer activity including: • [1,1-Cyclobutanedicarboxylato2]-O,O0 -diammi neplatinum(II), carboplatin, found to cause fewer side effects than cisplatin

B

68

Bioinorganic Chemistry

Bioinorganic Chemistry, Fig. 9 Platinum compounds used or being tested as anticancer agents. Figures in this entry are rendered using ChemBioOffice software. See www.cambridgesoft. com

• [(1R,2R)-cyclohexane-1,2-diamine](ethanedio ato-O,O0 )platinum(II), oxaliplatin, featuring a bidentate 1,2-diaminocyclohexane ligand in place of the two ammine ligands • [bis-(acetato)-ammine dichloro-(cyclohe xylamine) platinum(IV)], JM216, satraplatin, a platinum(IV) compound administered orally • Multinuclear, trans-platinum(II) ammine compounds synthesized in Nicholas Farrell’s laboratories, BBR3464 (Fig. 9) Of these compounds, and in addition to cisplatin, only carboplatin and oxaliplatin have received final FDA approval in the United States. Platinum(II) coordination compounds, all having d8 electron configurations, universally adopt square planar geometry. The platinum (IV) compound satraplatin, with d6 electron configuration, adopts octahedral geometry. In vivo, it is believed that the two acetato ligands are lost and

the compound is reduced to the platinum (II) oxidation state before anticancer activity ensues. Platinum(II) compounds’ mechanisms of activity in vitro and in vivo have been extensively studied. Early experiments showed that the intravenously administered drug would pass through the bloodstream intact and be hydrolyzed (with two chloride ligands replaced by H2O) after entering the cell where chloride ion concentration is much lower. Concurrently, it was established that the drug’s in vivo target is DNA. More specifically, cisplatin’s chloride ligands are hydrolyzed inside the target cell followed by intrastrand crosslinking at specific positions on guanine residues along the DNA chain. Kinks in the DNA chain caused by these cross-links prevent DNA replication eventually leading to cell death (apoptosis). Because normal cells as well as tumor cells are attacked during chemotherapy, all platinumcontaining anticancer drugs have serious side

Bioinorganic Chemistry

effects that are manifested principally as nephrotoxicity and nausea. In addition to toxic side effects of antitumor platinum coordination compounds, clinicians find that cancer cells become resistant to the drugs and further treatment is not efficacious. This fact has led to continued research with the goal of finding other effective metalcontaining compounds for cancer chemotherapy. See Wang and Guo (2013). Antirheumatic and Anticancer Gold Compounds A number of coordination compounds containing gold ions have been used to treat rheumatoid arthritis. Most common among these are gold sodium thiomalate and auranofin, both of which contain Au(I) ions. As gold(III) compounds have d8 electron configurations and are thus square planar in geometry as the platinum(II) drugs discussed above, many Au(III) coordination complexes have been evaluated as anticancer agents. Many Au(III) compounds have been found to be antitumor active; however, none have reached clinical trials, principally because of toxicity issues. See Berners-Price and Filipovska (2011). Bioorganometallic Drugs

Titanocene, [bis-cyclopentadienyltitanium(IV) dichloride], is an organometallic compound that has shown promise in vitro and in vivo experiments as an antitumor agent. However, it failed in phase I and II clinical trials. Researchers continue to study this and other titanium organometallic compounds to determine their mechanism of activity with the goal of developing more effective and less toxic agents. Radiopharmaceuticals

Technetium 99 m (99mTc) compounds have been used as radiopharmaceuticals for a number of years. This radionuclide emits a gamma ray of optimum energy (140 keV) and short half-life (6 h) and is readily available from stable 99 Mo–99mTc generator systems. The 99mTc radionuclide has been introduced into many complex organic ligand systems that are capable of visualizing different medical targets within the body. One organometallic system under study uses

69

cyclopentadienyl (Cp) and carbonyl (CO) ligands to form stable compounds such as [Cp99mTc(I) (CO)3]. The cyclopentadienyl ring can be modified to produce compounds capable of bonding with specific in vivo targets for diagnostic analysis. See Arano (2002). Conclusions Metals ions are found in every biological species known to man, some more commonly than others. First and second group metal ions such as Na+, K+, and Ca2+ serve to control muscle and nerve function. Iron ions in Fe(II), Fe(III), and Fe(IV) oxidation states take part in oxidation–reduction reactions, for example, in the heme cofactors of cytochromes and in iron–sulfur clusters. Fe (II) and Fe(III) ions in heme cofactors serve to carry dioxygen in all mammalian species. Copper ions carry dioxygen in arthropods and serve as important centers for control of reactive oxygen species such as superoxide ion in the disproportionating enzyme superoxide dismutase. Bioinorganic chemists study complex bioinorganic systems such as nitrogenase striving to find model complexes to split the stable dinitrogen molecule and provide new methods for the production of reduced nitrogen-containing molecules such as ammonia, NH3. Study of the many genetic mutations in the superoxide dismutase enzyme that cause diseases such as ALS may lead to treatment options for this currently incurable malady. Bioinorganic researchers study metals used in medical applications with the goal of finding better treatments with fewer damaging side effects. Other metal ions such as Ni(II), Ni(IV), Mo(III), Mo(IV), and Mo(VI) are found less commonly in biological species but constitute essential cofactors in enzymes such as dehydrogenases and ureases (nickel) and nitrogenase (molybdenum). This entry has attempted to address the more commonly found metals in some detail and mention the less commonly found as well.

Explanation of Terms Proteins: polypeptides formed by polymerization of amino acid building blocks. The chain of amino

B

70

acid residues forms a protein’s primary structure. Numbering of the protein chain begins at the N-terminal (free NH3+) end and terninates at the C-terminal (free COO) end. Secondary structures in proteins form through interactions of amino acid main and side chains. Two important secondary structures are a-helices and b-pleated sheets formed through van der Waals and hydrogen bonding interactions. Protein tertiary structure forms as the protein folds into its active configuration. The manner in which a protein folds is a major factor in its activity as an incorrectly folded enzyme will have greatly diminished or no catalytic activity. Most of the proteins discussed here are globular in nature; that is, they assume a more or less spherical form and are at least partially soluble in aqueous solution. Fibrous proteins – found in connective tissue, tendons, and muscle fiber – are more linear in structure and mostly insoluble in aqueous solution. Quaternary structures encompass several tertiary structures bonded together to work cooperatively. Hemoglobin’s quaternary structure includes four tertiary globular proteins that cooperatively bind and carry dioxygen (O2) throughout the bloodstream. Enzymes: proteins that catalyze biochemical reactions. Metalloenzymes: enzymes that include metal ions or metal ion-containing cofactors. Cofactor: an organic or inorganic species bound to a protein that is required for its activity. Closely related terms are coenzyme and prosthetic group. Substrate: molecular target that is transformed during enzymatic activity. Protein Data Bank (PDB): the depository for protein structural data. In early 2013, the PDB contained X-ray crystallographic, nuclear magnetic resonance (NMR), and cryo-electron microscopy data on more than 88,000 compounds. The PDB contains a wealth of other information about biological structures plus references to the primary literature and other biological classification systems such as the Structural Classification of Proteins (SCOP at http://scop.mrc-

Bioinorganic Chemistry

lmb.cam.ac.uk/scop/), CATH (http://www. cathdb.info), and Pfam (http://pfam.xfam.org) databases. Placing the PDB accession numbers of the form 4HHB (for hemoglobin) in the search box leads to all the information for that molecule at the PDB website: www.pdb.org. Deoxyribonucleic acid (DNA): double helical structure formed by polymerization of nucleotides (three-component molecules containing a purine or pyrimidine nitrogenous base, a cyclic sugar moiety, and a phosphate group). In DNA, the “deoxy” sugar moiety features a hydrogen atom at the 2’ position of the cyclic sugar ring. Biological DNA constitutes the genetic code that ultimately determines which proteins are produced first transcribing ribonucleic acids (RNA). RNAs then translate the genetic code into the proteins and enzymes necessary for biological life. Adenosine triphosphate (ATP) is the molecule responsible for intracellular energy transport in biological systems – a complex molecule containing adenine, a cyclic sugar moiety, and three phosphate groups. Adenosine diphosphate results from the hydrolysis of one phosphate group from ATP with concomitant release of energy, often concurrent with phosphorylation of an amino acid residue of the bound protein or enzyme. Essential elements: the chemical elements essential to living systems. The bulk elements, constituting between 2% and 50% of human body weight (HBW), are the following: hydrogen (H, H+), carbon (C), nitrogen (N), oxygen (O, oxide O2, superoxide O2, peroxide O22), phosphorus (P), and sulfur (S, sulfide S2). Macrominerals and ions, constituting 0.1% or less of HBW, are the following: sodium (Na, Na+), potassium (K, K+), magnesium (Mg, Mg2+) calcium (Ca, Ca2+), chloride (Cl), phosphate (PO43), and sulfate (SO42). Trace elements and ions, less than 0.005% of HBW, are iron [Fe, ferrous Fe2+, also written as Fe(II); ferric Fe3+, Fe(III); ferryl Fe4+, Fe(IV)], zinc [Zn, Zn2+, Zn(II)], and copper [Cu, cuprous Cu1+, Cu(I); cupric Cu2+, Cu(II); Cu3+, Cu(III)]. Ultratrace nonmetal elements and ions, less than 0.00001% of HBW, are fluorine (F, F), iodine (I, I), selenium (Se, Se2), silicon (Si, Si(IV)), arsenic (As),

Bioinorganic Chemistry

and boron (B). The latter three elements are metalloids (see definition below). Ultratrace metallic elements, less than 0.00001% of HBW, are manganese [Mn, Mn2+, Mn(II); Mn3+, Mn(III); Mn4+, Mn(IV)], molybdenum [Mo, Mo4+, Mo(IV); Mo5+, Mo(V); Mo6+, Mo(VI)], cobalt [Co, Co+Co(I); Co2+, Co(II); Co3+, Co(III)], chromium [Cr, Cr3+, Cr(III); Cr6+, Cr(VI)], vanadium [V, V3+, V(III); V4+, V(IV); V5+, V(V)], nickel [Ni, Ni1+, Ni(I); Ni2+, Ni(II); Ni3+, Ni(III)], cadmium [Cd, Cd2+, Cd(II)], tin [Sn, Sn2+, Sn(II); Sn4+, Sn(IV)], lead (Pb, Pb2+), and lithium (Li, Li+). Main group metals: chemical elements that occur in the first two groups at the left side of the periodic table. The principal main group metals found in biological species are lithium, sodium, and potassium (group 1 metals) and magnesium and calcium (group 2 metals). Transition elements: metals that occur in the center of the periodic table. Transition elements found in biological species (listed in decreasing concentration) are iron, copper, zinc, manganese, molybdenum, cobalt, chromium, vanadium, nickel, cadmium, tin, and lead. Transition elements contain filled or partially filled “d” electron shells. These electrons are sequentially lost to form positive ions of varying oxidation states – excepting zinc and lead that have filled “d” shells with 10 electrons. Because of their ability to add and subtract electrons, transition metals are often involved in biological oxidation–reduction reaction systems. Transition metal ions gain stabilization energy when bonded to ligands in specific geometric arrangements – planar, pyramidal, tetrahedral, and octahedral are most common in bioinorganic species. Often the geometries adopted by biological systems are distorted from the ideal. Metalloids: chemical elements that exhibit properties of both metals and nonmetals. Metalloids found in biological species are arsenic, boron, and silicon. Nonmetals are chemical elements that occur in the groups at the right-hand side of the periodic table. Nonmetals found in biological species are carbon (group 14); nitrogen and phosphorus (group 15); oxygen, sulfur, and selenium (group 16); and fluorine, chlorine, and iodine (group 17).

71

Cross-References ▶ DNA Damage, Types of ▶ DNA Methylation and Cancer ▶ Ion Channels and Transporters ▶ NMR Approaches to Determine Protein Structure ▶ Non-allosteric Proteins. Why Do Proteins Have Quaternary Structure? ▶ Obtaining Crystals ▶ Secondary Structure ▶ Secondary Structure by Circular Dichroism, Experimental Assessment of ▶ Tertiary Structure Domains, Folds and Motifs

References Arano Y (2002) Recent advances in 99mTc radiopharmaceuticals. Ann Nucl Med 16(2):79–93 Berners-Price SJ, Filipovska A (2011) Gold compounds as therapeutic agents for human diseases. Metallomics 3(9):863–873 Bertini I, Gray HB, Stiefel EI, Valentine JS (2007) Biological inorganic chemistry. University Science, California CATH. http://www.cathdb.info. Accessed 27 May 2014; and Pfam. http://pfam.xfam.org). Accessed 27 May 2014 Chang TMS (2007) Artificial cells: biotechnology, nanomedicine, regenerative medicine, blood substitutes, bioencapsulation, and cell/stem cell therapy. World Scientific, Singapore/London/Hackensack Crichton R (2008) Biological inorganic chemistry: an introduction. Elsevier BV, Amsterdam Denisov IG, Makris TM, Sligar SG, Schlichting I (2005) Structure and chemistry of cytochrome P450. Chem Rev 105:2253–2278 Groves JT, Bell SR (2009) A highly reactive P450 model compound. J Am Chem Soc 131(28):9640–9641 Hatcher LQ, Vance MA, Narducci Sargeant AA, Solomon EI, Karlin KD (2006) Copper-dioxygen adducts. Inorg Chem 45:3004–3013 Hengstler JG, Bolt HM (2008) A special issue on metal toxicity. Arch Toxicol 82(8):489–571 Howard JB, Rees DC (2006) How many metals does it take to fix N2? Proc Natl Acad Sci 103:17088–17093 Jena Library of Biological Molecules. http://www.flileibniz.de/IMAGE.html. Accessed 27 May 2014 Kim E, Chufan EE, Kallappan K, Karlin KK (2004) Synthetic Models for Heme-Copper Oxidases. Chem Rev 104:1077–1133 Lancaster CRD (2004) Structural biology: ion pump in the movies. Nature 432:286–287 Leitch JM, Yick PJ, Culotta VC (2009) The right to choose: multiple pathways for activating copper, zinc superoxide dismutase. J Biol Chem 284(37):24679–24683

B

72 Momenteau M, Reed CA (1994) Synthetic heme dioxygen complexes. Chem Rev 94:659–698 Nomenclature of the potassium channel superfamily. https://www.ipmc.cnrs.fr/~duprat/ipmc/nomenclature. html. Accessed 27 May 2014 Perry JJP, Shin DS, Getzoff ED, Tainer JA (2010) The structural biochemistry of the superoxide dismutases. Biochim Biophys Acta 1804(2):245–262 Pickett C, Best S (2005) Hydrogenases. Coord Chem Rev 249(15–16):1517–1690 Pratt AJ, Getzoff ED, Perry JJP (2012) Amyotrophic lateral sclerosis: update and new developments. Degener Neurol Neuromuscul Dis 2012(2):1–14 Protein Data Bank. www.pdb.org. Accessed 27 May 2014 Roat-Malone RM (2007) Bioinorganic chemistry. Wiley, New Jersey Schrock RR, Yandulov DV (2003) Catalytic reduction of dinitrogen to ammonia at a single molybdenum center. Science 301:76–78 Structural Classification of Proteins (SCOP). http://scop. mrc-lmb.cam.ac.uk/scop/. Accessed 27 May 2014 Wang X, Guo Z (2013) Targeting and delivery of platinum-based anticancer drugs. Chem Soc Rev 42:202–224 Yu FH, Yarov-Yarovoy V, Gutman GA, Catterall WA (2005) Overview of molecular relationships in the voltage-gated ion channel superfamily. Pharmacol Rev 57:387–395

Biological Inorganic Chemistry ▶ Bioinorganic Chemistry

Blue/White Selection Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Definition Blue-white screening provides a convenient and powerful way to distinguish bacterial colonies or phage plaques that contain a cloning vector with a DNA insert, from those containing empty vectors with no insert DNA. The method is based on the blue pigment that forms when betagalactosidase catalyzes hydrolysis of the synthetic

Biological Inorganic Chemistry

substrate X-gal. Hydrolysis of X-gal (5-bromo-4chloro-3-indolyl-b-D-galactoside) (Horwitz et al. 1964) produces galactose and 5-bromo-4-chloro3-hydroxyindole (Fig. 1). The latter product then undergoes spontaneous dimerization and oxidation to form a blue-colored indigo pigment (Burstone, 1962). E. coli cells that express betagalactosidase activity thus form blue colonies when spread on agar plates that contain X-gal, while cells that do not produce active enzyme form white colonies.

Discussion E. coli beta-galactosidase is a large protein (1,024 amino acid residues, 116 kDa), encoded by the lacZ gene. Cloning vectors for blue-white screening take advantage of a phenomenon called alphacomplementation of lacZ mutations. The betagalactosidase encoded by the lacZ DM15 gene lacks 31 residues near its N-terminus and is catalytically inactive (Langley et al. 1975; Prentki, 1992). N-terminal peptides from betagalactosidase can complement the lacZ DM15 mutation and restore beta-galactosidase activity to those cells (Ullmann et al. 1967). Cloning vectors for blue-white screening contain the 50 -end of the lacZ gene, referred to as lacZa. The lacZa gene encodes a 146 amino acid peptide from beta-galactosidase (Sambrook et al. 1989) that complements the lacZ DM15 mutation. The lacZa gene is under the control of the lac promoter and operator, so that its expression is repressed by the Lac repressor. A polylinker cloning site is inserted near the 50 -end of the lacZa coding sequence. Expression of the lacZa gene from the empty vector produces the complementing peptide. A DNA insert that is cloned into the polylinker disrupts the lacZa coding region, and so a recombinant plasmid will not produce the complementing peptide. Blue-white screening can increase the efficiency of cloning procedures that involve mixing DNA fragments of interest with linearized vector DNA and joining the two by ligation. The ligation mixture may contain recombinant plasmids (vector + insert) as well as plasmid vectors with

Blue/White Selection

73

B

Blue/White Selection, Fig. 1 Structure of X-gal and its hydrolysis products

no DNA insert that were recircularized by ligation (empty vectors). Both plasmids will give rise to antibiotic-resistant colonies after transformation into E. coli cells, and the investigator may waste time searching for colonies that contain the recombinant plasmid of interest among a large number of colonies with empty vectors. For blue-white screening, the ligation mixture is transformed into an E. coli strain that has the lacZ DM15 gene. The cells and/or the vector also encodes the Lac repressor protein which binds to the lac operator in the vector and suppresses transcription of the lacZa fragment in the plasmid. The transformed cells are spread on plates containing appropriate antibiotic (to select for the presence of plasmid, whether recombinant or empty vector), IPTG (isopropyl b-D-thiogalactoside), and X-gal. IPTG induces transcription from the lac promoter in the plasmid. Cells that acquired an empty vector in the transformation procedure produce the alphacomplementing peptide and thus have active beta-galactosidase. As a result, they form blue colonies on the X-gal plates. However, DNA ligated into the polylinker prevents alpha-peptide production, so cells containing recombinant plasmids form white colonies. Recombinant plasmids can be isolated from white colonies and the DNA insert analyzed by restriction digestion,

PCR, or other methods. The lac operon fragment (promoter, operator, polylinker, and lacZa gene) has been incorporated into many plasmid, phagemid, and bacteriophage cloning vectors, enabling blue-white screening on colonies or phage plaques to be applied in many different cloning applications.

References Burstone MS (1962) Enzyme histochemistry and its application in the study of neoplasms. Academic Press, New York Horwitz JP, Chua J, Curby RJ, Tomson AJ, Darooge MA, Fisher BE, Mauricio J, Klundt I (1964) Substrates for cytochemical demonstration of enzyme activity. I. Some substituted 3-Indolyl-Beta-D-Glycopyranosides. J Med Chem 7:574–575 Langley KE, Villarejo MR, Fowler AV, Zamenhof PJ, Zabin I (1975) Molecular basis of beta-galactosidase alpha-complementation. Proc Natl Acad Sci U S A 72:1254–1257 Prentki P (1992) Nucleotide sequence of the classical lacZ deletion delta M15. Gene 122:231–232 Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning: a laboratory manual, 2nd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor Ullmann A, Jacob F, Monod J (1967) Characterization by in vitro complementation of a peptide corresponding to an operator-proximal segment of the betagalactosidase structural gene of Escherichia coli. J Mol Biol 24:339–343

C

CAG Repeat Pathologies

Chemical Denaturation

▶ Polyglutamine Folding Diseases

Jeffrey K. Myers Department of Chemistry, Davidson College, Davidson, NC, USA

Canonical vs Noncanonical DNA Pairs Synonyms ▶ DNA Base Pairing, Modes of

Chemical unfolding; Solvent denaturation

Synopsis

Changes in DNA Melting Curves ▶ Adducts on Tm, Effects of

Charophytes ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Charophytes Plus Chlorophytes ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

Chemical denaturation is a means of rendering proteins nonfunctional via addition of denaturing agents (denaturants) to the solvent. The polypeptide remains chemically intact; denaturation occurs through unfolding of the precise, ordered three-dimensional shape that is typically required for biological function. Chemical denaturation is commonly used for measuring the conformational stability of globular proteins. While chemical denaturation has been in use for decades, the precise mechanism by which denaturants unfold proteins is still under investigation. Denaturants are also useful in rendering soluble otherwise insoluble forms for proteins and can be used to dissolve aggregates. Chemically denatured proteins are often used as starting points in kinetic studies of protein folding. Although chemical denaturation is considered the strongest means of denaturing proteins, chemically denatured

76

Chemical Denaturation

proteins may still contain significant levels of residual structure.

and the following can be used to define conformational stability: Xn DGu ¼  RTln

Introduction The term denaturation, as applied to proteins, typically refers to a loss of function. In order to function, most proteins are folded into specific conformations (shapes) that are thermodynamically stable (Anfinsen 1973). Chemical denaturation is a process by which this structure is destroyed through chemical means, via addition of denaturing agents (denaturants) to the solvent. The denatured protein is chemically intact but has lost most of the ordered noncovalent structure that is found in most native proteins. Chemical denaturation is a common method for experimentally determining a protein’s conformational stability, which refers to the thermodynamic stability of the native, functional conformation relative to nonfunctional conformations. Practically speaking, the cooperativity of protein folding allows consideration of groups of conformations (ensembles) that behave as distinct thermodynamic states. Often, an equilibrium involving only two such states is required and conformational stability can thus be defined as the difference in free energy between the native (folded, functional) state and the denatured (unfolded, nonfunctional) state: Native $ Denatured   ½denatured DGu ¼  RTln ½native

(1)

where DGu is the free energy difference between native and denatured states, R is the gas constant, and T is temperature in Kelvin. This two-state unfolding model is applicable to many small, single-domain proteins. However, if intermediate states are observable, then a multistate equilibrium may be considered: Native $ I 1 $ I 2 . . . $ Denatured

i¼1

½nonnativei

!

½native

(2)

where the sum of [nonnative]i represents all other thermodynamic states of the protein besides the native state. Equation 2 holds for any number of observed conformational states and may serve as general definition of conformational stability. The conformational stability of most globular proteins is marginal at best (20–40 kJ/mol under typical physiological conditions). A number of factors contribute to the stability, both stabilizing and destabilizing. The factors mostly cancel out, leading to a marginal stability of the native state. In addition to stability measurements, there are other uses for chemical denaturants including protein solubilization and generating a “starting point” for kinetic experiments of folding.

Measuring Protein Conformational Stability via Chemical Denaturation A common use for chemical denaturation is to determine the conformational stability of a globular protein under a given set of conditions. The chemical denaturant is used to perturb the protein’s unfolding equilibrium while monitoring the structural state of the protein. Common structural probes are optical spectroscopies such as UV absorbance, fluorescence or circular dichroism, more sophisticated techniques such as NMR, or simply monitoring biological activity. In chemical denaturation, additives are titrated into the sample while monitoring the folded state of the protein using some structural probe. Guanidine salts (usually guanidine hydrochloride or guanidine thiocyanate) are commonly used denaturants. They are very potent and are soluble in the molar concentrations necessary to complete denaturation. A nonionic alternative to guanidine is urea, which is less effective than guanidine

Chemical Denaturation

77

Chemical Denaturation, Fig. 1 Chemical denaturation of BdpA using guanidine hydrochloride as a denaturant. Circular dichroism at 222 nm, which is sensitive to helical structure, is used as a structural probe. Solid red line is fit to Eq. 2; dotted lines are independently fit baselines

hydrochloride but is even more soluble (up to about 10 M at room temperature). It can also be mixed with thiourea up to the solubility limits of both compounds to make an extremely potent denaturant (Arora et al. 2004). Unfortunately, thiourea is not UV transparent, hindering optical spectroscopy, but thiourea and thiourea/urea mixtures are aptly suited for NMR experiments if maximum denaturing power without high ionic strength is required. Sample chemical denaturation data are given in Fig. 1, where the B-domain of protein A (a small three-helix bundle protein) has been denatured with guanidine hydrochloride. Since conformational stability is an equilibrium thermodynamic parameter, it is essential that during the experiments, the system is at equilibrium and the conformational transition should be reversible. Usually, chemical denaturation is reversible, although this should be checked experimentally. Practical protocols and suggestions for chemical denaturation can be found in a variety of sources (Pace 1986; Walters et al. 2009). The analysis described below is appropriate for monomeric proteins. For oligomeric proteins, a more complex analysis is required since the conformational stability will depend on protein concentration (Walters et al. 2009).

C

Upon completion of a chemical denaturation experiment, a curve like the one shown in Fig. 1 may be analyzed to generate two key pieces of data: the conformational stability and the m-value. The two baselines can be fitted independently, and the two-state model will give the equilibrium constant for unfolding (and thus DG) at each denaturant concentration: K ¼



 y f  y = ðy  y u Þ

(3)

where y is the observed signal, yf and yu are the baseline values, and K is the unfolding equilibrium constant. Note that only data in the transition region, where both states are significantly populated, is useful for calculating K and DG. Plotting the observed DG values versus denaturant concentration usually gives a linear relationship (see Fig. 2). This empirical observation has led to the linear extrapolation method (LEM; Pace 1986) for determining the stability, where the y-intercept of the best-fit line is taken as DGuH2O (the conformational stability of the protein in the absence of denaturant) and the m-value (the linear dependence of DGu on denaturant concentration) is the slope: 2O DGu ¼ DGH  m½D u

(4)

78

Chemical Denaturation

Chemical Denaturation, Fig. 2 Dependence of DGu on guanidine hydrochloride obtained from the data in Fig. 1, using the two-state model. Slope of the best-fit line is the m-value, and the intercept at 0 M denaturant is the conformational stability, DGu H2 O

The two-state model and the LEM can be combined to give a single equation that represents the

denaturation curve and can be fit to the data to give stability parameters:

      yf þ af ½D þ ðyu þ au ½DÞexp m ½D  ½D1=2 =RT     y¼ 1 þ exp m ½D  ½D1=2 =RT

where y is the observed signal, yf and yu are the intercepts of the pre- and post-transition baselines, af and au are the slopes of the baselines, [D] is the denaturant concentration, and [D]1/2 is the midpoint of the denaturation. Here, the conformational stability, DGu(H2O), can be determined as the product of [D]1/2 and the m-value. The m-value is a measure of cooperativity, related to the steepness of the unfolding transition and also strongly correlated with the amount of solvent accessible surface area exposed in the unfolding process (Myers et al. 1995). Recently, Bolen and coworkers have been able to use transfer data on small model compounds, coupled with a thermodynamic model first promoted by Tanford (1970), to successfully predict m-values for urea and several stabilizing osmolytes (Auton et al. 2011). Their results suggest that both stabilizing osmolytes and destabilizing denaturants exert their thermodynamic influence mainly on the polypeptide backbone. Urea and other

(5)

denaturants are preferentially accumulated near the backbone; thus, conformations of the protein that expose more backbone to solvent are stabilized. However, the results of Record and coworkers suggest that the situation is more complicated and can only be understood by parsing out the interaction of denaturant molecules with different functional groups on the protein (Guinn et al. 2011). The precise molecular mechanism of protein denaturants and stabilizers is still under active investigation. A linear relationship between DGu and denaturant concentration is not always obeyed, and these systems will require models more complicated than the LEM to analyze (Johnson and Fersht 1995). Nonlinearity is usually most noticeable at low denaturant concentrations and may explain discrepancies between DGuH2 O obtained from chemical denaturation and those obtained by other methods (such as thermal denaturation), although major discrepancies

Chemical Denaturation

−2 −4 −6 CD signal / a.u.

Chemical Denaturation, Fig. 3 Denaturation surface of BdpA, from thermal denaturation experiments in varying concentrations of guanidine hydrochloride (Reprinted from Dimitriadis et al. 2004, copyright the National Academy of Sciences)

79

−8 −10

C

−12 −14 −16 4

3 [Gu 2 1 HC 0 0 l] / M

seem rare. If a nonlinear relationship between DG and [denaturant] is observed, more complicated extrapolation procedures can be applied to obtain DGuH2 O (Tanford 1970; Johnson and Fersht 1995). A more complete picture of unfolding thermodynamics can be obtained by performing denaturation experiments at different temperatures. This information can be utilized via a Gibbs-Helmholtz relation to generate a stability curve – unfolding free energy as a function of temperature: DGu ðT Þ ¼ DHm ð1  T=T m Þ þ DCp ð T  T m  T ln ðT=T m ÞÞ (6) where Tm is the midpoint temperature (at which DGu = 0), DHm is the enthalpy change at the midpoint temperature, and DCp is the heat capacity change. Another way of fitting Eq. 5 is to combine solvent and chemical denaturation data. Shown in Fig. 3 is a series of thermal denaturation experiments performed at different concentrations of guanidine hydrochloride (Dimitriadis et al. 2004). In principle, the same results can be obtained with chemical denaturation curves performed at different temperatures. The entire surface can then be globally fitted to determine thermodynamic stability parameters as a function of temperature and denaturant. A simple two-state unfolding model can be successfully applied to many proteins, particularly small, single-domain proteins. However,

60 70 80 90 10 20 30 40 50 Temperature / °C

proteins may unfold while populating one or more intermediate states. If this is the case, more complex fitting equations can be used to analyze multistate denaturation curves (Walters et al. 2009). Equation 2 can still be used to define conformational stability regardless of whether intermediate states are observed since it takes into account the presence of any number of observed thermodynamic states. To experimentally test the appropriateness of the two-state model, the denaturation experiment can be repeated using different probes of structure to follow unfolding. If multiple probes give coincident denaturation curves, then the two-state model may be suitable. In practice, a simple sigmoidal shape (such as that in Fig. 1) is not enough evidence; multistate denaturation curves may take on a simple appearance depending on the free energies and m-values of the intermediates. Chemical denaturation of membrane proteins is possible, by using either traditional denaturants or denaturing detergents. For example, the denaturing detergent sodium dodecyl sulfate (SDS) has been used to denature membrane proteins solubilized in detergent micelles (Hong et al. 2009). The resulting denatured protein is not fully unfolded, but rather the helical transmembrane regions remain helical in SDS micelles. Traditional protein denaturants may also be tried. For example, urea has been shown to denature the beta barrel E. coli outer membrane protein OmpA (Hong et al. 2009). The unfolded protein is soluble in high urea

80

concentrations and leaves the membrane environment as a result. Thus, the measured free energy includes both the removal from the membrane environment and the unfolding of the protein. Since the denatured states differ greatly for these two experiments, interpretations of the free energy changes and m-values must be interpreted circumspectly.

Other Uses for Chemical Denaturation Chemical denaturants are strong solubilizing agents for proteins, since the polypeptide backbone is much more soluble in denaturant solutions than in pure water (Auton et al. 2011). Chemical denaturation may therefore be only a by-product of an effort to solubilize protein that has been trapped in a solid aggregate. Denaturant solubilization steps are often parts of purification protocols, as overexpression of a protein in a model organism often results in insoluble inclusion bodies (a type of intracellular aggregate). After solubilization, the denaturant can be diluted out, and in ideal cases, the protein refolds into its native, functional conformation. Chemically denatured proteins may also be useful as starting points for studies of folding kinetics. Rapid dilution of the denaturant is a typical way of starting the folding process. Rate constants for protein folding and unfolding are typically a function of denaturant, and measurement of these “kinetic m-values” can be useful for characterizing the transition state for the folding and unfolding process (see Walters et al. 2009 for further details). Adding mutational analysis provides a very powerful tool for characterizing folding pathways. In any use of chemical denaturation, it must be kept in mind that even very high concentrations of denaturants may not fully unfold proteins. Evidence has accumulated over the years that residual structure may be present in these denatured states (Baldwin 2002). It should not be assumed that a chemically denatured protein is devoid of structure.

Chemical Kinetics

Cross-References ▶ NMR Approaches to Determine Protein Structure

References Anfinsen C (1973) Principles that govern the folding of protein chains. Science 181:223–230 Arora P, Oas T, Myers J (2004) Fast and faster: a designed variant of the B-domain of protein A folds in 3 microseconds. Protein Sci 13:847–853 Auton M, Rosgen J, Sinev M et al (2011) Osmolyte effects on protein stability and solubility: a balancing act between backbone and side-chains. Biophys Chem 159:90–99 Baldwin R (2002) A new perspective on unfolded proteins. Adv Protein Chem 62:361–367 Dimitriadis G, Drysdale A, Myers J et al (2004) Microsecond folding dynamics of the F13W/G29A variant of the B-domain of protein A by laser-induced temperature jump. Proc Natl Acad Sci USA 101:3809–3816 Guinn E, Pegram L, Capp M et al (2011) Quantifying why urea is a protein denaturant whereas glycine betaine is a protein stabilizer. Proc Natl Acad Sci U S A 108: 16932–16937 Hong H, Joh N, Bowie J et al (2009) Methods for measuring the thermodynamic stability of membrane proteins. Methods Enzymol 455:213–236 Johnson C, Fersht A (1995) Protein stability as a function of denaturant concentration: the thermal stability of barnase in the presence of urea. Biochemistry 34: 6795–6804 Myers J, Scholtz J, Pace C (1995) Denaturant m values and heat capacity changes: relation to changes in accessible surface areas of unfolding. Protein Sci 4: 2138–2148 Pace C (1986) Determination and analysis of urea and guanidine hydrochloride denaturation curves. Methods Enzymol 131:266–280 Tanford C (1970) Protein denaturation: part C. Theoretical models for the mechanism of denaturation. Adv Protein Chem 24:1–95 Walters J, Milam S, Clark A (2009) Practical approaches to protein folding and assembly: spectroscopic strategies in thermodynamics and kinetics. Methods Enzymol 455:1–39

Chemical Kinetics ▶ Differential Equations and Chemical Master Equation Models for Gene Regulatory Networks

Chemical Reaction Kinetics: Mathematical Underpinnings

81

Conservation of Mass

Chemical Reaction Kinetics: Mathematical Underpinnings John Wesley Cain Department of Mathematics and Computer Science, University of Richmond, Richmond, VA, USA

At the most basic level, models of chemical reaction kinetics boil down to invoking the principle of conservation of mass. Suppose that the mass of a particular chemical species varies over the course of a reaction, and let M (t) denote the mass of the species at time t. A short time Dt later, the mass will be Mðt þ DtÞ ¼ MðtÞ þ mass influx  mass efflux;

Synopsis Mathematical modeling and simulation of biochemical reaction networks facilitates our understanding of metabolic and signaling processes. For closed, well-mixed reaction systems, it is straightforward to derive kinetic equations that govern the concentrations of the reactants and products. The usual way of deriving kinetic equations involves application of the principle of conservation of mass in conjunction with the law of mass action. Here, examples of kinetic models for several basic processes are discussed.

where the influx and efflux are measured over the time interval from t to t + Dt. Let fin(t) and fout(t) denote the instantaneous mass influx and efflux (in mass per unit time) at time t. Then because Dt is assumed to be small, the total mass influx and efflux over the aforementioned time interval are approximately fin(t)Dt and fout(t)Dt, respectively. Inserting these approximations into the above conservation equation yields Mðt þ DtÞ  MðtÞ  fin ðtÞ  fout ðtÞ Dt which in the limit as Dt ! 0 gives an exact description for the rate of change of the mass:

Introduction A century has passed since Michaelis and Menten described a mechanism for enzymemediated conversion of a substrate into a product. The kinetics of such biochemical reaction processes can be analyzed mathematically and simulated on computers, usually by appealing to the law of mass action. Kinetic models are typically presented as systems of differential equations (DEs) or continuous time Markov chains (▶ “Mathematical Models in the Sciences”). There are advantages to both types of models, but in what follows, the former are developed because the equations are (i) easy to write down by applying the law of mass action and (ii) deterministic in the sense that the kinetic parameters and initial state of the reaction system completely determine the future states.

dM ¼ fin  fout : dt

(1)

In words, this DE states that the rate of change of the mass is equal to the difference between the mass influx and the mass efflux.

Closed, Well-Mixed Systems Experimental data from chemical reactions tends to involve concentrations (mass per unit volume) as opposed to mass alone, and therefore, it is helpful to scale the above DE by dividing by volume. Consider, for example, a simple, reversible system A Ð B of constant volume V. Assume that the system is closed in that there is no flux of A nor B into or out of the system and that the system is well mixed in the sense that the

C

82

Chemical Reaction Kinetics: Mathematical Underpinnings

concentrations of A and B do not vary spatially within the reaction vessel. Letting [A] and [B] denote concentrations, the total masses of the two species are [A]V and [B]V, respectively. If fAB and fBA denote the mass fluxes associated with transitions from A to B and from B to A, respectively, then the conservation of mass equation implies that V

d ½A ¼ fBA  fAB dt

and

V

d ½B ¼ fAB  fBA : dt

(2) The quantities on both sides of the DEs in Eq. 2 have units of mass/time. Notice that adding the two DEs in Eq. 2 yields a single DE that describes the total mass of A and B: V

d ð½A þ ½BÞ ¼ 0; dt

which implies that the total mass V ([A] + [B]) is constant and reaffirms that mass is conserved. Dividing Eq. 2 by the volume V leads to DEs that govern the concentrations of the two species: d ½ A ¼ J BA  J AB dt

and

d ½B ¼ J AB  J BA ; dt (3)

where the chemical fluxes (or concentration fluxes) are defined by dividing the mass fluxes by the total volume; i.e., JAB = fAB/V and JBA = fBA/V. The quantities on both sides of the DEs in Eq. 3 have units of concentration/time. The primary challenge in the quantitative study of chemical kinetics is to mathematically model fluxes such as JAB and JBA in the above example. The most common approach towards doing so is to invoke the law of mass action, which is now developed. One must be mindful that the use of the word law is overpromising, but it is reassuring that the law of mass action serves as an excellent approximation for the dynamics of certain elementary reaction processes.

Mass Action Kinetics Through Examples Rather than stating the law of mass action in full generality, here it is explored via a

collection of progressively more substantive examples. Example 1 Recall the simple, reversible reaction A Ð B described above. Assume that during the conversion of A to B, molecules of A do not interact with one another chemically and vice versa for the reverse reaction B ! A. In this case, the law of mass action states that the chemical fluxes JAB and JBA are proportional to the concentrations [A] and [B], respectively. That is, there exist constants k+ and k such that J AB ¼ kþ ½A

J BA ¼ k ½B :

and

These relationships are reasonably intuitive in this case: given a small time window, each molecule of A has some probability of converting to a molecule of B, suggesting that doubling the concentration of [A] ought to double the flux JAB (and similarly for the conversion of B to A). The constants k+ and k are called kinetic constants or rate constants for this reaction, and in this example, they have units of (time)1. The larger the value of a rate constant, the faster the associated reaction proceeds. There is a standard notational convention of writing rate constants above or below the arrows associated with each process in the reaction diagram; e.g., kþ

AÐB k

for the present example. Armed with the mass action assumptions above, the DEs Eq. 3 take the form d ½ A ¼ k  ½ B  k þ ½ A dt

d ½ B ¼ kþ ½A  k ½B : dt (4)

If the rate constants k+ and k are measured from experimental data (see also the conclusion of Example 2 below) and the initial concentrations of A and B are known, then it is possible to solve the DEs on a computer to obtain plots of [A] and [B] versus time. As an illustration, Fig. 1 shows the solution of Eq. 4 assuming that the initial

Chemical Reaction Kinetics: Mathematical Underpinnings

83

more advanced application of the law of mass action relies upon rather sophisticated statistical mechanics, but here is the intuition. Suppose that molecules of A and B both move randomly in a closed, well-mixed system of constant volume. Then doubling either [A] or [B] should roughly double the probability that a molecule of A collides with a molecule of B, combining to form the product C. With the preceding paragraph in mind, consider an autocatalytic process in which a species promotes its own production; schematically, Chemical Reaction Kinetics: Mathematical Underpinnings, Fig. 1 Concentrations [A] and [B] as functions of time for the simple, reversible process in Example 1, using the initial conditions and rate constants mentioned in the text

concentrations of [A] and [B] are 8 mmol/L and 2 mmol/L, respectively, and that the forward and reverse rate constants are k+ =1.0 s1 and k =0.1 s1 Because of the simplicity of this chemical process and the fact that there are no complicated interactions involved, the system of DEs in Eq. 4 can actually be solved exactly, meaning that it is possible to derive precise mathematical formulas for [A] and [B] as functions of time without resorting to computer approximations and/or simulations. For the particular parameter choices and initial conditions listed above, concentrations as a function of time are given by ½A  ¼

  2  1:1t 2 þ 5 and ½B ¼ 39e 50  39e1:1t ; 11 11

both of which are graphed in Fig. 1. The luxury of exact solutions is not to be expected in practice, as illustrated in the third example. Example 2 For elementary reactions of the form A + B ! C in which two reactants must interact to form a product, the law of mass action states that the rate of change of the product concentration [C] is proportional to the (mathematical) product of the individual reactants. Mathematically, the rate of change of [C] is equal to k[A] [B] where k is some rate constant. (Schematically, k one writes this as A þ B ! C:) The basis for this



A þ B Ð B þ B: k

Further suppose that [A] is held constant throughout this process; e.g., a huge abundance of A makes its depletion negligible during the reaction. Applying the law of mass action, the DE governing [B] is d ½ B ¼ kþ ½A ½B  k ½B2 dt

(5)

where, for emphasis, [A] is regarded as constant by assumption. It is important to pause and take inventory of the physical units of the quantities in Eq. 5. The left-hand side of the equation has units of concentration/time. Thus, in order for the DE to be dimensionally consistent, both k+ and k must have units of (concentration)1(time)1. Dropping the brackets and capital letters for notational convenience, the DE db/dt = k+ab  kb2 happens to be exactly solvable. It is an example of a separable DE and, for readers familiar with the technique of separation of variables, one may manipulate the DE algebraically to obtain 1 db ¼1 2 dt kþ ab  k b and subsequently integrate both sides with respect to t, yielding ð

ð 1 db ¼ 1 dt: kþ ab  k b2

C

84

Chemical Reaction Kinetics: Mathematical Underpinnings

similar to the one described in this example. Importantly, performing a regression fit produces estimates for the kinetic constants k+ and k. Example 3 Here is an example that goes beyond the two previous ones: using the law of mass action to develop the well-known MichaelisMenten model for enzyme (E)-mediated conversion of a substrate (S) into a product (P) via an intermediate substrate-enzyme complex (C). The reaction diagram for that process is given by Chemical Reaction Kinetics: Mathematical Underpinnings, Fig. 2 Fitting a sigmoidal function of the form Eq. 6 to an actual data set from an autocatalytic process similar to the one described in Example 2

Provided that the denominator in the integral on the left-hand side is nonzero (see also the section on Equilibria below), that integral can be evaluated using a partial fraction decomposition of the integrand, ultimately allowing one to solve for b as a function of t. The solution of Eq. 5 is bð t Þ ¼

Cakþ eakþ t ; 1 þ Ck eakþ t

(6)

where C¼

b0 akþ  b0 k

is a constant whose value depends on the initial concentration b0, the rate constants, and the concentration of A. If the derivation of formula Eq. 6 seems mysterious, fear not. Remember that it is rarely possible or necessary to find an exact mathematical formula for the solution of a DE, so this example is atypical in that regard. For practical purposes, computer simulations can provide approximate solutions of DEs, often with a level of precision that the user is able to specify. Still, the rare fortune of having an exact formula Eq. 6 is worth exploiting, as it suggests what type of functions might be suited for fitting experimental data. Figure 2 shows a least squares regression fit (▶ “Mathematics of Fitting Scientific Data”) of a function of the form Eq. 6 to experimental data from an autocatalytic process



k2

S þ E Ð C ! P þ E: k

The free substrate S and enzyme E are converted to a substrate-enzyme complex C with rate constant k+. Molecules of the complex C may either revert back to S and E (with rate constant k) or produce P and E (with rate constant k2). There are four different concentrations to track (S, E, C, and P), each of which will be treated as a dependent variable and each of which will contribute a DE to the kinetic model. The full set of Michaelis-Menten equations reads dS dt dE dt dC dt dP dt

¼ k C  kþ SE ¼ k C  kþ SE þ k2 C (7) ¼ kþ SE  k C  k2 C ¼ k2 C;

where, once again, brackets have been suppressed when writing concentrations. The system Eq. 7 may seem a bit daunting in comparison with the single DE in the autocatalysis example. The four DEs in Eq. 7 are coupled in the sense that changing one concentration may influence multiple equations, which is somewhat intuitive through examining the reaction diagram. After all, if the concentration of the substrate-enzyme complex C were suddenly doubled, then the reaction diagram suggests that the rates of change of all four concentrations would be affected. That same conclusion could be drawn by observing that the

Chemical Reaction Kinetics: Mathematical Underpinnings

85

C

Chemical Reaction Kinetics: Mathematical Underpinnings, Fig. 3 Solutions of the Michaelis-Menten Eq. 7 for the particular parameter set given in the text

variable C appears on the right-hand side of all four DEs in Eq. 7. Despite the interdependencies among these four variables, the system Eq. 7 is not nearly as bad as it may first appear. Notice that the righthand sides of the DEs for dE/dt and dC/dt sum to 0, implying that E + C must remain constant during this reaction. From a chemistry standpoint, this makes perfect sense: the total amount of free enzyme and bound enzyme remains constant for this closed system. If Etot, a constant, represents the total amount of enzyme, then the variable E can be eliminated from all of the equations above since E = Etot  C. This effectively reduces the model to three different variables: S, C, and P. In fact, one further reduction is possible: The equations for dS/dt and dC/dt are not influenced by P. Therefore, if that subsystem of just two DEs were solved yielding formulas for S and C, then a formula for P can be obtained immediately. Specifically, once C is determined as a function of time t, that function can be

substituted into the right-hand side of the dP/dt equation which can then be integrated to find P. The reduction of a system of four DEs to a somewhat less menacing set of two DEs represents a victory, albeit a partial one. The reduced system is still too complicated to solve by hand, and ultimately one must resort to computer simulations to plot approximate solutions. Figure 3 shows the solution of Eq. 7 for the specific parameter choices k+ = 10.0 mM1 s1, k = 1.0 s1, and k2 = 1.0 s1 and with initial conditions S(0) = 1.0 mM, E(0) = 0.1 mM, C(0) = 0.0 mM, and P(0) = 0.0 mM.

Equilibria and Qualitative Analysis The Michaelis-Menten Eq. 7 highlighted the primary limitation of pen-and-paper mathematical analyses of chemical kinetics: when a model contains complex, nonlinear actions between dependent variables which are interdependent on one

86

Chemical Reaction Kinetics: Mathematical Underpinnings

another, finding exact formulas for all of those variables is an intractable problem. In order for those models to provide useful quantitative predictions, one would need to use a computer to approximate the behaviors of all of the variables. On the other hand, it may still be possible to extract powerful qualitative information from model equations without actually attempting to solve the equations. For example, it is often possible to determine how/whether a particular reaction will achieve chemical equilibrium, a steady state in which each concentration approaches some constant in the long run (see ▶ “Equilibria and Bifurcations in the Molecular Biosciences” for details). Recall the DE model Eq. 5 for an autocatalytic process: db/dt = k+ab  k b2, where k+ and k are (positive) kinetic constants and a represents a (positive constant) concentration of an abundant reactant. Identifying the equilibrium states of this process amounts to finding specific values of the dependent variable b for which db/dt = 0; i.e., so that the concentration b will not change. The trivial possibility b = 0 represents an equilibrium that is not terribly interesting from a chemistry standpoint. By algebra, there is another possibility:   kþ b ¼ beq ¼ a k

(8)

is a nontrivial equilibrium value for this process. The appearance of the ratio of forward and reverse rate constants is certainly plausible here – if the forward reaction is much faster than the reverse reaction, then it makes sense that beq should be larger. In some sense, Eq. 8 gives the more “natural” or “chemically relevant” equilibrium of Eq. 5. To clarify this remark, it helps to factor the right-hand side of the autocatalysis DE, yielding db ¼ bðkþ a  k bÞ: dt Bearing in mind that a, k+, and k are positive constants, suppose that the initial concentration of b is near but not equal to the trivial equilibrium b = 0. More exactly, suppose that b is positive but

smaller than beq, and examine the two factors of the right-hand side of the DE. Both of the factors are positive based upon our assumption that 0 < b < beq, which means that db/dt > 0. Because the rate of change of b is positive, b would have to increase away from the trivial equilibrium 0 and towards the nontrivial one beq. Likewise, if the concentration b was ever larger than beq, then the two factors on the right-hand side of the DE have an opposite sign, implying that db/dt < 0. Hence, b would then decrease and gradually relax back to beq. In this example, b = 0 is an example of a (mathematically) unstable equilibrium. If the initial condition b(0) is near, but not at, zero concentration of b, then one expects the system to evolve to a state far from the b = 0 equilibrium. By contrast, the equilibrium b = beq is an example of a stable, attracting equilibrium: starting from any initial concentration that is sufficiently close to beq, one expects b to approach beq in the long run (steady state). Notice that in the previous paragraph, there was no attempt to solve the autocatalysis DE, and yet valuable qualitative information was extracted: a characterization of the long-term steady-state dynamics of the system for every possible initial condition. Such qualitative analysis can be extended to “higher-dimensional” systems (i.e., more dependent variables) like the Michaelis-Menten Eq. 7; for details, see Chapter 6 of Strogatz (1994). Mathematically, what should it mean to have an equilibrium of Eq. 7? In order for that system to be in chemical equilibrium, none of the four concentrations S, E, C, and P can change over time, meaning that all four of the derivatives in Eq. 7 must simultaneously be zero. To seek equilibria, begin by focusing on the last of the four DEs in Eq. 7: dP/dt = k2C. Since k2 is a positive constant, it must be the case that C = 0 in order to achieve dP/dt = 0. Substituting C = 0 into the right-hand sides of the other three DEs, the only remaining terms all contain the (mathematical) product SE. Apparently, the system would be in equilibrium if one of the concentrations S or E was zero, the other being completely arbitrary. This would seem to suggest that there are infinitely many equilibria, but there is a lesson here: when

Chemical Unfolding

working with mathematical models in chemistry or any other field, do not lose sight of reality. Would it make sense for this system to have infinitely many equilibrium states to choose from? Recall that Etot, the total amount of the enzyme in the free or bound state, is conserved, and E + C = Etot. Since C must be 0 when equilibrium is achieved, it follows that E = Etot at steady state, which in turn forces S = 0. Finally, what will be the steady-state (a stable equilibrium in this case) value of the product concentration P ? As a step towards answering this, observe that there is another conserved quantity lurking in the original system of four DEs: the expressions for dS/dt, dC/ dt, and dP/dt sum to zero. This implies that the quantity S + C + P remains constant (call it t) during this enzymatic process. Because both S and C approach zero at steady state, it follows that P must approach t. The above was a purely heuristic argument that the closed, well-mixed Michaelis-Menten system should approach the stable, attracting equilibria S = 0, E = Etot, C = 0, and P = t in the long run. There are standard mathematical techniques (Strogatz 1994) for rigorously analyzing the stability of equilibria of systems of DEs such as this one.

Discussion and Further Reading The law of mass action offers a very systematic process for writing down DE-based chemical kinetic models, but the last example is a step towards understanding the challenges and roadblocks of the modeling process. Solving a system of DEs on a computer usually requires the user to provide (i) initial conditions for each dependent variable (concentration) and (ii) values for each parameter (kinetic constant). Biochemical reaction networks can easily have hundreds of parameters, most of which would be difficult, costly, or impossible to measure experimentally. It helps that, in many cases, values of individual rate constants may be less important than ratios of rate constants in terms of influencing dynamical behavior. Also, when rate constants have different orders of magnitude, it may happen that some

87

processes are much faster than others. The plots of [E] and [C] in Fig. 3 illustrate this concept – after the rapid adjustment near t = 0, both concentrations evolve over a much slower time scale. This phenomenon is a consequence of the order-ofmagnitude difference between k+ and the other two rate constants, and the complexity of the model can be reduced by approximating the rapid initial transient as being instantaneous. Readers interested in more advanced mathematical modeling techniques may wish to read about asymptotic methods, principal components analysis, sensitivity analysis, and scaling and non-dimensionalization, as a means for reducing the number of parameters in a model. For a mathematical reference on chemical kinetics see, for example, Beard and Qian (2008), and for general references on mathematical models in biology, biochemistry, and biomedicine, see Keener and Sneyd (2009), Murray (2002/2003), or Plonsey and Barr (2000).

Cross-References ▶ Equilibria and Bifurcations in the Molecular Biosciences ▶ Mathematical Models in the Sciences ▶ Mathematics of Fitting Scientific Data

References Beard DA, Qian H (2008) Chemical biophysics: quantitative analysis of cellular systems. Cambridge University Press, Cambridge Keener JP, Sneyd J (2009) Mathematical physiology, 2nd edn, vols 1 & 2. Springer, New York Murray JD (2002/2003) Mathematical biology, 3rd edn, vols 1 & 2. Springer, Berlin Plonsey R, Barr RC (2000) Bioelectricity: a quantitative approach, 2nd edn. Kluwer, New York Strogatz SH (1994) Nonlinear dynamics and chaos. Perseus, Cambridge

Chemical Unfolding ▶ Chemical Denaturation

C

88

Chloroplastida

Synopsis

Chloroplastida ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Chloroplastidians ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Chromatin ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Chromatin Modification ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Chromatin Remodeling ▶ Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of Scheherazade Khan and Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Control of gene expression begins with rendering DNA more or less accessible to the transcription machinery. DNA is packaged around histones and other proteins to form chromatin. Histones condense DNA but also limit its accessibility to the transcription machinery. DNA that is packaged tightly and inaccessible to transcription machinery is referred to as heterochromatin, while loosely packaged DNA that can be transcribed is termed euchromatin. Since transcription depends upon accessible DNA, chromatin modification provides a powerful means for the cell to regulate gene expression. Indeed, chromatin state is dynamic, changing over time or due to alterations in the cell’s environment. Chromatin state varies between cell types and is one way that cells in different tissues show unique patterns of gene expression. Importantly, chromatin state is altered in many diseases, including cancer. Proper regulation of chromatin state is important for maintaining healthy gene expression. Chromatin accessibility can be altered by modifying either the DNA itself or the proteins that package it. Such modifications are highly dynamic and reversible. Some modifications correlate with loose chromatin and active transcription, while other modifications correlate with tight chromatin and inhibited transcription. However, these correlations are not absolute and the causal relationships between the modifications and transcription efficiency are not clear. These modifications are numerous and potentially interconnected, forming an intricate network of transcriptional control. Some modifications persist into the next generation, affecting the transcription of the organism’s offspring.

Introduction Synonyms Chromatin; Chromatin modification; DNA methylation; Epigenetics; Histone modification; Imprinting; Transcriptional control

If the DNA contained in one human cell were stretched out, it would be about 2 m long. Yet, all of this DNA must fit into a nucleus that is only 6.5 mm in diameter (6.5  106 m). To accomplish this feat, eukaryotic DNA associates with

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of, Fig. 1 A nucleosome. A nucleosome includes 147 base pairs of DNA (gray) wrapped around a histone octamer comprised of two subunits each of histone H2A, H2B, H3, and H4. Each histone subunit has a highly positively charged N-terminal tail that is thought to help stabilize the histone on the negatively charged DNA (top). These tails are accessible to enzymes that add posttranslational modifications, such as acetylation (grey circles). Acetylated tails will have less overall positive charge and will cause looser association of the histones with the DNA

proteins that condense it into a highly ordered structure called chromatin. The fundamental unit of chromatin is the nucleosome, which includes 147 base pairs of DNA wrapped around an octamer of histone proteins, like a piece of thread around a spool. Each histone octamer includes two copies of histone proteins H2A, H2B, H3, and H4 (Fig. 1). These nucleosomes and their intervening DNA are arranged like beads on a string. With help from other proteins, the nucleosomes interact with one another and form higherorder structures to condense DNA into linear chromosomes. However, packaging DNA into

89

chromatin not only condenses it but also reduces its accessibility to transcriptional machinery. For this reason, chromatin modification plays an essential part in the regulation of gene expression. Eukaryotic gene expression can be controlled at every stage, including the birth of mRNA (transcription), the processing and nuclear export of mRNA, and the regulation of mRNA stability and translation. However control of gene expression begins with the state of the chromatin structure. Indeed, the pattern of chromatin condensation varies among different types of cells. The chromatin that is highly condensed (heterochromatin) is inaccessible to the transcription machinery, and the gene within that heterochromatin cannot be activated. If the chromatin condensation is loose (euchromatin), then the DNA is accessible and the gene can be transcribed. Thus, changes in chromatin structure will lead to changes in transcription. Chromatin structure can be altered in numerous ways. First, individual histones can be modified so as to increase or decrease their affinity for DNA, changing the accessibility of the transcription machinery to the DNA. Second, nucleosome remodeling complexes cause larger-scale alterations to the chromatin to increase the accessibility of the DNA. Once the DNA is accessible, transcriptional activators can bind the DNA and recruit the general transcription machinery. Third, the DNA itself can be modified via methylation. Methylated DNA can recruit factors that change the chromatin structure and the accessibility of nearby DNA. Chromatin modifications can persist for a long time. Some alterations are programmed within germ cells and passed onto the next generation. Thus, alteration of chromatin structure can cause the inheritance of certain patterns of gene expression that are not determined by the sequence of one’s DNA, but by how one’s DNA is packaged.

Histone Modification While histones condense the DNA, they also limit access of the transcription machinery. Thus DNA packaging influences whether and to what extent

C

90

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of, Fig. 2 Common modifications on N-terminal histone tails. Each histone tail varies in the locations and types of modifications that are known to occur and affect transcription. These examples show phosphorylation (triangles), methylation (circles),

and acetylation (squares). Note that some amino acid residues have more than one potential modification (as shown on histone H3), but only one modification can occur on the residue at any given time (Adapted from Watson et al. (2008), Fig. 7–39)

a gene is transcribed. A cell can alter histones by adding or removing posttranslational modifications. Posttranslational modifications are enzymatic, covalent additions of small chemical groups (methyl, phosphoryl, acetyl, etc.) or proteins (sumo, ubiquitin) to other proteins. The addition of posttranslational modifications can alter the activity of the target protein in a number of ways, including changing enzymatic activity, altering binding partners, or causing degradation. Adding or removing posttranslational modifications on histones influences transcription by altering the accessibility to the associated DNA and/or recruiting other factors that influence transcription. Each subunit of a histone octamer has a highly positive N-terminal tail (Fig. 1). These tails are accessible to enzymes that confer posttranslational modifications on specific amino acids within the tail (Fig. 2). Histone tails are almost completely conserved among eukaryotes, suggesting that patterns of the tail modification are important. Histone tails are subject to many modifications, including acetylation, methylation, phosphorylation, sumoylation, and ubiquitination. Several of these modifications are

reversible (Table 1). The modifications can alter the accessibility of the DNA to transcriptional machinery in one of three ways. First, modifications of histone tails may strengthen or weaken the interactions between the affected histones and DNA or neighboring histones. Weaker interactions would promote transcription. Second, such modifications facilitate the recruitment of other proteins (such as “Nucleosome Remodeling Complexes”; see below) that alter chromatin structure more dramatically. Third, these modifications can directly recruit transcription factors or the transcriptional machinery. While some modifications are euchromatic (they promote transcription), others are heterochromatic (they impede transcription). The best understood of these modifications is acetylation, which can affect transcription in both ways. Acetylation occurs when a lysine acetyltransferase enzyme (KAT, where K stands for lysine) adds an acetyl group and displaces two hydrogen atoms from lysine residues on the tails of core histones H3 and H4 with acetyl groups (Figs. 2 and 3). These enzymes have traditionally been called HATs (histone acetyltransferase), as

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

91

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of, Table 1 Chromatin modifications that affect transcription

Enzyme class Lysine acetyltransferase Lysine deacetylase Histone methyltransferase Histone demethylase Kinase Phosphatase Ubiquitin ligase Sumoylating enzymes (E1, E2, E3) DNA methyltransferase DNA demethylase

Hydroxymethylase

Nucleosome remodeling complex

Predominant consequence on transcription Activationa

Example Gcn5

Removal of acetyl group from lysines on histones Methylation of arginines or lysines on histones Removal of methyl group on histones Phosphorylation of histidine, threonine, tyrosine, or serine on histones Removal of phosphate group from histones Additional of ubiquitin to lysines on histones Addition of sumo (small ubiquitin-related modifier) to lysines on histones Methylation of DNA

Repressiona

HDAC6

Activation and repression Activation Activationa

Suv39H1 Lsd1b Rsk-2c

Repressiona

PP1d

Activation and repression Repressiona

Rad6 UBC9e

Repression

DNMT3A

Chemically alter or remove methyl group

Proposed to activate transcription Unknown

None have been identified

Activation

ISWI (ATP-dependent remodeling protein in a NRC)

Chromatin modification Acetylation of lysine residues on histones

Converts methyl C to hydroxymethyl C, which cannot be bound by methyl-binding proteins Catalyze movement of DNA or histones to promote accessibility of DNA

TET family of proteinsf

a

There are instances where these modifications tend to correlate with this effect on transcription, but there are examples that show the opposite effect as well b Reviewed in Mosammaparast and Shi (2010) c Reviewed in Latchman (2005) d See Koshibu et al. (2009) e See Shiio and Eisenman (2003) f Reviewed in Williams et al. (2011)

histones are their most well-known targets. However, these enzymes can target nonhistone proteins as well. Acetylation decreases the net positive charge on a histone tail (Fig. 3) and consequently decreases the histone’s affinity for the negatively charged backbone of DNA. This change in affinity can inhibit the ability of nucleosomes to fold into higher-order structures that hinder transcription. Thus, acetylation of histone tails often correlates with increased transcription, by increasing the accessibility to the DNA. Moreover, acetylation is correlated with increased turnover of histones, which would lead to a temporary increase in accessibility of DNA.

Acetylation can also affect transcription indirectly either by recruiting factors that remodel chromatin further (see “Nucleosome Remodeling Complexes,” below) or by recruiting the general transcription factors that bind the transcription machinery. The acetylated lysines can bind effector proteins that contain bromodomains, as bromodomains specifically recognize acetylated lysines. Both nucleosome remodeling complexes and many important general eukaryotic transcription factors contain bromodomains (Kurdistani and Grunstein 2003). Once the transcription of a given gene is no longer needed, the chromatin may return to its

C

92

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of, Fig. 3 Chemical structures of methylated cytosine in DNA and acetylated lysine in proteins. Structure of canonical cytosine (a) and 5-methylcytosine (b). (c) The amino acid lysine can be converted to acetyllysine by a KAT enzyme. The addition of the acetyl group, using acetyl-CoA as a cofactor, neutralizes the positive charge. This modification can be reversed by a KDAC enzyme

repressed, tightly packed state. Accordingly, another family of enzymes is required to remove euchromatic modifications from histone tails (Table 1). Lysine deacetylases (KDACs, a.k.a. HDACs for histone deacetylases) remove the acetyl groups from lysines, restoring their positive charge and stabilizing the interaction between histones and the negatively charged DNA backbone. Deacetylation also reverses the recruitment of bromodomain-containing factors. While KATs are typically associated with promoting transcription and KDACs with inhibiting transcription, there are emerging exceptions to this rule. In yeast, the KDACs Hos2 and Rpd3 are essential for the activation, not inactivation, of DNA damage-inducible genes RNR3 and HUG1 (Sharma et al. 2007). Although the specific mechanism is not clear, this example highlights the complexity and variability of effects that histone modifications can have on gene expression. Both KATs and KDACs, via their associated accessory proteins, can be targeted to specific genes. For instance, the KDAC Rpd3-containing

complex is recruited to a specific sequence in the promoter of INO1, a gene involved in inositol biosynthesis. Rpd3 deacetylates most acetylated residues on the adjacent histones, selectively repressing the transcription of INO1 (Kurdistani and Grunstein 2003). The KAT Gcn5 can form two distinct histone acetyltransferase complexes called SAGA and ATAC. SAGA and ATAC target different genes and therefore direct Gcn5 to different subgroups of genes (Krebs et al. 2011). Acetylation is but one type of posttranslational modification conferred upon histone tails. Other histone modifications include methylation, ubiquitination, sumoylation, and phosphorylation (Table 1). In certain ways, these modifications resemble acetylation. First, each modification can affect the rate of transcription, but not only in one direction. For example, the ubiquitination of histone H2A correlates with repression of transcription, whereas ubiquitination of histone H2B stimulates transcription. And while acetylation tends to promote transcription and methylation tends to repress transcription, there are exceptions

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

to these trends. Second, these modifications can recruit effector proteins that further remodel the chromatin. For example, sumoylated histones recruit KDACs to tighten the chromatin while phosphorylation of histone H3 (at serine 10) promotes KAT activity, which loosens chromatin (Latchman 2005). Third, each modification is potentially reversible by other enzymes; each modification class except sumoylation has been shown to be reversible on histones (Table 1). In some cases, one histone modification promotes or prevents another modification. Studies in yeast have revealed that de-ubiquitination of histone H2B (at lysine 123) is a necessary step for the methylation of histone H3 (at lysine 36) and correlates with expression of the GAL1 gene (Margueron et al. 2005). Additionally different modifications can occur on the same site (Fig. 2), indicating that some modifications are mutually exclusive. The possible patterns of modifications are immense, and the interplay of these modifications may work together to form a pattern that is generally recognized as promoting or inhibiting transcription, raising the suggestion of a “histone code.” The histone code theory predicts that histone modifications form regular patterns that are read by the effector proteins that bind them and, in turn, regulate chromatin structure and transcription. This theory implies that histone modifications cause changes in transcription. While experiments can clearly demonstrate that certain histone modifications correlate with changes in transcription, it is difficult to determine if the relationship is causal. Does a certain combination of histone modifications cause transcription or does transcription create a certain pattern of histone modifications? Early evidence suggested strong correlations between certain types of modifications and their affect on transcription, such as acetylation promoting transcription. However, opponents of the histone code theory point out that every “rule” has exceptions; no modification class to date has an absolute correlation with either activating or repressing transcription. A contrary theory posits that transcribed DNA is always somewhat accessible to transcriptional

93

activators or repressors and that their binding causes the change in chromatin state (open for activators, closed for repressors; reviewed in Henikoff and Shilatifard 2011). It remains unclear whether histone modifications strictly control DNA accessibility and therefore transcription or if transcription causes histone modifications that reinforced the desired DNA accessibility. We may find that these hypotheses are not mutually exclusive in that some portions of DNA recruit factors that alter histones while, at the same time, some histone modifications may be more likely to commit DNA to an open or closed state. As conditions change outside a cell (e.g., environmental stressors, shifts in resource availability, signals from nearby cells) patterns in transcription are altered. Not surprisingly, histone modifications are also altered and these changes correlate with alterations in the transcription of associated genes. For example, after a cell undergoes heat shock, phosphorylated forms of histone H3 are concentrated at heat shock genes that are being transcribed and are absent from genes that are silenced (Latchman 2005). Histone modifications may be one mechanism to alter gene expression in response to a change in the environment and may be a genetic regulatory mechanism underpinning phenotypic plasticity. While chromatin structure is often dynamic, there are portions of our chromosomes that are permanently silenced. For example, the centromere and telomeres of each chromosome are maintained in a heterochromatic state that is dependent on the maintenance of histone modifications that are consistent with silencing (see brief definition on ▶ “Long-Term Genetic Silencing at Centromere and Telomeres”). Most examples of how histone modification affects transcription pertain to transcription initiation, that is, whether the transcription machinery can engage the DNA and start transcription. However, emerging evidence suggests that histone modifications change throughout the transcription of its associated gene. As transcription proceeds, the RNA polymerase complex recruits enzymes that modify chromatin. These changes in modification can recruit new proteins to help

C

94

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

transcription or other downstream processes, such as mRNA processing (Latchman 2005).

Nucleosome Remodeling Complexes Large complexes of proteins called nucleosome remodeling complexes can remove or significantly alter entire nucleosomes altering the accessibility of DNA to transcription factors. These protein complexes can shift nucleosomes away from transcription factor binding sites, which promote transcription, or they can shift nucleosomes into transcription factor binding sites, which prevent transcription. There are five families of nucleosome remodeling complexes, which, though similar in function, are targeted to different locations along a chromosome (reviewed in Smith and Peterson 2005). Each complex has a similar ATPase domain at its core that uses the energy of ATP hydrolysis to remodel the nucleosomes. Each family associates with unique proteins that confer specific functions to the complex. The SWI/SNF complex is the best studied of the nucleosome remodeling complexes. The SWI/SNF complex is recruited to the appropriate DNA by a transcriptional activator bound to the DNA. The SWI/SNF complex is stabilized by Swi2, its catalytic protein, which contains a DNA binding domain and a bromodomain to bind acetylated histones (Hassan et al. 2001). While the precise mechanism is unknown, the complex uses the energy of ATP hydrolysis to move the DNA along the surface of a histone octamer to expose cis regulatory sequences in the DNA, so that the transcription machinery can bind the DNA and activate the gene. Several proteins are critical for chromatin modification. Some modify the histones directly, through posttranslational modifications, while others form nucleosome remodeling complexes that can move whole nucleosomes. However, it has become clear in the last few decades that RNAs are also dynamic players in chromatin remodeling and transcriptional control. See brief definition on ▶ “RNA-Induced Chromatin Remodeling” for more information.

DNA Methylation Transcription can also be affected through modification of DNA itself, specifically through the methylation of cytosines (Fig. 3). Ninety percent of 5-methyl cytosines are found in CG dinucleotides. Methylation of DNA is correlated with formation of heterochromatin. If a gene’s promoter or enhancer cis elements are converted to heterochromatin, then transcription decreases. The methylation or demethylation patterns within or around some genes are fixed, but around other genes are dynamic. For example, the DNA methylation state of some genes is tissue specific. In a tissue where the positive regulatory elements of a gene are methylated, the gene is not transcribed, while in a tissue where the regulatory elements are not methylated, the gene is transcribed (Latchman 2005). DNA methylation can repress transcription either directly or indirectly. For instance, certain crucial transcription factors, such as Sp1, cannot interact with methylated DNA sequences to initiate transcription. Alternatively, methylated DNA can recruit specific proteins that contain methylDNA-binding domains (MDBs). Many of these MDB-containing proteins bind methylated CG sequences and recruit other proteins, including KDACs and nucleosome remodeling complexes, to facilitate the creation of a more closed chromatin structure. DNA methylation can permanently silence or conditionally regulate the transcription of certain sequences. In healthy cells, the promoters of transposons and retrotransposons, deleterious mobile DNA elements that must be consistently silenced, are highly methylated (Walsh et al. 1998). In cancer cells, the methylation states of genes are often altered (see brief definition on ▶ “DNA Methylation and Cancer”). For example, “tumor suppressor genes,” which normally help prevent cells from acquiring characteristics of cancer, are hypermethylated and silenced in numerous cancers (Salozhin et al. 2005). Interestingly, some DNA methylation patterns can be inherited and affect the gene expression of the offspring (see brief definition on ▶ “Genomic

Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Imprinting”). This is an example of an epigenetic phenomenon. Epigenetics is the study of heritable changes in phenotype that are not caused by changes in the DNA sequence.

Interactions with the Environment Our traits are not exclusively determined by our DNA, but are influenced by our environment. Interestingly, environmental factors can influence our chromatin state and therefore the transcription of our genes. Under some environmental conditions, canonical histones can be replaced with “variant” or specialized histones. In some plants, a special histone H2 variant is selectively incorporated in nucleosomes around genes that are activated when the plants are grown at lower temperatures. This histone H2 variant correlates with open chromatin and increased transcription. Patterns of histone modifications also change in response to environmental signals. Mice fed a methyl-balanced diet are less likely to have cancer; this effect requires enzymes that methylate histones, suggesting that diet might influence the modification of histones (reviewed in Feil and Fraga 2012). Many types of environmental exposures have been shown to alter DNA methylation patterns in animals, which correlate with changes in the animal’s phenotype; some of these changes are heritable and are epigenetic. For example, if Bisphenol A (BPA), an additive to many plastics, is added to a female mouse’s diet, her offspring show decreased methylation of the agouti viable yellow allele (Avy) and are more likely to have yellow coat color (rather than brown), obesity, and diabetes. However, feeding the mother methylrich foods containing additives like vitamin B12 and folic acid can mitigate the effect of BPA in mice. This example shows that the environment (in this case diet) of the mother can affect her offspring, likely through changes in DNA methylation. While the effect of BPA on humans is unresolved, many companies have removed BPA from their products due to the public’s reaction to these studies (reviewed in in Feil and Fraga 2012).

95

Our current understanding of the connections between our environment and DNA/chromatin modification is primitive. Some correlations between the environment and changes to DNA/chromatin state are known, but the mechanism or causal relationship is unclear. With a greater understanding, we may be able to predict how our diet, drugs, and environment shape our phenotypes, and perhaps the phenotypes of our children and even grandchildren, through DNA methylation, chromatin modification, and their effect on the transcription of our genes.

Cross-References ▶ Epigenetics ▶ Transcription Factor Classes

References Feil R, Fraga MF (2012) Epigenetics and the environment: emerging patterns and implications. Nat Rev Genet 13:97–109 Hassan AH, Neely KE, Workman JL (2001) Histone acetyltransferase complexes stabilize swi/snf binding to promoter nucleosomes. Cell 104:817–827 Henikoff S, Shilatifard A (2011) Histone modification: cause or cog? Trends Genet 27:389–396 Koshibu K, Gräff J, Beullens M, Heitz FD, Berchtold D, Russig H, Farinelli M, Bollen M, Mansuy IM (2009) Protein phosphatase 1 regulates the histone code for long-term memory. J Neurosci 29: 13079–13089 Krebs AR, Karmodiya K, Lindahl-Allen M, Struhl K, Tora L (2011) SAGA and ATAC histone acetyl transferase complexes regulate distinct sets of genes and ATAC defines a class of p300-independent enhancers. Mol Cell 44:410–423 Kurdistani SK, Grunstein M (2003) Histone acetylation and deacetylation in yeast. Nat Rev Mol Cell Biol 4:276–284 Latchman D (2005) Gene regulation. Taylor & Francis Group, New York Margueron R, Trojer P, Reinberg D (2005) The key to development: interpreting the histone code? Curr Opin Genet Dev 15:163–176 Mosammaparast N, Shi Y (2010) Reversal of Histone Methylation: Biochemical and Molecular Mechanisms of Histone Demethylases. Annu Rev Biochem 79:155–179

C

96

Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae

Salozhin SV, Prokhorchuk EB, Georgiev GP (2005) Methylation of DNA–one of the major epigenetic markers. Biochemistry 70:525–532 Sharma VM, Tomar RS, Dempsey AE, Reese JC (2007) Histone deacetylases RPD3 and HOS2 regulate the transcriptional activation of DNA damageinducible genes. Mol Cell Biol 27:3199–3210 Shiio Y, Eisenman RN (2003) Histone sumoylation is associated with transcriptional repression. Proc Natl Acad Sci USA 100:13225–13230 Smith CL, Peterson CL (2005) ATP-dependent chromatin remodeling. Curr Top Dev Biol 65:115–148 Walsh CP, Chaillet JR, Bestor TH (1998) Transcription of IAP endogenous retroviruses is constrained by cytosine methylation. Nat Genet 20:116–117 Watson JD et al (2008) Molecular biology of the gene, 6th edn, Pearson/Benjamin Cummings, San Francisco Williams K, Christensen J, Helin K (2011) DNA methylation: TET proteins-guardians of CpG islands? EMBO Rep 13(1):28–35

Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae Mary Ann Osley Molecular Genetics and Microbiology, University of New Mexico School of Medicine, Albuquerque, NM, USA

Synonyms ATP-dependent nucleosome remodeling; Chromatin remodeling; Histone modification; Homologous recombination; Yeast MAT locus

Synopsis Homologous recombination (HR) is the preferred pathway to repair DNA double-strand breaks (DSBs) during the S and G2/M phases of the cell cycle and represents the major route for DSB repair in budding yeast, Saccharomyces cerevisiae. HR takes place in a chromatin context, and both ATP-dependent nucleosome disruption and histone modifications contribute to chromatin remodeling during HR. These chromatin-

remodeling events occur at different points during repair by HR and influence the execution of discrete steps in the HR pathway.

Introduction The nucleosomal organization of chromatin presents a formidable barrier to the access of regulatory factors to all DNA-mediated processes. To surmount this barrier, chromatin can be remodeled to expose key factor binding sites. Chromatin remodeling is classified into two broad categories and is mediated by multi-subunit complexes. The first involves ATP-dependent nucleosome remodeling, in which the energy of ATP hydrolysis is used to disrupt interactions between histones and DNA, leading to nucleosome sliding in cis or nucleosome displacement in trans. The second involves the modification of specific histones by acetylation, methylation, phosphorylation, or ubiquitylation. These modifications can directly affect chromatin structure or serve as binding platforms for nonhistone regulatory proteins. While these chromatin-remodeling mechanisms were first described during transcription, they play a prominent role at multiple steps during the repair of DSBs by HR. Underscoring the importance of chromatin remodeling to HR, mutations in these factors lead to phenotypes associated with defective HR repair. The contribution of chromatin remodeling to HR repair in budding yeast has been facilitated by the existence of a system in which a defined DSB can be created with high efficiency in mitotically growing cells (Fig. 1). This system employs the yeast mating type (MAT) locus, which is cleaved by the HO endonuclease during a switch from mating type a or a to the opposite mating type. The DSB created at MAT is repaired by HR using information encoded in one of two silent cassettes, HMLa or HMRa, and involves a classic gene conversion process in which the DNA present at the silent cassettes is retained. A modified version of this system was constructed by James Haber, in which the endogenous HO gene was replaced with GAL-HO. Transferring cells into galactosecontaining medium turns on the HO gene, and

Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae

97

C Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae, Fig. 1 The yeast mating type system. The structures of the yeast MATa locus and two silent mating type loci, HMLa and HMRa, are shown. The shaded boxes represent regions of homology among the three loci. In the system shown, the endogenous HO endonuclease gene has been replaced with a GAL regulated HO gene, which can be expressed at any point in the cell cycle by shifting cells into a medium that contains galactose. After HO creates a DSB at the MAT locus, HR repairs the break using information

present at the HM loci. When the DSB occurs at MATa, as shown here, ssDNA formed by strand resection will search for homology at HMRa, leading to synapsis, strand invasion, and strand extension to copy Ya information. This information replaces Ya information at MATa, leading to a switch in mating type to MATa. A second version of this system has been created in which the two HM loci have been deleted. Upon formation of a DSB at MAT, the break can only be repaired by nonhomologous end joining. However, strand resection will eventually occur, promoting the recruitment of HR factors

a DSB can be created at MAT with almost 100% efficiency at any point in the cell cycle. This strain has been widely used to examine the factors and strand interactions that lead to HR repair of the broken MAT locus, and it has been adapted to identify the chromatin-remodeling events that accompany HR. Two versions of the GAL-HO system have been used to examine chromatin remodeling upon creation of a DSB at MAT. In the first, the two HM loci have been deleted. This version cannot be repaired by HR, but strand resection and the recruitment of a number of factors required for HR still occur, allowing for analysis of chromatin remodeling at the DSB itself and during the initial steps of HR. The second version has retained the HM loci, and thus the association of chromatin remodeling with the entire spectrum of HR-associated events can be examined.

at the repaired break. The first two processes have been extensively studied using the unrepairable MAT DSB system. Chromatin signaling. Chromatin signaling is represented by the phosphorylation of the C terminus of histone H2A on serine 129, a histone modification first identified on vertebrate H2A.X and hereafter referred to as g-H2A. This is the first detectable chromatin-remodeling event to occur at the MAT DSB, occurring within minutes after the DSB is formed, and it is a key event in the activation of the cell cycle checkpoint and recruitment of DNA repair factors (Shroff et al. 2004). The phosphorylation of H2A quickly spreads from the DSB site and encompasses a region of 40–50 Kb around the break, thus amplifying the signal. Two protein kinases with well-established roles in checkpoint signaling of DSBs, Mec1 and Tel1, are responsible for g-H2A formation, and they are recruited to the MAT DSB shortly after the break is formed (Shroff et al. 2004). A second chromatin-remodeling event related to g-H2A signaling occurs through the activity of the RSC complex. RSC is an ATP-dependent nucleosome-remodeling complex that is recruited to the MAT DSB almost immediately after the break is formed through the association of RSC with the MRX (Mre11-Rad50-Xrs2) complex, which binds the ends of DSBs immediately after breaks are formed (Chai et al. 2005;

Main Text Chromatin Remodeling at the Unrepairable MAT DSB The repair of DSBs by HR, like all DNA repair processes that occur in a chromatin context, follows the rules: chromatin signaling of the break for repair, chromatin opening at the break for access of repair factors, and chromatin restoration

98

Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae

Shim et al. 2005). RSC moves several nucleosomes away from the DSB, thus exposing DNA adjacent to this region (Shim et al. 2007). This activity contributes to the recruitment of Mec1 and Tel1 to the DSB and thus leads to the amplification of the g-H2A signal (Liang et al. 2007). Chromatin opening. Chromatin around the DSB must be opened to allow for the accessibility of factors that mediate HR. Two modes of chromatin opening have been observed in the vicinity of the MAT DSB. The first involves ATP-dependent nucleosome remodeling by the INO80 complex, which leads to the disruption and eventual eviction of nucleosomes in a restricted region around the DSB (Morrison et al. 2004; Tsukuda et al. 2005). A second ATP-dependent nucleosome-remodeling complex, SWI/SNF, is also recruited to the MAT DSB around the same time as INO80, but its role in nucleosome dynamics has not been defined (Chai et al. 2005). Fun30, a third factor in this class, also comes to the MAT DSB, but does not alter nucleosomes. Instead, Fun30 removes the checkpoint adaptor protein Rad9 from chromatin. The second mode of chromatin remodeling involves the local and transient acetylation of histones H2A, H3, and H4 by the NuA4 and Gcn5 histone acetyltransferase (HAT) complexes, which promote chromatin relaxation (Downs et al. 2004; Tamburini and Tyler 2005). Both INO80 and NuA4 are recruited to the MAT DSB just after the phosphorylation of H2A, and their initial recruitment was reported to be dependent on g-H2A, thus representing a key function of g-H2A in DSB signaling (Morrison et al. 2004; Downs et al. 2004). This conclusion has been challenged by a report that found chromatin regulators are recruited to the MAT DSB in G2/M phase, a cell cycle phase in which phosphorylated H2A levels are low (Bennett et al. 2013). Fun30, a third factor in this class, also comes to the MAT DSB, but does not alter nucleosomes. Instead, Fun30 removes the checkpoint adaptor protein Rad9 from chromatin (Chen et al. 2012; Costelloe et al. 2012; Eapen et al. 2012). The chromatin opening promoted by these two classes of remodeling factors has important implications for early steps in HR. Nucleosome eviction has

been correlated with 50 to 30 strand resection and the efficient recruitment of Rad51 to ssDNA, which promotes strand pairing during homology search (Tsukuda et al. 2005). In addition, removal of Rad9 from chromatin promotes recruitment of enzymes that mediate long-range DNA resection. Histone H4 acetylation leads to the recruitment of the Rvb1 helicase containing complexes, INO80 and SWR1 (Downs et al. 2004). Thus, histone acetylation acts as an intermediary in g-H2A signaling by regulating the recruitment of other factors that remodel chromatin at the MAT DSB. An unresolved issue in the regulation of nucleosome displacement at the MAT DSB is whether the INO80 nucleosome-remodeling complex is the preeminent factor acting in this pathway. Several lines of evidence indicate that it cannot be the sole factor contributing to nucleosome eviction. First, mutations in the MRX complex, which acts upstream of INO80 in the DSB repair pathway, lead to a more severe eviction defect than mutations in the INO80 complex itself (Tsukuda et al. 2005). Second, nucleosome displacement is severely delayed but not abolished in the absence of INO80 activity (Tsukuda et al. 2005). Of the three remodeling factors – RSC, SWI/SNF, Fun30 – that are recruited to the MAT DSB, it is likely that SWI/SNF also contributes to nucleosome eviction along with INO80 (unpublished observations). Chromatin Remodeling During HR Repair of the MAT DSB When donor DNA sequences are present, the entire repertoire of HR factors is employed to repair the MAT DSB (Fig. 2). These factors are involved in homology search by the invading ssDNA strand, synapsis of Rad51 filaments with homologous donor sequences, strand invasion, and strand extension. The HM donor loci used in HR repair are assembled into a heterochromatinlike structure by the SIR silent-information regulators, Sir2, Sir3, and Sir4. This structure must be disrupted to allow HR to proceed. The major disruption identified to date involves the localized displacement of nucleosomes at the junction of the donor region where synapsis and initial strand invasion occur, with more modest levels of

Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae

99

Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae, Fig. 2 Factors that remodel chromatin during HR repair at MAT. The sequence of events in HR repair of a DSB at MAT is shown along with key repair factors that act at each of the steps. The chromatinremodeling factors that promote chromatin signaling of the DSB, chromatin opening at the break, and chromatin restoration of repaired DNA are listed next to the steps in HR repair that they influence. Not shown is the Fun30 ATP-dependent remodeler, which acts after RSC and promotes longrange DNA resection by removing the Rad9 checkpoint adapter from chromatin

displacement occurring distal to the junction (Hicks et al. 2011). Three ATP-dependent nucleosome-remodeling factors – SWI/SNF, INO80, and Rad54 – are connected to this disruption, and in their absence strand invasion or strand extension is impaired. All are present at the donor locus during HR repair, and their presence correlates with maximal synapsis between the invading Rad51 presynaptic filament and donor dsDNA. In vitro evidence shows that the presence of SWI/SNF at the donor locus displaces Sir3 from heterochromatin, thereby promoting joint formation by the Rad51 presynaptic filament (Sinha et al. 2009). In vivo data show that INO80 is required for efficient eviction of nucleosomes at the donor locus, although, surprisingly, it is dispensable for this event at the MAT DSB in HR-proficient strains (Tsukuda et al. 2009). Finally, Rad54, a nucleosome-remodeling protein

C

required for DNA synthesis during HR, also promotes nucleosome displacement at the donor locus. However, its effects are mediated not on the junction nucleosomes but rather on nucleosomes further from the invasion site (Hicks et al. 2011). These results lead to a potential model to account for the activity of these remodeling factors at the donor locus during HR. First, SWI/SNF disrupts heterochromatin by removing the SIR complex so the Rad51 presynaptic filament can locate and form an initial joint. Next, INO80 displaces nucleosomes to allow for base-pairing interactions between the invading ssDNA and donor dsDNA and subsequent strand invasion. Finally, Rad54 helps to disrupt nucleosomes further from the region of strand invasion to promote new DNA synthesis during repair. Interestingly, RSC acts at the last step in HR to control the ligation of repaired strands, but its activity on

100

Chromatin Remodeling During Homologous Recombination Repair in Saccharomyces cerevisiae

donor nucleosomes has not been assessed (Chai et al. 2005). In addition to ATP-dependent nucleosome remodeling, the Gcn5 and NuA4 HATs mediate HR-dependent acetylation of specific residues in the H3 and H4 N tails at the donor locus during HR repair (Tamburini and Tyler 2005). Acetylation could potentially open chromatin throughout the donor locus to assist all steps in HR. However, studies to date have only measured the levels of acetylated histones adjacent to the region of joint formation, where transient acetylation has been observed. The presence of acetylated histones at this region could potentially play a role with the INO80 nucleosome remodeler to promote Rad51filament synapsis and subsequent strand invasion, perhaps by recruiting or stabilizing INO80 at the donor locus. Restoration of Chromatin at the Completion of HR Once HR has been completed, chromatin must be restored to its original state. In particular, DSB signaling through g-H2A must be reversed at the recipient locus to alleviate the block to cell cycle progression and stop the recruitment of chromatin remodeling and repair factors. A number of mechanisms have been suggested for the removal of g-H2A from chromatin. The displacement of nucleosomes at the MAT DSB provides one route to remove the H2A modification. However, this process only removes g-H2A-containing nucleosomes in a restricted region around the break (Tsukuda et al. 2005). The SWR1dependent replacement of g-H2A with the histone variant H2A.Z also functions to remove the modified histone, and as a consequence of this replacement, the cell cycle checkpoint is relieved (Papamichos-Chronakis et al. 2006). But again, this replacement only occurs in nucleosomes immediately adjacent to the MAT DSB. The protein phosphatase, Pph3, has been shown to dephosphorylate g-H2A and deactivate cell cycle checkpoint signaling, although Pph3 is reported to act on g-H2A only after it has been removed from chromatin (Keogh et al. 2006). Thus, the issue of how g-H2A is removed from nucleosomes distal to the DSB remains an open question.

Nucleosomes that were disassembled during HR repair must also be restored to prevent gaps in chromatin that could lead to increased levels of DNA lesions and genomic instability. Chromatin assembly at the MAT DSB closely follows the repair of the break and is dependent on HR (Chen et al. 2008). A key player in repair-coupled chromatin assembly is the histone chaperone, Asf1, which binds histones H3 and H4. In the absence of Asf1, chromatin is not assembled at the MAT DSB, although DNA repair still occurs (Chen et al. 2008). As a consequence of the failure to reassemble nucleosomes on repaired DNA, the cell cycle checkpoint is only slowly deactivated. Asf1 indirectly contributes to chromatin reassembly during HR repair by promoting the acetylation of free histone H3 on lysine 56 by the HAT, Rtt109. All newly synthesized H3 is acetylated on lysine 56, which is recognized by two additional chaperones, CAF-I and Rtt106 (Li et al. 2008). These chaperones deposit H3K56ac onto DNA during replication, with the HDACs Hst3/Hst4 rapidly removing the acetyl mark once the histone is deposited (Maas et al. 2006). The acetyl mark persists in newly repaired DNA, however, because of the decreased expression of the Hst3 HDAC (Chen et al. 2008; Masumoto et al. 2005). This has led to the model that the presence of H3K56ac in repaired DNA acts as a local signal to deactivate the cell cycle checkpoint. Thus, different histone modifications set up a special chromatin environment at both the beginning (g-H2A) and end (H3K56ac) of DSB repair to signal either the activation or deactivation of the cell cycle checkpoint. In summary, the efficient repair of DSBs by HR in budding yeast depends on the concerted activities of a large number of ATP-dependent nucleosome remodeling and histone modification factors. These factors are recruited to the break in response to multiple signals, many of which remain to be identified, and their activity in remodeling chromatin promotes the recruitment or activation of factors that directly participate in HR repair or regulate cell cycle progression. The interactions between chromatin remodelers and the HR repair and checkpoint regulation machinery are just beginning to be understood, and it is

Chromid

likely that these interactions will be considerably more complex than initial models suggest.

References Bennett G, Papamichos- Chronakis M, Peterson C (2013) DNA repair choice defines a common pathway for recruitment of chromatin regulators. Nature Comm 4 Chai B, Huang J, Cairns BR, Laurent BC (2005) Distinct roles for the RSC and Swi/Snf ATP-dependent chromatin remodelers in DNA double-strand break repair. Genes Dev 19:1656–1661 Chen CC, Carson JJ, Feser J, Tamburini B, Zabaronick S, Linger J, Tyler JK (2008) Acetylated lysine 56 on histone H3 drives chromatin assembly after repair and signals for the completion of repair. Cell 134:231–243 Chen X, Cui D, Zhang X, Chu C-D, Tang J, Chen K, Pan X, Ira G (2102) The Fun30 nucleosome remodeler promotes resection of DNA double-strand break ends. Nature 489:576–580 Costelloe T, Louge R, Tomimatsu N, Mukherjee B, Martini E, Khadaroo B, Dobois K, Wiegant W, Thierry A, Burma S, van Attikum H, Llorente B (2102) The yeast Fun30 and human SMARCAD1 chromatin remodelers promote DNA end resection. Nature 489:581–584 Downs JA, Allard S, Jobin-Robitaille O, Javaheri A, Auger A, Bouchard N, Kron SJ, Jackson SP, Cote J (2004) Binding of chromatin-modifying activities to phosphorylated histone H2A at DNA damage sites. Mol Cell 16:979–990 Eapen V, Sugawara N, Tsabar M, Wu W-H, Haber J (2012) The Saccharomyces cerevisiae chromatin remodeler Fun3o regulates DNA end resection and checkpoint deactivation. Moll Cell Biol 32:4727–4740 Hicks WM, Yamaguchi M, Haber JE (2011) Real-time analysis of double-strand DNA break repair by homologous recombination. Proc Natl Acad Sci U S A A108:3108–3115 Keogh MC, Kim JA, Downey M, Fillingham J, Chowdhury D, Harrison JC, Onishi M, Datta N, Galicia S, Emili A et al (2006) A phosphatase complex that dephosphorylates gammaH2AX regulates DNA damage checkpoint recovery. Nature 439:497–501 Li Q, Zhou H, Wurtele H, Davies B, Horazdovsky B, Verreault A, Zhang Z (2008) Acetylation of histone H3 lysine 56 regulates replication-coupled nucleosome assembly. Cell 134:244–255 Liang B, Qiu J, Ratnakumar K, Laurent BC (2007) RSC functions as an early double-strand-break sensor in the cell’s response to DNA damage. Curr Biol 17: 1432–1437 Maas NL, Miller KM, DeFazio LG, Toczyski DP (2006) Cell cycle and checkpoint regulation of histone H3 K56 acetylation by Hst3 and Hst4. Mol Cell 23:109–119 Masumoto H, Hawke D, Kobayashi R, Verreault A (2005) A role for cell-cycle-regulated histone H3 lysine

101 56 acetylation in the DNA damage response. Nature 436:294–298 Morrison AJ, Highland J, Krogan NJ, Arbel-Eden A, Greenblatt JF, Haber JE, Shen X (2004) INO80 and gamma-H2AX interaction links ATP-dependent chromatin remodeling to DNA damage repair. Cell 119:767–775 Papamichos-Chronakis M, Krebs JE, Peterson CL (2006) Interplay between Ino80 and Swr1 chromatin remodeling enzymes regulates cell cycle checkpoint adaptation in response to DNA damage. Genes Dev 20:2437–2449 Shim EY, Ma JL, Oum JH, Yanez Y, Lee SE (2005) The yeast chromatin remodeler RSC complex facilitates end joining repair of DNA double-strand breaks. Mol Cell Biol 25:3934–3944 Shim EY, Hong SJ, Oum JH, Yanez Y, Zhang Y, Lee SE (2007) RSC mobilizes nucleosomes to improve accessibility of repair machinery to the damaged chromatin. Mol Cell Biol 27:1602–1613 Shroff R, Arbel-Eden A, Pilch D, Ira G, Bonner WM, Petrini JH, Haber JE, Lichten M (2004) Distribution and dynamics of chromatin modification induced by a defined DNA double-strand break. Curr Biol 14:1703–1711 Sinha M, Watanabe S, Johnson A, Moazed D, Peterson CL (2009) Recombinational repair within heterochromatin requires ATP-dependent chromatin remodeling. Cell 138:1109–1121 Tamburini BA, Tyler JK (2005) Localized histone acetylation and deacetylation triggered by the homologous recombination pathway of double-strand DNA repair. Mol Cell Biol 25:4903–4913 Tsukuda T, Fleming AB, Nickoloff JA, Osley MA (2005) Chromatin remodeling at a DNA double-strand break site in Saccharomyces cerevisiae. Nature 438:379–383 Tsukuda T, Lo YC, Krishna S, Sterk R, Osley MA, Nickoloff JA (2009) INO80-dependent chromatin remodeling regulates early and late stages of mitotic homologous recombination. DNA Repair (Amst) 8:360–369

Chromatin Structure ▶ Long-Term Genetic Silencing at Centromere and Telomeres

Chromid ▶ Plasmids as Secondary Chromosomes

C

102

Cis-Regulation of Eukaryotic Transcription Rachel McMullan and April Hill Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms Cis-regulatory elements; Enhancers; Transcriptional regulation

Synopsis The principal control of eukaryotic transcription involves cis-regulatory DNA sequences that are modulated through the binding of cell-specific transcription factor proteins. These regulatory sequences control development and cell function through the modification of patterns of gene expression. Cis-elements serve as binding sites, which, through DNA bending and looping, can bring transcription factors into close proximity with each other and with promoter regions of the genes they regulate. Binding of specific combinations of transcription factors to their appropriate cis-regulatory regions is the major determinant of when and where genes will be expressed through the transcription of mRNA. The most discussed and studied type of cis-regulatory elements, other than promoters, are enhancers. Multiple enhancers exist for most genes and act to mediate the control of gene expression in particular cell types at particular developmental times. Enhancers can be located far from the genes that they regulate and are the prime modulators of spatial and temporal gene expression. Cis-regulatory DNA can serve several functions including acting to enhance or suppress expression of target genes. Other cis-regulatory sequences can play roles in modification of chromatin but ultimately lead to increasing

Cis-Regulation of Eukaryotic Transcription

accessibility of regulatory DNA to transcription factors leading the control of transcription.

Introduction The number and composition of genes that are transcribed varies drastically among the various cell types during the life cycle of eukaryotic organisms in response to changing environmental and physiological conditions. It is the precise control of gene expression that leads to the differentiation of different cell types, body plans, and physiologies of eukaryotic organisms. Thus, a fundamental objective in research to understand cell differentiation and development is to determine how each cell of an organism can possess all the same genes while only certain genes are activated at particular times, in particular cells, and under particular circumstances. A major control point of gene regulation is at the birth of the mRNA (i.e., transcription). In eukaryotes, transcriptional regulation is achieved through the binding of cell-specific transcription factor (TF) proteins to particular DNA sequences that can be located distantly from the target genes they regulate. These sequences are referred to as cis-regulatory DNA. In most cases, a variety of TFs are required to bind to multiple cis-regulatory sequence motifs located in regions of the genes they regulate. The combinations of cell-specific TF binding to cis-regulated DNA at appropriate times and in proper space lead to turning genes on or keeping them off in contexts that act to specify the development and function of the organism. Consequently, cis-regulation by trans-acting factors accounts for much of the differences in gene expression found in various eukaryotic cell types. Indeed, it is now commonly understood that changes (i.e., mutations) to the sequences of cis-regulatory DNA are the major driver of morphological and phenotypic divergence in eukaryotic organisms over evolutionary time (for further reading, see Rivera and Sajuthi, this volume, and Wray et al. 2003). Thus, changes in cis-regulatory sequences can lead to changes in gene expression

Cis-Regulation of Eukaryotic Transcription

patterns, which can have both large and small effects on organism development and function.

Cis-Regulatory DNA There are multiple types of cis-regulatory elements including core promoters and promoterproximal elements found close to the regulated gene sequence along with elements found at greater distances from the transcriptional start sites (TSSs) including elements such as enhancers, silencers, insulators, and tethering elements (Spitz and Furlong 2012). A commonality among these various cis-regulatory elements is that they are most often short sequences of DNA located on the same chromosome as the gene (s) they regulate. Most of these regulatory DNAs function by serving as templates to bring TFs into close proximity so they can switch genes on and off in time and space in a highly synergistic manor (Levine 2010). The promoter, a sequence of bases found prior to (upstream of) the gene coding sequence, controls the rate of transcriptional initiation of its gene. Promoters of housekeeping genes, genes found expressed in almost all cell types, are often constitutively active, while other promoters are turned off (repressed) and are only turned on by certain cues. The cues that control the activation of promoters are the protein TFs that bind to a specific DNA sequence close to the regulated gene. Though there are not universal sequence motifs found in eukaryotic promoter regions, there are two functional features found in all promoters: the basal promoter sequence and the sequences for transcription factor binding sites. The basal promoter, also known as the core promoter, is a site within the promoter region where proteins involved in transcription bind. The actual sequence of the basal promoter varies among genes but most contain a TATA box, which is roughly 25–30 base pairs (bp) upstream from the 50 end of the transcription start site and is a vital binding site for proteins involved in transcription initiation. In addition to the basal promoter, the

103

transcription factor binding sites also vary in sequence for each gene. Which transcription factors bind to the promoter of a gene is arbitrated by the nucleotide sequence found in the transcription factor binding sites. Thus, the transcription factor binding sites of a particular gene defines the gene’s expression profile (a more extensive discussion of promoters can be found in Volume X). Enhancers are clusters of DNA regulatory sequences, usually a few hundred base pairs in length, that can be found located upstream, downstream, or within introns of genes and can sometimes be found thousands of base pairs away from the promoter region of the regulated gene (see Wittkoop 2006). The operational elements of the enhancer are short, specific DNA sequences (often called DNA binding motifs) and are recognized by TFs to result in protein-DNA binding. Enhancers are often capable of operating over long stretches of DNA to bring transcription factors close to the promoter region of the target gene for interactions with the mediator complex or transcription factor II D. In most cases, this will result in the recruitment of RNA polymerase II to the promoter leading to transcription of the target gene. A special class of enhancers is referred to as silencers. Specifically, these are cis-regulatory DNA sequences involved in the silencing of gene expression due to the binding of TFs that act as repressors and interact with the basal promoter region and machinery to keep transcription of a particular gene turned off. It is generally believed that the default state of chromatin is “off” and that most genes are silenced through binding of specific factors and through the regulation of chromatin. Most genes may have more than one enhancer element involved in regulating the gene expression profile. One well-studied example is the expression of the Ultrabithorax (Ubx) in Drosophila. This gene is controlled by multiple cis-regulatory elements, each of which containing binding sites for multiple activators and repressors. Ubx elements are responsible for the expression of Ubx in parasegment parts, appendage fields, and other germ layers during different

C

104

development stages of the Drosophila embryo. One such cis-regulatory element for the Ubx gene (named BRE) contains 17 TF binding sites in a 500 bp region; all of which are responsible for the expression of Ubx in four distinct parasegments of the Drosophila embryo. The unique combination of TF activator and TF repressor binding in this element leads to the distinct expression pattern of Ubx in developmental time and space (see this and other examples in Carroll et al. 2005). Thus, enhancers can contain many unique binding sites that are recognized by TFs that act as transcriptional activators or repressors, and these enhancers determine when a gene will be expressed and where the gene will be expressed (for review and many examples of specific transcriptional enhancers in animal development, see Levine 2010). However, the example above only includes a part of the story for the eventual expression of most genes. Generation of complex and robust patterns of gene expression often requires additional DNA regulatory sequences involved in the spread of chromatin modifications. The types of enhancers involved here act to increase the accessibility of DNA in regulatory regions to DNA binding proteins. These enhancers bind TFs that will interact with chromatin remodeling complexes and/or histone-modifying enzymes to lead to decompaction of chromatin (for further information, see Khan and Hilliker, this volume, and Ong and Corces 2011). The removal and/or remodeling of nucleosome structure results in nucleosome-free regions that can facilitate availability of TFs to cis-regulatory elements. The dynamics of nucleosomes is likely a major determinant of eukaryotic gene regulation. Some cis-regulatory DNA is involved in defining the specificity of enhancer-promoter interactions. Insulators, also known as barrier or boundary elements, are cis-regulatory DNA sequences that can be found flanking particular sequence elements (i.e., promoters, enhancers) and act to prevent TFs from binding to enhancers or silencers. Insulator elements disrupt communication between enhancers or silencers and the promoter of a gene and are thus a type of regulatory region that creates a boundary in the chromatin.

Cis-Regulation of Eukaryotic Transcription

One function of insulators is to prevent enhancers from promiscuously activating transcription from nontarget promoters. Two types of insulators have been described: enhancer-blocking insulators and barrier insulators. Enhancer-blocking insulators can be found in between an enhancer and a promoter and can prevent activation by enhancers or prevent repression from silencers, whereas barrier insulators prevent heterochromatin spreading. There are several different models for the mechanism by which insulators block enhancers including supporting evidence that insulators form chromatin loops by binding to different insulators (reviewed in Bushey et al. 2008). A final group of cis-regulatory sequences involved in regulating enhancerpromoter interactions are tethering elements, which assist in bringing distant enhancers close to a specific gene to encourage activity. They are often found close to the promoter-proximal region of the targeted gene (Spitz and Furlong 2012).

Transcription Factor Binding The bindings of enhancers by associated TFs are a main factor in the initiation of gene expression. The typical enhancer DNA binding motif recognized by a specific TF is a 6–12 bp long DNA sequence in the promoter and/or enhancer region of a gene and positively or negatively influences the transcription of the gene through conformational changes. Normally, TFs bind to enhancer regions that contain multiple TF binding sites (i.e., each enhancer regions may contain multiple sites for the same TF and/or sites for different TFs). The conditions within each cell type at any particular time in development will dictate which TFs are available for binding cis-regulatory regions that they recognize. This timing of DNA occupancy by TFs determines the gene regulatory networks (GRNs), which control development processes. Combinatorial TF occupancy, when multiple TFs interact with each other and bind to an enhancer, results in many types of output depending on whether and how the TFs interact together. Additive binding can occur when the activation of enhancers correlates with the concentration of a particular TF present, whereas

Cis-Regulation of Eukaryotic Transcription

cooperative binding is the interaction between two TFs occupying adjacent cis-regulatory binding sites. Cooperative binding produces a switchlike effect by either turning on or off gene activity and can increase the affinity of the TFs towards their motifs. In addition to direct protein-protein interactions between TFs in cooperative binding, there can be indirect forms of cooperativity between TFs (e.g., perhaps mediated by other small molecules). Furthermore, occupancy of a binding site by TFs may result in bending of DNA (see next section on DNA looping) leading to the binding of nearby TFs to an enhancer sequence. The binding of a TF to an enhancer is also determined by nucleosome positioning, and enhancer regions are most often found in nucleosome-depleted chromatin. The histone tail modifications in nucleosomes are detected in active and inactive promoter regions and influence the interaction between TFs and nucleosomes (see Spitz and Furlong 2012). Motif composition (i.e., the actual DNA sequence of the cis-regulatory binding site) and motif positioning (i.e., where the binding sites are located on the chromosome with respect to each other and to the gene(s) they regulate) are two characteristics that determine how each enhancer functions. The motif composition includes the binding motifs in an enhancer unique to particular TFs, while motif positioning is the order or orientation of TF motifs found in the enhancer. Motifs may be positioned such that protein-protein interactions for cooperative binding between TFs are facilitated or so that recruitment of other proteins including cofactors is possible. In some cases TF binding leads to recruitment of the assembly of transcription machinery at an enhancer site or alterations in nucleosome interactions that result in binding of other factors that lead to gene expression responses (Spitz and Furlong 2012). The way in which TFs interact with associated cis-regulatory DNA, with each other, and with other proteins and the polymerase complex attached to the basal promoter will ultimately lead to increased or decreased rates of transcription. For example, it is possible for a TF to prevent the binding of another TF that normally binds to a nearby site through steric hindrance.

105

Furthermore, a TF bound to its binding site on the DNA sequence can alter the chromatin structure creating condensed chromatin. By recruiting the SWI/SNF complex, enzymes capable of acetylation, deacetylation, methylation, and demethylation, TFs can change the state of the chromatin. Finally, TFs bound to DNA can create stabilized DNA loops (see below). For example, a TF acting as a transcriptional activator, through cooperative binding, may interact at the same time with DNA and with RNA polymerase in order to recruit the enzyme complex to the promoter for transcription initiation.

DNA Looping Some TFs are known to interact with each other as they are bound to adjacent cis-regulatory sites, whereas other TFs interact with each other even though the binding sites are many base pairs away on the DNA sequence. This type of proteinprotein interaction occurs through a process called DNA looping which does not require the TF binding sites to be relatively close to each other on the chromosome. For genes that are regulated by distant enhancers, these chromatin loops bring the enhancer into closer physical proximity to the gene it regulates. In many cases, enhancer sequences that are activated by TFs leading to DNA looping result in the TF/enhancer complex interacting with the promoter region of the gene(s) being expressed. There are several determinants for the formation of DNA looping including the presence of specific protein binding sites, distance between binding sites, orientation of the bound proteins, and structural characteristics of the DNA. Moreover, the conformational change of the DNA structure is determined by the free energy of DNA looping. There are two general categories of DNA loops, which include short (energetic) and long (entropic) DNA loops. Both are determined by the physical forces of their formation. Short DNA loops are usually less than 200 bp and are dictated by DNA elasticity or ability to bend while maintaining the loop structure. On the other hand, long DNA loops are characterized by the

C

Cis-Regulatory Elements

106

entropy lost when the DNA strands become bound together. The molecular characteristics of the proteins bound to DNA binding sites determine the formation of the DNA loops. In addition, the location and orientation of the DNA binding sites impact the formation of the loops. DNA looping allows functional complexes of more than one protein to form on DNA as multiple binding sites across DNA sequence regulate the same gene region. Therefore, many proteins, enhancers, and silencers can impact the function of RNA polymerase and the polymerase’s binding to the promoter region and transcription of a gene (for further review, see Saiz and Vilar 2006).

Clade ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Cleavage and Polyadenylation ▶ Co-transcriptional Eukaryotes

mRNA

Processing

in

Cloning Vector Compatibility Cross-References ▶ Genes and Genomes: Structure

References Bushey AM, Dorman ER, Corces VG (2008) Chromatin insulators: regulatory mechanisms and epigenetic inheritance. Mol Cell 32:1–9 Carroll SB, Grenier JK, Weatherbee SD (2005) From DNA to diversity: molecular genetics and the evolution of animal design. Blackwell Publishing, Malden Levine M (2010) Transcriptional enhancers in animal development and evolution. Curr Biol 20:R755–R763 Ong C, Corces VG (2011) Enhancer function: new insights into the regulation of tissue-specific gene expression. Nat Rev Genet 12:283–293 Saiz L, Villar JMG (2006) DNA looping: the consequences and its control. Curr Opin Struct Biol 16:344–350 Spitz F, Furlong EEM (2012) Transcription factors: from enhancer binding to developmental control. Nat Rev Genet 13:613–626 Wittkoop PJ (2006) Evolution of cis-regulatory sequence and function in Diptera. Heredity 97:139–147 Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA (2003) The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol 20:1377–1419

Cis-Regulatory Elements ▶ Cis-Regulation of Eukaryotic Transcription

Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Synonyms Incompatibility group

Definition Plasmid incompatibility refers to the inability of two plasmids that contain the same replicon to exist stably in an individual cell (del Solar et al. 1998; Novick 1987). A replicon is the specific DNA sequence required for replication of a particular plasmid. The replicon can include cis-acting sequences (e.g., protein-binding sites) and genes that encode diffusible molecules such as RNA or proteins that are necessary for plasmid replication. Two plasmids that contain the same replicon but that are otherwise different (i.e., they may contain different DNA inserts or different antibiotic resistance genes) will not be maintained together in a single bacterial cell in the absence of selective pressure for both plasmids. Instead, one or the other of the plasmids will be lost from the progeny of individual cells, so that the population of cells will after some time consist of cells that contain only one of the original plasmids.

Cloning Vector Compatibility

Discussion Plasmid incompatibility is a consequence of the mechanisms for regulation of plasmid replication and copy number (del Solar et al. 1998; Novick 1987). Individual plasmid molecules in a cell are replicated at random, and plasmid molecules are partitioned randomly to daughter cells. Replication initiation is regulated so that the plasmid copy number is maintained at the level typical of the replicon. If the plasmid copy number happens to rise above the normal level, the replication initiation frequency decreases so that the copy number declines back to the normal level. The opposite occurs if the copy number falls below the normal level. Suppose a cell initially contains two plasmids with the same replicon, in equal numbers. The random choice of molecules for replication and segregation means that some daughter cells are likely to contain an unequal number of the two plasmids. A daughter cell that happens to receive only one of the two plasmids at cell division will produce progeny that contain only that plasmid in all future generations. In a daughter cell that has a majority of one of the two plasmids, the more abundant plasmid is more likely to be chosen for replication. As the overall replicon copy number rises, the less abundant plasmid loses the opportunity to replicate due to the negative feedback regulation of initiation. These cells are ever more likely to give progeny with only one of the plasmids. Thus, over time, the number of cells that have lost one of the plasmids increases, at the expense of cells that still have some of each plasmid (Novick 1987). Two important regulatory mechanisms for plasmid replication are antisense RNA and protein binding to iterons. The antisense mechanism is used by plasmids including ColE1 and its derivatives, R1, and pT181 (del Solar et al. 1998). Two RNA molecules called RNAI and RNAII are necessary for regulation. RNAII is 550 nucleotides (nt) in length and provides the 50 -OH primer terminus necessary for initiation of plasmid DNA synthesis. RNAI is 108 nt and inhibits replication through base-pairing to RNAII. The amount of RNAI in the cell is proportional to the amount of plasmid present in the cell, so its inhibitory effect is

107

greater when the plasmid copy number in the cells is high. RNAI has a short half-life in the cell (t1/2 = 2 min) because it is degraded by ribonuclease E (Lin-Chao and Cohen 1991). The rapid inactivation of RNAI allows for greater initiation of replication if the plasmid number, and therefore synthesis of RNAI, becomes low in a cell. The iteron mechanism applies to F, P1, pSC101, R6K, and others (del Solar et al. 1998; Chattoraj 2000). These plasmids encode a replication protein called Rep that binds to a specific site within the plasmid origin to initiate replication. The binding sites, referred to as “iterons” because they are present in multiple repeats in the origin, are necessary for both initiation and inhibition of replication. For initiation, Rep binds to an iteron as a monomer. Two Rep monomers, bound to iterons in different plasmid molecules, can bind to each other. The resulting multimeric protein-DNA complexes are inactive for DNA replication, a mechanism called “handcuffing.” Handcuffing controls plasmid copy number because the probability of forming the inhibitory handcuffed complex increases as the number of plasmid molecules, and therefore the number of iterons, increases in a cell. Plasmid incompatibility can be an issue in situations where an investigator wishes to express two different proteins together in a bacterial cell. The simple strategy of cloning the two genes separately into the same cloning vector (say, pUC18) and selecting for growth in the presence of an antibiotic would be unsuccessful in this case. The cells could initially be induced to take up both recombinant plasmids. However, as the cells grow in culture, the population will be made up of cells that contain only one of the two recombinant plasmids. The simplest way to avoid this problem Cloning Vector Compatibility, Table 1 The Duet plasmids (Novagen) that allow for expression of up to eight different proteins in a single E. coli cell Plasmid pACYCDuet-1 pETDuet-1 pCDFDuet-1 pRSFDuet-1 pCOLADuet-1

Replicon P15A ColE1 CloDF13 RSF1030 ColA1

Antibiotic resistance (gene) Chloramphenicol (cat) Ampicillin (bla) Streptomycin (aadA) Kanamycin (kan) Kanamycin (kan)

C

108

Coherence Transfer

is to clone the genes into two different plasmids that have different replicons and thus are in different incompatibility groups. The Duet series of plasmids from Novagen (EMD Millipore Corp.) can be used to express up to eight different proteins (two per plasmid) in a single cell (Table 1).

References Chattoraj DK (2000) Control of plasmid DNA replication by iterons: no longer paradoxical. Mol Microbiol 37:467–476 del Solar G, Giraldo R, Ruiz-Echevarria MJ, Espinosa M, Diaz-Orejas R (1998) Replication and control of circular bacterial plasmids. Microbiol Mol Biol Rev 62:434–464 Lin-Chao S, Cohen SN (1991) The rate of processing and degradation of antisense RNAI regulates the replication of ColE1-type plasmids in vivo. Cell 65:1233–1242 Novick RP (1987) Plasmid incompatibility. Microbiol Rev 51:381–395

Coherence Transfer ▶ NMR Basis (Theory)

for

Biomolecular

Structure

Complement System Manuel Galvan Department of Biological Sciences, Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN, USA Department of Microbiology and Immunology, School of Medicine, Indiana University, South Bend, IN, USA

Synopsis The complement system is a central component of innate immunity and plays an important role in pathogen recognition and elimination. The

Submission date, 2012

complement system is composed of more than 35 soluble and membrane-bound proteins. Complement proteins are mainly synthesized by liver hepatocytes and circulate throughout the bloodstream. However, both membrane and soluble complement proteins can also be synthesized by peripheral blood leukocytes, such as neutrophils, macrophages, and dendritic cells (DCs). The complement system is an enzymatic cascade of serine proteases, which upon activation form a proteolytic cascade. Complement can be activated via the classical, lectin, and alternative pathway. The classical complement pathway can be activated by both antigen-antibody complexes and nonimmune molecules, such as beta-amyloid, prion protein, and DNA. The lectin pathway is activated by pathogen-specific sugars, such as mannose, fucose, and/or N-acetylglucosamine upon binding to mannose-binding lectin (MBL) or one of the ficolins, ficolin-1, ficolin-2, or ficolin-3. Although the end result of complement activation is MAC assembly, the proteolytic fragments generated throughout the cascade can mediate a wide range of biological functions. C1q and MBL are members of a family of pattern recognition proteins (PRPs) called defense collagens. PRPs bind pathogen-associated molecular patterns (PAMPs) as well as damage-associated molecular patterns (DAMPs). Under normal physiological conditions, complement plays a critical role in the clearance of apoptotic cells and in the removal of deleterious substances originating from necrotic cells, as well as in the clearance of circulating immune complexes. During the course of an infection, complement serves as an extremely effective mechanism for recognition and elimination of foreign pathogens. However, uncontrolled complement activation can be deleterious to host cells. To protect self, organisms have evolved mechanisms to protect host cells from complement-mediated activity. Complement fragments are highly labile such that they undergo spontaneous inactivation if they are not stabilized by other reactions. This allows for a regulated localized region of complement activity. In addition, to prevent damage to self-tissue, the complement system is highly regulated by a number of fluid-phase and membrane-bound regulators.

Complement System

Importantly there are complement regulators at various steps of the cascade. The importance of complement in immunity is best exemplified in individuals with complement deficiency. Individual with deficiency in classical complement pathway components exhibits increased risk for infection by encapsulated bacteria such as S. pneumoniae. In addition deficiency in either C1q, C4, or C2 is linked to the development of autoimmune disorders such as systemic lupus erythematosus (SLE), and this is thought to result from a failure to clear apoptotic cells which are a source of self-antigens.

109

synthesize the majority of complement proteins (Barnum 1995; Gasque et al. 1995; Korotzer et al. 1995; Hosokawa et al. 2003). The complement system is an enzymatic cascade of serine proteases, which upon activation form a proteolytic cascade. Complement proteins are designated both numerically and alphabetically (e.g., C4 or factor B). Upon complement activation, a number of complement proteins are cleaved into biologically active fragments, a small fragment and a large fragment. The small fragment is usually denoted with the letter “a” (e.g., C4a and C3a), whereas the large fragment is denoted with the letter “b” (e.g., C4b, C3b).

Introduction Complement Activation Pathways The complement system is a central component of innate immunity. Complement plays an important role in pathogen recognition and elimination. Complement was first described more than 100 years ago as the heat-labile serum constituent that augmented the opsonization of bacteria by antibodies and facilitated antibody-dependent killing of bacteria Ehrlich coined the term “complement” to describe the activity of serum that complemented the antibacterial activity of antibody. It is now well appreciated that the complement system orchestrates a variety of innate and adaptive immune responses that transcend the initial descriptive function of the system in enhancing bacterial killing. The complement system is composed of more than 35 soluble and membrane-bound proteins. Complement proteins are mainly synthesized by liver hepatocytes and circulate throughout the bloodstream. However, both membrane and soluble complement proteins can also be synthesized by peripheral blood leukocytes, such as neutrophils (Müller et al. 1978; Okuda 1991; Botto et al. 1992; Høgåsen et al. 1995; Nguyen et al. 2008), macrophages (Müller et al. 1978; Colten et al. 1986; Johnson and Hetland 1988), and dendritic cells (DCs) (Schwaeble et al. 1995; Cao et al. 2003; Castellano et al. 2004; Li et al. 2011). In addition, central nervous system (CNS) cells, including microglia, the CNS macrophage-like cell, have also been shown to

Complement can be activated via the classical, lectin, and alternative pathway (Walport 2001a, b). The classical complement pathway can be activated by both antigen-antibody complexes (Porter and Reid 1979; Norsworthy et al. 1999) and nonimmune molecules, such as beta-amyloid (Rogers et al. 1992; Velazquez et al. 1997), prion protein (Sim et al. 2007), as well as DNA (Van Schravendijk and Dwek 1982; Jiang et al. 1992). These molecules/complexes are recognized by the classical complement initiator complex C1, a 720-kDa, Ca2+-dependent complex. The C1 complex includes the recognition component C1q, C1r2, and C1s2. C1q is a 460-kDa macromolecule with N-terminal collagenous domains and Cterminal globular domains that is composed of six heterotrimeric peptide chains: A-chain, B-chain, and C-chain (Reid and Porter 1976). Activation of the classical pathway via C1q binding results in a conformation change in the C1 complex that leads to activation of C1r and C1s. Activated C1s cleaves C4 into C4b and C4a. C4b binds covalently to amine and carbohydrate groups on the activating surface via a thioester group that becomes exposed upon cleavage of C4. Surface-bound C4b binds C2, which gets cleaved into C2a and C2b and thereby leading to generation of C4bC2b, the C3 convertase. The cascade proceeds in the following order: C1, C4, C2, C3, C5, C6, C7, C8, and C9 (see Fig. 1).

C

110

Complement System

Complement System, Fig. 1 The complement system

Because of variances in the chronology of the discovery of the complement proteins, C4 and C2, the cascade does not proceed in exact numerical order. The lectin pathway is activated by pathogenspecific sugars, such as mannose, fucose, and or N-acetylglucosamine upon binding to mannosebinding lectin (MBL) or one of the ficolins, ficolin-1, ficolin-2, or ficolin-3. MBL has been shown to bind a number of pathogens, including bacteria, fungi, and protozoa. MBL, the recognition component of the lectin pathway, is a 230-kDa oligomeric C-type lectin that is assembled from six homotrimeric peptide chains.

Structurally, MBL is very similar to C1q and forms macromolecular complexes with MBL-associated serine proteases (MASP), MASP-1, MASP-2, and MASP-3, and the non-protease molecule Map19 (Holmskov et al. 2003). This complex is analogous to C1. Similarly, surface binding of MBL results in a conformational change in the MBL complex that leads to MASP activation. Activated MASP-2 cleaves C4, resulting in C4b being covalently deposited on the cell surface. MASP-2 also cleaves surface-bound C2 resulting in C4b2b, the same C3 convertase generated by the classical complement pathway.

Complement System

Activation of the complement cascade through either the classical or lectin pathway results in the assembly of the same C3 convertase (C4b2b) (Nagasawa et al. 1985). As mentioned above, the C3 convertase is assembled from C2 and C4 upon cleavage of C4 by either C1s or MASP-2. In contrast an alternative C3 convertase is generated as a result of the spontaneous hydrolysis of C3, leading to the formation of C3bH2 O . As such, complement component C3 is the convergence point of the complement cascade. C3 is a 190-kD protein which is relatively inert in its native form. However, a small number of C3 molecules are constantly hydrolyzed by water to generate active C3 which is the activator of the alternative pathway. C3b(H2O) forms a complex with factor B (i.e., C3bH2 O B) upon which factor B is cleaved by factor D. The end result of these enzymatic reactions is the C3H2 O Bb complex, the C3 convertase of the alternative pathway. Importantly, the alternative pathway also serves as a potent amplification mechanism of the complement cascade upon activation through either the classical or lectin pathway. The C3 convertases generated through either the classical, lectin, or alternative pathway cleave C3 into C3a and C3b. This reaction generates the C5 convertases (C4b2b3b or C3bBb3), which can cleave C5, a 190-kDa disulfide-linked heterodimer into C5a and C5b. Deposition of C5b on a surface, such as bacteria, initiates assembly of C5b-9n, the lytic membrane attack complex (MAC) which is inserted in cell surface membranes. MAC is a multimeric protein complex composed of C5b, C6, C7, C8, and multiple C9 molecules that polymerize into transmembrane channels that are approximately 5–10 nm in diameter (Podack 1984; Morgan 1999). Insertion of MAC in the lipid bilayer may disrupt ionic homeostasis causing the cell to swell and burst.

Complement Effector Functions Although the end result of complement activation is MAC assembly, the proteolytic fragments generated throughout the cascade can mediate a wide range of biological functions. For example,

111

macrophages and neutrophils can recognize and engulf bacteria that have been opsonized by either C4b or C3b via complement receptor type 1 (CR1). In addition CR2, CR3, and CR4 bind the iC3b fragment to induce phagocytosis (Daha 2010). The C4a, C3a, and C5a complement fragments generated as a result of complement activation are called anaphylatoxins. Anaphylatoxins are potent mediators of inflammatory responses, including vascular permeability, phagocyte recruitment, and cell activation (Markiewski and Lambris 2009). Anaphylatoxins exert their effects through their corresponding G-protein-coupled receptors, C3a receptor (C3aR) and C5a receptor (C5aR), also known as CD88. Notably, there is some evidence to suggest that anaphylatoxins can also be generated by alternative mechanisms. For example, recent studies have shown that thrombin can cleave C5 into C5a and C5b irrespective of the presence of C3, suggesting an alternative mechanism for generating the C5a anaphylatoxin and for MAC assembly (HuberLang et al. 2006).

C1q and MBL Are Defense Collagens While it is well recognized that C1q is the recognition component of the classical complement pathway, its role in nonclassical complementmediated events is underappreciated. C1q and MBL are members of a family of pattern recognition proteins (PRPs) called defense collagens. PRPs bind pathogen-associated molecular patterns (PAMPs) as well as damage-associated molecular patterns (DAMPs). For example, MBL has been shown to enhance phagocytosis of Staphylococcus aureus (Neth et al. 2000), Neisseria meningitidis (Jack et al. 2005), and Streptococcus pneumoniae (Roy et al. 2002). Importantly, both C1q and MBL are known to bind to apoptotic cells and trigger macrophage apoptotic cell clearance in the absence of inflammation (Korb and Ahearn 1997; Nauta et al. 2002; Fraser et al. 2009, 2010). Although several putative receptors have been described for C1q, the receptor by which C1q enhances phagocytosis has remained elusive. CD93 and CD91/calreticulin

C

112

complexes were proposed as C1q receptors that lead to enhancement of phagocytosis. However, murine macrophages genetically deleted of either CD93 or CD91 responded to C1q with enhanced phagocytosis (Norsworthy et al. 2004; Lillis et al. 2008). gC1qR, a receptor that binds the globular domain of C1q, has been proposed to mediate C1q-dependent chemotaxis (Vegh et al. 2006). In addition, gC1qR has been shown to bind directly to a number of pathogens and stimulate phagocytosis (Peerschke and Ghebrehiwet 2007).

Complement Regulatory Proteins Under normal physiological conditions, complement plays a critical role in the clearance of apoptotic cells and in the removal of deleterious substances originating from necrotic cells, as well as in the clearance of circulating immune complexes. During the course of an infection, complement serves as an extremely effective mechanism for recognition and elimination of foreign pathogens. However, uncontrolled complement activation can be deleterious to host cells. To protect self, organisms have evolved mechanisms to protect host cells from complementmediated activity. Complement fragments are highly labile such that they undergo spontaneous inactivation if they are not stabilized by other reactions. This allows for a regulated localized region of complement activity. In addition, to prevent damage to self-tissue, the complement system is highly regulated by a number of fluidphase and membrane-bound regulators. Importantly there are complement regulators at various steps of the cascade. For example, the glycoprotein C1 inhibitor (C1 Inh) can inhibit formation of C1 by binding C1r and C1s which prevents the subsequent activation of C4 and C2. The opsonic fragments, C3b and C4b, undergo proteolytic cleavage by the serine protease factor I in association with the CD46 (MCP, membrane cofactor) and complement receptor 1 (CR1) complex or in association with the factor H and C4BP (C4-binding protein) complex (Miwa and Song 2001). The C3 and C5 convertases are quickly

Complement System

disassembled and inactivated by complement regulatory molecules that have decay-assembly activity such as CD55 (DAF, decay-accelerating factor), CR1, factor H, and C4-binding protein (C4BP) (Miwa and Song 2001). The biological activities of anaphylatoxins are regulated by carboxypeptidases, which circulate in plasma or are present in the tissue (Matthews et al. 2004). For example, carboxypeptidase-N hydrolyzes the Cterminal peptide bond of anaphylatoxins, releasing the C-terminal arginine and thereby reducing the activity. The derivatives are known as C3a-desArg and C5a-desArg. MAC assembly is regulated by the membrane-bound protein CD59 as well as clusterin and vitronectin (Davies et al. 1989; Milis et al. 1993; McDonald and Nelsestuen 1997).

MAC Functions Beyond Cell Lysis Although historically MAC has been exclusively linked to cell death, in vitro studies have shown that deposition of C5b-9 can have diverse biological functions, including cell proliferation and cell survival (Niculescu and Rus 2001). C5b-9 has been shown to activate mitogen-activated protein kinase (MAPK) pathways (Niculescu et al. 1997). MAPK pathways regulate diverse physiological processes including cell growth, differentiation, and apoptotic cell death. For example, studies have shown that C5b-9 activates the ERK pathway (Niculescu et al. 1997). The ERK pathway is a member of the MAPK pathways which is important in cell proliferation and cell survival (Rubinfeld and Seger 2005). Moreover, the terminal complexes formed leading up to C5b-9 such as C5b-7 and C5b-8 have also been shown to affect signaling molecules. For example, generation of cAMP and lipid-derived messengers such as diacylglycerol and ceramide can be induced by C5b-7 (Niculescu et al. 1993). C5b-8 has been shown to influence cytosolic Ca+ and protein kinase activity (Wiedmer et al. 1987; Carney et al. 1990). The ability of C5b-9 to mediate diverse physiological functions is consistent with its ability to activate multiple signaling pathways.

Complement System

Complement Deficiencies The importance of complement in immunity is best exemplified in individuals with complement deficiency. Individual with deficiency in classical complement pathway components exhibit increased risk for infection by encapsulated bacteria such as S. pneumoniae (Skattum et al. 2011). In addition deficiency in either C1q, C4, or C2 is linked to the development of autoimmune disorders such as systemic lupus erythematosus (SLE), and this is thought to result from a failure to clear apoptotic cells which are a source of self-antigens. Although very rare, C1q deficiency is the strongest susceptibility factor for autoimmunity associated with lupus (Pickering et al. 2000). Lupus patients often show decreased circulating C1q levels and increased anti-C1q antibodies which can result in acquired C1q deficiency during active disease states. For example, 60% of patients with clinical symptoms of lupus nephritis tested positive for the presence of anti-C1q antibodies compared to healthy aged-matched volunteers (Jones et al. 1999; Akhter et al. 2011; Smykał-Jankowiak et al. 2011). Moreover, lupus patients also suffer from recurring infections from encapsulated bacteria such as S. pneumoniae and N. meningitidis. Despite the structural similarity between C1 and MBL and a described role for MBL in phagocytosis of apoptotic cells, there is no significant association between MBL deficiency and SLE (Lee et al. 2005). Mannose binding and thus lectin pathway complement activation have been demonstrated for a number of pathogens that colonize the lung, including Haemophilus influenzae (Neth et al. 2000), Mycoplasma pneumoniae (Hamvas et al. 2005), Mycobacterium avium (Polotsky et al. 1997), and Legionella pneumophila (Kuipers et al. 2003). In line with these observations, polymorphisms in MBL are associated with increased susceptibility to bacterial infections (Eisen 2010). The most common MBL polymorphism is found in the structural and promoter sequence of the MBL2 gene. This affects assembly of the molecule and therefore affects its ability to activate the lectin pathway. MASP-2 deficiency has also been shown to affect lectin pathway activation. Clinical

113

studies have shown that people deficient in MBL have severe respiratory tract infections. Indeed, MBL is found in bronchioalveolar lavage (BAL) fluid of patients with pneumonia but not in non-inflamed lungs (Gomi et al. 2004; Fidler et al. 2009). Moreover, there is a strong association between MBL deficiency and childhood infection, specifically respiratory tract infection (Fidler et al. 2009). Individuals with complement deficiencies of the alternative and terminal pathway exhibit increased susceptibility to bacterial infections. For example, deficiency in factor D or inherited deficiency of each of the terminal complement components (C5–C9) is associated with increased susceptibility to neisserial infection and meningococcal infection (Skattum et al. 2011).

References Akhter E, Burlingame R, Seaman A et al (2011) Anti-C1q antibodies have higher correlation with flares of lupus nephritis than other serum markers. Lupus 20: 1267–1274 Barnum SR (1995) Complement biosynthesis in the central nervous system. Crit Rev Oral Biol Med 6:132–146 Botto M, Lissandrini D, Sorio C, Walport MJ (1992) Biosynthesis and secretion of complement component (C3) by activated human polymorphonuclear leukocytes. J Immunol 149:1348–1355 Cao W, Bobryshev YV, Lord RSA et al (2003) Dendritic cells in the arterial wall express C1q: potential significance in atherogenesis. Cardiovasc Res 60:175–186 Carney DF, Lang TJ, Shin ML (1990) Multiple signal messengers generated by terminal complement complexes and their role in terminal complement complex elimination. J Immunol 145:623–629 Castellano G, Woltman AM, Schena FP et al (2004) Dendritic cells and complement: at the cross road of innate and adaptive immunity. Mol Immunol 41:133–140 Colten HR, Strunk RC, Perlmutter DH, Cole FS (1986) Regulation of complement protein biosynthesis in mononuclear phagocytes. Ciba Found Symp 118:141–154 Daha MR (2010) Role of complement in innate immunity and infections. Crit Rev Immunol 30:47–52 Davies A, Simmons DL, Hale G et al (1989) CD59, an LY-6-like protein expressed in human lymphoid cells, regulates the action of the complement membrane attack complex on homologous cells. J Exp Med 170:637–654 Eisen DP (2010) Mannose-binding lectin deficiency and respiratory tract infection. J Innate Immun 2:114–122 Fidler KJ, Hilliard TN, Bush A et al (2009) Mannosebinding lectin is present in the infected airway: a

C

114 possible pulmonary defence mechanism. Thorax 64:150–155 Fraser DA, Laust AK, Nelson EL, Tenner AJ (2009) C1q differentially modulates phagocytosis and cytokine responses during ingestion of apoptotic cells by human monocytes, macrophages, and dendritic cells. J Immunol 183:6175–6185 Fraser DA, Pisalyaput K, Tenner AJ (2010) C1q enhances microglial clearance of apoptotic neurons and neuronal blebs, and modulates subsequent inflammatory cytokine production. J Neurochem 112:733–743 Gasque P, Fontaine M, Morgan BP (1995) Complement expression in human brain. Biosynthesis of terminal pathway components and regulators in human glial cells and cell lines. J Immunol 154:4726–4733 Gomi K, Tokue Y, Kobayashi T et al (2004) Mannosebinding lectin gene polymorphism is a modulating factor in repeated respiratory infections. Chest 126:95–99 Hamvas RMJ, Johnson M, Vlieger AM et al (2005) Role for mannose binding lectin in the prevention of mycoplasma infection. Infect Immun 73:5238–5240 Høgåsen AK, Würzner R, Abrahamsen TG, Dierich MP (1995) Human polymorphonuclear leukocytes store large amounts of terminal complement components C7 and C6, which may be released on stimulation. J Immunol 154:4734–4740 Holmskov U, Thiel S, Jensenius JC (2003) Collections and ficolins: humoral lectins of the innate immune defense. Annu Rev Immunol 21:547–578 Hosokawa M, Klegeris A, Maguire J, McGeer PL (2003) Expression of complement messenger RNAs and proteins by human oligodendroglial cells. Glia 42:417–423 Huber-Lang M, Sarma JV, Zetoune FS et al (2006) Generation of C5a in the absence of C3: a new complement activation pathway. Nat Med 12:682–687 Jack DL, Lee ME, Turner MW et al (2005) Mannosebinding lectin enhances phagocytosis and killing of Neisseria meningitidis by human macrophages. J Leukoc Biol 77:328–336 Jiang H, Cooper B, Robey FA, Gewurz H (1992) DNA binds and activates complement via residues 14–26 of the human C1q A chain. J Biol Chem 267: 25597–25601 Johnson E, Hetland G (1988) Mononuclear phagocytes have the potential to synthesize the complete functional complement system. Scand J Immunol 27:489–493 Jones JL, Hanson DL, Dworkin MS et al (1999) Surveillance for AIDS-defining opportunistic illnesses, 1992–1997. MMWR CDC Surveill Summ 48:1–22 Korb LC, Ahearn JM (1997) C1q binds directly and specifically to surface blebs of apoptotic human keratinocytes: complement deficiency and systemic lupus erythematosus revisited. J Immunol 158: 4525–4528 Korotzer AR, Watt J, Cribbs D et al (1995) Cultured rat microglia express C1q and receptor for C1q: implications for amyloid effects on microglia. Exp Neurol 134:214–221

Complement System Kuipers S, Aerts PC, Van Dijk H (2003) Differential microorganism-induced mannose-binding lectin activation. FEMS Immunol Med Microbiol 36:33–39 Lee YH, Witte T, Momot T et al (2005) The mannosebinding lectin gene polymorphisms and systemic lupus erythematosus: two case–control studies and a metaanalysis. Arthritis Rheum 52:3966–3974 Li K, Fazekasova H, Wang N et al (2011) Expression of complement components, receptors and regulators by human dendritic cells. Mol Immunol 48:1121–1127 Lillis AP, Greenlee MC, Mikhailenko I et al (2008) Murine low-density lipoprotein receptor-related protein 1 (LRP) is required for phagocytosis of targets bearing LRP ligands but is not required for C1q-triggered enhancement of phagocytosis. J Immunol 181:364–373 Markiewski MM, Lambris JD (2009) Unwelcome complement. Cancer Res 69:6367–6370 Matthews KW, Drouin SM, Liu C et al (2004) Expression of the third complement component (C3) and carboxypeptidase N small subunit (CPN1) during mouse embryonic development. Dev Comp Immunol 28: 647–655 McDonald JF, Nelsestuen GL (1997) Potent inhibition of terminal complement assembly by clusterin: characterization of its impact on C9 polymerization. Biochemistry 36:7464–7473 Milis L, Morris CA, Sheehan MC et al (1993) Vitronectinmediated inhibition of complement: evidence for different binding sites for C5b-7 and C9. Clin Exp Immunol 92:114–119 Miwa T, Song WC (2001) Membrane complement regulatory proteins: insight from animal studies and relevance to human diseases. Int Immunopharmacol 1:445–459 Morgan BP (1999) Regulation of the complement membrane attack pathway. Crit Rev Immunol 19:173–198 Müller W, Hanauske-Abel H, Loos M (1978) Biosynthesis of the first component of complement by human and guinea pig peritoneal macrophages: evidence for an independent production of the C1 subunits. J Immunol 121:1578–1584 Nagasawa S, Kobayashi C, Maki-Suzuki T et al (1985) Purification and characterization of the C3 convertase of the classical pathway of human complement system by size exclusion high-performance liquid chromatography. J Biochem 97:493–499 Nauta AJ, Trouw LA, Daha MR et al (2002) Direct binding of C1q to apoptotic cells and cell blebs induces complement activation. Eur J Immunol 32:1726–1736 Neth O, Jack DL, Dodds AW et al (2000) Mannose-binding lectin binds to a range of clinically relevant microorganisms and promotes complement deposition. Infect Immun 68:688–693 Nguyen HX, Galvan MD, Anderson AJ (2008) Characterization of early and terminal complement proteins associated with polymorphonuclear leukocytes in vitro and in vivo after spinal cord injury. J Neuroinflammation 5:26 Niculescu F, Rus H (2001) Mechanisms of signal transduction activated by sublytic assembly of terminal

Conjugative Transfer Systems and Classifying Plasmid Genomes complement complexes on nucleated cells. Immunol Res 24:191–199 Niculescu F, Rus H, Shin S et al (1993) Generation of diacylglycerol and ceramide during homologous complement activation. J Immunol 150:214–224 Niculescu F, Rus H, van Biesen T, Shin ML (1997) Activation of Ras and mitogen-activated protein kinase pathway by terminal complement complexes is G protein dependent. J Immunol 158:4405–4412 Norsworthy P, Theodoridis E, Botto M et al (1999) Overrepresentation of the Fcgamma receptor type IIA R131/R131 genotype in caucasoid systemic lupus erythematosus patients with autoantibodies to C1q and glomerulonephritis. Arthritis Rheum 42: 1828–1832 Norsworthy PJ, Fossati-Jimack L, Cortes-Hernandez J et al (2004) Murine CD93 (C1qRp) contributes to the removal of apoptotic cells in vivo but is not required for C1q-mediated enhancement of phagocytosis. J Immunol 172:3406–3414 Okuda T (1991) Murine polymorphonuclear leukocytes synthesize and secrete the third component and factor B of complement. Int Immunol 3:293–296 Peerschke EI, Ghebrehiwet B (2007) The contribution of gC1qR/p33 in infection and inflammation. Immunobiology 212:333–342 Pickering MC, Botto M, Taylor PR et al (2000) Systemic lupus erythematosus, complement deficiency, and apoptosis. Adv Immunol 76:227–324 Podack ER (1984) Molecular composition of the tubular structure of the membrane attack complex of complement. J Biol Chem 259:8641–8647 Polotsky VY, Belisle JT, Mikusova K et al (1997) Interaction of human mannose-binding protein with Mycobacterium avium. J Infect Dis 175:1159–1168 Porter RR, Reid KB (1979) Activation of the complement system by antibody-antigen complexes: the classical pathway. Adv Protein Chem 33:1–71 Reid KB, Porter RR (1976) Subunit composition and structure of subcomponent C1q of the first component of human complement. Biochem J 155:19–23 Rogers J, Cooper NR, Webster S et al (1992) Complement activation by beta-amyloid in Alzheimer disease. Proc Natl Acad Sci U S A 89:10016–10020 Roy S, Knox K, Segal S et al (2002) MBL genotype and risk of invasive pneumococcal disease: a case–control study. Lancet 359:1569–1573. https://doi.org/10.1016/S01406736(02)08516-1, S0140-6736(02)08516-1 [pii]\n Rubinfeld H, Seger R (2005) The ERK cascade: a prototype of MAPK signaling. Mol Biotechnol 31:151–174 Schwaeble W, Schafer MK, Petry F et al (1995) Follicular dendritic cells, interdigitating cells, and cells of the monocyte-macrophage lineage are the C1q-producing sources in the spleen. Identification of specific cell types by in situ hybridization and immunohistochemical analysis. J Immunol 155:4971–4978 Sim RB, Kishore U, Villiers CL et al (2007) C1q binding and complement activation by prions and amyloids. Immunobiology 212:355–362

115

Skattum L, Van Deuren M, Van Der Poll T, Truedsson L (2011) Complement deficiency states and associated infections. Mol Immunol 48:1643–1655 Smykał-Jankowiak K, Niemir ZI, Polcyn-Adamczak M (2011) Do circulating antibodies against C1q reflect the activity of lupus nephritis? Pol Arch Med Wewn 121:287–294 Van Schravendijk MR, Dwek RA (1982) Interaction of C1q with DNA. Mol Immunol 19:1179–1187 Vegh Z, Kew RR, Gruber BL, Ghebrehiwet B (2006) Chemotaxis of human monocyte-derived dendritic cells to complement component C1q is mediated by the receptors gC1qR and cC1qR. Mol Immunol 43:1402–1407 Velazquez P, Cribbs DH, Poulos TL, Tenner AJ (1997) Aspartate residue 7 in amyloid beta-protein is critical for classical complement pathway activation: implications for Alzheimer’s disease pathogenesis. Nat Med 3:77–79 Walport MJ (2001a) Complement. Second of two parts. N Engl J Med 344:1140–1144 Walport MJ (2001b) Complement-first of two parts. N Engl J Med 344:1058–1066 Wiedmer T, Ando B, Sims PJ (1987) Complement C5b-9stimulated platelet secretion is associated with a Ca2+ -initiated activation of cellular protein kinases. J Biol Chem 262:13674–13681

Computational Epigenetics ▶ Epigenetic Research, Computational Methods in

Conjugative Transfer Systems and Classifying Plasmid Genomes Fernando de la Cruz and M. Pilar Garcillán-Barcia Instituto de Biomedicina y Biotecnología de Cantabria (IBBTEC), Universidad de CantabriaCSIC-IDICAN, Santander, Cantabria, Spain

Synopsis The ability to transfer DNA between bacteria by conjugation is a core function of plasmids. The best-studied types of conjugative systems involve functions that create a mating bridge between bacteria, functions that promote DNA-processing events required for transfer, and a coupling

C

116

Conjugative Transfer Systems and Classifying Plasmid Genomes

function that allows DNA processing to happen at the pore and thus drive transfer. The genes encoding these functions provide a valuable resource for classifying plasmids and predicting their transfer characteristics. In combination with the classification of replicons and addiction systems, these transfer functions are a key tool for analyzing plasmid genomes.

Introduction Horizontal gene transfer (HGT) is the mechanism by which bacteria acquire a significant part of their genetic information, which is used for adaptation and speciation (de la Cruz and Davies 2000). Among the mechanisms that allow HGT, conjugation seems the most active (Halary et al. 2010). Conjugation has certainly been responsible for the spread of multiple antibiotic resistance genes in Enterobacteriaceae, where plasmids have been most thoroughly analyzed. Conjugative transfer systems take one of two genetic architectures: they can be contained in autonomously replicating plasmids or be integrated into the bacterial chromosomes, where they are normally part of mobile elements called “integrative and conjugative elements” (ICEs) and IMEs (integrative and mobilizable elements). Although these two types of mobile genetic elements are considered different entities, comparative genomics analysis indicates that they are two presentations of the same type of element (Guglielmini et al. 2011). Because they tend to slow down the growth of their hosts due to metabolic or phenotypic burdens, in order to survive as autonomous replicative elements, plasmids must either give their hosts a counterbalancing advantage (such as antibiotic resistance) or replicate faster than their hosts. If not, they are bound for extinction (Slater et al. 2008). Plasmids transmissible by conjugation are called conjugative if they contain all the genetic information required for self-transfer, or mobilizable, if they require the additional transfer functions of a conjugative plasmid present in the same strain. Usually, mobilizable plasmids code for an origin of transfer (oriT), a relaxase that recognizes it, and a related

accessory protein, also involved in oriT recognition and processing. By being mobilizable rather than encoding the whole of the transfer apparatus, a plasmid trades some loss of autonomy for a way of reducing its burden to the host cell. A New System for Classifying Plasmid Genomes Since conjugative transfer is a key property as well as one of the core (or backbone) plasmid attributes, having ways to determine which transfer functions are present and their relation to known systems will greatly aid in the understanding and classification of plasmids in different systems. In order to have an overview of the landscape of conjugative systems, it was necessary to develop a robust classification system. The only gene in common to all transmissible elements (conjugative and mobilizable, including ICEs and IMEs) is the relaxase. Relaxases provide good phylogenetic markers, since they belong to a dominant protein family (characterized by the 3H sequence signature) (Garcillan-Barcia et al. 2009). The diversity of transmissible elements is encompassed into eight relaxase MOB families and eight mating pair (MPF) families (Guglielmini et al. 2011, 2013, 2014) (see comprehensive databases CONJscan (http://bit.ly/ CONJscan) and CONJdb (http://conjdb.web.pas teur.fr)). MPF proteins include the type 4 secretion system (T4SS) that forms the mating channel and a pilus that attaches to the recipient cell. In proteobacteria, conjugative or mobilizable plasmids could be classified in one of six MOB families (Garcillan-Barcia et al. 2009, 2011) and four MPF families (Smillie et al. 2010). Interestingly, the relaxase phylogeny was in general agreement with the coupling protein (T4CP) phylogeny and, to a lesser extent, with the T4SS protein phylogenies, which could be in turn classified into four families (for instance, MPFT, MPFF, MPFI, and MPFG) (Smillie et al. 2010). In general, conjugative genetic determinants, once assembled, remain as a phylogenetic unit over relatively long evolutionary periods. For instance, within the MOBF1 line, which is very ancient, there are three branches: MOBF11, which contains a MPFT; MOBF12, which contains a MPFF; and MOBF13,

Conjugative Transfer Systems and Classifying Plasmid Genomes Conjugative Transfer Systems and Classifying Plasmid Genomes, Table 1 Main MOB and REP plasmid types found in gamma-proteobacteria MOB typea MOBF11 MOBF11 MOBF11 MOBF12

Inc groupb IncW IncN IncP-9 IncFI

MOBF12 MOBF12 MOBF12 MOBF12 MOBF12 MOBF12 MOBP11

IncFII IncFIII/FIV IncFV – – – IncP-1a

MOBP11 MOBP11 MOBP11 MOBP11 MOBP11 MOBP12 MOBP12 MOBP12 MOBP12 MOBP13 MOBP14 MOBP14 MOBP14 MOBP3 MOBP3 MOBP4 MOBP51 MOBP51

IncP-1b IncP-1g IncP-1d IncP-1e IncP-1z IncI-1a IncI-1g IncK IncB/O IncL/M IncQ-2a IncQ-2b IncG/IncP-6 IncX2 IncX1 IncU – –

MOBP52 MOBP53 – MOBH11 MOBH11 MOBH11 MOBH12 MOBH12 – MOBH2 MOBC11 MOBC12

– – IncI-2 IncHI1 IncHI2 IncP-7 IncJ IncA/C IncT – – –

REP typec W, WoriV N, NoriV P9rep FIA, FIB, FIC, FrepB, FII FrepB FrepB FrepB FIIS, FIBS FIIK FIIY P, PtrfA1, trfA PtrfA2, trfA trfA_g trfA_d trfA I1-Ig I1-Ig K/B K/B – – – X – U – ColE, ColETp – – – HI1 HI2 – – A/C T – – –

Plasmid prototyped R388 R46 pWWO F

R100 pSU316 pED208 pSLT pKPN4 pMT RP4 R751 pQKH54 pEST4011 pKJK5 pMCBF1 R64 R621a R387 TP113 pCTX-M3 pTF-FC2 pTC-F14 Rms149 R6K pOLA52 pFBAOT6 ColE1 pTPqnrS-1a p9555 pAsal1 R721 R27 R478 pCAR1 R391e pSN254 Rts1 pKLC102 CloDF13 pYptb32953 (continued)

117

Conjugative Transfer Systems and Classifying Plasmid Genomes, Table 1 (continued) MOB typea MOBQ11 MOBQ12 MOBQu

Inc groupb IncQ-1 – –

REP typec QoriV – –

Plasmid prototyped RSF1010 p11745 pIGWZ12

a

MOB types according to references Alvarado et al. (2012) and Garcillan-Barcia et al. (2011). A dash indicates that no primers were developed because of the limited number of relaxase members b When incompatibility was experimentally tested, the Inc group is provided c REP types according Suppl Table 1 in Alvarado et al. (2012) and references therein. Sometimes, more than one primer pair was developed for some of the REP groups. Their names are annotated, while more details are to be found in the given references. A dash indicates the absence of primers for detection of the corresponding plasmid d Plasmid representatives of each gamma-proteobacterial MOB subfamily are listed. More than one reference plasmid per MOB subfamily were included when they belong to different Inc or REP types or when more than one MOB primer pair was used for detection of the subfamily e R391 is an ICE, formerly considered an IncJ plasmid

which contains a MPFC. No further variations in the associations between MOB and MPF seem to have occurred in the evolutionary history of these elements. In addition, large phylogenetic groups of T4CP sequences cluster into coherent taxonomical clusters. Because of the apparent modular organization of plasmid genomes, which is known to allow shuffling of genomic segments, it is surprising that there has not been more mixing of these building blocks. Genomic analysis showed that among 263 transmissible plasmids (out of 503 gamma-proteobacterial plasmids recorded in DNA sequence databases), approximately 54% were conjugative while the other 46% were mobilizable. The average size of a conjugative plasmid was 100 kb, with considerable variance (from 21 to 279 kb). Mobilizable plasmids showed at least three abundance peaks of average sizes, 5 kb, 175 kb, and 1,500 kb, possibly reflecting different mobility strategies, as discussed by Smillie et al. (2010). Finally, the same approach could be used to identify and classify ICEs in sequenced bacterial genomes (Guglielmini et al. 2011). The most surprising discovery was that ICEs outnumber plasmids in

C

118

Conjugative Transfer Systems and Classifying Plasmid Genomes

sequenced genomes by two to one. In addition, plasmids and ICEs are intermingled in phylogenetic trees for relaxases, indicating that plasmids and ICEs frequently interconvert, thus blurring the distinction between the two types of elements. Creating a Package of Classification Tools The MOB classification approach has already been put to use for plasmid classification. Experimental tools for applying it in the field of clinical plasmid epidemiology have been developed (see, e.g., Coelho et al. 2012; Mata et al. 2010, 2012; Valverde et al. 2009). For this purpose, a set of 19 degenerate oligonucleotide primer pairs that allow plasmid typing of gamma-proteobacterial plasmids was selected (Alvarado et al. 2012). By using this tool, called degenerate primer MOB typing, or DPMT, most clinically relevant plasmids could be classified. DPMT nicely complemented the highly successful PCR-based replicon typing (PBRT) of the Carattoli group (Carattoli et al. 2005). Table 1 shows the main plasmid families that can be analyzed by both methods. MOB typing opens not only the possibility of analyzing plasmids and ICEs as a continuum and following the dynamics of their interconversion but also offers the opportunity for interrogating metagenomes and analyzing the structure and dynamics of conjugative transfer systems. Acknowledgments Work in FdlC laboratory was supported by the Spanish Ministry of Economy and Competitivity (BFU2011–26608) and the European Seventh Framework Program (612146/FP7-ICT-2013-10 and 282004/FP7–HEALTH-2011-2.3.1–2). MPGB received a JAE-Doc_2009 postdoctoral contract from Consejo Superior de Investigaciones Científicas, which was cofinanced by the European Science Foundation.

References Alvarado A, Garcillan-Barcia MP, de la Cruz F (2012) A Degenerate Primer MOB Typing (DPMT) method to classify gamma-proteobacterial plasmids in clinical and environmental settings. PLoS One 7:e40438 Carattoli A, Bertini A, Villa L, Falbo V, Hopkins KL, Threlfall EJ (2005) Identification of plasmids by PCR-based replicon typing. J Microbiol Methods 63:219–228

Coelho A, Piedra-Carrasco N, Bartolome R, QuinteroZarate JN, Larrosa N, Cornejo-Sanchez T, Prats G, Garcillan-Barcia MP, de la Cruz F, Gonzalez-Lopez JJ (2012) Role of IncHI2 plasmids harbouring blaVIM-1, blaCTX-M-9, aac(60 )-Ib and qnrA genes in the spread of multiresistant Enterobacter cloacae and Klebsiella pneumoniae strains in different units at Hospital Vall d’Hebron, Barcelona, Spain. Int J Antimicrob Agents 39:514–517 de la Cruz F, Davies J (2000) Horizontal gene transfer and the origin of species: lessons from bacteria. Trends Microbiol 8:128–133 Garcillan-Barcia MP, Francia MV, de la Cruz F (2009) The diversity of conjugative relaxases and its application in plasmid classification. FEMS Microbiol Rev 33:657–687 Garcillan-Barcia MP, Alvarado A, de la Cruz F (2011) Identification of bacterial plasmids based on mobility and plasmid population biology. FEMS Microbiol Rev 35:936–956 Guglielmini J, Quintais L, Garcillan-Barcia MP, de la Cruz F, Rocha EP (2011) The repertoire of ICE in prokaryotes underscores the unity, diversity, and ubiquity of conjugation. PLoS Genet 7:e1002222 Guglielmini J, de la Cruz F, Rocha EP (2013) Evolution of conjugation and type IV secretion systems. Mol Biol Evol 30:315–331 Guglielmini J, Neron B, Abby SS, Garcillan-Barcia MP, de la Cruz F, Rocha EPC (2014) Key components of the eight classes of type IV secretion systems involved in bacterial conjugation or protein secretion. Nucleic Acids Res. https://doi.org/10.1093/nar/gku194 Halary S, Leigh JW, Cheaib B, Lopez P, Bapteste E (2010) Network analyses structure genetic diversity in independent genetic worlds. Proc Natl Acad Sci U S A 107:127–132 Mata C, Miro E, Mirelis B, Garcillan-Barcia MP, de la Cruz F, Coll P, Navarro F (2010) In vivo transmission of a plasmid coharbouring bla and qnrB genes between Escherichia coli and Serratia marcescens. FEMS Microbiol Lett 308:24–28 Mata C, Miro E, Alvarado A, Garcillan-Barcia MP, Toleman M, Walsh TR, de la Cruz F, Navarro F (2012) Plasmid typing and genetic context of AmpC beta-lactamases in Enterobacteriaceae lacking inducible chromosomal ampC genes: findings from a Spanish hospital 1999–2007. J Antimicrob Chemother 67:115–122 Slater FR, Bailey MJ, Tett AJ, Turner SL (2008) Progress towards understanding the fate of plasmids in bacterial communities. FEMS Microbiol Ecol 66:3–13 Smillie C, Garcillan-Barcia MP, Francia MV, Rocha EP, de la Cruz F (2010) Mobility of plasmids. Microbiol Mol Biol Rev 74:434–452 Valverde A, Canton R, Garcillan-Barcia MP, Novais A, Galan JC, Alvarado A, de la Cruz F, Baquero F, Coque TM (2009) Spread of bla(CTX-M-14) is driven mainly by IncK plasmids disseminated among Escherichia coli phylogroups A, B1, and D in Spain. Antimicrob Agents Chemother 53:5204–5212

Conservative Site-Specific Recombination

Conservative Site-Specific Recombination Adam R. Parks1 and Joseph E. Peters2 1 Molecular Control and Genetics Section, Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD, USA 2 Department of Microbiology, Cornell University, Ithaca, NY, USA

Synonyms Site-specific recombination

119

(~20–30 bp) DNA sequences that are recognized by recombinase proteins. The process is considered “conservative” because there is no net gain or loss of sequence information during the recombination process (i.e., no target-site duplication as in DDE-type transposition). This mechanism also does not require any high-energy cofactors, such as ATP (Grindley et al. 2006). The proteins that conduct this activity belong to one of two large and unrelated families of proteins, the tyrosine recombinases and the serine recombinases. These recombinase families are named after the conserved amino acid residue within the protein that forms a covalent bond with the DNA backbone of the element or the DNA target. Site-specific recombinases can perform three functions:

Synopsis Conservative site-specific recombination is a process that enables genetic recombination between DNA molecules that contain short DNA sequences, which are bound by specific recombinase proteins. These recombinase proteins belong to one of two families of proteins, the tyrosine recombinases and the serine recombinases. Both recombinase families form covalent bonds between the enzyme and the DNA backbone. Following isomerization, the DNA backbone is rejoined in a process that does not require ATP or metal cofactors. The outcomes of site-specific recombination can include DNA integration, deletion, or inversion. Site-specific recombinases perform a variety of biological functions including integration of bacteriophage DNA into a host genome, resolution of dimer DNA molecules, recombination of antibiotic resistance gene cassettes, and alteration of gene expression. Site-specific recombinases are also actively used as tools in the biotech industry.

Introduction Conservative site-specific recombination is a process that enables genetic recombination between DNA molecules that both contain short

• Integration of one DNA molecule into another (Fig. 1) • Resolution, or separation, of two or more conjoined DNA molecules (Fig. 2a) • Inversion of a particular DNA sequence (Fig. 2b) These activities depend on the orientation and arrangement of the recombinase binding sites, the configurations of recombinase proteins, topology of the DNA molecules involved, and sometimes the activities of accessory proteins that can be either encoded by the host organism or the mobile genetic element (Craig 2002). In its simplest form, the process of site-specific recombination follows a general overall path (Grindley et al. 2006): 1. A dimer of recombinase proteins binds to specific DNA sequences on each DNA molecule. 2. The dimers bound to each molecule pair, forming a tetramer, and bring the two DNA recombination sites together, forming a synaptic complex. 3. DNA strands are cleaved. 4. DNA strands are repaired with partner DNAs. 5. The synaptic complex dissociates, leaving behind the newly recombined DNA.

C

120

Conservative Site-Specific Recombination

attP

attB Int IHF Xis attL

Int IHF attR

Conservative Site-Specific Recombination, Fig. 1 Integration and excision of l genome involve site-specific recombination. A closed circular double-stranded DNA copy of the l bacteriophage genome (red) containing the attP sequence is bound by Int protein. Int also binds to the attB locus within the bacterial chromosome (black). IHF mediates DNA bending necessary for synapsis between the two DNA molecules. Recombination between the attP and

attB sites, which mediates recombination between the donor and target DNA molecules, results in the integration of the l genome within the bacterial chromosome. A reversal of this process enables the removal of the l genome in a process that also requires the Xis protein to ensure synapsis and drive the reaction in the direction of excision

a 1 2 3

1 2 3

1 2 3

b 1 2 3

1 2 3

1 2 3

Conservative Site-Specific Recombination, Fig. 2 Orientation of site-specific recombinase binding sites dictates inversion versus deletion. (a) Recombinase binding sites oriented in the same direction mediate the

excision of the intervening DNA sequence. (b) Recombinase binding sites that are oriented in opposing directions mediate the inversion of the intervening DNA sequence

Site-Specific Recombinase Proteins

Yang 2010). They both do, however, use similar biochemical strategies to achieve the outcome of recombination between identical, or nearly identical, DNA sites. In both cases, serine and tyrosine recombinases use the OH group of the catalytic amino acid residue as a nucleophile to attack the sugar backbone (Fig. 3). They both surround the scissile PO4 with positively charged amino acid residues and use general acid/base catalysis

One of the distinctive characteristics of sitespecific recombination is the formation of a covalent bond between DNA and protein. The two families of site-specific recombinase, tyrosine and serine recombinases, appear to have evolved independently from one another, since they do not share sequence or structural similarity (Table 1;

Conservative Site-Specific Recombination

121

mechanisms. A notable difference is the location of the protein-DNA bond (Grindley et al. 2006). Tyrosine recombinases displace the 50 bridging O, forming a 30 phosphotyrosyl bond. Serine recombinases displace the 30 bridging O, forming a 50 phosphoserine bond. These mechanistic differences are also reflected in their domesticated eukaryotic counterparts. Tyrosine recombinases Conservative Site-Specific Recombination, Table 1 Examples of conservative site-specific recombinases Tyrosine recombinase/ DNA substrate Phage l: Int/att Phage P1: Cre/lox

Serine recombinase/ DNA substrate Tn3: TnpR/res Salmonella typhimurium: Hin/hix Plasmid RP4: ParA/MRS

Escherichia coli: XerCD/ diff Yeast 2 m plasmid: Flp/frt

Phage Mu: Gin/gix

Tyrosinerecombinase

base

5’ 3’

Base?

3’

Y

O – H O P O 5’ O

H

H

O

3’ Acid?

O –O P O

H O O

Base?

e

O

H

bas

base

5’

3’ Acid?

O

H Acid?

3’

base

5’

P O –

5’

Base?

Acid?

O

O

base 3’

Serinerecombinase

base

5’

Y

O

are related to type IB topoisomerases and telomere resolvases, whereas serine recombinases are related to type II topoisomerases (Yang 2010). Topoisomerases, however, do not require specific DNA binding sites. In both cases, no metal cofactors are required for recombinase activity; however, gd requires Mg2+ for strand joining. The recombination sites on the DNA are composed of two binding sites for the recombinase proteins, one for each strand of DNA (Grindley et al. 2006). The recombinases work as tetramers, with one monomer acting on each strand of donor DNA or target DNA. Serine and tyrosine recombinases differ in the timing of nucleophilic attack of the DNA backbone. For serine recombinases, all four monomers are active at once, creating a double-strand DNA break intermediate (Fig. 4). For tyrosine recombinases, only two monomers are active at once, creating a Holliday junction intermediate that is resolved

H

O

base

5’

O

H

3’ S

Conservative Site-Specific Recombination, Fig. 3 Tyrosine and serine recombinases share similar mechanisms but evolved independently. Amino acid residues participate in the reaction as general acids and bases.

O P

Base?

O

base

5’

3’ S

Serine recombinases leave a free 30 OH at the break site. Tyrosine recombinases leave a free 50 OH. The reverse reaction restores the phosphodiester backbone of the DNA. (Grindley et al. 2006)

C

122

Conservative Site-Specific Recombination

Serine-recombinase a

b

a

3⬘ OH

S-O P

P O-S

S-O

P

P

Y-O

P 5⬘ OH

Y-O

O-S

P 5⬘ OH

3⬘ OH

3⬘ OH

P

S-O P

O-S

c

3⬘ OH

3⬘ OH

S-O P

b

3⬘ OH

3⬘ OH

c

Tyrosine-recombinase

P

5⬘ OH

P

O-Y

O-S

3⬘ OH

d

5⬘ OH

P

O-Y

d

Conservative Site-Specific Recombination, Fig. 4 Comparison of serine and tyrosine recombinase mechanisms: (A) both serine and tyrosine recombinases bind to specific half sites on the DNA. They both bind initially as dimers at each recombination site on a given DNA substrate, and those dimers further dimerize to make tetramers. (B) The recombinases are not activated until all subunits of the tetramer are present and bound to DNA substrates. Serine recombinases are all activated at once, forming a double-stranded DNA break that is held together by the protein-protein interactions of the recombinases. Serine recombinases leave a free 30 OH at the break site.

Tyrosine recombinases activate only a pair of recombinases at a time, nicking both DNA substrates and leaving a free 50 OH. (C) Serine recombinases undergo a large conformational change, rotating dimer pairs 180 with respect to one another, aligning the new DNA ends in preparation for joining. Tyrosine recombinases form a Holliday junction intermediate. Once the first set of recombinases has rejoined the new DNA ends, the second set of recombinases is activated. (D) In both cases, DNA is recombined without any net addition or deletion of DNA sequence

by the activity of the other two tyrosine recombinase monomers. In both cases, conformational change of the protein-DNA complexes places the DNA-protein bond from the donor DNA molecule opposite the sequence in the target DNA molecule, and a phosphodiester bond is formed between donor DNA and target DNA.

Tyrosine Recombinases The tyrosine recombinase family is sometimes referred to as the l integrase family because much of what is known about this family was initially observed in the bacteriophage l integrase protein that mediates the integration of the l genome into a specific site of the E. coli

Conservative Site-Specific Recombination

chromosome (Fig. 1) (Grindley et al. 2006). The tyrosine recombinases rely on a conserved amino acid motif, the RHR triad (arginine, histidine-XX-arginine). The arginines coordinate the oxygen molecules of the PO4 in the DNA backbone to stabilize the transition state of the transesterification reaction. The histidine is involved in general acid/base chemistry, promoting the use of the tyrosine residue in nucleophilic attack on the phosphate during protein-DNA bond formation and supplying a hydrogen to the tyrosine to promote its role as a leaving group when a DNA-DNA bond is restored (Fig. 3). Energy is stored in the protein-DNA bond; therefore, the overall reaction does not require the addition of any high-energy cofactors, such as ATP.

Serine Recombinases The serine recombinase family (sometimes called the invertase/resolvase family) includes the gd recombinase and the Tn3 resolvase. The Sin resolvase has also been extensively studied. There is a great diversity of serine-type recombinases, and they vary widely, from 100 amino acids in size to over 800 amino acids (Van Duyne and Rutherford 2013). The diversity of these recombinases also extends to the complexity of their recognition sites and the intricacy of the protein-DNA complexes that carry out recombination reactions. While they are not evolutionarily related, the biochemical steps in recombination via serine recombinases share many similarities with those of tyrosine recombinases. Details regarding which amino acid residues activate the serine within the active center remain hazy, but recent crystal structures have begun to shed some light on these details (Yang 2010). In the case of Sin resolvase, two arginine residues near the catalytically active serine residue appear to participate as a general base in protein-DNA bond formation and general acid in restoration of DNA-DNA bonds (Fig. 3). In serine recombination systems, recombinase dimers bind to recognition sites on the DNA and are brought together by topological features of the DNA (Grindley et al. 2006). This process is referred to as synapsis. For serine resolvases,

123

such as Tn3 resolvase and gd resolvase, the binding sites in the DNA where recombination is carried out are referred to as res sites. These res sites contain a minimum of two binding sites for the recombinase proteins, oriented head to head, but they can contain more accessory sites that are often used for regulatory purposes (Craig 2002). For serine resolvases, both synapsis of the recombination sites and activation of the recombination complex depend on the appropriate supercoiling of DNA substrates. Once strand cleavage is activated, the active serine forms a covalent bond with the DNA strand, in cis, with the same DNA to which the protein is bound. It is important that all components of the recombination reaction are present because activation of the recombinases cleaves all four DNA strands involved at once (Boocock and Rice 2013). Strand exchange in serine recombinases depends on a large movement of protein and DNA. Topological features play a large role in driving the strand transfer reaction once the DNA strands have been broken. The recombinase subunits undergo a 180 rotation with respect to one another, articulating about an extended flat protein-protein interface. This rotation positions the free 30 OH end generated by one recombinase subunit opposite the 50 phosphoserine bond of its partner and allows strand exchange to occur. Watson-Crick base pairing between a 2 bp 30 extension and the partner sequence is also important for completion of the reaction and serves as a checkpoint to ensure that the correct recombination outcome is achieved. Positive and negative patches within this protein interface are postulated to allow subunit rotations to occur in 180 increments, allowing DNA ends to return to their original position to be rejoined with the original DNA molecule if conditions that promote strand exchange are not favorable (Grindley et al. 2006). DNA topology is often the key determinant in dictating the outcome of recombination reactions, ensuring that the correct recombination reaction (i.e., integration, excision, or inversion) is achieved, a feature that is sometimes referred to as a topological filter. The bacteriophage fC31 uses a large serine recombinase to mediate phage genome insertion into and excision out of the host genome

C

124

(Van Duyne and Rutherford 2013). Integration and excision are controlled by conformational changes in the DNA-bound integrase proteins. On its own, the integrase of this bacteriophage only forms a stable synapse necessary for integration, mediating recombination between the phage attP and the bacteriophage attB site. Integrase also binds well to attL and attR sites; however, it adopts a different conformation that is not competent for recombination. The Xis protein of these phages is thought to bind to attL- or attR-bound integrase, switching integrase to a recombination competent conformation. Conversely, Xis binding to attP- or attB-bound integrase prevents recombination. This mechanism is in contrast with that of the tyrosine recombinase system of phage lambda, which regulates excision and insertion at the level of DNA conformation rather than protein conformation.

Recombinase Binding Sites The sequences that are used for recombination events must be conserved sites, and recombination between these sites does not result in any addition or deletion of any nucleotides, as seen in DDE-type transposase where target-site duplications occur, for example. In many cases, within bacteria, tRNA genes may be used as recombination sites (Hacker and Kaper 2000). This feature enables some bacterial plasmids and some bacteriophage genomes to exploit the presence of a conserved gene within the host genome as a site of integration. Since tRNA genes are commonly used by horizontally transferred DNA as site of integration, a convenient way to look for mobile DNA within a bacterial chromosome is to look for regions of DNA that are flanked on either side by duplications of the same tRNA gene. The orientation and location of DNA binding sequence has important implications in the outcome of recombination. When binding sites are on two different DNA molecules, the recombination process can lead to the integration of one DNA molecule into the other. When binding sites are on the same DNA molecule, it is the orientation of the binding sites that dictates the outcome of

Conservative Site-Specific Recombination

recombination; directly repeated recombination sites lead to excision of intervening DNA, and inverted recombination sites lead to the inversion of DNA. For biotechnological applications, the orientation of recombination sites has long been used to direct activities within cells; for example, the Cre/lox system of bacteriophage P1 can be used to delete genes by placing directly repeated recombination sites (lox sites) on either side of the gene or to turn them off or on based on the inversion of DNA sequence between two inverted lox sites (Ghosh et al. 2005). The integration system of bacteriophage lambda has been studied extensively and has served as a model system for understanding sitespecific recombination mediated by tyrosine recombinases. This system enables integration of the covalently closed circular double-stranded DNA genome into the E. coli chromosome. The sites that mediate recombination are composed of inverted repeats separated by a 6–8 bp spacer (Grindley et al. 2006). The inverted sequences each bind a recombinase protein, bringing two together to make a dimer. This process occurs both on the bacteriophage genome, at a site called attP, and on the E. coli chromosome, at a site called attB. The dimers bound to attP bind to the dimers bound to attB, forming a tetrameric complex, and bring the two DNA molecules together. Typically strands are not cleaved until a synaptic complex is formed; however, some conditions may allow strand cleavage earlier. These events are typically reversed quickly, preventing DNA damage (Ma et al. 2009). Cleavage happens at the 50 boundary of the spacer between the inverted repeats. Unlike serine recombinases, only two alternating subunits of the synaptic complex cleave and rejoin at a time (Tropp 2012). The result is the formation of a Holliday junction structure that is resolved by a repeat of the same process by the activity of the remaining pair of recombinase subunits. Isomerization of the complex repositions the inactive monomers to an active conformation. This stepwise recombination mechanism prevents the formation of doublestrand DNA breaks that can be mutagenic and also prevents other more exotic by-products, such as DNA hairpins and three-way junctions.

Conservative Site-Specific Recombination

The sequence, orientation, and location of spacer regions between the recombinase binding sites can influence the outcome of recombination, promoting inversion, deletion, or insertion. Spacer sequence can affect the flexibility of the DNA and impose a bias in the side of the space that is nicked. This attribute of the spacer region can also be harnessed by accessory proteins, such as the Xis protein of l, that shift the outcome of recombination reactions in the direction of excision, rather than integration.

Accessory Proteins For the chemical transactions of recombination to take place, it is important that the DNA binding sites, bound by the recombinase proteins, come into contact with one another. This requirement is typically accomplished by the activity of hostencoded proteins that shape the secondary structure and topology of double-stranded DNA molecules. The mobile element itself may encode additional accessory proteins that fill this role. An example is the Xis protein of bacteriophage l, which is involved in forming DNA structures required for excision of a l genome that has been integrated into an E. coli chromosome. The recombinase proteins work in concert with host proteins that bend DNA at certain sites, bridge longer stretches of DNA together, or twist intervening DNA such that the recombination sites that must interact will be brought closer together. Host proteins that accomplish these tasks can include H-NS, IHF, Fis, and HU (Grindley et al. 2006). The requirement for contact between DNA sites for efficient recombination to occur has been used to map regions of chromosomes into higher orders of organization that are isolated from one another (macrodomains in bacteria), regardless of the fact that they reside on the same DNA molecule.

Integron Cassette Systems Integrons are a versatile and flexible recombination system that mediates acquisition of a wide variety of antibiotic resistance genes, mainly in

125

Proteobacteria (gram-negative bacteria) but also in some Firmicutes (gram-positive bacteria) (Cambray et al. 2010). The integrase protein (IntI) that mediates recombination is a tyrosine recombinase, most closely related to the Xer protein that resolves chromosome dimers (Boyd et al. 2009). Integron systems are composed of an integrase encoding open reading frame (intI), which is adjacent to one or more function genes, each bounded by attachment sites (attC) that are recognized by the integrase protein (Fig. 5). IntI binds to the attachment sites and carries out the excision and integration steps required for mobilization of gene cassettes. Some integron systems encode many more than a single gene cassette. The so-called super integrons can comprise as much as 3% of an entire genome, containing over 100 gene cassettes. attC sites that flank the gene cassettes are highly recombinogenic with a special attachment site, attI, that is found between the gene cassettes and the integrase gene. Typically, newly acquired cassettes are inserted into the attI site, enabling the expansion of the collection of gene cassettes. The integrase gene is generally transcribed by a dedicated promoter, Pint, that is located within the attI site and transcribes away from the gene cassettes. The gene cassettes are transcribed by a separate promoter, Pc, that is typically found within the coding sequence of the integrase gene. The gene cassettes nearest to the promoter are transcribed more efficiently. In very large gene cassette arrays, sitespecific recombination events that move later genes closer to Pc can enhance production of those gene products; therefore, recombination of gene cassettes enables tunable regulation. In some instances the IntI coding sequence is interrupted by a highly conserved stop codon, presumably as a control mechanism to restrict the frequency of recombination. The intI gene is controlled by an SOSinducible promoter, which stimulates recombination activity under conditions that cause cellular stress. Some antibiotics have been shown to induce the SOS response, and integron gene cassettes often carry antibiotic resistance genes (Cambray et al. 2011). The SOS-inducible feature of integron supports an adaptive response that

C

126

Conservative Site-Specific Recombination

b

attC

SpecR

attC

a

attC

c

attC

attC SpecR

attC SpecR

Pc attI KanR

Integrase

attI

attC

Integrase

Pint Integrase

attI

KanR

SpecR

attC

attC

KanR

attC

Conservative Site-Specific Recombination, Fig. 5 Integron cassette systems are site-specific recombination systems composed of a gene (blue arrow) encoding a tyrosine recombinase (blue circles) and gene “cassettes” (red arrows) that are flanked on either side by recombinase binding sites (light blue boxes). (a) The attI recombination site is close to the 50 end of the integrase gene, whereas the attC sites flank each cassette. The integrase gene is expressed from the Pint promoter. The gene cassettes are typically expressed from a single promoter, Pc, which is located either within the integrase gene or within the attI site. (b) Antibiotic resistance cassettes (SpecR and KanR) are examples of gene cassettes that may

be part of an integron cassette system. The recombinase proteins may excise a gene cassette from a donor DNA molecule, forming a circular single-stranded DNA intermediate, which is then recombined into the recipient integron cassette system on a recipient DNA molecule at a site where there is a binding site for the integrase. (c) DNA replication restores the original copy of the element and synthesizes the second strand within the recipient DNA molecule. The final result is the addition of the new gene cassette to the integron cassette system on the recipient DNA molecule. Systems such as this can have many gene cassettes (>100 within an individual integron cassette system reported in some Vibrio cholerae strains)

promotes recombination that could lead to increased production of antibiotic resistance determinants when they are most needed. Interestingly, one of the key triggers for the SOS response is the presence of single-stranded DNA intermediates. Single-stranded DNA plays an important role in the mobilization process of integron cassettes (Cambray et al. 2010). Recognition of recombination sites depends to a large extent on secondary structure of the attC recombination sites. These secondary structures are most prevalent when gene cassettes are transiently single stranded, such as during conjugation of a plasmid, during DNA replication, or following DNA damage. Single-stranded DNA intermediates are far more (1,000-fold) recombinogenic than double stranded; however, there is some evidence that double-stranded att sites can extrude into cruciform structures that can be recognized by the integrase enzyme. The recombination sequences (attC and attI) are

poorly conserved with respect to sequence; however, the secondary structures that they form are well conserved. This feature seems to enable a certain degree of cross talk between distantly related integron cassette systems. The secondary DNA structures form a hairpin double-strand helix that is interrupted with an unconserved spacer sequence that causes a bulge in the secondary structure, along with three extrahelical bases (EHB) that form specific interactions with the integrase. The integrase binds to this secondary structure (mimicking dsDNA) and mediates recombination events that join the single-stranded loop to a single strand of the target DNA molecule. This mobilization mechanism requires DNA replication to resolve the recombination intermediate. Integrons can be carried within transposable elements or on plasmids, leveraging the mobility of these DNA elements with the versatility of the gene cassette system, termed mobile integrons

Control of Initiation in E. coli

(MI). Stationary integrons are termed chromosomal integrons (CI). Integrons possess the programmed ability to access a large collection of selectable functional genes that are flanked by attC sequences, and mobilization of the integrons enables them to “sample” a variety of genetic sources for these cassettes. In some cases, the gene cassettes encode addiction modules, such as toxin/antitoxin pairs that prevent their loss from the genome. These systems typically encode a very stable toxin protein, which hinders growth of the host cell when not accompanied by an antidote protein (antitoxin). Antitoxin proteins are usually very unstable and require constant production to prevent the activity of the toxin. This feature ensures that the antitoxin gene cannot be lost without cost to the host cell. In contrast with other gene cassettes, toxin/antitoxin modules often carry their own promoters and do not rely on the Pc promoter for production.

Cross-References ▶ DNA Recombination, Mechanisms of

References Boocock MR, Rice PA (2013) A proposed mechanism for IS607-family serine transposases. Mob DNA 4:24 Boyd EF, Almagro-Moreno S, Parent MA (2009) Genomic islands are dynamic, ancient integrative elements in bacterial evolution. Trends Microbiol 17:47–53 Cambray G, Guerout AM, Mazel D (2010) Integrons. Annu Rev Genet 44:141–166 Cambray G, Sanchez-Alberola N, Campoy S, Guerin E, Da Re S, Gonzalez-Zorn B, Ploy MC, Barbe J, Mazel D, Erill I (2011) Prevalence of SOS-mediated control of integron integrase expression as an adaptive trait of chromosomal and mobile integrons. Mob DNA 2:6 Craig NL (2002) Mobile DNA II. ASM Press, Washington, DC Ghosh K, Lau CK, Gupta K, Van Duyne GD (2005) Preferential synapsis of loxP sites drives ordered strand exchange in Cre-loxP site-specific recombination. Nat Chem Biol 1:275–282 Grindley ND, Whiteson KL, Rice PA (2006) Mechanisms of site-specific recombination. Annu Rev Biochem 75:567–605 Hacker J, Kaper JB (2000) Pathogenicity islands and the evolution of microbes. Annu Rev Microbiol 54: 641–679

127 Ma CH, Rowley PA, Macieszak A, Guga P, Jayaram M (2009) Active site electrostatics protect genome integrity by blocking abortive hydrolysis during DNA recombination. EMBO J 28:1745–1756 Tropp BE (2012) Molecular biology: genes to proteins, 4th edn. Jones & Bartlett Learning, Sudbury Van Duyne GD, Rutherford K (2013) Large serine recombinase domain structure and attachment site binding. Crit Rev Biochem Mol Biol 48:476–491 Yang W (2010) Topoisomerases and site-specific recombinases: similarities in structure and mechanism. Crit Rev Biochem Mol Biol 45:520–534

Control of Initiation in E. coli Jon M. Kaguni Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA

Synopsis Studies have revealed several independent molecular mechanisms that regulate the frequency of replication initiation in E. coli (Skarstad and Katayama 2013; Nielsen and Lobner-Olesen 2008). By comparison with the biochemistry of DNA replication in eukaryotic cells, evidence suggests that all domains of life utilize comparable strategies to regulate the frequency of initiation. In E. coli, one pathway relies on the sequestration of the replication origin to occlude DnaA and other proteins from assembling at this site at an inappropriate time in the cell cycle. Sequestration occurs during the time interval that immediately follows initiation. Other biochemical pathways act by either modulating the availability of DnaA or its activity.

Introduction DNA replication and the cell cycle. DNA replication in free-living organisms occurs only once and at a specific time in the cell cycle. This process can be formally divided into the stages of initiation of DNA replication, elongation of nascent DNA, and

C

128

termination. Following termination, the completed chromosomes separate and then partition into daughter cells upon cell division. The committed step in a cell to duplicate its genome occurs at the initiation stage and is highly regulated. For E. coli cells growing under various standard laboratory conditions, the time required to duplicate the E. coli chromosome is relatively constant and corresponds to about 40 min (Cooper and Helmstetter 1968). In rapidly growing cells having generation times shorter than 40 min, DNA replication initiates before duplication of the chromosome is complete. In these circumstances, multiple origins are present due to overlapping rounds of replication in which initiations in an individual cell occur synchronously (Skarstad et al. 1986). The synchrony of initiation in a single cell is the result of several regulatory mechanisms that are summarized below. In E. coli, DNA replication is characterized by initiation at a particular cell mass (the initiation mass) (Boye et al. 1996; Donachie 1968; Hill et al. 2012; Lobner-Olesen et al. 1989). As a mechanistic explanation for this phenomenon, the prevailing notion is that the initiation mass is determined by the ratio of replication origins to DnaA protein complexed to ATP, whose expression is feedback regulated by the binding of DnaA to DnaA boxes in the dnaA promoter region. The second feature of DNA replication in E. coli is that a new cycle of initiation is followed by an eclipse period when premature initiation is inhibited. A protein named SeqA acts during this eclipse period by binding to oriC to block the assembly of replication proteins that would otherwise support another round of initiation. Other separate mechanisms have been discovered that regulate the frequency of initiation. These pathways either affect the cellular availability of DnaA or control its activity by influencing the nucleotide-bound state of DnaA. DnaA Autoregulates Expression of the dnaA-dnaN-recF Operon Autoregulation of dnaA. The dnaA gene is the first in an operon that also contains dnaN encoding the b subunit of DNA polymerase III holoenzyme,

Control of Initiation in E. coli

and recF that is part of the RecFOR complex involved in repairing single-strand gaps in DNA (Courcelle 2005; Cox 2007; Handa et al. 2009; Hiom 2009). The gyrB encoding one of the subunits of DNA gyrase is downstream from recF (see ▶ “DNA Topology and Topoisomerases” by Ketron and Osheroff in this volume). DnaA complexed to ATP (DnaA-ATP) autoregulates its own expression by acting as a repressor for the two promoters (p1 and p2) of the dnaA gene (Atlung et al. 1985; Braun et al. 1985; Kucherer et al. 1986; Wang and Kaguni 1987). As in vivo evidence, overproduction of DnaA represses dnaA transcription compared with elevated expression in a null dnaA strain (Atlung et al. 1985; Braun et al. 1985; Smith et al. 1997). Repression is dependent on binding of DnaA to the two DnaA boxes located between the promoters where DnaA self-oligomerizes to occlude RNA polymerase (Atlung et al. 1985; Braun et al. 1985; Speck et al. 1999; Lee et al. 1997). One of these DnaA boxes carries a mismatch from the DnaA box sequence (TTATCCACA) of R1 and R4 found within oriC. Consistent with these observations, several studies support the conclusion that dnaA expression as the cell grows leads to a critical DnaA level that induces initiation at the proper time in the cell cycle (Hansen et al. 1987, 1991). Indeed, the increased expression of dnaA controlled by a regulated promoter induces initiation at an earlier time in the cell cycle (Atlung et al. 1987; Lobner-Olesen et al. 1989). Thus, the concentration of DnaA should be invariant with respect to growth rate and the ratio of DnaA to oriC should not vary appreciably. In support, experiments show that the ratio of DnaA to chromosomal origins per cell is relatively constant with respect to growth rate (Hansen et al. 1991; Herrick et al. 1996). To summarize, the autoregulation of dnaA expression suggests that the activity of DnaA or its abundance is normally limiting and that the growthdependent accumulation of DnaA leads to a new cycle of initiation. Proteins that influence dnaA expression. Other proteins have been identified that affect expression of the dnaA gene. One is ArgP (IciA), a transcriptional regulator that belongs to the

Control of Initiation in E. coli

LysR-type group of transcription factors. It was discovered by its inhibitory effect on in vitro DNA replication of plasmids carrying oriC in which ArgP antagonizes the unwinding of oriC by DnaA by binding to a region near the left boundary of oriC (Hwang and Kornberg 1990, 1992). Bound to the dnaA promoter region, ArgP enhances the binding of RNA polymerase to the p1 promoter (Lee et al. 1997). Another is Fis, a protein that was originally identified by its involvement in site-specific DNA inversion in Salmonella typhimurium and phage Mu (Finkel and Johnson 1992; Travers et al. 2001). Fis binds to the 35 region of the dnaA p2 promoter and flanking DNA (Froelich et al. 1996). The observation of an elevated DnaA level in a fis null mutant suggests that Fis normally represses dnaA expression. The third protein is SeqA. E. coli and related bacteria normally contain methylated adenine residues at GATC sequences in their DNA. The enzyme responsible for this modification is DNA adenine methylase, which plays an indirect role in methyl-directed mismatch repair. Summarized briefly here, DNA replication occasionally results in the misincorporation of a nucleotide. The repair machinery relies on the unmethylated state of the newly made progeny strand to distinguish it from the complementary parental DNA strand that is methylated. Repair enzymes then remove the mispaired nucleotide contained in the unmethylated progeny strand and replace it with the properly base-paired nucleotide by a gap-filling mechanism before the progeny strand is methylated. The dnaA promoter region contains GATC sequences that become hemi-methylated. SeqA specifically recognizes and binds to hemimethylated GATC sites, such as those in the dnaA promoter region when this region of the chromosome is duplicated, and has been visualized at replication forks in vivo (Brendler et al. 2000; Hiraga et al. 1998; Waldminghaus et al. 2012; Yamazoe et al. 2005). Experiments with synchronized cells show that decreased transcription from the dnaA promoters correlates with the hemi-methylated state of this DNA that is presumably bound by SeqA (Campbell and Kleckner 1990). In agreement, a twofold increase

129

in DnaA or its transcript was measured in seqA mutants compared with a wild-type strain (Bogan and Helmstetter 1997; Charbon et al. 2011; von Freiesleben et al. 1994). Together, these observations support the conclusion that SeqA negatively regulates dnaA expression in the period after the dnaA promoter region has been duplicated and is hemi-methylated. Hda also affects expression of the dnaA gene. As described below in “Regulatory Inactivation of DnaA (RIDA): Hda and the bclamp of DNA Polymerase III Holoenzyme,” Hda together with the bclamp bound to DNA stimulate the hydrolysis of ATP bound to DnaA. Hence, mutations in hda that are unable to stimulate ATP hydrolysis lead to an elevated ratio of DnaA complexed to ATP to the sum of DnaA-ATP and DnaA-ADP. On the basis that DnaA complexed to ATP represses transcription of the dnaA gene, the result is a decreased cellular abundance of DnaA (Riber et al. 2006). SeqA Occludes oriC During the Period Following Initiation An overabundance of GATC sequences is in oriC, which become hemi-methylated when this site is duplicated. Hence as explained above, this site is specifically recognized by SeqA to temporarily sequester oriC from DnaA and other proteins that would otherwise assemble here (Brendler et al. 1995; Brendler and Austin 1999; Campbell and Kleckner 1990; Lu et al. 1994; Slater et al. 1995; von Freiesleben et al. 1994). Were they to assemble, another cycle of initiation would occur too soon in the cell cycle. For the optimal binding of SeqA, pairs of hemimethylated GATC sequences within three helical turns of DNA and on the same side of the DNA helix are needed (Brendler and Austin 1999; Brendler et al. 2000; Han et al. 2003, 2004; Kang et al. 2005). Such pairs of GATC sequences are in the left half of oriC, which include the region containing the 13-mers. Hence, SeqA bound to this hemi-methylated DNA inhibits the DnaA-dependent unwinding of oriC. SeqA also occludes DnaA from the R5 DnaA box and the I2 and I3 sites in oriC (Nievera et al. 2006). As hemimethylated oriC interacts with the outer

C

130

membrane, which contributes to the latent period when oriC is not available for another cycle of initiation (Campbell and Kleckner 1988; Ogden et al. 1988), it is possible that SeqA localized to the outer membrane is responsible. However, studies support the conclusion that SeqA is not at the outer membrane but is bound to the hemimethylated DNA at replication forks in log phase cells (Brendler et al. 2000; Hiraga et al. 1998; Onogi et al. 1999; Waldminghaus et al. 2012; Yamazoe et al. 2005). As described above, SeqA also binds to the dnaA promoter region in the period after this region has been duplicated and is temporarily hemi-methylated. As the dnaA gene (83.6 min) is about 1 map minute away from oriC (84.6 min), dnaA expression should decrease shortly after each round of initiation. In confirmation of this expectation, dnaA transcription was inhibited after initiation of replication in synchronized cells (Theisen et al. 1993). Because newly synthesized DnaA should bind ATP, which is more abundant than ADP in vivo, repressed dnaA expression by SeqA should lead to a transient increase in the relative cellular abundance of DnaA-ADP. Because DnaA-ADP is relatively inactive in initiation, this process is an additional means to reduce the frequency of initiation. In summary, SeqA inhibits reinitiation in the latency period after initiation by sequestering oriC and by repressing dnaA expression. Although the eclipse period after initiation lasts for about one-third of the generation time (Campbell and Kleckner 1990), SeqA alone is insufficient to account for it. Other factors contribute to latency to prevent reinitiation. The datA Locus A separate pathway relies on the datA locus, which contains four or five DnaA boxes that titrate excess DnaA (Kitagawa et al. 1998). Indirect measurements suggest that as many as 300 molecules of DnaA bind to this site. Because deletion of datA leads to extra initiations, this locus apparently inhibits unscheduled initiations by reducing the availability of DnaA when its abundance surpasses the level that is sufficient for a new cycle of initiation. In support, extra copies of datA via

Control of Initiation in E. coli

a multicopy plasmid cause the loss of minichromosomes whose cellular maintenance requires initiation from oriC by presumably limiting the availability of DnaA or neutralizes the effect of elevated levels of DnaA that would otherwise induce extra initiation (Felczak and Kaguni 2009; Morigen et al. 2001). This effect of datA raises the question of whether other chromosomal loci act similarly to titrate excess DnaA. To address this issue, 308 sites in the E. coli chromosome have been identified that are identical to the DnaA box TTA/TTNCACA (where N is any nucleotide) (Roth and Messer 1998). Using domain four of DnaA bound to magnetic beads, DNAs extricated from a restriction digest of E. coli DNA were determined to carry datA, oriC, appY, narU, mutH, and gtrS, which are encoded by prophage CPS-53/KpLE1. Like datA and oriC, these DNAs should titrate excess DnaA, but they have not been tested. Among the other chromosomal sites, some reside in promoters (dnaA, rpoH, mioC, uvrB, drpA, nrdAB, guaBA, and polA) that are known to be bound by DnaA (Atlung et al. 1985; Augustin et al. 1994; Braun et al. 1985; Gon et al. 2006; Olliver et al. 2010; van den Berg et al. 1985; Wang and Kaguni 1989; Zhou and Syvanen 1990; Messer and Weigel 2003). In addition to the DNAs already described, these sites presumably also contribute to titrating excess DnaA. DnaA-Reactivating Sequences (DARS) Plasmids that rely on E. coli for their maintenance also carry one or more DnaA boxes in or near their replication origins (Konieczny 2003). For plasmid ColE1 and its derivatives, the replication origin contains one DnaA box that is identical to the DnaA box consensus sequence (TTA/TTNCACA) and two similar sequences nearby. These sites are required for DNA replication of pBR322 (a derivative of ColE1) in a pathway that depends on DnaA, but not in another that requires PriA, PriB, and PriC (Ma and Campbell 1988; Parada and Marians 1991; Seufert and Messer 1987). Of interest, this region of DNA was discovered to stimulate the dissociation of ADP bound to DnaA. If ATP is

Control of Initiation in E. coli

present in vitro, its binding to DnaA rejuvenates DnaA so that it can initiate DNA replication with an oriC-containing plasmid. Hence, this DNA is named DARS for DnaA-reactivating sequence (Fujimitsu and Katayama 2004). Its effect on stimulation of the exchange of the nucleotide bound to DnaA is like that of anionic phospholipids (Boeneman and Crooke 2005; Sekimizu and Kornberg 1988; Zheng et al. 2001), but pBR322 does not apparently affect the cellular level of DnaA-ATP in vivo (Kurokawa et al. 1999). Two sites in the E. coli chromosome named DARS1 and DARS2 were also identified that have a similar effect. Both contain three DnaA boxes in an orientation like that of DARS in pBR322 (Fujimitsu et al. 2009; Katayama et al. 2010). However, the activity of DARS2 but not DARS1 is stimulated by a soluble factor whose identity has not been determined (Fujimitsu et al. 2009). The biochemical mechanism whereby a DARS sequence leads to the dissociation of ADP or ATP bound to DnaA has not been determined. Interestingly, a recent study showed that a complex of datA and IHF promotes the hydrolysis of ATP bound to DnaA in vitro (Kasho and Katayama 2013). In vivo results suggest that IHF binds to datA at the time of initiation, suggesting that this DNA site both sequesters excess DnaA and also promotes the hydrolysis of DnaA-bound ATP to inhibit initiation at improper times in the cell cycle. Regulatory Inactivation of DnaA (RIDA): Hda and the bclamp of DNA Polymerase III Holoenzyme The Hda-bclamp complex. The observations that DnaA complexed to ATP but not ADP is active in initiation and that DnaA has intrinsic ATPase activity suggest an autogenous mechanism to control the frequency of initiation (Sekimizu et al. 1987). However, the rate of ATP hydrolysis (50% in 15 min) is slow, suggesting that the ATPase activity of DnaA is inadequate to regulate initiation. In addition to the effect of DARS sequences and acidic phospholipids, a pathway named the regulatory inactivation of DnaA (RIDA) has been discovered that controls the nucleotide-bound state of DnaA (Kawakami and

131

Katayama 2010; Skarstad and Katayama 2013). In this process, a complex formed by a protein named Hda with the bclamp of DNA polymerase III holoenzyme bound to DNA stimulates the hydrolysis of ATP bound to DnaA (Katayama et al. 1998; Kato and Katayama 2001). The regions of each protein involved in formation of the Hda-bclamp complex have been identified. In Hda, this region is near its N-terminus (Kurz et al. 2004; Su’etsugu et al. 2005; Xu et al. 2009) and is called the b-binding motif (QLXLF where X is any amino acid) on the basis that other proteins that interact with the bdimer also carry this amino acid sequence (Dalrymple et al. 2001). In the bdimer, the interacting region resides in a hydrophobic patch near the C-terminus of each monomer (Dalrymple et al. 2001). Gel filtration and cross-linking experiments indicate that a single bclamp forms a complex with one or two Hda dimers (Su’etsugu et al. 2005). In addition, substitution of amino acids of Hda in either the b-binding motif, its “arginine finger” residue implicated in the hydrolysis of ATP bound to DnaA (see below), or in other residues required to interact with DnaA inactivates Hda in RIDA activity in vitro (Camara et al. 2005; Kato and Katayama 2001; Kawakami et al. 2006; Nakamura and Katayama 2010; Riber et al. 2006; Su’etsugu et al. 2005). Together, these results support the conclusion that Hda forms a complex with the bclamp and that this complex stimulates the hydrolysis of ATP bound to DnaA. An essential role of the bdimer is to act as a sliding clamp to confer high processivity to DNA polymerase III holoenzyme as it synthesizes DNA (Langston et al. 2009; McHenry 2011) (see ▶ “DNA Polymerase III Structure” by McHenry in this volume). If one protomer of the bdimer interacts with the asubunit of DNA polymerase III holoenzyme, the second protomer of the bdimer can interact with an Hda dimer. If correct, this places Hda at the replication fork. Properties of hda mutants. Deletion of the chromosomal hda gene leads to an increased ratio of DnaA-ATP to DnaA-ADP and more frequent initiations in vivo, suggesting that Hda is physiologically important (Kato and Katayama

C

132

2001). Of interest, null hda mutants are sick, but spontaneously arising suppressor mutations lead to improved viability (Charbon et al. 2011; Fujimitsu et al. 2008; Riber et al. 2006). Some suppressor mutations mapped to the dnaA gene and cause less frequent initiation (Charbon et al. 2011). Another suppressor mapped to the seqA promoter, which causes an increased level of SeqA that correlated with prolonged sequestration of oriC, a reduced cellular abundance of DnaA, and a lower ratio of DnaA-ATP to DnaA-ADP. In a separate study of a cold-sensitive hda mutant (Fujimitsu et al. 2008), disruption of the diaA gene suppressed its cold-sensitive phenotype, suggesting that the absence of DiaA, which otherwise would stabilize the DnaA oligomer at oriC (see below), results in less frequent initiation. As DnaA affects the transcription of genes in the DnaA regulon, inactivation of hda indirectly affects their expression by altering the relative abundance of DnaA-ATP and its ability to bind to the respective promoters (Charbon et al. 2011; Riber et al. 2006). The b-binding motif. The b-binding motif is also found in other proteins that interact with the bdimer (Dalrymple et al. 2001; see ▶ “DNA Polymerase III Structure” by McHenry in this volume). These include the a and d subunits of DNA polymerase III holoenzyme (Jeruzalmi et al. 2001; Kurz et al. 2004; Maki and Kornberg 1988). Their interaction with a hydrophobic domain near the C-terminus of each protomer of the bdimer leads to highly processive DNA synthesis by this DNA polymerase (Langston et al. 2009; McHenry 2011). Other DNA polymerases that are involved in translesion DNA synthesis, or the replacement of the primer for Okazaki fragment synthesis with DNA, also interact with the bdimer via its C-terminal hydrophobic domain as do MutS, MutL, and DNA ligase (Lopez de Saro and O’Donnell 2001; Lopez de Saro et al. 2006; Sutton et al. 1999). For the latter enzymes that act in the repair of misincorporated nucleotides, their interaction with the bdimer is a means to localize each to newly replicated DNA containing mismatched nucleotides. Mechanism of ATP hydrolysis by the Hda-bclamp complex. DnaA is a member of the

Control of Initiation in E. coli

AAA + superfamily of ATPases. Domain III is highly conserved among bacterial DnaAs and carries the Walker A and B box motifs that function in ATP binding, and the coordination of magnesium ion chelated to the phosphates of the bound nucleotide, respectively. Domain III also bears the Sensor I, Sensor II (Box VIII), and Box VII motifs characteristic of AAA+ ATPases (Erzberger et al. 2002; Koonin 1993). A homology model of E. coli DnaA has been derived from the X-ray crystallographic structure of domains III and IV of Aquifex aeolicus DnaA bound to ADP and AMP-PCP (Erzberger et al. 2002, 2006). As found in other AAA + proteins, the ATP-binding pocket of DnaA is a bipartite binding site formed between adjacent protomers (Neuwald et al. 1999). Like the conserved arginine finger residue of other AAA + proteins (Davey et al. 2002; Ogura et al. 2004), two conserved arginines in Box VII of E. coli DnaA are implicated in multimerization of DnaA (Felczak and Kaguni 2004) and interaction with ATP (Kawakami et al. 2005; see ▶ “DnaA, DnaB, DnaC” by Kaguni in this volume). Hda is also an AAA + protein but prefers to bind ADP over ATP (Su’etsugu et al. 2008). The arginine finger residue of Hda (arginine 168) is essential for the hydrolysis of ATP bound to DnaA (Su’etsugu et al. 2005). The X-ray crystallographic structure of Shewanella amazonensis SB2B Hda as a dimer complexed to CDP has been determined (Xu et al. 2009). Because of the similarity in amino acid sequence and structure of the AAA + subdomains of Hda and DnaA, a model has been constructed in which the bipartite ATP-binding pocket is formed by a heterodimer of DnaA and Hda. In this model, the arginine finger of Hda is close to the gphosphate of ATP. By analogy with the mechanism of ATP hydrolysis by other AAA + ATPases, a catalytic aspartate residue in Hda is proposed to interact with water that acts as a nucleophile during the hydrolysis of the gphosphate of ATP bound to DnaA. The role of the Hda-bclamp complex bound to DNA in stimulating the hydrolysis of ATP raises the question of where this occurs on the chromosome. One model is that it occurs at or near oriC

Control of Initiation in E. coli

133

after the bclamp has been loaded onto DNA, stimulating the concerted hydrolysis of ATP bound to each DnaA molecule in the DnaA filament (Clarey et al. 2006; Mott and Berger 2007). An alternative model is that Hda binds specifically to the bclamp that is used to synthesize Okazaki fragments (Su’etsugu et al. 2004). The bclamp is a homodimer and carries a hydrophobic pocket located near the interface joining each subunit. As described above, this domain of the bclamp interacts with Hda, the a and d subunits of DNA polymerase III holoenzyme, and other proteins (Dalrymple et al. 2001; Jeruzalmi et al. 2001). Because each bclamp has two interacting domains, Hda and DNA polymerase III holoenzyme can separately interact with a molecule of the bclamp as an Okazaki fragment is synthesized. Hence, the hydrolysis of ATP bound to DnaA may occur during Okazaki fragment synthesis. A separate possibility is that ATP hydrolysis occurs after the DNA polymerase has completed the synthesis of an Okazaki fragment and has dissociated from the bclamp.

conditions expected to overproduce DnaA (Gon et al. 2006). Relative to an internal control (thioredoxin), the abundance of NrdB detected by immunoblotting was essentially unaffected, so overproduction of DnaA does not alter the cellular abundance of ribonucleotide reductase (Gon et al. 2006; Olliver et al. 2010). As an explanation for this apparent discrepancy, it is possible that the level of ribonucleotide reductase estimated at 1,500–3,000 molecules per cell is too abundant (Eriksson et al. 1977), obscuring the repressive effect of DnaA on nrdAB expression.

Ribonucleotide Reductase E. coli ribonucleotide reductase 1 that synthesizes deoxyribonucleotides needed for DNA replication is composed of two subunits encoded by the nrdAB-yfaE operon whose expression is influenced by DnaA (Augustin et al. 1994; Herrick and Sclavi 2007; Sun and Fuchs 1992; Sun et al. 1994). In vitro evidence suggests that DnaAATP represses nrdAB expression, whereas DnaAADP is less effective (Gon et al. 2006). It has been suggested that the conversion of DnaA-ATP to DnaA-ADP by the Hda-bclamp complex coordinates the increased synthesis of both ribonucleotide reductase and deoxynucleotides that result with the elongation stage of DNA replication when deoxynucleotides are needed. As DnaAATP is active in initiation, this mechanism may synchronize the transition from the stage of initiation to the propagation of replication forks. A prediction is that oversupply of DnaA should repress nrd expression. However the study by Gon et al. compared the level of NrdB in a strain carrying a dnaA expression plasmid to the control of this strain harboring the empty vector under

Acknowledgments I thank the members of my lab for their support while I was writing. This work is supported by Grant GM090063 from the National Institutes of Health and by the Michigan Agricultural Experiment Station.

Cross-References ▶ DNA Polymerase III Structure ▶ DNA Replication ▶ DNA Topology and Topoisomerases ▶ DnaA, DnaB, DnaC ▶ Replication Origin of E. coli and the Mechanism of Initiation

References Atlung T, Clausen ES, Hansen FG (1985) Autoregulation of the dnaA gene of Escherichia coli K12. Mol Gen Genet 200(3):442–450 Atlung T, Lobner-Olesen A, Hansen FG (1987) Overproduction of DnaA protein stimulates initiation of chromosome and minichromosome replication in Escherichia coli. Mol Gen Genet 206(1):51–59 Augustin LB, Jacobson BA, Fuchs JA (1994) Escherichia coli Fis and DnaA proteins bind specifically to the nrd promoter region and affect expression of an nrd-lac fusion. J Bacteriol 176(2):378–387 Boeneman K, Crooke E (2005) Chromosomal replication and the cell membrane. Curr Opin Microbiol 8(2): 143–148 Bogan JA, Helmstetter CE (1997) DNA sequestration and transcription in the oriC region of Escherichia coli. Mol Microbiol 26(5):889–896 Boye E et al (1996) Coordinating DNA replication initiation with cell growth: differential roles for DnaA and SeqA proteins. Proc Natl Acad Sci U S A 93(22): 12206–12211

C

134 Braun RE, O’Day K, Wright A (1985) Autoregulation of the DNA replication gene dnaA in E. coli K-12. Cell 40(1):159–169 Brendler T, Austin S (1999) Binding of SeqA protein to DNA requires interaction between two or more complexes bound to separate hemimethylated GATC sequences. EMBO J 18(8):2304–2310 Brendler T, Abeles A, Austin S (1995) A protein that binds to the P1 origin core and the oriC 13mer region in a methylation-specific fashion is the product of the host seqA gene. EMBO J 14(16): 4083–4089 Brendler T et al (2000) A case for sliding SeqA tracts at anchored replication forks during Escherichia coli chromosome replication and segregation. EMBO J 19(22):6249–6258 Camara JE et al (2005) Hda inactivation of DnaA is the predominant mechanism preventing hyperinitiation of Escherichia coli DNA replication. EMBO Rep 6(8):736–741 Campbell JL, Kleckner N (1988) The rate of Dam-mediated DNA adenine methylation in Escherichia coli. Gene 74(1):189–190 Campbell JL, Kleckner N (1990) E. coli oriC and the dnaA gene promoter are sequestered from dam methyltransferase following the passage of the chromosomal replication fork. Cell 62(5):967–979 Charbon G et al (2011) Suppressors of DnaA(ATP) imposed overinitiation in Escherichia coli. Mol Microbiol 79(4):914–928 Clarey MG et al (2006) Nucleotide-dependent conformational changes in the DnaA-like core of the origin recognition complex. Nat Struct Mol Biol 13(8): 684–690 Cooper S, Helmstetter CE (1968) Chromosome replication and the division cycle of Escherichia coli B/r. J Mol Biol 31(3):519–540 Courcelle J (2005) Recs preventing wrecks. Mutat Res 577(1–2):217–227 Cox MM (2007) Regulation of bacterial RecA protein function. Crit Rev Biochem Mol Biol 42(1):41–63 Dalrymple BP et al (2001) A universal protein-protein interaction motif in the eubacterial DNA replication and repair systems. Proc Natl Acad Sci U S A 98(20):11627–11632 Davey MJ et al (2002) Motors and switches: AAA + machines within the replisome. Nat Rev Mol Cell Biol 3(11):826–835 Donachie WD (1968) Relationship between cell size and time of initiation of DNA replication. Nature 219(158): 1077–1079 Eriksson S, Sjoberg BM, Hahne S (1977) Ribonucleoside diphosphate reductase from Escherichia coli. An immunological assay and a novel purification from an overproducing strain lysogenic for phage lambdadnrd. J Biol Chem 252(17):6132–6138 Erzberger JP, Pirruccello MM, Berger JM (2002) The structure of bacterial DnaA: implications for general

Control of Initiation in E. coli mechanisms underlying DNA replication initiation. EMBO J 21(18):4763–4773 Erzberger JP, Mott ML, Berger JM (2006) Structural basis for ATP-dependent DnaA assembly and replication-origin remodeling. Nat Struct Mol Biol 13(8):676–683 Felczak MM, Kaguni JM (2004) The box VII motif of Escherichia coli DnaA protein is required for DnaA oligomerization at the E. coli replication origin. J Biol Chem 279(49):51156–51162 Felczak MM, Kaguni JM (2009) DnaAcos hyperinitiates by circumventing regulatory pathways that control the frequency of initiation in Escherichia coli. Mol Microbiol 72:1348–1363 Finkel SE, Johnson RC (1992) The Fis protein: it’s not just for DNA inversion anymore. Mol Microbiol 6 (22):3257–3265. [Published erratum appears in Mol Microbiol 1993;7(2):1023] Froelich JM, Phuong TK, Zyskind JW (1996) Fis binding in the dnaA operon promoter region. J Bacteriol 178(20):6006–6012 Fujimitsu K, Katayama T (2004) Reactivation of DnaA by DNA sequence-specific nucleotide exchange in vitro. Biochem Biophys Res Commun 322(2):411–419 Fujimitsu K et al (2008) Modes of overinitiation, dnaA gene expression, and inhibition of cell division in a novel cold-sensitive hda mutant of Escherichia coli. J Bacteriol 190(15):5368–5381 Fujimitsu K, Senriuchi T, Katayama T (2009) Specific genomic sequences of E. coli promote replicational initiation by directly reactivating ADP-DnaA. Genes Dev 23(10):1221–1233 Gon S et al (2006) A novel regulatory mechanism couples deoxyribonucleotide synthesis and DNA replication in Escherichia coli. EMBO J 25(5):1137–1147 Han JS et al (2003) Sequential binding of SeqA to paired hemi-methylated GATC sequences mediates formation of higher order complexes. J Biol Chem 278(37): 34983–34989 Han JS et al (2004) Binding of SeqA protein to hemimethylated GATC sequences enhances their interaction and aggregation properties. J Biol Chem 279(29): 30236–30243 Handa N et al (2009) Reconstitution of initial steps of dsDNA break repair by the RecF pathway of E. coli. Genes Dev 23(10):1234–1245 Hansen FG et al (1987) Titration of DnaA protein by oriC DnaA-boxes increases dnaA gene expression in Escherichia coli. EMBO J 6(1):255–258 Hansen FG et al (1991) Initiator (DnaA) protein concentration as a function of growth rate in Escherichia coli and Salmonella typhimurium. J Bacteriol 173(16): 5194–5199 Herrick J, Sclavi B (2007) Ribonucleotide reductase and the regulation of DNA replication: an old story and an ancient heritage. Mol Microbiol 63(1):22–34 Herrick J et al (1996) The initiation mess? Mol Microbiol 19(4):659–666

Control of Initiation in E. coli Hill NS et al (2012) Cell size and the initiation of DNA replication in bacteria. PLoS Genet 8(3):e1002549 Hiom K (2009) DNA repair: common approaches to fixing double-strand breaks. Curr Biol 19(13):R523–R525 Hiraga S et al (1998) Cell cycle-dependent duplication and bidirectional migration of seqA- associated DNA-protein complexes in E. coli [In Process Citation]. Mol Cell 1(3):381–387 Hwang DS, Kornberg A (1990) A novel protein binds a key origin sequence to block replication of an E. coli minichromosome. Cell 63(2):325–331 Hwang DS, Kornberg A (1992) Opposed actions of regulatory proteins, DnaA and IciA, in opening the replication origin of Escherichia coli. J Biol Chem 267(32):23087–23091 Jeruzalmi D et al (2001) Mechanism of processivity clamp opening by the delta subunit wrench of the clamp loader complex of E. coli DNA polymerase III. Cell 106(4):417–428 Kang S et al (2005) Dimeric configuration of SeqA protein bound to a pair of hemi-methylated GATC sequences. Nucleic Acids Res 33(5):1524–1531 Kasho K, Katayama T (2013) DnaA binding locus datA promotes DnaA-ATP hydrolysis to enable cell cyclecoordinated replication initiation. Proc Natl Acad Sci U S A 110(3):936–941 Katayama T et al (1998) The initiator function of DnaA protein is negatively regulated by the sliding clamp of the E. coli chromosomal replicase. Cell 94(1):61–71 Katayama T et al (2010) Regulation of the replication cycle: conserved and diverse regulatory systems for DnaA and oriC. Nat Rev Microbiol 8(3):163–170 Kato J, Katayama T (2001) Hda, a novel DnaA-related protein, regulates the replication cycle in Escherichia coli. EMBO J 20(15):4253–4262 Kawakami H, Katayama T (2010) DnaA, ORC, and Cdc6: similarity beyond the domains of life and diversity. Biochem Cell Biol 88(1):49–62 Kawakami H, Keyamura K, Katayama T (2005) Formation of an ATP-DnaA-specific initiation complex requires DnaA arginine 285, a conserved motif in the AAA + protein family. J Biol Chem 280(29): 27420–27430 Kawakami H, Su’etsugu M, Katayama T (2006) An isolated Hda-clamp complex is functional in the regulatory inactivation of DnaA and DNA replication. J Struct Biol 156(1):220–229 Kitagawa R et al (1998) Negative control of replication initiation by a novel chromosomal locus exhibiting exceptional affinity for Escherichia coli DnaA protein. Genes Dev 12(19):3032–3043 Konieczny I (2003) Strategies for helicase recruitment and loading in bacteria. EMBO Rep 4(1):37–41 Koonin EV (1993) A common set of conserved motifs in a vast variety of putative nucleic acid-dependent ATPases including MCM proteins involved in the initiation of eukaryotic DNA replication. Nucleic Acids Res 21(11):2541–2547

135 Kucherer C et al (1986) Regulation of transcription of the chromosomal dnaA gene of Escherichia coli. Mol Gen Genet 205(1):115–121 Kurokawa K et al (1999) Replication cycle-coordinated change of the adenine nucleotide-bound forms of DnaA protein in Escherichia coli. EMBO J 18(23): 6642–6652 Kurz M et al (2004) Interaction of the sliding clamp betasubunit and Hda, a DnaA-related protein. J Bacteriol 186(11):3508–3515 Langston LD, Indiani C, O’Donnell M (2009) Whither the replisome: emerging perspectives on the dynamic nature of the DNA replication machinery. Cell Cycle 8(17):2686–2691 Lee Y et al (1997) The binding of two dimers of IciA protein to the dnaA promoter 1P element enhances the binding of RNA polymerase to the dnaA promoter 1P. Nucleic Acids Res 25(17):3486–3489 Lobner-Olesen A et al (1989) The DnaA protein determines the initiation mass of Escherichia coli K- 12. Cell 57(5):881–889 Lopez de Saro FJ, O’Donnell M (2001) Interaction of the beta sliding clamp with MutS, ligase, and DNA polymerase I. Proc Natl Acad Sci U S A 98(15):8376–8380 Lopez de Saro FJ et al (2006) The beta sliding clamp binds to multiple sites within MutL and MutS. J Biol Chem 281(20):14340–14349 Lu M et al (1994) SeqA: a negative modulator of replication initiation in E. coli. Cell 77(3):413–426 Ma D, Campbell JL (1988) The effect of dnaA protein and n’ sites on the replication of plasmid ColE1. J Biol Chem 263(29):15008–15015 Maki S, Kornberg A (1988) DNA polymerase III holoenzyme of Escherichia coli. III. Distinctive processive polymerases reconstituted from purified subunits. J Biol Chem 263(14):6561–6569 McHenry CS (2011) Bacterial replicases and related polymerases. Curr Opin Chem Biol 15(5):587–594 Messer W, Weigel C (2003) DnaA as a transcription regulator. Methods Enzymol 370:338–349 Morigen et al (2001) Regulation of chromosomal replication by DnaA protein availability in Escherichia coli: effects of the datA region. Biochim Biophys Acta 1521(1–3):73–80 Mott ML, Berger JM (2007) DNA replication initiation: mechanisms and regulation in bacteria. Nat Rev Microbiol 5(5):343–354 Nakamura K, Katayama T (2010) Novel essential residues of Hda for interaction with DnaA in the regulatory inactivation of DnaA: unique roles for Hda AAA Box VI and VII motifs. Mol Microbiol 76(2):302–317 Neuwald AF et al (1999) AAA+: A class of chaperone-like ATPases associated with the assembly, operation, and disassembly of protein complexes. Genome Res 9(1): 27–43 Nielsen O, Lobner-Olesen A (2008) Once in a lifetime: strategies for preventing re-replication in prokaryotic and eukaryotic cells. EMBO Rep 9(2):151–156

C

136 Nievera C et al (2006) SeqA blocking of DnaA-oriC interactions ensures staged assembly of the E. coli pre-RC. Mol Cell 24(4):581–592 Ogden GB, Pratt MJ, Schaechter M (1988) The replicative origin of the E. coli chromosome binds to cell membranes only when hemimethylated. Cell 54(1):127–135 Ogura T, Whiteheart SW, Wilkinson AJ (2004) Conserved arginine residues implicated in ATP hydrolysis, nucleotide-sensing, and inter-subunit interactions in AAA and AAA + ATPases. J Struct Biol 146(1–2): 106–112 Olliver A et al (2010) DnaA-ATP acts as a molecular switch to control levels of ribonucleotide reductase expression in Escherichia coli. Mol Microbiol 76(6): 1555–1571 Onogi T et al (1999) The assembly and migration of SeqAGfp fusion in living cells of Escherichia coli. Mol Microbiol 31(6):1775–1782 Parada CA, Marians KJ (1991) Mechanism of DNA A protein-dependent pBR322 DNA replication. DNA A protein-mediated trans-strand loading of the DNA B protein at the origin of pBR322 DNA. J Biol Chem 266(28):18895–18906 Riber L et al (2006) Hda-mediated inactivation of the DnaA protein and dnaA gene autoregulation act in concert to ensure homeostatic maintenance of the Escherichia coli chromosome. Genes Dev 20(15):2121–2134 Roth A, Messer W (1998) High-affinity binding sites for the initiator protein DnaA on the chromosome of Escherichia coli. Mol Microbiol 28(2):395–401 Sekimizu K, Kornberg A (1988) Cardiolipin activation of dnaA protein, the initiation protein of replication in Escherichia coli. J Biol Chem 263(15): 7131–7135 Sekimizu K, Bramhill D, Kornberg A (1987) ATP activates dnaA protein in initiating replication of plasmids bearing the origin of the E. coli chromosome. Cell 50(2):259–265 Seufert W, Messer W (1987) DnaA protein binding to the plasmid origin region can substitute for primosome assembly during replication of pBR322 in vitro. Cell 48(1):73–78 Skarstad K, Katayama T (2013) Regulating DNA replication in bacteria. Cold Spring Harb Perspect Biol 5(4) Skarstad K, Boye E, Steen HB (1986) Timing of initiation of chromosome replication in individual Escherichia coli cells. Embo J 5 (7):1711–1717. [Published erratum appears in EMBO J 1986 Nov;5(11):3074] Slater S et al (1995) E. coli SeqA protein binds oriC in two different methyl-modulated reactions appropriate to its roles in DNA replication initiation and origin sequestration. Cell 82(6):927–936 Smith RW, McAteer S, Masters M (1997) Autoregulation of the Escherichia coli replication initiator protein, DnaA, is indirect. Mol Microbiol 23(6):1303–1315 Speck C, Weigel C, Messer W (1999) ATP- and ADP-dnaA protein, a molecular switch in gene regulation. EMBO J 18(21):6169–6176

Control of Initiation in E. coli Su’etsugu M et al (2004) Molecular mechanism of DNA replication-coupled inactivation of the initiator protein in Escherichia coli: interaction of DnaA with the sliding clamp-loaded DNA and the sliding clamp-Hda complex. Genes Cells 9(6):509–522 Su’etsugu M et al (2005) Protein associations in DnaAATP hydrolysis mediated by the Hda-replicase clamp complex. J Biol Chem 280(8):6528–6536 Su’etsugu M et al (2008) Hda monomerization by ADP binding promotes replicase clamp-mediated DnaAATP hydrolysis. J Biol Chem 283(52):36118–36131 Sun L, Fuchs JA (1992) Escherichia coli ribonucleotide reductase expression is cell cycle regulated. Mol Biol Cell 3(10):1095–1105 Sun L et al (1994) Cell cycle regulation of the Escherichia coli nrd operon: requirement for a cis-acting upstream AT-rich sequence. J Bacteriol 176(8):2415–2426 Sutton MD, Opperman T, Walker GC (1999) The Escherichia coli SOS mutagenesis proteins UmuD and UmuD’ interact physically with the replicative DNA polymerase. Proc Natl Acad Sci U S A 96(22):12373–12378 Theisen PW et al (1993) Correlation of gene transcription with the time of initiation of chromosome replication in Escherichia coli. Mol Microbiol 10(3):575–584 Travers A, Schneider R, Muskhelishvili G (2001) DNA supercoiling and transcription in Escherichia coli: The FIS connection. Biochimie 83(2):213–217 van den Berg EA et al (1985) Analysis of regulatory sequences upstream of the E. coli uvrB gene; involvement of the DnaA protein. Nucleic Acids Res 13(6):1829–1840 von Freiesleben U, Rasmussen KV, Schaechter M (1994) SeqA limits DnaA activity in replication from oriC in Escherichia coli. Mol Microbiol 14(4):763–772 Waldminghaus T, Weigel C, Skarstad K (2012) Replication fork movement and methylation govern SeqA binding to the Escherichia coli chromosome. Nucleic Acids Res 40:5465–5476 Wang QP, Kaguni JM (1987) Transcriptional repression of the dnaA gene of Escherichia coli by dnaA protein. Mol Gen Genet 209(3):518–525 Wang QP, Kaguni JM (1989) dnaA protein regulates transcriptions of the rpoH gene of Escherichia coli. J Biol Chem 264(13):7338–7344 Xu Q et al (2009) A structural basis for the regulatory inactivation of DnaA. J Mol Biol 385(2):368–380 Yamazoe M et al (2005) Sequential binding of SeqA protein to nascent DNA segments at replication forks in synchronized cultures of Escherichia coli. Mol Microbiol 55(1):289–298 Zheng W et al (2001) Mutations in DnaA protein suppress the growth arrest of acidic phospholipid-deficient Escherichia coli cells. EMBO J 20(5):1164–1172 Zhou Z, Syvanen M (1990) Identification and sequence of the drpA gene from Escherichia coli. J Bacteriol 172(1):281–286

Co-transcriptional mRNA Processing in Eukaryotes

Co-transcriptional mRNA Processing in Eukaryotes Bonnie Marvin and Maki Inada Biology Department, Ithaca College, Ithaca, NY, USA

Synonyms 30 end processing; 50 capping; Alternative splicing; Cleavage and polyadenylation; Pre-mRNA processing; Pre-mRNA splicing

Synopsis Gene expression describes the flow of genetically encoded information from DNA to its intermediary form, mRNA, and its functional form, protein. In eukaryotes, mRNAs are co-transcriptionally highly processed from a precursor mRNA or pre-mRNA to a mature mRNA. To form mature mRNAs, the pre-mRNA’s 50 end is capped, its coding regions are joined together during a process called pre-mRNA splicing, and its 30 end is cleaved and appended with a poly(A) tail. By modifying pre-mRNAs, the cell is afforded multiple opportunities for regulatory control in the diversity and levels of an mRNA prior to its translation into protein.

Introduction For all organisms, the expression of genes from DNA to functional protein requires first the transfer of information to a transient messenger RNA molecule known as mRNA. In prokaryotes, mRNAs are typically direct copies of their respective genes and are immediately or even simultaneously translated to protein. However, for eukaryotes, mRNAs undergo extensive modification during their synthesis from a precursor mRNA (pre-mRNA) to a mature message. Namely, these processing steps are the (1) addition of a cap to the 50 end of an mRNA; (2) pre-mRNA

137

splicing, whereby noncoding regions are removed and the remaining protein-coding regions are joined together; and (3) cleavage and appendage of a poly(A) tail to the 30 end of an mRNA. Each of these steps is highly regulated and can modulate the levels of protein produced during translation. The pervasive diversity of the content and levels of mRNAs generated by these processes found among different cell types, developmental stages, and disease states is being increasingly revealed by recent technological advances in high-throughput sequencing of RNAs. In sum, these modifications provide the cell with intricate control and regulation over gene expression by affecting the stability of the mRNA, its progress within the cell, the rate at which it is translated, and the information ultimately it encodes. Cross Talk Between mRNA Processing Events and Transcription Initially, the processes and proteins involved in transcription, 50 end capping, pre-mRNA splicing, and 30 end maturation were unraveled separately, but soon it was realized that these processing events greatly influence each other. Now synthesis and processing of an mRNA are regarded as simultaneous and highly interlinked, and it has been shown that this interplay among processing machinery and the transcription machinery, namely, RNA polymerase II (RNAP II), plays a key role in gene regulation. RNAP II’s ability to regulate mRNA processing lies in a domain called the carboxyl terminal domain (CTD), consisting of several heptad repeats of the amino acid sequence Y1S2P3T4S5P6S7. The CTD acts as a landing platform, recruiting various protein factors necessary for both transcribing and processing the mRNA. For each heptad repeat, the tyrosine, serine and threonine residues can be reversibly phosphorylated and the prolines can be cis-trans isomerized, changing the capacity of the CTD to bind to different factors (Buratowski 2009; Egloff and Murphy 2008). Tethering transcription and processing of an mRNA shape mRNA processing in three main ways. First, it localizes and positions the protein factors required for a processing event near the growing mRNA strand, thus increasing the

C

138

likelihood for the event to occur at the right time and place. Second, when an mRNA is formed, it begins to fold and progressively form a 3-D structure. This 3-D mRNA may hide or showcase various parts of itself, including signals for subsequent splicing or 30 end processing. The rate of transcription elongation plays a critical factor in the folding of the mRNA and hence the signals available for proteins to interact. This may radically alter the formation of RNA–protein complexes and, therefore, the processing of an mRNA. Third, mRNA processing factors on the CTD of RNAP II during elongation can lead to the allosteric activation or inhibition of other mRNA processing factors. This results in variations in the processing of an mRNA (Bentley 2005). Overall, the CTD of RNAP II has critical influence on the co-transcriptional modifications of an mRNA and, thus, gene expression, and are discussed throughout this entry. 50 End Capping As nascent transcripts emerge from RNAP II, the mRNAs undergo the first processing event, 50 end capping. This takes place after 20–30 nucleotides have been synthesized. Note that the sequence of the 50 end of the mRNA, and consequently the location of the 50 cap, can vary. This is dictated by the regulation of transcription initiation, which is described elsewhere (see ▶ “Cis-Regulation of Eukaryotic Transcription”). Capping involves a three-step enzymatic reaction that ultimately attaches an inverted methylated guanine nucleotide to the 50 end of an mRNA (Fig. 1) and serves primarily to protect the end of the mRNA and modulate gene expression. The three capping enzymes involved are an RNA 50 -triphosphatase, a guanylyl transferase, and a methyltransferase. The RNA 50 triphosphatase initiates 50 capping by hydrolyzing the triphosphate group on the first nucleotide of the growing mRNA strand, resulting in the removal of a phosphate. Next, guanylyl transferase catalyzes the condensation of a molecule of guanosine 50 -triphosphate (GTP) to the first nucleotide of the mRNA, essentially attaching a guanosine and a phosphate to the first nucleotide on the mRNA. Afterwards,

Co-transcriptional mRNA Processing in Eukaryotes

methyltransferase methylates the nitrogen found at position 7 of the transferred guanosine (Proudfoot et al. 2002). The capping enzymes are recruited to the CTD with the phosphorylation of Ser5 during transcription initiation. Since the CTD is located adjacent to where the mRNA strand emerges from RNAP II, this promotes the rapid addition of the 50 cap. It has also been observed that enzymatic activity of guanylyl transferase is increased via interaction with the CTD, namely, phosphorylated Ser5, increasing the efficiency of the capping reaction (Proudfoot et al. 2002; Buratowski 2009; Bentley 2005). The addition of the 50 cap to an mRNA serves several different functions that enhance gene expression. First, the cap helps to stabilize an mRNA by acting as a barrier for ribonucleases, preventing it from being quickly degraded. Second, it aids in the export of the mRNA out of the nucleus. To exit the nucleus, the mRNA must pass through the nuclear pore complex (NPC), a large protein complex that spans the nuclear membrane. The NPC recognizes a capped mRNA via associated proteins known as the cap-binding complex (CBC) and promotes the efficient export of that mRNA. Third, the cap marks the beginning of the mRNA for translation by promoting initial interaction with the ribosome. When the mRNA enters the cytoplasm, the CBC is exchanged with the translation initiation factor, eIF-4E. This interaction between the cap and eIF-4E, along with the addition of other translation initiation factors, initiates the binding of the ribosomal subunits with the beginning of the mRNA. Finally, the cap also enhances the rate of translation by the formation of a protein–protein bridge between the cap-binding factors and the poly(A)-binding protein (PABP) that coats the poly(A) tail, creating a circularized mRNA/protein complex. When the ribosome reaches the end of the mRNA and terminates translation, it is now situated adjacent to the beginning of the mRNA. This drastically increases its likelihood to bind to the mRNA again and initiate another round of translation (Proudfoot et al. 2002). Overall, the 50 cap plays a vital role in an mRNA’s life span, its proper movement within a cell, and its translatability.

Co-transcriptional mRNA Processing in Eukaryotes

139

C

Co-transcriptional mRNA Processing in Eukaryotes, Fig. 1 50 End capping during transcription. 50 End capping consists of three enzymatic reactions, that add an inverted methylated guanosine cap at the 50 end of an mRNA (light blue). Yellow circles denote phosphates. During transcription initiation, Ser5 on the carboxy terminal domain (CTD) of RNA polymerase II (RNAP II) (green) undergoes phosphorylation, resulting in the recruitment of capping enzymes to the CTD of RNA polymerase. These capping enzymes include an RNA 50 -triphosphatase (RTP), a guanylyl transferase (GT), and a

methyltransferase (MT). First, RTP hydrolyzes the triphosphate group on the first nucleotide on the nascent mRNA strand, removing a phosphate. Second, GT catalyzes the addition of guanosine 50 -monophosphate (GMP) to the first nucleotide on the mRNA using a molecule of guanosine 50 triphosphate (GTP). The two remaining phosphates from the GTP molecule leave as pyrophosphate. Finally, MT catalyzes the addition of a methyl group to the nitrogen found at position 7 of the transferred guanosine (Proudfoot et al. 2002)

Pre-mRNA Splicing A unique characteristic of eukaryotic genes are introns, or intervening sequences that do not encode for protein. These intronic sequences are excised from the transcribed mRNA and the sequences that encode for proteins, or exons, are ligated together during a process called pre-mRNA splicing. Splicing requires the coordinated activity of both signal sequences found within the mRNA (or cis elements) and the interactions of the splicing machinery (or trans-acting factors) with these cis elements. Core signals include a 50 splice site (SS), a branch point sequence (BP) with a highly conserved adenosine residue, and a 30 SS (Fig. 2a). The splicing

machinery itself is a large ribonucleoprotein (RNP) complex called the spliceosome and consists of more than 100 proteins and five small nuclear RNAs (snRNAs), namely, U1, U2, U4, U5, and U6. The spliceosome is a dynamic machine whose accuracy in locating the splice signals in splicing mRNAs is critical for proper gene expression (Wang and Burge 2008; Black 2003; Wahl et al. 2009). The splicing reaction itself consists of two sequential trans-esterification reactions at the 50 and 30 SS (Fig. 2a). Splicing begins with the formation of a catalytically active spliceosome, which requires multiple steps to ensure that the short splicing signals are recognized and splicing

140

Co-transcriptional mRNA Processing in Eukaryotes, Fig. 2 (a) Pre-mRNA splicing reaction. (1) Schematic of pre-mRNA highlighting the cis-acting sequences

Co-transcriptional mRNA Processing in Eukaryotes

required for splicing. (2) The two trans-esterification steps of the splicing reaction. The arrows indicate the two nucleophilic attacks at phosphate groups (yellow). In

Co-transcriptional mRNA Processing in Eukaryotes

happens at proper sites (Fig. 2b). To begin the assembly of the spliceosome on the pre-mRNA, first, U1 snRNA interacts with the 50 SS. Afterwards, U2 snRNA base pairs with the BP, leading to the highly conserved adenosine residue at the BP to bulge. This bulge exposes the 20 OH on the conserved adenosine, positioning it for nucleophilic attack of the phosphodiester bond at the 50 SS. Next, U4/U5/U6 tri-snRNP is recruited to the spliceosome. U1 is then displaced by U6 at the 50 SS. The combined interaction of U2 and U6 with the pre-mRNA places the 50 SS and BP within close proximity and, thus, positions the pre-mRNA for the first catalytic step. This reaction produces an unusual lariat intermediate that releases the 50 exon. After the first reaction, several rearrangements of the spliceosome occur in order for the second reaction to proceed. This reorganization includes a conformational change in U2, leading to a disruption in the interaction between U2 and the BP and the binding of U5 to the 50 and 30 SS. Ultimately, this results in the 30 OH of the newly released exon to similarly attack the last nucleotide of the intron at the 30 SS, joining the two exons and removal of the intron lariat. This highly dynamic process of assembling a new spliceosome for each intron requires the activity of many spliceosomal factors, such as helicase proteins, to modulate the multiple RNA/RNA and RNA/protein and protein/protein interactions that take place for accurate splicing (Wahl et al. 2009; Black 2003). If the spliceosome acts upon the first available splice sites as they emerge and every exon is spliced in order on the growing mRNA transcript, then this is referred to as constitutive splicing.

141

However, it has been observed that there can be large variation in the splice sites used with different pre-mRNAs. Different exonic sequences can be spliced either in or out, generating different coding sequences and hence different translated proteins. This process is known as alternative splicing (AS) (Fig. 3). AS can consist of the following: (1) exon inclusion or exon skipping where one or several exons along with their flanking introns are either kept or removed from the mature mRNA, (2) alternative 50 SS or 30 SS usage resulting in the extension or shortening of an exon, and finally (3) intron retention which is the maintenance of an intron in a mature mRNA. While AS is less prevalent in lower eukaryotes, it has been shown via high-throughput sequencing of RNA populations that greater than 90% of human genes undergo AS (Licatalosi and Darnell 2010). One example of the difference AS can have on a final protein product is with the Fas receptor. Via AS, a Fas receptor can either localize in the cytoplasm or be membrane bound, and this subtle difference plays a vital role in cell death (Wang and Burge 2008). As such, AS can afford the cell an enormous amount of regulatable proteomic diversity from a small number of genes. Determinants of AS consist of regulatory proteins (splicing factors) that bind to auxiliary cis-regulatory sites (enhancers and inhibitors) within the mRNA that can either increase or diminish usage of different splice locations. Key splicing factors are Ser–Arg (SR) proteins and heterogeneous RNPs (hnRNPs) that load onto mRNAs near splice sites and recruit spliceosomal factors that can either enhance or inhibit the splicing of an exon. For example, hnRNP

ä Co-transcriptional mRNA Processing in Eukaryotes, Fig. 2 (continued) the first step, the 20 OH of the branched adenosine attacks the 50 splice site resulting in a free exon 1 and the intron lariat–exon 2 intermediate. In the second step, the 30 OH of exon 1 attacks the 30 splice site, resulting in ligated exons and an excised intron lariat. (b) Spliceosome assembly. Formation of a catalytically active spliceosome requires multiple steps. First, U1 snRNA interacts with the 50 SS on the pre-mRNA. Afterwards, U2 snRNA base pairs with the BP, leading to the highly conserved adenosine residue at the BP to bulge. Next, U4/

U5/U6 tri-snRNP is recruited to the spliceosome, resulting in a catalytically active spliceosome. Afterwards, the U6 snRNA displaces U1 at the 50 SS. The 50 SS and BP, due to the interactions of U2 and U6 snRNAs, are now positioned close to one another, thus positioning the pre-mRNA for the first catalytic step. This reaction forms an unusual lariat intermediate, which releases the 50 exon. Subsequently, a series of rearrangement occur that ultimately results in the 30 OH of the newly released exon to attack the last nucleotide of the intron at the 30 SS, joining the two exons and removal of the intron lariat (Wahl et al. 2009)

C

142

Co-transcriptional mRNA Processing in Eukaryotes

Co-transcriptional mRNA Processing in Eukaryotes, Fig. 3 Alternative splicing. A gene can encode many mRNA transcripts due to alternative processing or alternative splicing (AS) of its pre-mRNAs. Pre-mRNAs with differing exonic sequences spliced in or out to produce mRNAs with differing coding regions are shown. (1) Showcases when the spliceosome acts upon the first available splice sites, leading to the splicing of every exon sequentially or constitutive splicing. (2) Depicts two examples of exon skipping and exclusion. (3) Indicates an example of intron retention where an intron fails to be spliced out and remains in the mature mRNA transcript and may be encoded in the final protein product (Black 2003; Licatalosi and Darnell 2010)

H when bound to a G-rich sequence downstream of the 50 SS promotes splicing by helping to assemble a spliceosomal complex; however, when hnRNP H is bound to a G-rich sequence within an exon, it inhibits splicing by sterically blocking access of the spliceosome (Chen and Manley 2009). The levels of different splicing factors found within a cell of different tissue and cell types also greatly influence alternative

splicing decisions. Recent high-throughput studies have estimated that greater than 50% of alternatively spliced transcripts in human tissues are expressed at varying levels among different tissues (Chen and Manley 2009). Along those lines, AS is highly prevalent among neuronal and immunological genes where diversity is known to be critical for function and survival (Licatalosi and Darnell 2010).

Co-transcriptional mRNA Processing in Eukaryotes

Control of AS is intimately linked to transcription. The components of the spliceosome and splicing factors such as the U1 snRNP and SR proteins are recruited to the CTD of RNAP II during transcription, influencing the commitment of assembly of the spliceosome at different splice sites. Transcription elongation rates, which can be modulated by promoters, can change the 3-D structure of an mRNA, therefore shifting the splice sites available for interactions with the splicing machinery. For example, slow elongation provides the spliceosome a longer window of time to form on weaker splicing signals and can lead to the inclusion of what is known as an alternative exon (Moore and Proudfoot 2009). More recently, structures in chromatin via nucleosome positioning and histone modifications have also been seen to influence splice site choice by their effects on transcription rate and splice site accessibility (Moore and Proudfoot 2009; Luco et al. 2011; see ▶ “Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of”). The downstream consequences of splicing within a cell extend beyond the diversity of sequences of the proteins that are synthesized and play a central role in altering the fate of an mRNA by changing the noncoding information included in the mRNA. For example, cis-sequences for mRNA-binding proteins that enhance mRNA export or direct mRNA localization can be included or excluded from the final mRNA by AS and thus determine how quickly and efficiently an mRNA is exported into the cytoplasm and where it localizes in a cell (see ▶ “mRNA Localization and Localized Translation”). Another effect of splicing is on the translatability of an mRNA. Spliced mRNAs more effectively recruit the translation machinery during the pioneer round of translation due to the presence of proteins, known as the exon junction complex, deposited during splicing (Moore and Proudfoot 2009). Splicing can also alter the decay of an mRNA, through the inclusion of different 30 untranslated regions (UTRs). It has been observed that mRNAs with shorter 30 UTRs are generally more stable. The inclusion of a 30 UTR that is a target for small regulatory RNAs, such as miRNAs or siRNAs, can also decrease its stability

143

(Di Giammartino et al. 2011; see ▶ “RNA interference”). It has also been observed that inclusion or exclusion of sequences by AS that introduce stop codons or alter the location of the normal stop codon can affect whether an mRNA is the target for nonsense-mediated decay, an mRNA degradation pathway, and ultimately its expressibility (Moore and Proudfoot 2009). AS can also influence the site of 30 end processing of an mRNA, which can affect its expression in a myriad of ways already described above (and see below). Our view of the incredible diversity that splicing regulation can provide has been greatly expanded by the recent explosion of high-throughput sequencing studies (Licatalosi and Darnell 2010). As these alterations to the mRNA have huge influence on the stability, location, and information expressed from mRNAs, it is becoming increasingly clearer that AS is a critical juncture for regulation of gene expression. 30 End Processing A defining characteristic of almost all eukaryotic mRNAs is the presence of 200–300 adenine nucleotides, or a poly(A) tail, at their 30 end. This is the final product of what is called 30 end processing. This almost universal modification of eukaryotic mRNAs (with the exception of histone mRNAs) is notably different from bacterial mRNAs, whose 30 ends are formed by transcriptional termination. Similar to 50 capping, the major function of the poly(A) tail is added stability to the mRNA and modulation of gene expression. The maturation of an mRNA’s 30 end consists of two steps: cleavage and polyadenylation. During cleavage, the 30 end of an mRNA is cut, and subsequently, during polyadenylation, adenine nucleotides are sequentially added without the use of a DNA template. Processing of the 30 end of an mRNA, similar to pre-mRNA splicing, also requires the guidance of trans-acting factors by cis-sequences. In mammals, a large, complex protein machinery comprising of over fourteen different proteins is required for 30 end processing. The sequences within the mRNA directing the 30 end machinery are found towards the very end of the mRNA within the 30 UTR (Fig. 4). Three main signals

C

144

Co-transcriptional mRNA Processing in Eukaryotes

Co-transcriptional mRNA Processing in Eukaryotes, Fig. 4 30 End processing signals and machinery. Sequences within the 30 UTR of an mRNA direct the 30 end machinery. These signals include a cleavage site delineated by a CA dinucleotide and two signals for polyadenylation, the polyadenylation signal (PAS) and the downstream element (DSE). Additionally, there are two auxiliary sequences positioned upstream of the PAS and downstream of the DSE and enhance 30 end processing by interacting with the processing machinery and regulatory

factors. Arrows indicate the approximate nucleotide distance between signals. The 30 end machinery itself is divided into several sub-complexes, each containing many different proteins: cleavage and polyadenylation specificity factor (CPSF), cleavage stimulation factor (CstF), cleavage factor I (CFIm), cleavage factor II (CFIIm), poly(A) polymerase (PAP), and the poly(A)binding proteins (PABPs) (not shown here). Each sub-complex is depicted above the cis signal with which it binds (Mandel et al. 2008; Millevoi and Vagner 2010)

indicate where along an mRNA cleavage and polyadenylation occur. These signals include a cleavage site delineated by a CA dinucleotide and two signals for polyadenylation, the polyadenylation signal (PAS) and the downstream element (DSE). The PAS sequence is located 10–30 nucleotides upstream of the cleavage site and consists of a highly conserved AAUAAA hexamer. In some cases, multiple sequences or weaker variants such as AUUAAA appear, allowing for the use of alternative polyadenylation sites. The DSE sequence is found 30 nucleotides downstream of the cleavage site. Beyond these primary signals, two auxiliary sequences positioned upstream of the PAS and downstream of the DSE enhance 30 end processing by interacting with the processing machinery and regulatory factors. The 30 end machinery itself is divided into several sub-complexes, each containing many different proteins. These sub-complexes include cleavage and polyadenylation specificity factor (CPSF), cleavage stimulation factor (CstF), cleavage factor I (CFIm), cleavage factor II (CFIIm), poly (A) polymerase (PAP), poly(A)-binding proteins (PABPs), and many others (Millevoi and Vagner 2010; Mandel et al. 2008; Fig. 4).

Processing of the 30 ends of mRNAs is again tightly coupled with transcription and depends upon the CTD of RNAP II to recruit 30 end processing factors. For example, CPSF associates with the CTD early in the transcription cycle and then transfers onto the mRNA, binding to the PAS sequence (Proudfoot et al. 2002). As transcription elongation progresses, the levels of phosphorylated Ser5 on the CTD that interacts with capping machinery decrease, and the levels of phosphorylated Ser2 increase, making way for new interactions. Polyadenylation cleavage factor Pcf11 preferentially associates with phosphorylated Ser2 and helps to strengthen the poly (A) machinery’s association with the 30 end of the mRNA, increasing the ability of the poly (A) machinery to scan the mRNA to bind to the cis elements for more efficient and accurate 30 end processing (Buratowski 2009). Next, CstF binds to the DSE, and the two cleavage factors, CFIm and CFIIm, PAP, and other associated factors bind. Once this complex forms, CPSF cleaves the mRNA and PAP subsequently begins to add about 10 untemplated adenine nucleotides to the 30 end of the newly cleaved mRNA. At this point, PABP binds to the short poly(A) tail, increasing

Co-transcriptional mRNA Processing in Eukaryotes

the rate of polyadenylation. PABP also controls the ultimate length of the poly(A) tail by regulating the interaction between CPSF and PAP (Millevoi and Vagner 2010). Following the same functional themes as the addition of a 50 cap and splicing, the processing of the 30 end of an mRNA has immense influence on an mRNA and its gene expression. The processed 30 end increases the stability of the mRNA as the poly(A) tail acts as a barrier for the 30 end from ribonucleases in the cytoplasm. The processed 30 end also promotes the export of the mRNA to the cytoplasm and, as mentioned previously, increases the rate of translation by forming a circularized mRNA/protein complex that facilitates rapid reassociation of the ribosome to the mRNA (Mandel et al. 2008; Proudfoot et al. 2002). Interestingly, the appearance of more than one polyadenylation signal or weaker signals provides opportunity for additional regulation via alternative polyadenylation. It has been suggested that greater than 50% of human genes produce transcripts with alternative 30 ends. In mRNAs with multiple polyadenylation signals, variation in where the 30 end of an mRNA is processed can alter the protein-coding information, the information included in the 30 UTR, or the length of the 30 UTR. These alterations ultimately affect the protein formed as well as the properties of the mRNA, including its stability, localization, transport, and translation (Millevoi and Vagner 2010). To highlight, a genome-wide examination of human 30 UTRs produced through alternative polyadenylation found that 52% of microRNA target sequences were located after the first potential site for polyadenylation. This suggests that inclusion or exclusion of small regulatory RNA-binding sites by alternative 30 end processing can alter the mRNA’s expression via posttranscriptional mechanisms (see ▶ “RNA interference”). It has also been noted that organisms developing and growing rapidly have a greater proportion of mRNAs with short 30 UTRs. Shorter 30 UTRs, in general, as mentioned previously, are more stable leading to a higher rate of translation, generating more rapidly the proteins necessary for the developing organism (Di Giammartino et al. 2011).

145

Conclusion Co-transcriptional RNA processing, including 50 end capping, splicing, and 30 end maturation, provides the cell with an enormous amount of genomic complexity, diversity, and regulation. The modifications have huge influence on the stability, location, and information expressed from each mRNA. High-throughput sequencing studies of cellular RNAs have given us an unprecedented view of the capacity that exists and places RNA as central to gene expression regulation. It is, therefore, not surprising that the deregulation of RNA processing has been observed in many diseases such as cancers. These studies will continue to aid in our understanding of regulatory mechanisms for RNA processing and hence disease.

Cross-References ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of ▶ Cis-Regulation of Eukaryotic Transcription ▶ Gene Regulation ▶ mRNA Localization and Localized Translation ▶ RNA Interference

References Bentley DL (2005) Rules of engagement: co-transcriptional recruitment of pre-mRNA processing factors. Curr Opin Cell Biol 17(3):251–256 Black DL (2003) Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem 72:291–336 Buratowski S (2009) Progression through the RNA polymerase II CTD cycle. Mol Cell 36(4):541–546 Chen M, Manley JL (2009) Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat Rev Mol Cell Biol 10(11):741–754 Di Giammartino DC, Nishida K, Manley JL (2011) Mechanisms and consequences of alternative polyadenylation. Mol Cell 43(6):853–866 Egloff S, Murphy S (2008) Cracking the RNA polymerase II CTD code. Trends Genet 24(6):280–288 Licatalosi DD, Darnell RB (2010) RNA processing and its regulation: global insights into biological networks. Nat Rev Genet 11(1):75–87

C

146

Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis

Luco RF et al (2011) Epigenetics in alternative pre-mRNA splicing. Cell 144(1):16–26 Mandel CR, Bai Y, Tong L (2008) Protein factors in pre-mRNA 30 -end processing. Cell Mol Life Sci CMLS 65(7–8):1099–1122 Millevoi S, Vagner S (2010) Molecular mechanisms of eukaryotic pre-mRNA 30 end processing regulation. Nucleic Acids Res 38(9):2757–2774 Moore MJ, Proudfoot NJ (2009) Pre-mRNA processing reaches back to transcription and ahead to translation. Cell 136(4):688–700 Proudfoot NJ, Furger A, Dye MJ (2002) Integrating mRNA Processing with Transcription. Cell 108(4):501–512 Wahl MC, Will CL, Lührmann R (2009) The spliceosome: design principles of a dynamic RNP machine. Cell 136(4):701–718 Wang Z, Burge CB (2008) Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA 14(5):802–813

Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis Charles S. McHenry Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO, USA

Synopsis The E. coli replicase has enormous processivity, sufficient to synthesize over 100,000 bases without dissociation. Yet, on the lagging strand of the replication fork, Okazaki fragments of 1,000–2,000 bases are made every 1–2 s. This requires a mechanism to trigger the lagging strand polymerase to cycle in far less than one second to new primers synthesized at the replication fork. Several mechanisms have been proposed to explain how this might occur. The most prominent have been the collision model where a direct collision with the 50 -end of the preceding Okazaki fragment triggers cycling and the signaling model where synthesis of a new primer at the fork triggers polymerase release and rebinding to the new primer. Kinetic studies indicate release after collision is far too slow to support the rate of lagging strand synthesis. Experimental support for the

signaling model has been obtained on synthetic mini-circle templates that contain high asymmetry in GC composition between the two strands, allowing the rate of lagging strand synthesis to be selectively perturbed. Furthermore, cycling can be induced by addition of exogenous primers, indicating that it is the presence of a new primer, not the action of primase, that provides the signal.

Introduction The Pol III HE has the processivity required to replicate >150 kb (Mok and Marians 1987a, b) and perhaps the entire E. coli chromosome, without dissociation, yet it must be able to efficiently cycle to the next primer synthesized at the replication fork upon the completion of each Okazaki fragment at a rate faster than Okazaki fragment production. The rate of replication fork progression in E. coli is about 600 nt/s at 30  C (Breier et al. 2005), approximately the rate of replicase progression on single-stranded templates (Johanson and McHenry 1982). Thus, most of the time in an Okazaki fragment cycle is spent on elongation and little time (0.1 s) remains for the holoenzyme to release, bind the next primer, and begin synthesis. A processivity switch must be present to increase the off-rate of the lagging strand polymerase by several orders of magnitude. There are two competing but nonexclusive models for the signal that throw the processivity switch. The first (signaling model) proposes that a signal is provided by synthesis of a new primer at the replication fork that induces the lagging strand polymerase to dissociate, even if the Okazaki fragment has not been completed (Wu et al. 1992b). The second (collision model) was originally proposed for T4 (Alberts et al. 1983) and then extended to the E. coli system (Leu et al. 2003). The collision model posits that the lagging strand polymerase replicates to the last nucleotide (Leu et al. 2003) or until the Okazaki fragment is nearly complete (Georgescu et al. 2009). A communication circuit that proceeds through the t subunit has been proposed to sense the conversion of a gap to a nick, signaling release.

Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis

A third model for cycling has been proposed that attributed dissociation of the lagging strand polymerase induced by the inability of the dimeric polymerase to rotate around the template strand once per helical turn, resulting in highly supercoiled product (Kurth et al. 2013). Shortly after the discovery that t dimerizes the leading and lagging strand polymerases, it was recognized that negative superhelical torque could be an issue for the leading strand, but that rotation within single-stranded DNA would rapidly dissipate superhelical tension in the lagging strand (McHenry et al. 1988). However, Marians observed that coupled leading and lagging strand rolling circle replication could generate leading strand products of approximately 150,000 bases without the addition of topoisomerases (Mok and Marians 1987a, b). This suggested that if this torque was generated, it could be relieved by mechanisms that did not involve topoisomerase action. Cozzarelli and colleagues considered the issue in more detail (Ullsperger et al. 1995) and discounted the notion of extensive precatenane interwinding of the leading and lagging strand product created by rotation of the lagging strand polymerase about the leading strand to remove the torque. As one plausible solution, they proposed the 30 -end of the growing leading strand could occasionally be released from the active site of the polymerase to allow dissipation of superhelicity (Ullsperger et al. 1995). It would appear that the ring-like structure of the b sliding clamp might allow rotation of the helix through its central pore without dissociation, maintaining processive leading strand synthesis. Another possible mechanism for torque release would be dissipation of leading strand torque by release of the lagging strand polymerase during cycling to the primer for synthesis of the next Okazaki fragment. This mechanism would appear cumbersome, but not impossible, because of the large mass and radius of gyration of the Pol III HE in the viscous milieu of the cell. However, if cycling involves DnaX as the sensor that initiates cycling (section Cycling of the Lagging Strand Polymerase in E. coli is Directed Exclusively by a Modified Signaling Model), it would be attached to the priming apparatus, precluding this mechanism.

147

Evidence that a modification of the signaling model is exclusively used to drive cycling in E. coli is presented in the last section of this essay.

Cycling Models Provided by Bacteriophages T4 and T7 In the more fully characterized systems provided by the replication apparatus of bacteriophages T4 and T7, signaling through synthesis or the availability of a new primer appears to play an important role, with the collision pathway playing a backup role (Hamdan et al. 2009). A handoff of the nascent tetraribonucleotide primer and the T7 polymerase is effected by direct primasepolymerase interaction (Kato et al. 2004). It takes time for synthesis of a new primer, release of the lagging strand polymerase, and initiation of synthesis on a new primer. Differing views have been presented on how T7 overcomes this delay on the lagging strand to allow leading and lagging strand replication to remain coordinated (Lee et al. 2006; Pandey et al. 2009). In one model, it is proposed that the helicase and thus leading strand synthesis is halted during slow primer synthesis (Lee et al. 2006). In the other model, it is proposed that the lagging strand primer is synthesized before it is needed and held in a priming loop, close to the replication fork, facilitating handoff to the DNA polymerase (Pandey et al. 2009). This model, if generally applicable, could explain the double loops sometimes observed at replication forks by electron microscopy (Chastain et al. 2003). In the latter model, it is proposed that the reactions remain coordinated because the lagging strand polymerase elongates faster than the leading strand polymerase (Pandey et al. 2009). In T4, primer synthesis does not halt progression of the leading strand polymerase. Primer synthesis occurs by two mechanisms: (1) dissociative, with primase releasing from its helicase association during primer synthesis, or (2) processive, whereby primase remains associated and a second loop is formed on the lagging strand DNA (Manosas et al. 2009; Yang et al. 2006). The sliding clamp and clamp loader increase the

C

148

Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis

processive/looping mechanism and it was suggested that gp32 (T4 SSB) might further increase the processive looping mechanism in the natural system by facilitating handoff of the nascent pentaribonucleotide primer (Manosas et al. 2009). A proposal was made that the clamp loader/clamp interaction with a new primer might be the key signal required for release of the lagging strand polymerase (Yang et al. 2004). More recent work supports that proposal (Chen et al. 2013). The size distribution of short Okazaki fragments that derive from signaling-induced cycling are affected by concentrations of the clamp loader while independent of polymerase concentration, as expected for a processive, recycled lagging strand polymerase.

Does the OB Fold Provide the Processivity Switch for Cycling? In the structure of Pol III a complexed to DNA, an OB fold is located close to the primer terminus (Wing et al. 2008). Because OB folds commonly bind to ssDNA, a proposal was made that it could be part of the sensing network (Bailey et al. 2006; Wing et al. 2008). Consistent with this hypothesis, the ssDNA binding portion of Pol III was localized to a C-terminal region of a that contains the OB fold element (McCauley et al. 2008). A test of the importance of the OB fold motif was made using a mutant in which three basic residues located in the b1-b2 loop were changed to serine (Georgescu et al. 2009). No ssDNA binding was observed in the mutant, indicating diminution in affinity. However, even the wild-type polymerase bound ssDNA extremely weakly, near the limit of detection in the assays used (Kd  8 mM). The processivity of the mutant polymerase was decreased by the b1-b2 loop mutations, an effect that was rescued by the presence of the t-complex (Georgescu et al. 2009). The latter observation would seem to suggest that although the OB fold contributes to ssDNA affinity and processivity, it is not the processivity sensor or at least that the residues mutated are not the key interactors. The OB fold might bind to the nick generated by completion of an Okazaki fragment, as seen in human ligase

1 (Pascal et al. 2004), inducing a non-processive conformation. Alternatively, the OB fold might act in concert with other binding changes as part of a more complex signaling network. It is possible that the entire polymerase active site is the processivity switch. Steitz and colleagues (Wing et al. 2008) have elegantly demonstrated a conformational change in a induced by substrate binding in which the movement of several elements places the b2 binding domain in a position where it can productively interact with the b2 clamp on DNA. Follow-on studies with a Gram-positive polymerase suggest this observation is general (Evans et al. 2008; Wing 2010). The geometry and spatial constraints around the active site when the exiting template is doublestranded might make insertion of the last nucleotide energetically unfavorable. Upon insertion, the product might lose affinity for the active site, triggering a reversal of the conformational changes that occurred upon primer-template and dNTP binding, causing the b2 binding domain to be pulled away, and switching the polymerase to a low processivity mode. The presence of an unliganded polymerase domain serves to decrease the affinity of the C-terminus of Pol III a for b2 (Kim and McHenry 1996), consistent with this view. The genetic screen for mutants that led to a loss of the dominant-negative phenotype of D403E a (▶ DNA Polymerase III Structure, C-terminal Domains, t-binding Domain) revealed a mutant at a conserved alanine in the interior of the b2 binding domain (A887E) that no longer binds b2 tightly. Future studies should use this mutant to address whether the communication circuit that modulates b2 affinity flows through this region. Perhaps a bulky residue changes the presentation of the b2 binding loop or b2 binding surface, decreasing affinity and, if relevant to cycling, the dissociation rate. This type of cycling, induced by collision with a downstream 50 -end, would be relevant for the release of Pol III HE from repaired DNA, but not at the replication fork during Okazaki fragment synthesis, because the release is too slow (section The Rate of Polylmerase Dissociation Upon Collision with the Preceding Primer Is Too Slow to Support Okazaki Fragment Synthesis).

Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis

What Is t’s Role in Cycling? It has been proposed that t acts as a sensor for conversion of a gap to a nick upon completion of Okazaki fragment synthesis (Leu et al. 2003; Lopez de Saro et al. 2003a, b). When bound to ssDNA, t was proposed to lose contact with the C-terminus of a, leaving the C-terminus free to contact b. When the t-ssDNA contact is lost, t was proposed to bind the C-terminus of a, displacing b and allowing the polymerase to cycle to the next Okazaki fragment (Leu et al. 2003; Lopez de Saro et al. 2003a, b). This model did not consider earlier data that showed that t and b2 did not compete significantly for polymerase binding and that the critical binding site for b2 was internal and not at the extreme C-terminus of a (Dalrymple et al. 2001; Kim and McHenry 1996; Wijffels et al. 2004). Follow-up work rigorously confirmed that the internal b2 site is the one required for processive replication (Dohrmann and McHenry 2005). Interestingly, replacement of the internal b2 binding site with the consensus sequence identified by informatics (Dalrymple et al. 2001) increased the affinity 120-fold, while the same change at the C-terminus had no effect on b2 binding but caused a 2,700-fold decrease in t binding (Dohrmann and McHenry 2005). This suggests that either the internal site provides the b-interaction sequence in a unique conformation or additional local contacts dictate the specificity of binding. While it is possible that the C-terminus of a could interact with b2 for some undiscovered ancillary purpose, all current mutational effects can be attributed to defects in t interaction or minor structural perturbations. Recent functional and structural studies reveal only participation of the internal Pol IIIa b binding site and occlusion of the C-terminal site (Jergic et al. 2013; Liu et al. 2013; Ozawa et al. 2013; Toste et al. 2013). Furthermore, a consensus b2 binding motif is not found at the C-terminus of many replicative DnaEs, whereas the internal sequence is conserved (Kurth et al. 2013). The function of t in enabling rapid cycling remains unclear. The earliest observation pertaining to a possible role of t in cycling was

149

made by Marians and colleagues (Wu et al. 1992a). They saw a t-dependent acceleration that was codependent upon primase and, along with an observed lessening of the proportion of short Okazaki fragments, proposed t accelerated transit from an old Okazaki fragment to a new one (Wu et al. 1992a). Proposals for t being the sensor for gap to nick conversion were derived from equilibrium measurements in which t decreased the affinity for a nick relative to a gap (Leu et al. 2003). Photo-cross-linking experiments performed using diazirine side chains at a large number of positions in the template ahead of the polymerase and in positions within a duplex with which the elongating polymerase collides only detected a cross-links – none to t (Dohrmann et al. 2011). Because irradiation of diazirines generates carbenes that will insert into any amino acid, one can interpret not seeing t cross-links with confidence, eliminating the possibility that t directly senses gap to nick conversion.

The Rate of Polymerase Dissociation Upon Collision with the Preceding Primer Is Too Slow to Support Okazaki Fragment Synthesis Most of the measurements assessing pathways to trigger polymerase release and cycling have been performed using equilibrium measurements. While a decreased affinity might be consistent with a role in accelerating release, the real issue is whether the lagging strand can be triggered to release in 0.1 s or less upon collision with the preceding Okazaki fragment’s 50 -end. This issue was pursued using a surface plasmon resonance assay in which immobilized primers are placed a set distance from an oligonucleotide that models the 50 -end of the preceding Okazaki fragment. It was found that filling in a gap to the final nucleotide accelerates release, but the rate is far too slow to support the physiological rate of Okazaki fragment synthesis (Dohrmann et al. 2011) – approximately 2–3 min rather than 100 DNA repair enzymes that survey for damage. In some settings with experimental animals administered chemical carcinogens, adduct levels in excess of 1/105

bases have been reported (Kim and Guengerich 1990) and are probably related to the tumors. However, even with modern technology it is difficult to estimate cancer or other risks based on levels of DNA adduct formation. As mentioned earlier, it is important to evaluate any levels of formation of DNA adducts in the context of the background level in healthy individuals.

Cross-References ▶ Bioactivation of Carcinogens ▶ Damage DNA, Natural Products that ▶ Damaged DNA, Analysis of ▶ Depurination ▶ DNA Damage as a Therapeutic Strategy ▶ DNA Damage by Endogenous Chemicals ▶ DNA Damage, Types of ▶ DNA Repair ▶ Oxidative DNA Damage

References Billson HA, Harrison KL, Lees NP et al (2009) Dietary variables associated with DNA N7-methylguanine levels and O6-alkylguanine DNA-alkyltransferase activity in human colorectal mucosa. Carcinogenesis 30:615–620

DNA Damage, Practical Screening for Chastain PD 2nd, Nakamura J, Rao S et al (2010) Abasic sites preferentially form at regions undergoing DNA replication. FASEB J 24:3674–3680 Kim D-H, Guengerich FP (1990) Formation of the DNA adduct S-[2-(N7-guanyl)ethyl]glutathione from ethylene dibromide: effects of modulation of glutathione and glutathione S-transferase levels and the lack of a role for sulfation. Carcinogenesis 11:419–424

DNA Damage, Practical Screening for Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synonyms Genotoxicity assays; Mutation assays

Definition Biological assays are used to measure the potential genotoxicity (including mutation) of chemicals in the regulatory arena, particularly in the areas of pharmaceuticals and industrial and agricultural chemicals. One of the main applications is in the prediction of causing cancer in humans.

Discussion Genotoxicity screening is necessary in many situations, including the safety assessment of new industrial chemicals and drug candidates. With regard to the latter, it is very difficult to advance a genotoxic entity, and the process requires rapid, low-cost analysis of many compounds. In the absence of knowledge about the specific modified bases that could be obtained, the liquid chromatography-mass spectrometry and other assays mentioned under “Analysis of damaged DNA” are not practical.

201

The most commonly used approach is a biological one, mutagenesis in the Salmonella typhimurium base-pair and frameshift tester strains developed by Prof. Bruce Ames (Ames et al. 1973) (Fig. 1a). These are reversion assays involving the screening of S. typhimurium auxotrophs for mutation (reversion) to prototrophy, in terms of histidine synthesis. This assay was developed almost 40 years ago and is the standard for initial screening for the Food and Drug Administration and other regulatory agencies. The assay is traditionally done with plates and counting of colonies, but newer, commercially available variations can be done in solution, with colorimetric responses in microtiter plate formats. Another biological assay involves the bacterial SOS response to DNA damage and is called the “umu test” or “chromotest,” depending upon the bacterium and the genes utilized in the system (Shimada et al. 1994). DNA damage invokes the SOS response due to the binding of the protein LexA to single-stranded DNA that results from blockage of DNA polymerase in bacteria (Fig. 1b). LexA binding starts a cascade that involves increased transcription of >30 genes, including umuC and umuD (which code for a DNA polymerase that has the ability to bypass bulky damage). In these assays, a plasmid (pSK1002) is inserted with the umuC/D regulatory region fused to a lacZ reporter. Bacteria (S. typhimurium) containing the pSK1002 plasmid are treated with a DNA-damaging agent. As in the Ames test, a key feature is the inclusion of an enzyme system (e.g., liver extract) capable of activating test chemicals. More recently, individual cytochrome P450 enzymes, glutathione transferases, and sulfotransferases have been expressed in the umu and Ames test systems and used to help define the roles of individual enzymes in bioactivation and detoxification. Another approach has been used in several settings over the past 30 years, the 32Ppostlabeling method developed by Prof. Kurt Randerath (Randerath et al. 1981) (Fig. 2). In this method, DNA is isolated, digested to nucleotides, and then phosphorylated using 32P, a gamma emitter. The modified nucleotides are separated from the normal ones by

D

202

DNA Damage, Practical Screening for

DNA Damage, Practical Screening for, Fig. 1 Bacterial assays. (a) The Ames test for bacterial mutagens (Ames et al. 1973) (b) The umu test for genotoxicity (Shimada et al. 1994)

DNA Damage, Practical Screening for

203

D

DNA Damage, Practical Screening for, Fig. 2

32

P-Postlabeling assay for DNA damage (Randerath et al. 1981)

multidimensional thin-layer chromatography (or high-performance liquid chromatography in some cases). The advantages of this system are that it is highly sensitive and that some quantitation can be done in the absence of known adducts (with known adducts, it can be quantitative). Disadvantages include the need to work with high levels of radiation, the inability of the system to define new adducts, and the difficulty in quantitation of unknown adducts due to the extreme variability of kinase activity in the labeling of adducted nucleosides/nucleotides.

Cross-References ▶ Bioactivation of Carcinogens ▶ Damaged DNA, Analysis of ▶ DNA Damage, Frequency of ▶ DNA Damage, Relevance to Cancer

▶ DNA Damage, Types of ▶ DNA Repair Polymerases ▶ DNA Replication, Chemical Biology of ▶ Nucleotide Excision Repair ▶ Selectivity of Chemicals for DNA Damage

References Ames BN, Durston WE, Yamasaki E et al (1973) Carcinogens are mutagens: a simple test system combining liver homogenates for activation and bacteria for detection. Proc Natl Acad Sci U S A 70:2281–2285 Randerath K, Reddy MV, Gupta RC (1981) 32P-labeling test for DNA damage. Proc Natl Acad Sci U S A 78:6126–6129 Shimada T, Oda Y, Yamazaki H et al (1994) SOS function tests for studies of chemical carcinogenesis in Salmonella typhimurium TA 1535/pSK1002, NM2009, and NM3009. In: Adolph KW (ed) Methods in molecular genetics, vol 5, Gene and chromosome analysis. Academic, Orlando, pp 342–355

204

DNA Damage, Relevance to Cancer Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synonyms Mutation and cancer

Definition One of the reasons why the study of DNA damage is of high interest is in relation to cancer. Many of the accepted human carcinogens are believed to use mutagenesis as a mode of action.

Discussion Although it is not possible to establish exactly what fraction of human cancer is due to DNA damage and somatic mutation, there are several major reasons to believe that DNA damage is a cause of human cancer. The reasons are not necessarily given in order. Individuals with established deficiencies in DNA repair are very prone to cancer. Some examples include xeroderma pigmentosum, Cockayne’s syndrome, and trichothiodystrophy. There is strong evidence that UV light damage causes skin cancer in humans. This is generally accepted to be the result of various DNA lesions produced by cyclization and other reactions. Further, the deficiencies in DNA repair (see above) are linked with this damage. Animal studies on chemical carcinogens show strong associations between the extent of DNA adduct formation and the incidence of tumors (Bechtel 1989). Some of the classic epidemiological studies of chemicals causing human cancer are best understood in the context of DNA adducts that are

DNA Damage, Relevance to Cancer

known to be formed from these agents. Examples include arylamines (Rehn 1895), aflatoxin B1 (Busby and Wogan 1984), and vinyl chloride (Creech and Johnson 1974). Strongly suggestive associations can also be made for tobaccoinduced causes (Hecht 2008). The ability of DNA adducts to cause mutations can be demonstrated experimentally. It is possible to prepare oligonucleotides with many defined DNA lesions (see ▶ “Synthesis of Modified Oligonucleotides”). Replication can be done with DNA polymerases, and the patterns of misincorporation can be established by several methods. An even more powerful approach induces site-specific mutagenesis (see ▶ “SiteSpecific Mutagenesis”), in which the modified oligonucleotide is added to a cell and errors in replication can be detected. Collectively, these arguments provide strong evidence that this DNA damage is an issue in human cancer. This somatic mutation theory of cancer was proposed by Bauer in 1928 (Bauer 1928). The assignment of chemicals to a list of human carcinogens is not a trivial process, and these lists can be provocative, due to economic considerations or medical needs for a drug. The list of “Known Human Carcinogens” for the US National Toxicology Program includes (in alphabetical order): aflatoxins, alcoholic beverage consumption, 4-aminobiphenyl, analgesic mixtures containing phenacetin, inorganic arsenic compounds, asbestos, azathioprine, benzene, benzidine, beryllium and beryllium compounds, 1,3-butadiene, 1,4-butanediol dimethanesulfonate (Myleran ®), cadmium and cadmium compounds, chlorambucil, 1-(2-chloroethyl)-3-(4-methylcyclohexyl)-1-nitrosourea (MeCCNU), bis (chloromethyl) ether and technical-grade chloromethyl methyl ether, chromium hexavalent compounds, coal tar pitches, coal tars, coke oven emissions, cyclophosphamide, cyclosporin A, diethylstilbestrol, dyes metabolized to benzidine, environmental (second-hand) tobacco smoke, erionite, estrogens (steroidal), ethylene oxide, hepatitis B virus, hepatitis C virus, human papilloma viruses (some genital-mucosal types), melphalan, methoxsalen with ultraviolet A therapy

DNA Damage, Types of

(PUVA), mineral oils (untreated and mildly treated), mustard gas, 2-naphthylamine, neutrons (ionizing radiation), nickel compounds, radon, crystalline silica (respirable size), smokeless tobacco, solar radiation, soots, strong inorganic acid mists containing sulfuric acid, exposure to sunlamps or sunbeds, tamoxifen, 2,3,7,8tetrachlorodibenzo-p-dioxin (TCDD, “dioxin”), thiotepa, thorium dioxide, tobacco smoking, vinyl chloride, ultraviolet radiation (broad spectrum UV radiation), wood dust, and X-radiation and gamma-radiation. Usually the assignment is based on either strong epidemiological evidence (in humans) or extensive experience with multiple experimental models. In another classification system (e.g., International Agency for Research on Cancer), there are five categories: group 1 “Carcinogenic to humans” (107 compounds), group 2A “Probably carcinogenic to humans” (58 compounds), group 2B “Possibly carcinogenic to humans” (249 compounds), group 3 “Not classifiable as to its carcinogenicity to humans” (512 compounds), and group 4 “Probably not carcinogenic to humans” (1 compound).

Cross-References ▶ Bioactivation of Carcinogens ▶ Damage DNA, Natural Products that ▶ DNA Damage as a Therapeutic Strategy ▶ DNA Damage by Endogenous Chemicals ▶ DNA Damage, Frequency of ▶ DNA Damage, Practical Screening for ▶ DNA Damage, Types of ▶ Site-Specific Mutagenesis

205 Creech JL Jr, Johnson MN (1974) Angiosarcoma of liver in the manufacture of polyvinyl chloride. J Occup Med 16:150–151 Hecht SS (2008) Progress and challenges in selected areas of tobacco carcinogenesis. Chem Res Toxicol 21:160–171 Rehn L (1895) Über Blasentumoren bei Fuchsinarbeitern. Archiv Clin Chirgurie 50:588–600

D DNA Damage, Types of Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synopsis DNA is not a pristine collection of the canonical four bases in Watson-Crick geometry. Modification of the bases and sugars in DNA is an ongoing process, with endogenous damage as well as exogenous (Lawley 1984). Oxygen and light are major causes of DNA damage. Although the frequency of total modifications can be considered low (i.e., one in 105 to 106 bases), the number is more than one per average gene. The modifications not only block DNA polymerases but also cause miscoding and the introduction of mutations into the genome. Paradoxically, drugs that modify DNA are used to treat cancers, although there is a finite risk of introducing DNA damage that could lead to a new cancer. More the 100 human genes are involved in DNA repair, indicating the importance of the damage, and deficiencies in these genes have been shown to have major phenotypic consequences.

References Bauer KH (1928) Mutationstheorie der Geschwulstenstehung. Springer, Berlin Bechtel DH (1989) Molecular dosimetry of hepatic aflatoxin B1-DNA adducts: linear correlation with hepatic cancer risk. Regul Toxicol Pharmacol 10:74–81 Busby WF, Wogan GN (1984) Aflatoxins. In: Searle CE (ed) Chemical carcinogens, vol 2, 2nd edn. American Chemical Society, Washington, DC, pp 945–1136

Introduction DNA is the genetic material of organisms, and its integrity is essential for the maintenance of life. However, DNA is under constant attack from chemical and physical agents from inside and outside cells. Thus, DNA is not a pristine mixture

206

of the four nucleobases (six including 5methylcytosine and 5-hydroxymethylcytosine) but contains traces of a myriad of damaged products.

Chemistry and Biology of DNA Damage The sources of DNA damage are many, including both exogenous and endogenous stressors. The endogenous stressors include physical agents such as ultraviolet (UV) light and other radiation, plus a plethora of chemicals such as pollutants, carcinogens in food, and even chemotherapeutic agents. Although one can make conscious lifestyle decisions that may reduce exposure to exogenous chemicals and physical agents that damage DNA, one can never completely avoid these agents, e.g., UV light. Aside from the exogenous stressors, there are many chemical modifiers of DNA within cells, and an individual has less control over these processes. These endogenous factors include alkylating agents (e.g., S-adenosylmethionine), oxidants (reactive oxygen and nitrogen species), and electrophilic products generated from oxidative and other intracellular reactions (e.g., Michael acceptors derived from lipids). Before discussing the types of modification in more detail, a review of the significance of DNA modification is in order. The mutation theory of cancer goes back at least to Bauer in 1928 (Bauer 1928), predating the recognition of DNA in the genetic code. One of the first connections between DNA modifications and mutations was mustard gas, which was shown to modify nucleic acid bases and also be mutagenic. The question of how modifications cause miscoding was already considered in the classic paper of Watson and Crick in 1953. The relationship between carcinogens and mutagens was also unclear for many years, but in vitro activation systems were very helpful in this regard (Ames et al. 1973) (although the caveat should be included that not all carcinogens are genotoxins). The study of mutagenesis is of relevance not only regarding cancer but also teratology (birth defects), atherosclerosis, and a number of other diseases (Ramos and Moorthy 2005).

DNA Damage, Types of

The relationship between DNA modification and mutation was unambiguously established using an approach termed site-specific mutagenesis by Essigmann and his colleagues (Basu and Essigmann 1988). The approach involves the preparation of a vector containing a modified DNA base (in the first example, O6-methylguanine (Green et al. 1984)), its introduction into cells, and the analysis of the resulting mutations. Further, many molecular details of how modified DNA bases interact with DNA polymerases have now been established through combinations of structural and kinetic studies. It is now generally agreed that DNA modifications can lead to a variety of potentially detrimental genotoxic effects, including base pair and frameshift mutations, strand breaks, and also complex events, e.g., deletions and recombination events. The incidence of DNA damage is inherently low, but a single problem in a gene could produce a dramatic biological effect. The problem is not so much that the process would lead to an inactive protein, in that the other allele is not damaged in the same way, but a mistake leading to a protein with abnormal function/regulation is more serious. An example is the oncogenes, in which modification at particular sites leads to loss of control and ultimately aberrant signaling in a cell. The tumor suppressor gene p53 has a number of functions, and loss of one allele can be quite detrimental (and heterozygotic mice have high tumor incidence). Thus, one can ask the question of how many adducts are a problem. Indeed, this question is very relevant in consideration of practical toxicology and setting exposure limits to potential carcinogens, which has considerable economic as well as health issues. The answer is that in principle even a single DNA adduct could conceivably result in cancer. However, the frequency of detrimental biological events associated with each DNA modification is low and, in addition, >100 genes are present to repair damage. In settings with experimental animals, a level of DNA modification of 1/105 bases is considered high and 1/104 would be dramatic. Many individual adducts are present at levels of 1/107 bases, some in apparently healthy individuals. What has

DNA Damage, Types of

been shown in experimental animal models is that there is often a correlation of the levels of DNA adducts with tumor incidence, e.g., with aflatoxin B1. Moreover, in people a higher incidence of DNA adducts (with aflatoxin bound) is seen in areas where people have exposure to aflatoxin and there is a high incidence of liver cancer (Wogan 1992). A number of drugs that have been used to treat cancer act by modification of DNA, i.e., alkylating agents. The concept is that these drugs modify DNA and block replication; DNA is rather quiescent in most tissues but being replicated rapidly in tumors. Unfortunately the alkylation of DNA is not totally specific for the tumors and DNA in other tissues is modified. Related to this phenomenon is an increased risk of future tumors related to cancer chemotherapy (Kaldor et al. 1990). The general mechanism of DNA modification follows several routes. Some chemicals are inherently reactive, e.g., certain direct alkylating agents used as drugs. Many other chemicals are rather inert but are activated by oxidation, reduction, or other processes to electrophilic agents or radicals that modify DNA. The list of chemicals thus activated includes oxygen. These activated chemicals vary in their stability, but in general they have half-lives on the order of at least seconds and can migrate to the nucleus to react with DNA. Some of the bulkier chemical entities have affinity for DNA due to intercalation and other physical forces, e.g., epoxides derived from aflatoxins and polycyclic aromatic hydrocarbons (Johnson and Guengerich 1997). Reaction then occurs with DNA, yielding a covalent product. The chemistry of these modifications generally fits into two categories: radical reactions (involving unshared electrons) and the reactions (2-electron reactions) of electrophiles with the nucleophilic sites of DNA. Some of the DNA adducts thus formed are inherently stable and rather persistent, if they are not removed through DNA repair. In other cases the chemistry is unstable and hydrolytic, or rearrangement processes convert the initial adduct into another, with a possible increase or decrease in the potential for miscoding.

207

A plethora of genes and enzymes for handling DNA damage exist, not only to repair the damage but also to signal cells to arrest and quiesce. Information about these processes is contained in other parts of this book. One of the most striking pieces of evidence to support the premise that DNA damage is important in humans is the strong predilection of individuals with defects in some of the genes to cancer and neurological problems, e.g., xeroderma pigmentosum.

Cross-References ▶ Bioactivation of Carcinogens ▶ Damage DNA, Natural Products that ▶ Damaged DNA, Analysis of ▶ DNA Damage as a Therapeutic Strategy ▶ DNA Damage by Endogenous Chemicals ▶ DNA Damage, Frequency of ▶ DNA Damage, Practical Screening for ▶ DNA Damage, Relevance to Cancer ▶ DNA Repair ▶ DNA Replication, Chemical Biology of ▶ DNA Replication ▶ Electrophiles, Types of ▶ Selectivity of Chemicals for DNA Damage ▶ Site-Specific Mutagenesis ▶ Ultraviolet Light DNA Damage

References Ames BN, Durston WE, Yamasaki E et al (1973) Carcinogens are mutagens: a simple test system combining liver homogenates for activation and bacteria for detection. Proc Natl Acad Sci U S A 70:2281–2285 Basu AK, Essigmann JM (1988) Site-specifically modified oligodeoxynucleotides as probes for the structural and biological effects of DNA-damaging agents. Chem Res Toxicol 1:1–18 Bauer KH (1928) Mutationstheorie der Geschwulstenstehung. Springer, Berlin Green CL, Loechler EL, Fowler KW et al (1984) Construction and characterization of extrachromosomal probes for mutagenesis by carcinogens: site-specific incorporation of O6-methylguanine into viral and plasmid genomes. Proc Natl Acad Sci U S A 81:13–17 Johnson WW, Guengerich FP (1997) Reaction of aflatoxin B1 exo-8,9-epoxide with DNA: kinetic analysis of covalent binding and DNA-induced hydrolysis. Proc Natl Acad Sci U S A 94:6121–6125

D

208 Kaldor JM, Day NE, Pettersson F et al (1990) Leukemia following chemotherapy for ovarian cancer. N Engl J Med 322:1–6 Lawley PD (1984) Carcinogenesis by alkylating agents. In: Searle CE (ed) Chemical carcinogens, vol 1, 2nd edn. American Chemical Society, Washington, DC, pp 325–484 Ramos KS, Moorthy B (2005) Bioactivation of polycyclic aromatic hydrocarbon carcinogens within the vascular wall: implications for human atherogenesis. Drug Metab Rev 37:595–610 Wogan GN (1992) Aflatoxins as risk factors for hepatocellular carcinoma in humans. Cancer Res Suppl 52:2114–2118

DNA Heteroduplex ▶ Mismatch Repair

DNA Methylation ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

DNA Methylation and Cancer Scheherazade Khan and Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms Gene silencing; Oncogene and proto-oncogene regulation; Transcriptional silencing

Definition Since DNA methylation influences whether or not a gene is transcribed, misregulation of DNA methylation can lead to abnormal cell function and disease. In fact, aberrant changes in DNA methylation patterns are tightly linked to several

DNA Heteroduplex

cancers. There are several known examples where changes in methylation patterns correspond to changes in gene expression that lead to the development of cancer. In some areas of a cancerous cell’s genome, there is too little DNA methylation, while in other areas, there is too much. Both types of changes in methylation patterns can alter the transcription of genes that contribute to the formation of cancer.

Discussion Cancer usually arises from several mutations in DNA. Certain genes help protect cells from cancer (tumor suppressor genes), while other genes control cell division and other cellular characteristics. These genes, called proto-oncogenes, are also carefully controlled in cells. If proto-oncogenes are expressed too highly, then they can contribute to cancer and are then referred to as oncogenes. Many mutations that lead to cancer affect the regulation of these types of genes. For example, increasing the expression of a transcription factor that in turn activates genes needed for the cell to grow could contribute to cancer. The transcription factor, which is important for normal cells, is usually kept in check; however, cellular changes that result in the transcription factor’s increased transcription might lead cells to grow uncontrollably, as they do in cancer. DNA methylation is another powerful way to affect transcription, as DNA methylation patterns can be inherited from progenitor cell to progeny cell. The proper methylation pattern must be faithfully preserved to allow a cell to continue to function like its progenitor. It is not surprising, then, that disruptions in normal patterns of DNA methylation have been identified as hallmarks of cancerous cells (Kisseljova and Kisseljov 2005). Cancerous cells exhibit general, global demethylation (hypomethylation) of DNA, which will promote transcription of genes (including proto-oncogenes) that can contribute to cancer, as described below. Alternatively, cancerous cells can also show increased methylation (hypermethylation) of promoters of tumor suppressor genes, which will inhibit transcription of

DNA Methylation and Cancer

genes that normally help prevent cancerous characteristics (Das and Singal 2004). Altering the balance of DNA methylation in either direction can have dire consequences on the cell. DNA methylation often correlates with gene silencing (see ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of). Sometimes, this silencing protects the genome from unwanted rearrangements. For example, DNA methylation helps suppress repetitive DNA sequences or mobile DNA elements (transposons). These elements could lead to rearrangements in the DNA that can be detrimental to the stability of the genome or to the health of the cell. These DNA elements are often methylated and transcriptionally silent in normal cells. However, many cancer cells show broad demethylation and transcriptional activation of mobile DNA elements, leading to an increase in chromosomal rearrangements, mutations, and general genomic instability that is characteristic of cancerous cells. DNA methylation is also correlated with imprinting patterns (see ▶ Genomic Imprinting). Thus, loss of methylation can also lead to the activation of genes that are normally silenced through imprinting. In some cases, activation of both copies rather than one copy of an imprinted gene can have deleterious effects. Disruption of the uniparental expression of imprinted genes, due to a general loss of DNA methylation, has been correlated with the generation of cancerous tumors. Another deleterious result of global demethylation is the activation of tissue-specific genes in a nonspecific manner. Some of the tissue-specific genes that are demethylated in cancerous tumors in humans have been found to be determinants of metastatic potential; that is, they contribute to the spread of cancer to other locations in the body. An example is HOX11, a gene that is normally only expressed during embryogenesis to aid in the formation of the skeleton and the spleen. In cells affected by T-cell acute lymphoblastic leukemia, however, the HOX11 promoter of this gene is demethylated, and it is abnormally expressed in bone marrow (Watt et al. 2000). HOX11 then turns on many other genes and its particular activity in

209

the bone marrow is strongly correlated with development of this form of leukemia. In addition to global hypomethylation, cancerous cells can exhibit increased methylation (hypermethylation) of specific DNA sequences. This hypermethylation often leads to the inactivation of tumor-suppressing genes. For instance, several different types of renal tumors exhibit the hypermethylation of the tumor suppressor gene von Hippel-Lindau (VHL) (Baylin and Jones 2011). Understanding the role of DNA methylation in promoting cancer is important for advancing both detection and treatment of cancer. Testing for hypermethylation of key promoters has the potential to be an important diagnostic tool (Roobol et al. 2011). Current research is aimed at developing a panel of common promoter methylation states that are characteristic of many different cancers. These changes could be screened from small tissue samples inexpensively via PCR-based assays. An understanding of the connections between DNA methylation and cancer may lead to new therapeutic tools. DNA methylation can be blocked or perhaps reversed and may be easier to correct than mutations in or rearrangements of the DNA. Potential therapies to alter DNA methylation are being tested, including the targeting of methyltransferases to silence certain parts of the genome or, conversely, using DNA methyltransferase inhibitors (DNMTi) to prevent aberrant patterns of hypermethylation in the DNA of cancerous cells.

References Baylin SB, Jones PA (2011) A decade of exploring the cancer epigenome – biological and translational implications. Nat Rev Cancer 11:726–734 Das PM, Singal R (2004) DNA methylation and cancer. J Clin Oncol 22:4632–4642 Kisseljova NP, Kisseljov FL (2005) DNA demethylation and carcinogenesis. Biochemistry 70:743–52 Roobol MH, Haese A, Bjartell A (2011) Tumour markers in prostate cancer III: biomarkers in urine. Acta Oncol 50(Suppl 1):85–9 Watt PM, Kumar R, Kees UR (2000) Promoter demethylation accompanies reactivation of the HOX11 protooncogene in leukemia. Genes Chromosome Cancer 29:371–7

D

210

DNA Mismatch

Introduction

DNA Mismatch ▶ Mismatch Repair

DNA Mispair ▶ Mismatch Repair

The a-subunit of Pol III has been classified as a Class C polymerase, distinct from eukaryotic polymerases and the other polymerases found in E. coli. Functional and genetic experiments have demonstrated the modular nature of Pol III a, and recent structures have refined the definition of its domain boundaries and provided valuable insight into its function (Fig. 1). The extra domains, appended to the polymerase, confer special properties that include the ability to bind to and communicate with other replication proteins.

DNA Polymerase III Structure

Polymerase Domains

Charles S. McHenry Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO, USA

Regions of a with distinct biochemical activities initially helped to delineate the domain organization. Systematic mutagenesis of conserved acidic residues permitted the identification of the three acidic side chains (E. coli (Eco) D401, D403, and D555) that coordinate two Mg++ ions, facilitating catalysis (Pritchard and McHenry 1999). Antimutator and nucleotide selection mutants, presumably associated with polymerase function, helped to further define the limits of the polymerase domains. The apoenzyme structures of the full-length Thermus aquaticus (Taq) a-subunit provided significant insight (Bailey et al. 2006). A big surprise that emerged from this study was that the palm domain has the basic fold of the X family of DNA polymerases that includes the slow, non-processive Pol bs, placing bacterial replicases as a special class within that family. A structure of a version of Eco a truncated within the b-binding domain also exhibited a Pol b-like fold with perturbations in the active site which are presumably corrected upon substrate binding (Lamers et al. 2006). Like all polymerases, Pol III a contains palm, thumb, and finger domains, in the shape of a cupped right hand. Superposition of the a-palm with that of mammalian Pol b aligns the three identified catalytic residues of a (Bailey et al. 2006) with those of Pol b (Sawaya et al. 1997). The palm also contains a universally conserved lysine (Eco K552) that forms a salt

Synopsis By itself, the polymerase catalytic subunit of the DNA polymerase III holoenzyme (Pol III HE), a, exhibits no special properties that hint of the Pol III HE’s high catalytic efficiency, accuracy, and enormous processivity. These properties are gained by association with other proteins through a series of distinct protein interaction domains. A PHP domain at the N-terminus of Pol III a binds the proofreading subunit, e. A typical Mg++-dependent polymerase catalytic domain has a fold similar to the DNA polymerase b (Pol X family). Adjacent to the polymerase domain is the b-binding domain. Interaction of this domain with the b2 sliding clamp processivity factor, together with an eb2 interaction, provides the primary determinants of the enzyme’s processivity. The C-terminus contains two domains, one an OB fold that may bind single-stranded DNA and a t-binding domain that binds the t-subunit of the DnaX complex. X-ray crystal structures of Pol III a in the apoenzyme form, bound to DNA, and, separately, e and t have provided significant insight into the function of this prototypical replicase.

DNA Polymerase III Structure

211

DNA Polymerase III Structure, Fig. 1 Modular organization of Pol III a. The names and colors of the domains shown are from Bailey et al. (2006) except that their C-terminal domain was further divided into the OB fold and t-binding domains. The residue numbers that define domain borders in E. coli a are shown above the bar in black. The position of antimutator mutations (marked below the dnaE gene in blue) and mutations selected to discriminate dideoxynucleotides (red above the bar) are indicated (Fijalkowska and Schaaper 1993; Hiratsuka and Reha-Krantz 2000; Oller and Schaaper 1994; Vandewiele et al. 2002). It is likely that these influence either the rate of polymerization or base selection and reside within the polymerase active site. Sde

mutations (McHenry 2011) that likely interfere with initiation complex formation are shown in magenta above the bar. Mutator mutations (not shown) in dnaE (Maki et al. 1991; Strauss et al. 2000; Vandewiele et al. 2002) also map within the polymerase domain (palm, thumb, fingers) with the exception of two temperature-sensitive alleles (74 and 486) that exhibit a slight mutator phenotype at the permissive temperature (Vandewiele et al. 2002). dnaE74 maps to position 134 within the PHP domain and dnaE486 maps to position 885 between the b2-binding site and the HhH element within the b2-binding domain. A presumed template slippage mutant maps to residue 133 (Bierne et al. 1997)

bridge with the last phosphate of the primer. The fingers contain most of their conserved residues at the interface with the palm domain, including four arginines that appear to form a preinsertion nucleotide binding site that binds the incoming dNTP before transfer to the actual catalytic binding site (Bailey et al. 2006). A ternary complex of a dideoxy-terminated primer-template, incoming dNTP, and full-length Taq a provided broader insight into the function of Class C polymerases (Wing et al. 2008). Among the template-primer-induced conformational changes is movement of the thumb domain toward the DNA bound by the palm, which is driven by interaction of two thumb a-helices in parallel with the DNA to make contacts with the sugar-phosphate backbone in the minor groove. The fingers also move, and a portion that rotates ca.15 , together with the palm and the 30 -terminus of the primer, forms a pocket that positions the incoming dNTP above the three essential catalytic aspartates. The g-phosphate contacts the Gly-Ser motif (Eco 363–364) found in all polymerases and an additional arginine (Wing et al. 2008). The polymerase contacts the template from its terminus to a position 12 nucleotides behind the primer terminus, in excellent agreement with photo-

cross-linking experiments (Reems et al. 1995). The finger domain creates a wall at the end of the primer terminus that forces a sharp kink in the emerging template strand. Finally, a 30 bend is induced in two nucleotides behind the primer terminus by loops that connect the palm and thumb domains (Wing et al. 2008). In the ternary complex structure of a Grampositive Pol III, two novel elements, not found in Eco or Taq a, were identified (Evans et al. 2008). The four-Cys Zn++ binding motif, discovered by Neal Brown and colleagues and shown to be required for activity (Barnes et al. 1998), serves an apparent structural function and is not part of the catalytic site (Evans et al. 2008). DNA binding through the thumb domain comes primarily from two b-strands that interact with the minor groove. Packing appears tighter around the primed template in this enzyme than in other polymerases (Evans et al. 2008). Proposals were made that this packing made unique contributions to preserving fidelity (Evans et al. 2008). However, no support was provided for that hypothesis. The discrimination against RNA primers made uniquely by PolC Gram-positive polymerases is a more likely consequence of the tight packing observed (Sanders et al. 2010). This should be

D

212

explored experimentally. The wider diameter of the A-form RNA-DNA duplex appears not to fit well into the DNA-binding channel. Thus, an RNA-DNA duplex might not bind strongly, or conformational changes that are coupled to template-primer binding might not occur completely, leading to improper formation of the catalytic site. The presentation of the Grampositive PolC structure concluded that the DNA template-primer was bound in a significantly different orientation relative to that observed in Taq a (Evans et al. 2008). However, as pointed out by Wing (2010), there really isn’t a significant difference if one aligns the structures using the invariant catalytic acidic residues as a reference.

PHP Domain Koonin and colleagues observed homology between the N-terminal region of bacterial Pol IIIs (called the PHP domain) and a subclass of phosphoesterases (Aravind and Koonin 1998). This domain is found in a wide variety of bacterial polymerases, including bacterial Pol bs. Initially, it was proposed that this region might be involved in pyrophosphate hydrolysis (Aravind and Koonin 1998), but such an activity has not been found (Lamers et al. 2006). Recently, this domain has been ascribed a second proofreading activity that is Zn++-dependent (Stano et al. 2006) and also identified as the domain that binds the classical Mg++-based proofreading subunit, e (Wieczorek and McHenry 2006). Deletion experiments initially restricted the domain to residues 1-(255–320), and recent crystal structures provided further precision (Bailey et al. 2006; Lamers et al. 2006; Wieczorek and McHenry 2006). The structure of Taq a revealed a cluster of nine residues in the PHP domain that included eight of the ligands predicted from informatics approaches (Wieczorek and McHenry 2006) to chelate three metal ions (Bailey et al. 2006), as shown directly for the E. coli YcdX homologue (Teplyakov et al. 2003). A structure of a Gram-positive PolC PHP domain binds three metals, using the same nine ligands expected from the homologous Taq PHP structure.

DNA Polymerase III Structure

Kuriyan and colleagues, from the structure of the Eco a, pointed out a channel between the polymerase active site and the proposed PHP active site (Lamers et al. 2006). The PHP domain contains a long loop (Eco 107–116) that interacts extensively with the thumb. There may also be contacts between the PHP domain and DNA (Wing et al. 2008). This would explain the dependence of polymerase activity on the integrity of the PHP domain. Deletion of 60 N-terminal PHP residues or a D43A point mutation within the proposed active site abolishes polymerase activity (Kim et al. 1997). In bacterial Pol bs that contain associated active PHP domains (Banos et al. 2008; Blasius et al. 2006), deletion of the polymerase domain abolishes PHP activity, an indication of the reciprocal nature of the interaction (Nakane et al. 2009). Lamers and colleagues have underscored a structural role for the PHP domain of E. coli Pol III a, demonstrating cooperative unfolding and a destabilizing effect of mutations in the site that is homologous to the metal-binding site of PHP domain of Tth Pol III a (Barros et al. 2013). An Interaction Between the «-Subunit of Pol III and b Contributes to Processive Replication An exciting recent discovery revealed an interaction between a b-binding motif within the e proofreading subunit and the second protein interaction cleft within dimeric b2 (Jergic et al. 2013; Ozawa et al. 2013; Toste et al. 2013). In the work by Dixon and collaborators, the interaction was discovered by an assay that challenges the processivity of Pol III HE, strand displacement synthesis (Jergic et al. 2013). It was found that e was required for this reaction to take place and that exonuclease-inactive mutants would substitute. Although minor contributions of e in polymerization had previously been observed, this provided the first robust assay where the dependency was sufficient to permit careful study of contributing factors. This led to the discovery that the a-binding C-terminal portion of e, lacking the exonuclease domain altogether, could suffice. Alignments revealed a weak b-binding motif located in the C-terminal tail of e located immediately downstream of the exonuclease domain.

DNA Polymerase III Structure

Construction of mutants that would be expected in increased affinity of the motif for b increased e function in strand displacement and those expected to decrease affinity abolished it. Physical measurements of binding interaction correlated. Replicative complexes reconstituted with heterodimers of b that contained only one protein binding cleft, failed to benefit from the synergy of interaction with both b and e. Thus, both protein binding sites of b2 are engaged during elongation, making them unavailable for interaction with other proteins. In an independent parallel study by Lamers and colleagues, the initial observation that led to the same discovery was that the presence of e led to enhanced affinity of Pol III a for b2 (Toste et al. 2013). Elegant and technically challenging chemical cross-linking experiments revealed multiple interactions between Pol III a, e, and b2. Among them was the b-binding site of Pol III a and b and also a similar b-binding motif within e and b. Mutation of the putative b-binding site of e abolished its interaction with b. Single-molecule experiments on immobilized l DNA templates indicated that the rates and processivity of the replicase are increased as the affinity of e for a increased by directed favorable amino acid changes in the e- and b-binding motif. This suggested that the interactions stabilized the polymerization mode (Jergic et al. 2013). Elongation assays that permitted primer extension and subsequent degradation nucleotide depletion suggested that the e-b interaction also enhanced exonuclease activity (Toste et al. 2013). However, this later assay was complex in that two competing events were being observed. An assay that measures proofreading activity directly is needed. In any case, these studies point out interesting issues of the roles of a-b interaction and a-e interaction on the balance between polymerization and proofreading. Mutants enhanced or weakened in these interactions should be analyzed for their direct effects on the fidelity of DNA replication and may reveal if one of the interactions needs to be broken to permit proofreading to occur. Significant structural advances have also been made that suggest the positioning within the

213

replicative complex (Ozawa et al. 2013; Toste et al. 2013). A crystal structure showed that most C-terminal regions of e bind the PHP domain of Pol III a on one face, and photo-cross-linking suggests that the unstructured tether leading to the exonuclease domain wraps around to the opposing side of the PHP domain. Chemical cross-linking studies were consistent with this view, revealing interactions of the C-terminal domain of e on one face of the PHP domain and interaction of the exonuclease domain of e with the PHP domain on the opposing face (Toste et al. 2013). Taking all cross-linking data and available structures of e, b2 on DNA, and Pol III a complexed with DNA led to models where the e proofreading domain is sandwiched between the PHP domain of Pol III a and b (Ozawa et al. 2013; Toste et al. 2013).

C-Terminal Domains Analysis of a deletion mutants revealed that C-terminal domains were responsible for interactions with both t and b (Kim and McHenry 1996a, b). An essential b2 interaction site (Eco 920–924) (Dalrymple et al. 2001) was verified by mutagenesis, coupled with functional, genetic, and biophysical experiments (Dohrmann and McHenry 2005). Deletion of residues from the C-terminus abolished t-binding, but N-terminal deletions extending into the fingers domain also diminished t-binding, suggesting either extensive t interactions or structural perturbations (Kim and McHenry 1996a). More detailed mutagenesis studies (Dohrmann and McHenry 2005) have identified the C-terminus as critical for t-binding, but the binding site has not been firmly identified. The C-terminal region of a contains additional domains identified by similarity to elements found in other DNA-binding proteins. These include a helix-hairpin-helix motif (HhH) (Eco 836–854) (Bailey et al. 2006; Doherty et al. 1996) and an OB fold (Eco 964–1078) (Bailey et al. 2006; Theobald et al. 2003). In yet another variation, Synechocystis encodes the N-terminal two-thirds and the C-terminal one-third of a as two distinct

D

214

proteins that are spliced posttranslationally by an intein-mediated reaction (Evans et al. 2000). b2-Binding Domain A structure of Taq a revealed a well-organized b2binding domain with dsDNA binding capability. DNA binding occurs through a HhH motif and its flanking loops (Wing et al. 2008). The b2-binding consensus sequence is presented in a loop that is oriented adjacent to dsDNA as it exits the polymerase in the correct position to bind b2 as it surrounds DNA. The b2-binding domain rotates 20 and swings down into position as the enzyme binds DNA (Wing et al. 2008), a reorientation that is apparently driven energetically by the HhH motif binding to DNA and likely coupled to conformational changes of the thumb, palm, OB fold, and PHP domains. A later structure of a Grampositive Pol III showed a domain similar to the b2binding domain containing HhH motifs that bind dsDNA and a b2-binding consensus sequence (Evans et al. 2008). The domains C-terminal to the b2-binding domain in E. coli and Taq are either absent in the Gram-positive PolC or moved forward in front of polymerase domain. OB Fold Domain The structure of the ternary complex of Taq a with primer-template and incoming dNTP reveals a striking conformational change that includes the OB fold moving to a position near the singlestranded template distal to the primer (Wing et al. 2008). The path of the emerging template, which can be traced from electron density of the ribose-phosphate backbone, appears to come close to the OB fold. The element of the OB fold that comes closest to the ssDNA template, the b1b2 loop, often contributes to ssDNA binding (Theobald et al. 2003). However, the b1-b2-b3 face that commonly interacts with ssDNA (Theobald et al. 2003) appears to “face away” from the emerging template and to face the t-binding domain. So binding of the OB fold, if it occurs, either occurs in a nonstandard way or there are further rearrangements as the template strand becomes longer or when additional protein subunits are present. A ternary complex structure of a Gram-positive Pol III includes an OB fold,

DNA Polymerase III Structure

but it shows no interactions with the 5-nucleotide single-stranded portion of the template. Whether it interacts with a longer template awaits experimental verification. Binding (Leu et al. 2003) and kinetic data (Dohrmann et al. 2011) suggest that the “sensor” for completion of an Okazaki fragment can distinguish a single-nucleotide gap from a nick. It is not clear if the OB fold can bind a single-nucleotide gap strongly. However, there is at least one example of an OB fold binding to a nick (Pascal et al. 2004). The role of the OB fold is further discussed in another essay (▶ Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis, Does the OB Fold Provide the Processivity Switch for Cycling?). t-Binding Domain The second half of the C-terminus in the Taq a structure revealed a domain containing an incompletely conserved sequence that binds weakly to b2 but is not required for processive replication in vitro or function in vivo. This domain is loosely packed against the OB fold, with many polar residues in the interface (Bailey et al. 2006). Mutational studies support the importance of this subdomain in binding t (Dohrmann and McHenry 2005). Further information regarding possible sites of interaction of this extreme C-terminal domain with t was derived from a genetic screen for suppression of a dominant-lethal phenotype of an extrachromosomally expressed dnaE that formed initiation complexes but was unable to elongate because of a mutation in a critical catalytic aspartate (D403E) (Lindow and McHenry, unpublished). Suppression could result from any defect that decreased the ability of dnaE D403 to form initiation complexes and compete with wildtype Pol III. Among the full-length, properly folded proteins obtained from this screen (sde mutants, Fig. 2), it would be expected that most mutants would interfere with interaction with partners required for initiation complex formation or slow the relative rate of initiation complex formation. Mutants were analyzed for specific interaction defects. Two mutations (W1134C and L1157Q) appeared to severely diminish the interaction with t, consistent with a role of the C-terminal domain in an interaction with t (Fig. 2).

DNA Polymerase III Structure

215

D

DNA Polymerase III Structure, Fig. 2 Sde mutants mapped onto the Pol III a structure. Mutations that suppress the dominant-negative phenotype of Eco dnaE (D403E) are mapped onto the position of the corresponding residues in the Taq a structure (Bailey et al. 2006). Labels specify the E. coli position number and residue. A deletion mutation that included the

b2-binding loop is shown in cartoon form (helix and loop not in space fill). Point mutations are shown as contrasting colors. Domains are shown using the convention of (see Fig. 1) PHP (yellow), palm (purple), thumb (green), fingers (blue), b-binding (orange), OB fold (bright red), and t-binding (dark red)

A crystal structure has recently been determined between Thermus aquaticus Pol III a complexed with the C-terminal domain of Taq t (tc) (Liu et al. 2013). tc does not appear to be homologous to Eco tc, so direct comparison of the t portion of the structure is not possible. Taq Pol III a interacts with tc through its C-terminal helix and the preceding loop (Liu et al. 2013). This interaction is consistent with mutagenesis data from E. coli (Dohrmann and McHenry 2005) and the sde50 mutant (Fig. 2).

Available information suggests that domain V binds to a through a largely unstructured C-terminus (Jergic et al. 2007; Su et al. 2007). Deletion of seven amino acids from the C-terminus of t abolishes its ability to bind to a (Jergic et al. 2007). Structures for w bound to c are also available (Gulbis et al. 2004), and the structure of the c-peptide bound to g3dd’w permitted models to be formulated regarding its placement (Simonetta et al. 2009), but additional structural information is required to reveal if there are other protein-protein interactions that orient wc. A recent structure revealed the binding site for the C-terminal tail of SSB on w (Marceau et al. 2011). The site is characteristic of SSB sites on other proteins (Shereda et al. 2008) and is dominated by the interaction of the C-terminal Phe side chain in a hydrophobic pocket and an ionic interaction between the terminal carboxyl group and R128 of w. The C-terminus of SSB also contains acidic residues which form ionic bonds with peripheral acidic residues within the SSB sites of other proteins. The residues in w that form ionic bonds to these acidic residues are predicted to be K132 and R135 (Naue et al. 2010; Simonetta et al. 2009). An R128A mutation abolished SSB binding (Simonetta et al. 2009).

Structure of Proteins Associated with Pol III Structures of most of the subunits associated with Pol III within the holoenzyme are presented elsewhere in this entry (▶ Bacterial DNA Replicases), and only those not described elsewhere are presented briefly here. Structures of DnaXcx contained a truncated DnaX subunit which excluded domains IV and V of t. Subsequent structures of these domains (Jergic et al. 2007; Su et al. 2007) show that domain V has a fold that is unique to t-subunits (Su et al. 2007).

216

Cross-References ▶ Bacterial DNA Replicases ▶ Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis

References Aravind L, Koonin EV (1998) Phosphoesterase domains associated with DNA polymerases of diverse origins. Nucleic Acids Res 26:3746–3752 Bailey S, Wing RA, Steitz TA (2006) The structure of T. aquaticus DNA polymerase III is distinct from eukaryotic replicative DNA polymerases. Cell 126:893–904 Banos B, Lazaro JM, Villar L, Salas M, De Vega M (2008) Editing of misaligned 30 -termini by an intrinsic 30 -50 exonuclease activity residing in the PHP domain of a family X DNA polymerase. Nucleic Acids Res 36:5736–5749 Barnes MH, Leo CJ, Brown NC (1998) DNA polymerase III of Gram-positive eubacteria is a zinc metalloprotein conserving an essential finger-like domain. Biochemistry 37:15254–15260 Barros T, Guenther J, Kelch B, Anaya J, Prabhakar A, O’Donnell M, Kuriyan J, Lamers MH (2013) A structural role for the PHP domain in E. coli DNA polymerase III. BMC Struct Biol 13:8 Blasius M, Shevelev I, Jolivet E, Sommer S, Hübscher U (2006) DNA polymerase X from Deinococcus radiodurans possesses a structure-modulated 30 ! 50 exonuclease activity involved in radioresistance. Mol Microbiol 60:165–176 Bierne H, Vilette D, Ehrlich SD, Michel B (1997) Isolation of a dnaE mutation which enhances recA independent homologous recombination in the Escherichia coli chromosome. Mol Microbiol 24:1225–1234 Dalrymple BP, Kongsuwan K, Wijffels G, Dixon NE, Jennings PA (2001) A universal protein-protein interaction motif in the eubacterial DNA replication and repair systems. Proc Natl Acad Sci U S A 98:11627–11632 Doherty AJ, Serpell LC, Ponting CP (1996) The helixhairpin-helix DNA-binding motif: a structural basis for non-sequence-specific recognition of DNA. Nucleic Acids Res 24:2488–2497 Dohrmann PR, McHenry CS (2005) A bipartite polymerase-processivity factor interaction: only the internal b binding site of the a subunit is required for processive replication by the DNA polymerase III holoenzyme. J Mol Biol 350:228–239 Dohrmann PR, Manhart CM, Downey CD, McHenry CS (2011) The rate of polymerase release upon filing the gap between Okazaki fragments is inadequate to support cycling during lagging strand synthesis. J Mol Biol 414:15–27 Evans TC Jr, Martin D, Kolly R, Panne D, Sun L, Ghosh I, Chen L, Benner J, Liu XQ, Xu MQ (2000) Protein

DNA Polymerase III Structure trans-splicing and cyclization by a naturally split intein from the dnaE gene of Synechocystis species PCC6803. J Biol Chem 275:9091–9094 Evans RJ, Davies DR, Bullard JM, Christensen J, Green LS, Guiles JW, Pata JD, Ribble WK, Janjic N, Jarvis TC (2008) Structure of polC reveals unique DNA binding and fidelity determinants. Proc Natl Acad Sci U S A 105:20695–20700 Fijalkowska IJ, Schaaper RM (1993) Antimutator mutations in the a subunit of Escherichia coli DNA polymerase III identification of the responsible mutations and alignment with other DNA polymerases. Genetics 134:1039–1044 Gulbis JM, Kazmirski SL, Finkelstein J, Kelman Z, O’Donnell ME, Kuriyan J (2004) Crystal structure of the chi:psi subassembly of the Escherichia coli DNA polymerase clamp-loader complex. Eur J Biochem 271:439–449 Hiratsuka K, Reha-Krantz LJ (2000) Identification of Escherichia coli dnaE (polC) mutants with altered sensitivity to 20 ,30 -dideoxyadenosine. J Bacteriol 182:3942–3947 Jergic S, Ozawa K, Williams NK, Su XC, Scott DD, Hamdan SM, Crowther JA, Otting G, Dixon NE (2007) The unstructured C-terminus of the t subunit of Escherichia coli DNA polymerase III holoenzyme is the site of interaction with the a subunit. Nucleic Acids Res 35:2813–2824 Jergic S, Horan NP, Elshenawy MM, Mason CE, Urathamakul T, Ozawa K, Robinson A, Goudsmits JM, Wang Y, Pan X, Beck JL, van Oijen AM, Huber T, Hamdan SM, Dixon NE (2013) A direct proofreaderclamp interaction stabilizes the Pol III replicase in the polymerization mode. EMBO J 32:1322–1333 Kim DR, McHenry CS (1996a) Biotin tagging deletion analysis of domain limits involved in proteinmacromolecular interactions: mapping the t binding domain of the DNA polymerase III a subunit. J Biol Chem 271:20690–20698 Kim DR, McHenry CS (1996b) Identification of the b-binding domain of the a subunit of Escherichia coli polymerase III holoenzyme. J Biol Chem 271:20699–20704 Kim DR, Pritchard AE, McHenry CS (1997) Localization of the active site of the a subunit of the Escherichia coli DNA polymerase III holoenzyme. J Bacteriol 179:6721–6728 Lamers MH, Georgescu RE, Lee SG, O’Donnell M, Kuriyan J (2006) Crystal structure of the catalytic a subunit of E. coli replicative DNA polymerase III. Cell 126:881–892 Leu FP, Georgescu R, O’Donnell ME (2003) Mechanism of the E. coli t processivity switch during laggingstrand synthesis. Mol Cell 11:315–327 Liu B, Lin J, Steitz TA (2013) Structure of the Pol III a-t(c)-DNA complex suggests an atomic model of the replisome. Structure 21:658–664 Maki H, Mo JY, Sekiguchi M (1991) A strong mutator effect caused by an amino acid change in the a subunit

DNA Recombination, Mechanisms of of DNA polymerase III of Escherichia coli. J Biol Chem 266:5055–5061 Marceau AH, Bahng S, Massoni SC, George NP, Sandler SJ, Marians KJ, Keck JL (2011) Structure of the SSB-DNA polymerase III interface and its role in DNA replication. EMBO J 30:4236–4247 McHenry CS (2011) DNA replicases from a bacterial perspective. Annu Rev Biochem 80:403–436 Nakane S, Nakagawa N, Kuramitsu S, Masui R (2009) Characterization of DNA polymerase X from Thermus thermophilus HB8 reveals the POLXc and PHP domains are both required for 30 –50 exonuclease activity. Nucleic Acids Res 37:2037–2052 Naue N, Fedorov R, Pich A, Manstein DJ, Curth U (2010) Site-directed mutagenesis of the w subunit of DNA polymerase III and single-stranded DNA-binding protein of E. coli reveals key residues for their interaction. Nucleic Acids Res 39:1398–1407 Oller AR, Schaaper R (1994) Spontaneous mutation in Escherichia coli containing the DnaE911 DNA polymerase antimutator allele. Genetics 138: 263–270 Ozawa K, Horan NP, Robinson A, Yagi H, Hill FR, Jergic S, Xu ZQ, Loscha KV, Li N, Tehei M, Oakley AJ, Otting G, Huber T, Dixon NE (2013) Proofreading exonuclease on a tether: the complex between the E. coli DNA polymerase III subunits alpha, epsilon, theta and beta reveals a highly flexible arrangement of the proofreading domain. Nucleic Acids Res 41:5354–5367 Pascal JM, O’Brien PJ, Tomkinson AE, Ellenberger T (2004) Human DNA ligase I completely encircles and partially unwinds nicked DNA. Nature 432: 473–478 Pritchard AE, McHenry CS (1999) Identification of the acidic residues in the active site of DNA polymerase III. J Mol Biol 285:1067–1080 Reems JA, Wood S, McHenry CS (1995) Escherichia coli DNA polymerase III holoenzyme subunits a, b and g directly contact the primer template. J Biol Chem 270:5606–5613 Sanders GM, Dallmann HG, McHenry CS (2010) Reconstitution of the B. subtilis replisome with 13 proteins including two distinct replicases. Mol Cell 37:273–281 Sawaya MR, Prasad R, Wilson SH, Kraut J, Pelletier H (1997) Crystal structures of human DNA polymerase b complexed with gapped and nicked DNA: evidence for an induced fit mechanism. Biochemistry 36: 11205–11215 Shereda RD, Kozlov AG, Lohman TM, Cox MM, Keck JL (2008) SSB as an organizer/mobilizer of genome maintenance complexes. Crit Rev Biochem Mol Biol 43:289–318 Simonetta KR, Kazmirski SL, Goedken ER, Cantor AJ, Kelch BA, McNally R, Seyedin SN, Makino DL, O’Donnell M, Kuriyan J (2009) The mechanism of ATP-dependent primer-template recognition by a clamp loader complex. Cell 137:659–671

217 Stano NM, Chen J, McHenry CS (2006) A coproofreading Zn(2+)-dependent exonuclease within a bacterial replicase. Nat Struct Mol Biol 13:458–459 Strauss BS, Roberts R, Francis L, Pouryazdanparast P (2000) Role of the dinB gene product in spontaneous mutation in Escherichia coli with an impaired replicative polymerase. J Bacteriol 182:6742–6750 Su XC, Jergic S, Keniry MA, Dixon NE, Otting G (2007) Solution structure of domains IVa and Vof the t subunit of Escherichia coli DNA polymerase III and interaction with the a subunit. Nucleic Acids Res 35:2825–2832 Teplyakov A, Obmolova G, Khil PP, Howard AJ, Camerini-Otero RD, Gilliland GL (2003) Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site. Proteins 51:315–318 Theobald DL, Mitton-Fry RM, Wuttke DS (2003) Nucleic acid recognition by OB-fold proteins. Annu Rev Biophys Biomol Struct 32:115–133 Toste RA, Holding AN, Kent H, Lamers MH (2013) Architecture of the Pol III-clamp-exonuclease complex reveals key roles of the exonuclease subunit in processive DNA synthesis and repair. EMBO J 32: 1334–1343 Vandewiele D, Fernandez de Henestrosa AR, Timms AR, Bridges BA, Woodgate R (2002) Sequence analysis and phenotypes of five temperature sensitive mutator alleles of dnaE, encoding modified alpha-catalytic subunits of Escherichia coli DNA polymerase III holoenzyme. Mutat Res 499:85–95 Wieczorek A, McHenry CS (2006) The NH(2)-terminal php domain of the a subunit of the E. coli replicase binds the e proofreading subunit. J Biol Chem 281:12561–12567 Wing RA (2010) Structural studies of the prokaryotic replisome. Thesis/Dissertation, Yale University, p 170 Wing RA, Bailey S, Steitz TA (2008) Insights into the replisome from the structure of a ternary complex of the DNA polymerase III a-subunit. J Mol Biol 382:859–869

DNA Recombination, Mechanisms of Sergio Santa Maria1 and Bertrand Llorente2 Space Biosciences Division, NASA Ames Research Center, Mountain View, CA, USA 2 Aix-Marseille Universite, Marseille, France 1

Synopsis DNA double-strand breaks are one of the most deleterious forms of DNA damage and may arise from exposure to environmental agents or as

D

218

intermediates during normal cellular processes. If left unrepaired, chromosome rearrangements and loss occur, which can lead to the uncovering of recessive mutations (loss of heterozygosity) and cell growth arrest, which can ultimately lead to cell death. Breaks can be repaired by nonhomologous end joining (NHEJ) or through homologous recombination (HR) pathways. NHEJ uses no homology or microhomology of a few nucleotides to directly join the broken ends and thus is often inaccurate and can generate mutations. HR, on the other hand, involves the use of homologous sequences as templates to exchange information and copy the region containing the break and thus is thought to be primarily error-free. HR is initiated by processing of the broken ends to form single-stranded DNA bound to proteins known as recombinases that will then initiate homology search within the template molecule followed by strand exchange and DNA synthesis. Processing of the recombination intermediates may then lead to either noncrossover or crossover products if an exchange in flanking DNA sequences takes place. Different pathways for HR exist depending on whether only one or both ends of the broken molecule interact with the template. Under some circumstances, broken chromosomes may present only one end for repair as in collapsed replication forks or at uncapped telomere ends. This mode of recombination repair is known as break-induced recombination (BIR) and can result in nonreciprocal translocations.

Introduction Living organisms are engaged in a constant battle to protect their genetic material. Fortunately, evolution has provided organisms with a plethora of strategies to cope with DNA damage. Which mechanisms come into play depends mainly on the type of damage, when it occurs, and where the damage is located. One of the most deleterious types of damage is the DNA double-strand break (DSB), as it affects both strands of the double helix and thus does not leave an intact strand for gap repair of the damage. Double-strand breaks

DNA Recombination, Mechanisms of

arise through exposure to environmental agents and endogenously by errors in DNA replication or as obligatory intermediates during normal cellular processes such as meiosis and V(D)J recombination. DNA recombination is crucial for the repair of double-strand breaks and is also required for the recovery of stalled or collapsed replication forks, telomere maintenance, and chromosome segregation during meiotic recombination. The importance of DNA recombination is highlighted by the fact that defects in DNA recombination and DSB repair have been correlated with genome instability and high cancer predisposition. DNA recombination involves the active participation of several protein factors, including helicases, nucleases, translocases, polymerases, and ligases. Many of the genes involved in homologous DNA recombination were first identified in the budding yeast Saccharomyces cerevisiae by the isolation of mutants sensitive to DNA-damaging agents that generate double-strand breaks (e.g., ionizing radiation) and by failure to produce viable meiotic products. Although currently there are different models of the repair of double-strand breaks by HR, all of them have several key common features (Heyer et al. 2010; San Filippo et al. 2008). The different mechanisms by which these proteins come together and how they are regulated are discussed in this entry. First, the mechanisms of double-strand break repair will be presented and will lead into a brief introduction to the components and steps involved in the various recombination processes, followed by a detailed description of each mechanism supported by genetic and biochemical data. Recombination refers to the exchange of information between DNA molecules and is found in all living organisms. The different mechanisms of DNA recombination are grossly divided into two categories, dependent on whether or not repair uses homologous DNA regions. Homologous recombination (HR) has a vital role in the error-free repair of double-strand breaks and other DNA lesions encountered during DNA replication, such as DNA interstrand cross-links. As its name states, HR involves the exchange of information between homologous sequences. DNA recombinases are essential for HR.

DNA Recombination, Mechanisms of

Recombinases are highly conserved enzymes that are able to bind single-stranded and doublestranded DNA and promote strand invasion for the exchange of information between DNA sequences. Compared to the prokaryotic RecA protein recombinase, eukaryotes possess two recombinases, Rad51 and the meiosis-specific recombinase Dmc1. In meiosis, HR is essential to establish a physical connection or linkage between homologous chromosomes to ensure their correct separation during meiosis I and to ensure genetic diversity by generating new DNA arrangements. DNA recombination and DSB repair can occur by several mechanisms. We will describe the components of each reaction and when each pathway is used for DSB repair. In addition to HR, a second mechanism involving sequence homology has been proposed for the repair of double-strand breaks in the absence of protein recombinases. This mechanism is commonly referred to as single-strand annealing (SSA). The SSA pathway is specialized for the repair of double-strand breaks occurring between two repeated sequences with the same orientation (Fig. 1). In the absence of homology, repair of double-strand breaks is accomplished by nonhomologous end joining (NHEJ). NHEJ is sometimes referred to as illegitimate recombination because no homology or very limited microhomology is used. This pathway is used primarily in the G1 phase of the cell cycle when there are no closely associated homologous sequences and involves the direct joining or ligation of the broken ends. Both SSA and NHEJ are considered to be error-prone pathways, as DNA sequences are lost to generate the appropriate substrates for repair. The spatiotemporal use of each of these mechanisms and their variations will be discussed in detail later in this entry. Double-Strand Break Repair Mechanisms Formation of double-strand breaks initiates recombination repair. Breaks can occur spontaneously or as a consequence of exogenous DNA damage, or from programmed cellular events such as those observed in meiosis through the action of the Spo11 endonuclease. Double-strand breaks are usually lethal events if left unrepaired.

219

Therefore, cells have several pathways in place for their repair. Some of these repair pathways are error-free, but others can be error-prone and lead to mutations or DNA rearrangements. Another consequence of double-strand break repair is the potential to lose heterozygosity at a DNA region. Loss of heterozygosity (LOH) can uncover recessive mutations that may be deleterious for the cell. The most common HR pathway in mitotic cells that prevents LOH by promoting a noncrossoveronly type of recombination is synthesisdependent strand annealing (SDSA) (Fig. 1; Nassif et al. 1994), in which the 30 end in the D-loop is extended by repair synthesis and then the newly synthesized DNA strand dissociates to anneal to the second end to complete the reaction. The SDSA pathway does not produce crossovers, which avoids possible genome rearrangements. In contrast, as meiotic recombination products are haploid and there is no such concept as LOH, crossovers occur to ensure correct alignment of chromosome pairs for the meiosis I division and to generate additional diversity in the genotypes of the meiotic products. This type of recombination involves the formation of a double Holliday junction that requires specific enzymes to be resolved (Fig. 1; Szostak et al. 1983). The double Holliday junction intermediate is then resolved by endonucleases and helicases into crossover or noncrossover DNA products. While crossover formation plays a vital role in facilitating chromosome segregation during meiotic recombination, crossovers occurring during mitotic recombination can potentially generate a series of deleterious rearrangements, including deletions, inversions, and translocations. The dissolution of double Holliday junctions thus suppresses possible genome instability and is performed by a complex mechanism involving RecQ-family DNA helicases (yeast Sgs1 or human BLM) and DNA topoisomerase III and other protein cofactors. A third type of HR process is single-strand annealing (SSA). This reaction occurs when a break is formed between repeated sequences. Resection of the broken ends reveals DNA sequence homology and allows for annealing of the single-stranded regions. The complementary

D

220

DNA Recombination, Mechanisms of, Fig. 1 Mechanisms of double-strand break (DSB) repair. Double-strand breaks can be repaired by end joining or using homologous sequences. After a break is formed, the ends can be directed into the nonhomologous end joining (NHEJ) reaction to catalyze the direct religation of the broken ends without homology. If the break has occurred in a region flanked by direct repeat sequences, it can be repaired by the single-strand annealing (SSA) pathway. The product of SSA is a deletion of one repeat and all of the DNAs between the repeats. If the resected end becomes a platform for assembly for the Rad51 nucleofilament, the broken molecule is directed into one of several homologous recombination (HR) repair pathways. These all begin with the synapsis step of HR and strand invasion into a homologous sequence and D-loop formation. If the break has only one end, as occurs with uncapped telomeres and collapsed replication forks, the one-ended HR pathway of break-induced replication (BIR) is initiated. BIR

DNA Recombination, Mechanisms of

proceeds through the formation of a new replication fork and produces nonreciprocal translocations or halfcrossover products. These products give imbalanced genomes with loss of heterozygosity (LOH). If the DNA break has two ends, as is usually the case from induced DNA damage, there are two options for HR-mediated repair of the break. In mitosis the primary mode of repair is through the synthesis-dependent strand annealing (SDSA) pathway. In this pathway the D-loop is extended by DNA synthesis, and the newly generated DNA strand dissociates to anneal to the second end to complete the reaction. SDSA does not generate crossover products. The third HR pathway involves capture of the second end by the D-loop and formation of a double Holliday junction (dHJ) intermediate. This structure is resolved by DNA topoisomerases and helicases to produce noncrossover products (dHJ dissolution) or resolvases, which are specialized nucleases that recognize the Holliday junction to form noncrossover or crossover products

DNA Recombination, Mechanisms of

single-stranded DNA strands formed at the repeat then anneal, and the remaining single-stranded tails formed after the annealing reaction are removed by nucleases. The resulting gaps are then filled by DNA synthesis and ligation. This process results in loss of one of the repeats and the DNA sequence positioned in between the repeats. Compared to homologous recombination, SSA is more mutagenic because it involves loss of chromosomal sequences. Finally, in the case where only one doublestrand break end is formed or is available, BIR uses homology at the broken end to find a homologous sequence, and after establishing and stabilizing a recombination intermediate, it transitions to a replication fork mode that can copy an entire chromosome arm (Fig. 1; Morrow et al. 1997). Genetically this appears as either a very long gene conversion or a nonreciprocal translocation, resulting in a long region of LOH. As an alternative to HR, the broken ends may be “repaired” through an end joining process (i.e., NHEJ). The determination between NHEJ and HR appears to be made early, at the step of double-strand break end processing. It is partially controlled by the availability of end processing enzymes and limits HR to the S and G2 phases of the cell cycle when a sister chromatid is available for error-free recombination repair. Double-Strand Break End Processing The HR mechanisms all start with the processing of a double-strand break end, as the substrate required for strand invasion is a 30 single-stranded DNA tail of several hundred nucleotides (Fig. 1). These single-stranded DNA tails become substrates for Rad51 recombinase protein binding (i.e., presynaptic filament or Rad51 nucleoprotein filament formation). Once the ends become resected, they cannot be used in a straight end joining reaction and thus the end processing step inhibits NHEJ. In vivo, DNA ends do not exist unprotected for long. At the ends of chromosomes, for example, they are bound by a special set of proteins to form the telomeres. When naked ends occur elsewhere in the genome, they are immediately bound by a protein complex called MRX (Mre11-Rad50-

221

Xrs2) in the budding yeast S. cerevisiae or MRN (Mre11-Rad50-Nbs1) in higher eukaryotes and humans. The MRX/MRN complex also associates with a protein called Sae2 (CtIP in mammalian cells) to initiate end processing and to clean up what are sometimes called dirty ends (Mimitou and Symington 2008; Huertas et al. 2008). As commented before, dirty ends can occur from chemical, enzymatic, or radiation breakage of the double helix and are ends that do not have a free 50 -phosphate end available for resection. Once MRX-Sae2/MRN-CtIP removes a short segment of DNA, resection continues through the combined action of a DNA helicase and a nuclease. Although not all of the end resection components may have been defined, in yeast they use the helicases Sgs1 and Dna2 and the nucleases Exo1 and Dna2 (Zhu et al. 2008; Fig. 2). In mammalian cells the BLM DNA helicase, related to yeast Sgs1, is used for end resection with EXO1. Recently, this end resection reaction has been reconstituted in vitro (Niu et al. 2010; Cejka et al. 2010; Nicolette et al. 2010). These experiments show critical roles for the Sgs1 helicase and the Dna2 and Exo1 nucleases in end resection. Nonhomologous End Joining (NHEJ) Repair NHEJ is a mechanism that repairs DNA breaks when there is little or no homology at the broken ends, or when HR is restricted, for example, in the G1 phase of the cell cycle. As end processing is inhibitory to NHEJ, the factors that bind to the broken ends to promote end joining, primarily the Ku complex, inhibit the binding and access of Exo1, Sgs1, and Dna2 to double-strand breaks (Fig. 2). The different end joining mechanisms will be discussed in much more detail in a following essay. Briefly, NHEJ is initiated by the binding of the Ku70/Ku80 heterodimeric complex to the DNA ends. The Ku complex bound at each broken end is then thought to form a bridge between the two ends. The final joining is completed by the combined action of specialized DNA polymerases that may be needed to fill in a few nucleotides that are missing at the fusion of the ends along with a specialized DNA ligase called ligase IV. Ligase IV has a partner called XRCC4 in mammalian cells and Lif1 in yeast. Additional

D

222

DNA Recombination, Mechanisms of

DNA Recombination, Mechanisms of, Fig. 2 Doublestrand break end processing. After double-strand break formation, specialized protein complexes (MRX and Ku) bind to the ends to initiate end processing. Dirty ends produced by DNA damage (e.g., ionizing radiation) are bound by the MRX-Sae2 complex to ensure that the ends are ready or clean for end resection. End resection then continues by the combined activity of the Sgs1 helicase

with its partners Top3 and Rmi1, the STR complex, and the Exo1 and Dna2 nucleases. When the breaks are produced by nucleolytic cleavage, and depending on the cell cycle stage, they do not require further cleaning and are bound by the Ku complex to initiate the end joining reactions. The presence of Ku at the ends inhibits binding of Exo1, Dna2, and Sgs1 and promotes end joining

proteins aid in the end-rejoining reaction. Interestingly, the end joining proteins are also called on to function during V(D)J or somatic recombination in the immune system cells.

meiosis. As HR is initiated by a double-strand break, the model that has evolved for noncrossover recombination has come from the doublestrand break repair model (Szostak et al. 1983). As mentioned before, the Rad51 nucleoprotein filament promotes homology search and strand invasion after end resection. Following DNA end processing, the single-stranded DNA is bound by the single-stranded DNA-binding protein RPA (replication protein A). RPA binding eliminates possible secondary structures in single-stranded DNA and protects it from degradation. RPA-bound DNA also inhibits nucleoprotein filament formation, the coating of singlestranded DNA with Rad51 recombinase, and also works as a platform for the recruitment of checkpoint and HR proteins. Removal of RPA for subsequent DNA recombinase binding and filament formation requires the function of mediator

Noncrossover Recombination: SynthesisDependent Strand Annealing (SDSA) Although the original recombination models linked noncrossover and crossover recombination (Szostak et al. 1983), sometimes called nonreciprocal and reciprocal recombination, respectively, it has become increasingly clear that for the most part they form through different pathways involving distinct intermediates (Fig. 1). Such conclusion is based on several observations. First, most mitotic recombination events end in noncrossover products. Second, crossover and noncrossover recombination can be separated both temporally and mechanistically during

DNA Recombination, Mechanisms of

proteins. Recombination mediators are proteins that aid in the assembly of the presynaptic filament through displacement of RPA from singlestranded DNA. Genetic and biochemical studies have identified different classes of mediators (Heyer et al. 2010). Rad51 and its related proteins or paralogs constitute the first group, comprised by the yeast proteins Rad55, Rad57, Shu1, and Psy3, and the mammalian RAD51B, RAD51C, RAD51D, XRCC2, and XRCC3 proteins. They all share some resemblance to the bacterial recombinase RecA; however, they cannot form extensive filaments on DNA and are unable to perform the range of DNA pairing and exchange reactions catalyzed by the Rad51 recombinase. A second class of mediators is typified by the yeast Rad52 protein, which aids in Rad51 loading onto DNA and in strand annealing of RPA-bound single-stranded DNA. The third group of mediators seems to be absent in S. cerevisiae and is exemplified by BRCA2, the human breast and ovarian cancer tumor suppressor protein. It has been suggested that BRCA2 recruits RAD51 to the double-stranded DNA junction at the resected end, the end of a double-strand break that has been prepared for HR by resection to form a 30 singlestrand tail. When mediator proteins do not function correctly or are absent because the corresponding genes are mutated, massive chromosome rearrangements and damage are observed. Mutations in BRCA2, for example, are associated with familial breast cancer and Fanconi anemia, a genetic disorder characterized by sensitivity to DNA damage. The homologous duplex DNA and the invading single-stranded DNA filament form a hybrid or heteroduplex DNA. Base mismatches may be present in the heteroduplex when the two strands do not come from sister chromatids. Subsequent repair of these mismatches by the mismatch repair machinery converts the information of one of the two strands to the one from the other strand, which ultimately can lead to a so-called gene conversion event. As the formation of heteroduplex DNA characterizes all HR reactions, all of them are potentially associated with a gene conversion event. Genetic identification of noncrossover recombination relies entirely on the existence of

223

an associated gene conversion. For this reason, noncrossover recombination has often been called gene conversion, which is actually misleading since gene conversion characterizes both noncrossover and crossover recombination. Gene conversion events rarely span more than a few kilobases and therefore encompass a limited number of possible genetic markers. Noncrossover recombination is therefore more difficult to identify than crossover recombination, which can be identified independently of any associated gene conversion since it is characterized by reciprocal exchange of chromosome arms. The product of the strand invasion is the recombination intermediate called a displacement loop or D-loop (Fig. 1). The strand invasion reaction also involves specialized motor proteins known as chromatin remodelers. These proteins can push nucleosomes along the DNA and help to open a region for pairing of the DNA strands in the chromatin. The recombination chromatin remodeling proteins are called Rad54 and Rdh54 in budding yeast and RAD54 and RAD54B in mammalian cells. These proteins belong to the Swi2/Snf2 family of protein translocases, as they move along single- or double-stranded DNA using ATP as energy source. Rad54 first stabilizes the Rad51 filament and enhances D-loop formation by Rad51, and then promotes the transition from DNA strand invasion to DNA synthesis by dissociating Rad51 from heteroduplex DNA. Additional chromatin remodeling proteins are needed in DSB repair. These are discussed in the short essay on chromatin remodeling during homologous recombination. Once a D-loop is formed, it is extended by the combined action of leading strand DNA replication from the invading 30 -OH strand and opening up of the double-strand duplex DNA, a process called heteroduplex extension. The copying of DNA information from the invaded DNA strand allows repair of simple double-strand breaks but also double-strand gaps. In the latter case, the copying process transfers the DNA sequence information from the intact DNA duplex to the broken DNA strand. In so doing, if the information that originally existed at the broken DNA strand was slightly different from that contained

D

224

DNA Recombination, Mechanisms of

DNA Recombination, Mechanisms of, Table 1 S. cerevisiae proteins involved in SDSA and BIR 50 end resection

Strand invasion and D-loop formation DNA synthesis step

Strand displacement dHJ resolution dHJ dissolution

SDSA Mre11-Rad50Xrs2, Sae2 Exo1, Dna2, Sgs1-Top3-Rmi1 RPA Rad51, Rad52, Rad55-Rad57, Rad54, Rdh54

BIR Mre11-Rad50Xrs2, Sae2 Exo1, Dna2, Sgs1-Top3-Rmi1 RPA Rad51, Rad52, Rad55-Rad57, Rad54, Rdh54

DNA processivity clamp PCNA Leading strand DNA machinery only

DNA processivity clamp PCNA Leading and lagging strand DNA machineries Replicative helicase

Srs2, Sgs1, Mph1 Yen1, Slx1-Slx4, Mus81-Mms4 Sgs1-Top3-Rmi1

in the intact strand, there will be a nonreciprocal transfer of information, or a gene conversion, independently of any mismatch repair. Up to this step, the scenario seems to be identical between crossover recombination and noncrossover recombination. However, at this point the pathways diverge (Fig. 1). For noncrossover recombination, the D-loop is now undone by displacement of the invading single strand. Undoing DNA strands requires the action of a DNA helicase. For this reaction there are several candidate DNA helicases that are called Srs2, Mph1, and Sgs1 in S. cerevisiae and RTEL1, FANCM, and BLM in human cells (Table 1). The D-loop can be displaced when the invading 30 -OH strand has been extended enough so that it can base pair or anneal with the other DNA double-strand break end that also has a free single-strand region (see Fig. 1). Hence, this model is called synthesisdependent strand annealing or SDSA. Once the strands have annealed, the break is sealed by a DNA ligase. There is no crossover intermediate

or product, but the DNA break has been repaired using information from another DNA duplex. Thus, this is a recombination event. If the sister chromatid is used for repair, the recombination will be silent as the two sister chromatids are identical. If a homologous chromosome is used, there may be some DNA sequence variation, resulting in a gene conversion event, which is detected as a small region of recombination. Crossover Recombination: Double Holliday Junction Formation and Resolution The crossover recombination pathway is initiated in the same manner as the SDSA pathway described above. It involves DNA end processing, formation of the Rad51 nucleoprotein filament, homology search, and strand invasion to form a D-loop. The 30 -OH terminus at the invading end can then be extended by DNA synthesis and unwinding ahead of this end (see Fig. 1). This then enlarges the single-strand region of the D-loop. However, instead of displacement of the invading strand, the single-strand region of the D-loop can pair with the single-stranded region at the other double-strand break end, essentially the other end of the original DNA break. This event is known as second end capture. In capturing the second end, the recombination intermediate becomes fixed with one full strand exchange structure or Holliday junction and one half-junction at the captured end (Fig. 1). Eventually the free 30 -OH can be ligated to the 50 -phosphate of the other end, now captured, to form a double Holliday junction intermediate (Fig. 1). This intermediate has potential regions of heteroduplex DNA at each Holliday junction while the central region, in the case of a double-stranded DNA gap, has been filled in by replication on both strands to form homoduplex DNAs. The final step is the resolution of the double Holliday junction structure. There are two major pathways. The first pathway for resolution is a true resolution step and involves structurespecific DNA endonucleases and DNA ligase. Depending on how the Holliday junctions are cut, the products may be crossover or noncrossover (Fig. 1). In E. coli there is one major resolvase called RuvC and a minor resolvase

DNA Recombination, Mechanisms of

called RusA. In eukaryotes there are at least three resolvase complexes, Mus81-Mms4/EME1, Slx1-Slx4/BTBD12, and Yen1/GEN1 (Table 1). The uppercase names are the enzymes used in mammalian cells, while the lowercase names reflect the budding yeast enzymes. Interestingly, while resolution of double Holliday junctions can lead to both noncrossover and crossover recombination products, resolution of these intermediates in budding yeast during meiotic recombination gives rise primarily to crossover products. The second resolution pathway is called dissolution. It uses the combined action of a DNA helicase complex and a DNA topoisomerase to push the two Holliday junctions toward each other so the enzymatic activity of the topoisomerase can then decatenate or unlink the crossed strands. This dissolution process always results in noncrossover products. The key players in this reaction are the helicase Sgs1 in yeast with its partners Top3 and Rmi1 (STR; Table 1) and in humans the BLM helicase with its partners TOPOIIIa and RMI1 and RMI2. Defects in any of these enzymes result in an increase in crossover products due to shunting of the double Holliday junction intermediate into a resolution pathway, as well as in additional chromosomal rearrangements and genomic instability (e.g., Bloom’s syndrome). Discrimination between noncrossover recombination resulting from the SDSA pathway and from the dissolution or resolution of a double Holliday junction intermediate can be made experimentally by determining the structure of the heteroduplex DNA using either poorly repairable genetic markers or mismatch repair-deficient cell lines. Repair of One-Ended Double-Strand Breaks: Break-Induced Replication (BIR) One-ended breaks can occur at uncapped telomeres. When telomerase is absent, such telomeres can be extended by a one-ended recombination process called alternative lengthening of telomeres or ALT. The ALT pathway uses HR with either a sister chromatid, a homologous chromosome, or a nonhomologous chromosome to reform the sequences at the telomere end through

225

a DNA synthesis-based event. One-ended breaks can also occur at collapsed replication forks. These breaks usually use the sister chromatid as a template for repair, but on occasion other homologous sequences can be used, leading to duplications of large chromosome regions, also known as segmental duplications or nonreciprocal translocations. The principles of BIR are similar to those of the HR mechanisms described previously. Following end resection, the single-stranded DNA tail becomes coated with RPA and then Rad51 to form the nucleoprotein filament, which engages in a search for homology and strand invasion (Fig. 1). BIR is therefore an HR pathway, which relies on the same proteins as SDSA up to the strand invasion step. Later on, SDSA requires only 30 extension of DNA ends by the leading strand DNA machinery, while BIR requires the establishment of a complete replication fork with both leading and lagging strand DNA machinery and the activity of a replicative DNA helicase. Remarkably, if both ends of a break are available, BIR is rare and is outcompeted by a gene conversion reaction, which generates LOH over distances much shorter than BIR. Interestingly, while end resection is essential for any HR reaction, extensive 50 to 30 DNA end resection mediated by Exo1 and Sgs1 is inhibitory to BIR. One interpretation is that BIR is a slow process during which extensive 50 to 30 resection could affect either the stability of the 30 end or the establishment of a Rad51 nucleoprotein filament at the 30 end, which is essential to initiate BIR. By duplicating a large segment of a template chromosome to a broken one, BIR preserves the integrity of the template molecule. This is fundamentally different from the repair of the broken chromosome by crossover recombination, which results in a reciprocal translocation. In this latter case, the structure of both the recipient and the template molecule is affected. However, considering only the recipient chromosome, the BIR product is identical to a reciprocal translocation event between two chromosomes. Only the analysis of all the chromosomes present when the recombination event occurred can distinguish between these different recombination pathways.

D

226

DNA Recombination, Mechanisms of

DNA Recombination, Mechanisms of, Fig. 3 Complex rearrangements through BIR and SDSA. When a second round of strand invasion occurs after strand invasion and displacement, complex rearrangements can occur. If the second invasion occurs at the same molecule (red), SDSA will result in a noncrossover product, while BIR can result in half-crossover product where only one of the two molecules engaged in the recombination is a crossover

product. If the second invasion occurs at a repeated DNA sequence on a different molecule or at a sequence at a different location on the same molecule (green), then SDSA will result in a complex noncrossover product that contains information from different chromosome regions (tri-parental recombination), while BIR will result in a complex translocation that involves three parental molecules

Interestingly, in the absence of essential BIR factors such as Pol32, a nonessential subunit of DNA polymerase delta, or Rad51, it is possible to select for rare recombination events that can repair one-ended breaks. Such repair events are generally half-crossovers between the broken chromosome and a template chromosome that leave a novel one-ended break at the end of the process. Such events can occur in diploid cells and yield chromosome loss. They can also occur during the G2 phase of the cell cycle in haploid cells. In this latter case, the broken chromosome can segregate into a different daughter cell than the repaired chromosome, leading to one unviable cell and one viable cell, respectively. Only a pedigree analysis or a molecular analysis of the chromosomes can distinguish between BIR and half-crossovers. The first strand invasion intermediate common to SDSA and BIR is intrinsically unstable. In the case of SDSA, if strand displacement occurs after sufficient extension of the 30 end, pairing with the

second end (or second end capture) is possible and repair of the break can be completed. When 30 end extension is not sufficient to allow second end capture such as during a gap repair event, a second round of strand invasion is necessary to complete the repair. This second round of strand invasion can occur at a site different from the initial one when repeated sequences are present (Fig. 3). This template switch leads to tri-parental recombination events. Once again, in the case of SDSA, the consequences of tri-parental events lead to much less dramatic consequences than in the case of BIR, which may lead to complex nonreciprocal chromosomal translocations. Given the abundance of repetitive sequences in eukaryotic genomes, such events may be common. Repair Pathway Choice The choice of pathway for the repair of doublestrand breaks is regulated by several factors, including the nature of the lesion and the phase

DNA Recombination, Mechanisms of

of the cell cycle in which the lesion occurs. For example, programmed double-strand breaks generated during V(D)J recombination, a diversity mechanism to produce antibodies in vertebrates, are repaired by NHEJ, whereas those generated during meiosis by the Spo11 endonuclease are repaired by HR. Whether HR or NHEJ is used for the repair of double-strand breaks is largely determined by the phase of the cell cycle. Homologous recombination operates in S/G2, when the sister chromatid is readily available, whereas NHEJ operates throughout the cell cycle, predominantly in G1 where the cells are growing but the genome has not been replicated. In addition, cell cycle phase also plays an active role in regulating HR in that end resection is promoted by cyclindependent protein kinases (CDKs). This is further discussed in the essay on checkpoints and DSB repair. As an example, the Sae2 protein, a key player in end resection, is phosphorylated by CDK to start HR in S/G2. There is significant evidence suggesting that HR and NHEJ collaborate (and compete) in the maintenance of genome stability and that both pose risks of genome rearrangements. Both pathways are essential for DNA damage repair in mice, as single knockouts for different factors of either pathway result in early embryonic lethality in mice. Further studies have shown that living cells enhance the activity of one pathway by suppressing or inactivating the other. This competition is more apparent in mutants that affect either end resection or the ability of a resected end to be channeled into HR. Thus, NHEJ mutants that have enhanced end resection have increased HR, while mutants with decreased end resection have increased NHEJ. Interestingly, it is now known that several proteins can actively participate in both pathways, highlighting the importance of pathway regulation and pathway choice. Double-Strand Break Formation Repair of double-strand breaks can occur through end joining mechanisms that use no homology or microhomology at the broken ends. NHEJ is cell cycle regulated and usually occurs in the G1 phase of the cell cycle, when a sister chromatid is not available. Repair of DNA breaks by HR can occur

227

through several pathways and is usually limited to the S and G2 phases. If breaks are induced by exogenous DNA damage such as ionizing radiation, they can occur at any phase of the cell cycle. Recent studies suggest that a double-strand break induced in G1 does not inhibit entry of the cell into S phase (Doksani et al. 2009). However, when the replication fork reaches the break, it stalls and requires mechanisms to repair the damage and complete replication. Breaks may also arise as a consequence of DNA damage, either exogenous or endogenous, that modifies bases, which can create blocks for protein complexes that move along the duplex DNA or unwind the DNA, as occurs during replication and transcription. Stalled replication or transcription complexes can elicit responses that involve nicking of the DNA backbone at the site of damage. This may form a double-strand break where none was initially present. Other types of damage that generate breaks during the repair process include DNA cross-links, which can occur within a single DNA strand or between DNA strands of a duplex molecule (i.e., intrastrand and interstrand cross-links, respectively). Cross-links create strong roadblocks to any motor protein that moves along the DNA, and removal of them is a multistep process that involves successive nicking of the DNA backbone. Another source of double-strand breaks comes from incomplete topoisomerase action, particularly that of topoisomerase II, which makes scissions in both strands of the DNA duplex. Lastly, DNA breaks can occur as part of a developmental program to generate genetic diversity. The most prevalent is that which occurs during meiosis, through the action of the Spo11 protein endonuclease. Meiotic recombination is more fully discussed in another essay of this section. Briefly, Spo11 cleavage of DNA occurs during meiotic prophase to make over 200 breaks per meiotic cell in all eukaryotic organisms studied that undergo conventional meiotic chromosome divisions. The breaks are repaired through HR with the homologous chromosome to generate gene conversion and crossover events. The formation of crossovers is an essential feature of meiosis for correct chromosome segregation at meiosis I.

D

228

Double-Strand Break Formation by Ionizing Radiation Ionizing radiation can cause direct breakage of the DNA backbone at any point of the cell cycle, resulting in double-strand breaks. Although not as deleterious, indirect damage can also take place when water surrounding the DNA absorbs the radiation, producing reactive chemical species that can then generate additional DNA lesions (e.g., oxidative DNA damage). Recent studies in budding yeast have provided experimental evidence showing that broken ends generated by ionizing radiation are different from those that occur by enzymatic DNA cleavage (Barlow et al. 2008). Radiation-induced ends are sometimes called dirty or frayed ends, as they do not possess a chemically defined end with terminal 50 -phosphate and 30 -hydroxyl (30 -OH) groups, and are not an immediate substrate for DNA enzymes that would repair a break by ligation or end extension. Radiationinduced breaks always require further processing and thus become subject to end resection. In contrast, clean ends induced by enzymatic cleavage are proper substrates for repair and become bound by different protein complexes, depending on the cell cycle phase. For ends induced enzymatically in G1, they become bound by a protein complex called Ku70/Ku80, which prevents the ends from further processing or resection and promotes end joining reactions. Ku proteins are evolutionarily conserved from bacteria to humans and form a basket-shaped structure that can slide onto DNA and work as a scaffold for the recruitment of other NHEJ proteins. It is not known why dirty ends do not bind Ku70/Ku80 when the breaks are generated in G1, but it has been hypothesized that dirty or frayed ends do not neatly fit into the ring structure formed by the Ku70/Ku80 heterodimeric complex (Wyman et al. 2008). Instead, it has been proposed that dirty ends are preferentially bound by the MRX-Sae2 complex, which then promotes the processing of the radiation-induced ends for subsequent repair by HR when the homologous chromatid is readily available in S/G2 (Barlow et al. 2008).

DNA Recombination, Mechanisms of

Programmed DNA Breaks The three most common forms of programmed DNA breaks are the meiosis-specific DNA breaks initiated by the Spo11 endonuclease, breaks in the immunoglobulin gene precursors by the RAG-1 and RAG-2 enzymes to generate mature antibody genes against a wide range of epitopes, and mating type switching in the fungi, promoted by the HO endonuclease. In addition, site-specific transposition and recombination systems that use specific DNA sequences and associated recombinases make DNA breaks at the specified target sequences. Meiosis-specific double-strand breaks and the mode of action of Spo11 are more fully explored in a separate essay in this section. In vertebrates, the immune system generates a very large number of T-cell receptors and antibodies against diverse pathogens and antigens. This process is known as V(D)J or somatic recombination and involves programmed site-specific recombination of the immunoglobulin gene precursors. The immunoglobulin genes are organized as a long array of V (variable) gene regions, up to 500, followed by an array of 12 D (diversity) regions, which is followed by a tract of 4 J (joining) regions. Programmed somatic rearrangement in B cells first joins one D region to one J region, which is then joined to one V region, to form the mature V(D)J gene for heavy immunoglobulin chains. The light chains of mature immunoglobulins, on the other hand, are formed by joining of a V region to a J region. As there are multiple copies of V, D, and J regions, the number of combinations can generate a diverse repertoire of up to 1011 antibodies. Joining of the V, D, and J regions occurs during B-cell development in a programmed and regulated fashion. Each V, D, and J region is flanked by a special sequence called the recombination signal sequence (RSS) (Fig. 4), which ensures that joining occurs at the RSS sequence. The RSS signals that flank each gene segment consist of conserved heptamer and nonamer elements separated by a less well-conserved spacer region of either 12 or 23 nucleotides. Cleavage at the RSS sequence is initiated by the recombinationactivating genes (RAG) RAG-1 and RAG-2. As

DNA Recombination, Mechanisms of

DNA Recombination, Mechanisms of, Fig. 4 Mechanism of V(D)J recombination. Each immunoglobulin gene precursor is flanked by the recombination signal sequence (RSS). The RSS signal contains conserved heptamer and nonamer sequences separated by a spacer segment of 12 or 23 base pairs. During V(D)J recombination rearrangements, the RAG complex first binds to an RSS sequence and makes a single nick at the junction between the coding sequence and the RSS. The RSS-RAG complex then binds to a second RSS-RAG complex (synapsis) and through a transesterification reaction forms a double-strand break. The ends of the coding sequences become sealed by a hairpin structure, which is then cleaved by the Artemis nuclease for subsequent ligation by an end joining reaction

229

these genes are expressed only during lymphocyte development, this restricts V(D)J breaks and recombination to this stage. The breakage process is complex. First, the RAG complex binds to an RSS sequence and makes a single-strand nick at the junction between the RSS and the coding sequence, forming a free 30 -OH group. The RSS-RAG complex then recognizes and aligns with a second RSS-RAG complex, and the free 30 -OH group ends attack the opposite strand through a transesterification reaction to form a double-strand break at the RSS/coding sequence junction (Fig. 4). During double-strand break formation, the ends of the coding sequences for the V, D, or J segments become sealed by a hairpin structure. To prevent self-joining or joining of two V sequences, two D sequences, or two J sequences together, the V(D)J recombination process uses the 12/23 rule. The 12/23 rule dictates that a 12 base pair spacer must be joined to a 23 base pair spacer. This mechanism helps to ensure fidelity and accuracy during the V(D)J joining reaction. Next, the hairpin end is cleaved by a DNA nuclease protein called Artemis, and, together with NHEJ factors, the different immunoglobulin gene segments are ligated together. Occasionally, this process goes wrong, and RSS sequences become joined to oncogenes in a random genomic rearrangement, as occurs in some human lymphomas. This places the oncogenes under the control of a strong transcription regulatory element that, under the wrong circumstances, can result in uncontrolled cell growth and tumorigenesis. Mating type switching is a mechanism that allows yeast cells to quickly change cell type, which is determined by the information contained at a defined DNA sequence known as the MAT locus. In yeast, mating type switching is promoted by the action of the HO endonuclease. Cleavage at the MAT locus is regulated by cell type and occurs only in haploid cells that are of a single mating type. The recognition sequence of the HO endonuclease is a nonpalindromic 24 base pair sequence that occurs at three sites in chromosome III of the yeast genome, the MAT locus, and the silent cassettes of mating type information, HML

D

230

DNA Recombination, Mechanisms of

DNA Recombination, Mechanisms of, Fig. 5 Mating type switching in budding yeast. In S. cerevisiae, mating type switching is promoted by the HO endonuclease. HO recognizes and cleaves at three specific sites in the yeast genome, the MAT locus, and the silent cassettes of mating type information, HML and HMR, all located in chromosome III. In addition to the gene coding regions (Ya and Ya), MAT, HML, and HMR share two regions flanking the

Y sequences, termed X and Z1. HML and MAT also share two sequences termed W and Z2. Double-strand break formation by HO triggers a gene conversion event between MAT and one of the silent cassettes (between MATa and HMLa in this particular example to gene convert MATa to MATa). This process employs the HR machinery to switch information via synthesis-dependent strand annealing (SDSA) and without crossover products

and HMR (Fig. 5). The HML (hidden MAT left) locus is located in the left arm of chromosome III and typically carries a silenced copy of the MATa allele, while the HMR (hidden MAT right) locus is located in the right arm of chromosome III and carries a silenced copy of the MATa allele. Formation of a double-strand break at the MAT locus by the HO endonuclease triggers a gene conversion event between MAT and one of the silent cassettes (Fig. 5). Although cleavage at the silent cassettes is repressed by the chromatin structure of the HML and HMR loci, these loci are able to participate in the repair reaction of the HO-induced break at MAT. During the repair process, the HO-formed ends are treated as normal DNA breaks and are processed by the HR repair processing enzymes. The broken ends pair with homologous DNA sequences at the HML or HMR loci and undergo an HR repair reaction via synthesis-dependent strand annealing (SDSA). The repair reaction copies the information contained in the silent cassette into the broken

MAT locus and in so doing changes or gene converts the mating type information at the MAT locus. As the mating type switching reaction uses the same enzymes to repair the HO-induced break through HR as other break-induced HR reactions do, it is often used as a model for homologous recombination. Interestingly, the HO endonuclease system has become a widely used approach in laboratories to generate breaks at specific locations in the genome for the study of HR processes in normal and mutant cells. Site-Specific Recombination Site-specific recombination entails the exchange, insertion, or deletion of DNA segments at defined DNA sequence sites. It often occurs during movement of transposable elements, which entails a specific enzyme cleaving specific and often very few sequences that flank a transposable element. Many site-specific recombination systems have been modified for use as genetic tools in gene targeting. One of these, the Cre/lox system,

DNA Recombination, Mechanisms of

231

D

DNA Recombination, Mechanisms of, Fig. 6 Model experiment using Cre-lox site-specific recombination. loxP sites are introduced by DNA transfection into the genome of a mouse to flank one exon of a target gene of interest (loxP mouse). A second mouse carries a cre construct in its genome, where cre expression is controlled by a specific promoter (cre mouse). When the two mice are crossed,

progeny that carry the loxP construct (surrounding the target gene exon) and the cre gene are produced (right panel). When the Cre recombinase is expressed in cells that activate the cre promoter, the protein catalyzes sitespecific recombination between the loxP sites. The result is a deletion of an exon within the gene of interest and disruption of the gene function or gene knockout

which originates from bacteriophage P1, has been modified and co-opted for use in eukaryotes to induce breaks and recombination at specific sequences, most commonly for gene targeting in mouse models. The Cre (for causes recombination) enzyme recognizes and binds to loxP sites. loxP(for locus of crossing-over (X) P1) is a 34 base pair (34 bp) region that consists of an asymmetric octamer sequence in between two palindromic 13 bp sequences flanking it. After

binding, Cre promotes cleavage and catalyzes rejoining between loxP sites and thus can be used to conditionally turn off or turn on a gene of interest. This is often used to generate targeted gene knockouts through controlled expression of Cre in mouse (see Fig. 6). The resulting recombination outcome, either a deletion or an insertion, depends on the orientation of the loxP sites. This system can also be used to express a gene in tissues where it is not normally expressed or at

232

DNA Recombination, Mechanisms of

DNA Recombination, Mechanisms of, Fig. 7 Transcription-associated recombination (TAR) mechanisms. Stalled replication forks may form when the transcription and replication machineries collide with each other (collision model). Opening of the duplex DNA by the RNA polymerase complex can generate regions of singlestranded DNA that are susceptible to nuclease and DNA-damaging agents (accessibility model). In addition,

transcription results in negative supercoiling on the duplex DNA behind the RNA polymerase, which can facilitate the formation of DNA/RNA hybrids (R-loops) by the nascent mRNA. R-loops represent strong roadblocks to the replication machinery and thus may require rescue by HR and other repair pathways (Adapted from Gottipati and Helleday 2009)

a time when it is not normally expressed (i.e., ectopic gene expression). Another site-specific cleavage enzyme is FLP (for flippase recombinase), which derives from the endogenous two-micron plasmid of the yeast S. cerevisiae. During replication of this plasmid, it undergoes an amplification event that is promoted by FLP cleavage at inverted repeat sequences called FRT (for FLP Recognition Target) sites. The mechanism of action of the FLP recombinase is analogous to the Cre/lox recombination system, as it recognizes and cleaves specifically at an FRT site, which also consists of a 34 bp DNA segment with an 8 bp asymmetric core flanked by two palindromic 13 bp sequences. The FLP/FRT system has been exported and exploited in different model organisms like Drosophila to promote specific double-strand breaks at introduced FRT sites relatively close to a gene of interest. Induction of double-strand breaks by regulated FLP expression during somatic growth induces mitotic crossing over via HR. Depending on the segregation of the crossover chromosomes during cell division, some daughter cells will now be homozygous for large regions of DNA containing the gene of interest (or a specific mutant allele). This can result in loss of heterozygosity for the chromosome in question and is a method to uncover recessive mutations in a controlled fashion. This

is particularly useful if the mutations act at more than one stage during development, to see the effects later in development without being masked by earlier roles of the genes, or if loss of the gene leads to early lethality. Transcription-Associated DNA Breaks Studies have shown that transcription can stimulate mitotic recombination and regulate the frequency of recombination events (Aguilera 2002; Gottipati and Helleday 2009). This type of recombination is called transcription-associated recombination or TAR. TAR has been reported to occur in both prokaryotes and eukaryotes and to play roles in maintaining but also preventing genome stability. In eukaryotes, for example, TAR can be promoted by RNA polymerase I, which is involved in transcription of the ribosomal DNA sequences, or RNA polymerase II, which transcribes the coding sequences and long noncoding DNA sequences into RNA. Interestingly, V(D)J recombination is also stimulated by transcription (Aguilera 2002). Two basic and nonexclusive mechanisms have been proposed to explain TAR, the accessibility and the collision models (Gottipati and Helleday 2009; Fig. 7). The first model implies that the open DNA region left as a consequence of transcriptional elongation provides better accessibility for

DNA Recombination, Mechanisms of

binding of proteins like DNA nucleases or an easier target for DNA-damaging agents that ultimately can stimulate recombination by acting as an initiation site for recombination repair. The second type of mechanism is based on a collision of two DNA machineries, the transcription and replication complexes. In this model, the two machineries either collide head-on or one bumps into the other from behind. Regardless, the collision can cause an accumulation of supercoiling around the collision and an extended RNA/DNA hybrid duplex behind the stalled RNA polymerase complex. This blockage or obstruction can lead to DNA nicks and breaks to release the topological tension and to allow one machinery to pass the other (Fig. 7). Some of the possible DNA structures that can arise as a consequence of replication fork stalling and transcription machinery stalling can resemble recombination intermediates and thus trigger recombination repair processes. A stalled replication fork produced either by DNA blockages or by collisions between machineries is susceptible to breakage at the single-stranded region. One mechanism to rescue a stalled fork involves its reversal through specialized enzymes for subsequent template switching-dependent DNA synthesis. Template switching involves the use of the nascent DNA strand on the undamaged chromatid as a template for synthesis. The mechanisms and protein factors involved in the fork reversal and template switching reactions are discussed in a separate essay later in this section. Importantly, fork reversal generates two recombinogenic structures: a free duplex DNA end, which is essentially a one-ended double-strand break at the reversed paired nascent DNA strands, and a four-way junction similar to a Holliday junction, which is susceptible to cleavage by structure-specific nucleases. Suffice it to say here that HR is essential to complete a successful replication, as the replication DNA machinery encounters blocks that can lead to double-strand breaks almost every replication cycle. Replication-Induced DNA Breaks In addition to the roadblocks to an unhindered replication fork movement along the DNA caused

233

by the transcription machinery, DNA base adducts, single-strand nicks, secondary DNA structures or hairpins, protein-DNA complexes, and imbalances in the dNTP pools can also cause replication forks to stall or collapse. These situations leave the DNA in a compromised state, with single-stranded regions exposed to DNA nucleases or to additional damage by endogenous or exogenous sources. When the replication machinery reaches a nick on the DNA template, a double-strand break is formed. Under some circumstances, a nick on the DNA can become a one-ended double-strand break, which can ultimately lead to a collapsed replication fork. One-ended breaks can be rescued by an HR-dependent mechanism known as break-induced replication or BIR (Fig. 1). Stalling of the replication fork due to a bulky base adduct or some other type of blockage can also lead to the stalling and collapse of the fork. The stalled fork may be bound by DNA nucleases, resulting in double-strand breaks, or the fork may reverse to promote template switching, forming a four-way junction or what is sometimes referred to as a “chicken-foot” intermediate. As explained above, this interesting structure has two recombination-type features that can promote a recombination response and thus depends on HR proteins to resume DNA replication. Chromosome fragile sites are defined cytologically as chromosome regions that are prone to breakage during replication stress, caused either by inhibition of the replicative DNA polymerases or by an alteration in the dNTP pools. They occur at defined chromosomal regions and are often characterized by inverted repeat sequences that can form DNA hairpin structures (Fig. 8). Fragile sites are prone to breakage and are overrepresented in tumor cells as sites of genomic DNA rearrangements. Studies in yeast using reduced levels of the replicative DNA polymerase alpha have provided possible explanations on how fragile sites can lead to DNA breaks (Lemoine et al. 2005). This scenario results in a significant increase in chromosome breakage, chromosome loss, and genomic translocations. Most of the produced rearrangements involve the yeast

D

234

DNA Recombination, Mechanisms of, Fig. 8 Mechanisms of double-strand break formation at fragile sites. When DNA replication is slowed down by inhibition of replicative DNA polymerases or by an alteration in the dNTP pools, large regions of single-stranded DNA can form on the lagging strand. If inverted repeat sequences, shown by the arrows, are located in these regions, they can

DNA Recombination, Mechanisms of

form large hairpin structures that are prone to nucleolytic cleavage (red triangles) by DNA nucleases. Alternatively, these inverted repeats can be extruded as cruciform intermediates that resemble recombination structures and thus are prone to nucleolytic attack red arrows). The processing of the hairpin and cruciform structures ultimately leads to double-strand breaks and repair by the HR machinery

DNA Recombination, Mechanisms of

transposable elements, which are organized as dispersed repeats throughout the genome. One explanation is that low levels of DNA polymerase result in a delay in Okazaki fragment synthesis on the lagging strand, which leads to larger singlestranded DNA regions that can form hairpin structures involving the transposable element sequences (Fig. 8). Processing of the hairpin by nucleases can then lead to double-strand breaks. A second nonexclusive explanation is that these inverted repeats become extruded as cruciform structures, which are basically paired hairpins (Fig. 8). This cruciform structure resembles a Holliday junction and is thus susceptible to nucleolytic cleavage by the recombination machinery. Alternatively, it can be replicated without the removal of the single-stranded loop and produce a palindromic chromosome and further rearrangements after each replication cycle. Interestingly, other studies have shown that insertion of mammalian short repeated sequences (Alu elements) into the yeast genome increases HR events nearly 1,000-fold, leading to the formation of double-strand breaks with terminal hairpin structures, which if left unrepaired can lead to chromosome inverted duplications (Lobachev et al. 2002). Importantly, Alu element insertion has been implicated in several inherited human diseases and in carcinogenesis. Clearly, such repeat sequence organization is selected against by recombination repair and other repair pathways or is highly controlled to prevent hairpin and cruciform structure formation and thus genome instability.

References Aguilera A (2002) The connection between transcription and genomic instability. EMBO J 21:195–201 Barlow JH, Lisby M, Rothstein R (2008) Differential regulation of the cellular response to DNA double-strand breaks in G1. Mol Cell 30:73–85 Cejka P, Cannavo E, Polaczek P, Masuda-Sasa T, Pokharel S, Campbell JL, Kowalczykowski SC (2010) DNA end resection by Dna2-Sgs1-RPA and its stimulation by Top3-Rmi1 and Mre11-Rad50-Xrs2. Nature 467:112–116

235 Doksani Y, Bermejo R, Fiorani S, Haber JE, Foiani M (2009) Replicon dynamics, dormant origin firing, and terminal fork integrity after double-strand break formation. Cell 137:247–258 Gottipati P, Helleday T (2009) Transcription-associated recombination in eukaryotes: link between transcription, replication and recombination. Mutagenesis 24:203–210 Heyer WD, Ehmsen KT, Liu J (2010) Regulation of homologous recombination in eukaryotes. Annu Rev Genet 44:113–139 Huertas P, Cortes-Ledesma F, Sartori AA, Aguilera A, Jackson SP (2008) CDK targets Sae2 to control DNA-end resection and homologous recombination. Nature 455:689–692 Lemoine FJ, Degtyareva NP, Lobachev K, Petes TD (2005) Chromosomal translocations in yeast induced by low levels of DNA polymerase: a model for chromosome fragile sites. Cell 120:587–598 Lobachev KS, Gordenin DA, Resnick MA (2002) The Mre11 complex is required for repair of hairpin-capped double-strand breaks and prevention of chromosome rearrangements. Cell 108:183–193 Mimitou EP, Symington LS (2008) Sae2, Exo1 and Sgs1 collaborate in DNA double-strand break processing. Nature 455:770–774 Morrow DM, Connelly C, Hieter P (1997) ‘Break copy’ duplication: a model for chromosome fragment formation in Saccharomyces cerevisiae. Genetics 147:371–382 Nassif N, Penney J, Pal S, Engels WR, Gloor GB (1994) Efficient copying of nonhomologous sequences from ectopic sites via P-element-induced gap repair. Mol Cell Biol 14:1613–1625 Nicolette ML, Lee K, Guo Z, Rani M, Chow JM, Lee SE, Paull TT (2010) Mre11-Rad50-Xrs2 and Sae2 promote 50 strand resection of DNA double-strand breaks. Nat Struct Mol Biol 12:1478–1485 Niu H, Chung WH, Zhu Z, Kwon Y, Zhao W, Chi P, Prakash R, Seong C, Liu D, Lu L, Ira G, Sung P (2010) Mechanism of the ATP-dependent DNA end-resection machinery from Saccharomyces cerevisiae. Nature 467:108–111 San Filippo J, Sung P, Klein H (2008) Mechanism of eukaryotic homologous recombination. Annu Rev Biochem 77:229–257 Szostak JW, Orr-Weaver TL, Rothstein RJ, Stahl FW (1983) The double-strand-break repair model for recombination. Cell 33:25–35 Wyman C, Warmerdam DO, Kanaar R (2008) From DNA end chemistry to cell-cycle response: the importance of structure, even when it’s broken. Mol Cell 30:5–6 Zhu Z, Chung WH, Shim EY, Lee SE, Ira G (2008) Sgs1 helicase and two nucleases Dna2 and Exo1 resect DNA double-strand break ends. Cell 134:981–994

D

236

DNA Repair I. Robert Lehman Department of Biochemistry, Beckman Center, Stanford University School of Medicine, Stanford, CA, USA

Synopsis DNA damage threatens the integrity of the DNA genome that is paramount to inheritance of genetic material. There are many types of DNA damage and correspondingly, numerous DNA repair mechanisms. Historically, UV damage and its repair were the first to be discovered. Environmental exposure to UV and ionizing radiation remain important contributors to the most frequent and severe types of damage in humans, respectively, which if unrepaired can result in chromosome loss. This entry examines the mechanisms and regulation of the DNA repair process that are crucial to all organisms.

Introduction Our current understanding of the mechanisms of the repair of damage to DNA had its origin nearly 60 years ago with the investigation of a rather esoteric subject, the radiation (UV) sensitivity of bacteria. Initially the field was dominated by bacterial geneticists whose work led to the isolation of large numbers of radiation-sensitive mutants and ultimately the identification of bacterial genes whose products were essential for the repair of the radiation-induced lesions. An important advance in the 1960s was made with the demonstration that patients with a human inherited disease, xeroderma pigmentosum (XP), were extremely sensitive to ultraviolet light and were defective in their ability to repair the DNA damage resulting from UVexposure. The fact that these patients were also cancer prone prompted a much increased interest in DNA repair and, in particular, DNA repair in humans. It is now clear that DNA can be damaged in a variety of ways in addition to ultraviolet radiation.

DNA Repair

These include errors occurring during DNA replication and exposure to products of normal cellular metabolism (hydroxyl radicals) and various chemicals (particularly alkylating agents) and by the relative lability of certain chemical bonds in DNA (the glycosylic bond in purine nucleotides and the 5-amino group of cytosine and adenine). Beginning with the identification of the genetic defect in patients with xeroderma pigmentosum, geneticists identified a large number of genes in addition to the XP gene in humans whose products are essential for DNA repair. An important aspect of DNA repair, in both bacteria and eukaryotes, including humans, has been the regulation of the repair process. The first of these pathways to be defined is the SOS response in E. coli. In the SOS response, which is observed particularly at the high levels of DNA damage which brings DNA replication to a halt, a group of proteins are induced, some of which function in DNA repair, which permit replication bypass of the lesions produced by the damaging agents and enhance the recombinational mode of repair (see below). In mammalian cells, regulation of DNA repair is under the control primarily of the tumor suppressor gene, p53, and the ATM protein kinase (defective in the human disease ataxia-telangiectasia). ATM is a protein complex that serves to halt DNA replication in the S phase of the cell cycle until the repair process is complete, thereby permitting normal error-free replication to proceed.

The Biochemistry of DNA Repair Essentially six types of DNA repair have been identified and their mechanisms described to various extents of refinement: nucleotide excision repair, base excision repair, mismatch repair, direct repair, recombinational repair, and doublestrand break repair.

Nucleotide Excision Repair DNA lesions that cause large distortions of the helical structure of DNA are repaired by this system which recognizes such bulky lesions as the

DNA Repair

cyclobutane pyrimidine dimers and 6-4 photoproducts generated by exposure of DNA to ultraviolet light. In E. coli where the mechanism of nucleotide excision repair has been described in most detail, the repair enzyme is made up of three subunits, products of the UvrA, UvrB, and UvrC genes (the ABC excinuclease). The nucleolytic activity of the heterotrimeric enzyme is novel in the sense that two cuts are made at the site of the pyrimidine dimer, one at the 50 and the other at the 30 side of the lesion. A helicase, the product of the UvrD gene, then removes the resulting 12–13 base-pair oligonucleotide that spans the pyrimidine dimer. The resulting gap in the DNA duplex is filled in by a DNA polymerase using the complementary undamaged strand as a template and sealed by a DNA ligase, thus removing the pyrimidine dimer lesion and restoring the normal nucleotide sequence. As in E. coli, excision of pyrimidine dimers and other bulky lesions in mammalian cells is mediated by the action of a general damage-recognitionspecific bimodal incision activity formally equivalent to the bacterial ABC excinuclease, followed by the action of a DNA polymerase and ligase to restore genomic integrity. Since, unlike E. coli and other prokaryotes, DNA in mammalian cells is complexed with histones in the form of chromatin, accessibility of the damaged DNA to the excision repair enzyme presents a significant problem which is currently under investigation but has not yet been solved. The process of nucleotide excision repair in mammalian cells is influenced by active transcription of the genome by RNA polymerase II. A large literature has accumulated on the differences in repair in transcribed and untranscribed DNA strands. Of particular interest in this regard has been the finding that nucleotide excision repair is directly associated with transcription and indeed, the human basal transcription factor associated with RNA polymerase II, TFIIH, includes known excision repair proteins.

Base Excision Repair Base excision repair is catalyzed by a class of enzymes collectively known as DNA

237

glycosylases which recognize particularly common lesions such as the products of cytosine and adenine deamination and remove the affected base by cleaving the N-glycosylic bond to create an apyrimidinic or apurinic site in the DNA (referred to as AP sites). An important member of this group of enzymes is the uracil N-glycosylase which removes from DNA the uracil which results from the spontaneous deamination of cytosine, a particularly facile reaction. Under the conditions found in a typical cell, spontaneous deamination of cytosine in DNA to form uracil will occur approximately one in every 107 cytosines in a 24 h period, corresponding to about 100 spontaneous mutations in a day per cell. (The deamination of adenine to form hypoxanthine is approximately 1,000-fold slower.) The uracil formed by the deamination of cytosine will, of course, pair with adenine in the opposing strand rather than with guanine. The problem posed by cytosine deamination suggests a plausible reason for the presence of thymine rather than uracil in DNA. Cytosine deamination would gradually lead to a decrease in G-C base pairs and an increase in A-U base pairs in the DNA of all cells. Over thousands of years, spontaneous deamination of cytosine would eliminate the G-C pairs and the genetic code that depend upon them. Other DNA glycosylases recognize and remove hypoxanthine and xanthine arising from the deamination of adenine and guanine, as well as the alkylated (and mutagenic) bases such as 3-methyladenine, 7-methylguanine, and 06methylguanine formed by exposure to various alkylating agents. Finally, pyrimidine dimers can also be removed by the action of a pyrimidine dimer-specific DNA glycosylase. Removal of any of these bases by the appropriate DNA glycosylase results in the generation of an AP site (see above). Once an AP site is generated, it must be repaired. The repair does not occur simply by insertion of a new base (claims were made at one point for the existence of “base insertases”; however, those were ultimately disproved). The repair of an AP site generated by cleavage of the bond between the base and the deoxyribose in the backbone of the DNA duplex occurs by the action of an AP

D

238

endonuclease to cleave the phosphodiester backbone near the AP site. The deoxyribose-50 phosphate formed at the site of cleavage is then removed by the action of a 50 ! 30 exonuclease. A DNA polymerase then fills in the resulting gap (DNA polymerase I in E. coli) and the resulting nick is then sealed by the action of a DNA ligase. The repair process is now complete and the undamaged nucleotide sequence is restored. DNA glycosylase and AP endonucleases have been identified in yeast and mammalian cells. However, their mechanisms have not yet been described in detail.

Mismatch Repair The mismatch repair system identifies and repairs the rare mismatches that are left behind during the DNA replication process. As was the case with nucleotide excision repair, bacterial mutants were identified (mut mutants) that showed a hypermutation phenotype, that is, they showed a spontaneous mutation frequency that substantially exceeded that of the wild-type strain. Among the hypermutant mutants are those in which an abnormally high frequency of mismatched base pairs arise by “errors during DNA replication as a consequence of a defect in the products of the mut genes which scan the genome during replication and correct any errors (mismatches) that arise. The identification of the mut genes led to the discovery of the methyl-directed mismatch repair system. In this system, the mismatches are nearly always corrected to correspond to the information on the template strand. This discrimination is accomplished by tagging the old (template) DNA strand with methyl groups to distinguish it from the newly synthesized DNA strands. The actual strand discrimination is based on the action of the Dam methylase which methylates DNA at the N6 position of all adenines that occur within the (50 ) GATC sequences. Immediately after a DNA segment is replicated, there is a short lag during which the newly synthesized strand remains unmethylated. It is this transient undermethylation of the GATC sequence in the newly synthesized strand that permits strand

DNA Repair

discrimination. Replication mismatches in the vicinity of a GATC sequence are then repaired according to the information in the methylated parent (template) strand. The MutS, MutH, and MutL proteins, the products of the E. coli MutS, MutH, and MutL genes, are the key elements of the system. The process proceeds as follows. The MutS protein binds to a wide range of mismatched base pairs. The MutH protein first binds to the GATC sequences; the MutL appears to be an interface protein linking the MutS and MutH proteins in a complex consisting of the three proteins. If only one of the two strands is methylated at the GATC sequence and a mismatched base pair exists within approximately 1,000 base pairs, the MutH protein acts as a site-specific endonuclease, cleaving the unmethylated strand at the 50 side of the G in the GATC sequence, thereby marking the mismatched strand for repair. Additional steps in the repair sequence depend upon where the mismatch is located relative to the damage site. When the mismatch is on the 50 side of the cleavage site, the unmethylated strand is unwounded by a specific helicase and degraded in the 30 ! 50 direction from the cleavage site by an exonuclease and replaced by the action of a DNA polymerase (the replicative DNA polymerase III) and a singlestrand DNA-binding protein. The DNA ligase then completes the process. The repair of mismatches on the 30 side of the cleavage is similar except that a different exonuclease which degrades single-stranded DNA in both the 50 ! 30 and 30 ! 50 directions is used. Energetically, mismatch repair is a particularly expensive process. The mismatch may be a kilobase or more away from the GATC sequence. The degradation and replacement of a DNA strand of this magnitude represents an enormous investment in activated deoxynucleotide precursors to repair a single mismatch by a DNA polymerase, providing a graphic illustration of the importance of DNA repair to the organism. It is noteworthy that all mismatches are recognized and replaced by this mismatch repair system, but not equally well. Specifically the G-T mismatches, which are the most common, are replaced most efficiently, whereas C-C mismatches which occur rather infrequently are repaired poorly.

DNA Repair

The enzymology of mismatch repair in humans is currently being investigated by a number of laboratories. Although there are differences in the pathway in E. coli, in particular the lack of a methylated sequence to mark a strand for repair, the process seems to have been largely conserved. In a particularly striking demonstration of how basic, untargeted research can lead to breakthroughs in health-related or more applied areas, it has been shown that defects in hM5H2 or hMLH1, the human analogues of the E. coli MutS and MutL proteins, are associated with hereditary nonpolyposis colorectal cancer (HNPCC). This is one of the most common genetic diseases in humans and affects as many as 1 in 200 individuals. Despite the name, the syndrome-affected individuals also develop tumors of the endometrium, ovary, and other organs; HNPCC accounts for 4–13% of all colorectal cancers in the developed world.

239

excited state of the chromophore *MTFH, which transfers the excitation energy to the catalytic cofactor, FADH2. The FADH2 excited state *FADH initiates monomerization of the pyrimidine dimer by electron transfer regenerating the catalytically active flavin. Back electron transfer from the pyrimidine dimer radical formed in the process to the dimer results in dimer monomerization. Direct repair of O6 methyl guanine is carried out by the O6 methyl guanine DNA methyl transferase which promotes the transfer of the methyl group of the O6 methyl guanine to a specific cysteine residue of the same protein. The methyl transferase is not strictly an enzyme because a single methyl transfer event inactivates the protein (the protein commits suicide during the repair process). The consumption of an entire protein molecule to correct a single damaged base (albeit a particularly nasty lesion) is another vivid example of the central importance of maintaining the integrity of the genome.

Direct Repair As noted above, nucleotide base and mismatch repair all depend upon the double-stranded structure of DNA, with the undamaged strand providing a template for repair of the damaged strand after removal of the lesion by the repair process. There are, however, types of damage that are repaired without complete removal of the affected nucleotide or base. The best characterized of these are the direct photorepair of pyrimidine dimers, a reaction promoted by the enzyme photolyase, and by the repair of O6-methylguanine which forms by exposure to alkylating agents. The latter is a common and highly mutagenic (and lethal) lesion. As noted previously, pyrimidine dimers are bulky lesions that result from a UV-light-induced reaction. Photolyases use the energy derived from absorbed (visible) light to reverse this damage. The photolyases contain two cofactors which serve as light-absorbing chromophores. One of the chromophores is the riboflavin derivative, FADH2, and the other is in most cases a folic acid derivative (a pterin). The photochemical mechanism is as follows. Absorption of a photon by the pterin chromophore, MTFH yields the

Recombinational DNA Repair Although genetic recombination serves many cellular functions, one of its most important functions is in DNA repair. As described above, most forms of DNA repair are predicated on the fact that a DNA lesion on one strand can be accurately repaired because the genetic information is present in an undamaged complimentary strand. In certain types of lesions in DNA such as doublestrand breaks and double-strand cross-links, the complimentary strand itself is damaged or absent. When this occurs, the information required for accurate DNA repair must come from a separate homologous chromosome, and the repair involves homologous recombination or recombinational repair. A current and plausible pathway for recombinational repair in E. coli involves the product of the recA gene, the RecA protein (recA mutants are completely incapable of promoting genetic recombination). As noted above, a lesion in an unpaired DNA strand cannot be excised by the base or nucleotide excision pathways since this

D

240

DNA Repair Polymerases

would leave breaks in both DNA strands, an outcome that would be lethal to the cell. To prevent chromosomal breakage and allow for repair, the lesion must acquire a complementary strand. The recombinational repair pathway makes use of the intact homologous DNA molecule. The RecA protein, which exists as a multi-subunit filament, promotes a strand exchange reaction between the two duplex DNA molecules (one, the mutant containing the lesion and the other wild type). The strand exchange which requires ATP hydrolysis leads to the generation of a transient fourstranded structure known as a Holliday Intermediate. The Holliday Intermediate is then processed by specific nucleases to form a heteroduplex DNA molecule containing the lesion and which is now a substrate for nucleotide excision repair. Following the discovery of the strand exchange activity of the RecA protein and its role in homologous recombination in E. coli, the search for enzymes analogous to the RecA protein in eukaryotes began and led to the discovery first in yeast and then in human cells, of the Rad 51 protein, so named because it was the product of the rad 51 gene, mutations in which led to severe radiosensitivity. (The rad 51 gene is a member of the rad 51 epistasis group which includes rad 51, rad 52, and rad 54, all of which are highly radiation sensitive.) The action of the human Rad 51 protein in recombinational repair, including formation of a Holliday Intermediate, resembles closely that described for the E. coli RecA protein. The products of the BRCA1 and BRCA2 gene, mutations in which result in breast and ovarian cancer susceptibility, are known to interact with the human Rad 51 and Rad 52 proteins and are thought to direct the Rad 51 and Rad 52 proteins to sites of DNA damage so as to initiate the recombinational repair process.

the Rad 51, Rad 52 pathway. Two, nonhomologous end joining (NHEJ) involves a group of proteins, Ku, a DNA protein kinase (DNA-PKcs), a DNA ligase (DNA-ligase 4XRCC4), a DNA polymerase, and a protein of a yet unidentified function known as Cernounos. The current picture of the reactions involved in NHEJ is as follows. The broken ends are juxtaposed by the action of Ku and DNA-PKcs and joined by a DNA polymerase. DNA ligase and Cernounos, generally with the loss of one or more nucleotides at the site of joining. Consequently, DNA repair by NHEJ is inherently mutagenic. Of particular interest, NHEJ participates in the programmed recombination pathway that occurs during the formation of immunoglobulin genes from gene segments that are separated in the genome.

Double-Strand Break Repair

Synopsis

Double-strand breaks result from the exposure of DNA to a number of agents, most notably ionizing radiation. Two mechanisms for the repair of double-strand breaks have been described in humans. One, homologous end joining involves

To cope with DNA damage, cellular organisms possess evolutionary conserved mechanisms to remove DNA lesions and restore the original genetic information. Most of these pathways require a resynthesis step during which the intact

Cross-References ▶ DNA Damage, Types of ▶ DNA Repair Polymerases ▶ Double-Strand Break Repair ▶ Homologous Recombination in Lesion Bypass ▶ Mismatch Repair ▶ Nucleotide Excision Repair ▶ Oxidative DNA Damage ▶ Ultraviolet Light DNA Damage

DNA Repair Polymerases Giuseppe Villani and Nicolas Tanguy Le Gac Institut de Pharmacologie et de Biologie Structurale, CNRS-Université Paul Sabatier Toulouse III, Toulouse, France

DNA Repair Polymerases

strand serves as a template in repairing the damaged one. Specialized enzymes called DNA polymerases catalyze this DNA synthesis. This review focuses on the role of the numerous prokaryotic and eukaryotic DNA polymerases identified to date in the major DNA repair pathways: base excision repair (BER), nucleotide excision repair (NER), double-strand break repair (DSBR), cross-link repair (CLR), and mismatch repair (MMR).

Introduction The structure of DNA is constantly subjected to alteration from physical agents such as UV light and ionizing radiation or by chemicals found in the environment. In addition to the action of these exogenous agents, DNA is also damaged by a plethora of endogenous cellular metabolites such as those produced by hydrolysis, reactive oxygen species, and small reactive intracellular molecules. The number of DNA damages introduced daily into DNA can be very high; for example, it has been calculated that approximately 104 spontaneous depurination events per day occur in the DNA of a human cell, and this evaluation can reach the level of 105 when all possible modifications are taken in consideration (Lindahl and Wood 1999). Such alterations can affect genetic stability and result in cellular death, degenerative changes, or aging. To answer these threats, all cellular organisms, including some viruses, have developed evolutionary conserved enzymatic systems capable of reducing the genotoxic consequences of DNA damage. Therefore, different DNA repair systems exist that are capable of removing DNA lesions and restoring the original information. Generally speaking, DNA repair processes are initiated by the recognition and removal of damages by specialized proteins. Subsequently, with the exception of the direct damage reversal (DDR) repair system, a step of DNA resynthesis, in which one strand serves as a template in repairing the other, is absolutely required. This DNA synthesis is accomplished by specialized DNA-directed

241

enzymes capable of catalyzing DNA chain growth, called DNA polymerases (Kornberg and Baker 2005; Hübscher et al. 2010). At least 20 prokaryotic and eukaryotic DNA polymerases have been identified to date (Hübscher et al. 2010), and, based on sequence homology and structural similarities, they have been grouped in different families and ascribed to different cellular functions (Hübscher et al. 2010). This subsection attempts to summarize published data indicating a role for a number of prokaryotic or eukaryotic DNA polymerases in the resynthesis step of major DNA repair processes, including base excision repair (BER), nucleotide excision repair (NER), double-strand break repair (DSBR), cross-link repair (CLR), and mismatch repair (MMR). Brief Description of the Repair Mechanisms Examined Base excision repair (BER) is considered the predominant defense system for eliminating DNA lesions generated by alkylating agents, reactive oxygen species, and spontaneous base loss or strand breakage in mammalian cells. This process generates small single-stranded DNA gaps, which can be filled by different DNA polymerases. BER proceeds by at least two sub-pathways, singlenucleotide BER (SN-BER) and long-patch BER (LP-BER), distinguished by the excision repair patch size. The SN-BER is supposed to be the major BER pathway in both bacterial and mammalian cells and is coordinated by multiple protein-protein interactions. Nucleotide excision repair (NER) senses the presence of a lesion through the distortion it causes to the DNA structure. The damaged strand is recognized and a short oligonucleotide around the lesion excised, leaving a gap that is filled by one or more DNA polymerases. NER comprises two sub-pathways: global genome repair (GGR) that repairs lesions in transcriptionally silent regions of the genome and transcription-coupled repair (TCR) that repairs lesions in transcriptionally active regions. However, the major difference between these two pathways resides in the differential mechanisms of recognition of the lesion more than on the resynthesis step.

D

242

Double-strand break repair (DSBR) repairs the otherwise lethal DNA breaks arising mainly when the replication fork encounters a nick. There are two major pathways for repair of double-strand DNA break. The first proceeds through homologous recombination (HR) and occurs during late S or G2 phases of the cell cycle, while the second, named nonhomologous end joining (NHEJ), involves rejoining of what remains of the two DNA ends and can eventually tolerate nucleotide loss or addition at the rejoining sites. Cross-link repair (CLR) comprises the repair of two types of lesions. The first type of repair is acting on lesions generating by bifunctional agents that cross-link the two strands of DNA and is defined as interstrand cross-link repair (ICL). The second type repairs DNA-protein cross-links that are formed in cells as by-products of DNA metabolism. Mismatch repair (MMR) is a process that corrects mismatches generated during DNA replication that have escaped proofreading and ensures fidelity in genetic recombination. The resynthesis step of MMR requires the synthesis of long tracts of DNA. The prokaryotic and eukaryotic DNA polymerases thought to be involved in the DNA repair processes mentioned above are listed in Table 1. The role of each polymerase within each process is discussed in detail in the following parts of the subsection.

Base Excision Repair Short Overview of Prokaryotic DNA Polymerases Participating in Base Excision Repair E. coli base excision repair pathway has been reconstituted in vitro using five purified enzymes: uracil-DNA glycosylase, endonuclease IV, RecJ exonuclease, DNA polymerase I (pol I), and DNA ligase (Dianov and Lindahl 1994). The absence of pol I prevented rejoining of the damaged strand since DNA ligase showed little activity at an incised abasic site. In addition to DNA polymerase activity, pol I contains 5–30 (nick-translation) and 3–50 (proofreading) exonuclease activities.

DNA Repair Polymerases DNA Repair Polymerases, Table 1 DNA repair polymerases DNA repair process

Prokaryotes

Eukaryotes

DNA polymerase I

DNA polymerase b DNA polymerase l DNA polymerase g DNA polymerase y DNA polymerase d DNA polymerase e DNA polymerase ι

DNA polymerase I DNA polymerase IIa DNA polymerase IIIa

DNA polymerase d DNA polymerase e DNA polymerase k

LigD

DNA polymerase l DNA polymerase m ScDNA polymerase e ScDNA polymerase d ScDNA polymerase a TdT DNA polymerase Z DNA polymerase d DNA polymerase e DNA polymerase a

BER

NER

DSBR Nonhomologous end joining

Recombinational repair

CLR Interstrand cross-link

DNA protein cross-link MMR

DNA polymerase III DNA polymerase IV DNA polymerase II DNA polymerase I DNA polymerase I DNA polymerase IV

DNA polymerase IV

DNA polymerase III

DNA polymerase e DNA polymerase k DNA polymerase Z DNA polymerase ι DNA polymerase u Rev 1 DNA polymerase k DNA polymerase u

DNA polymerase d DNA polymerase ea

SC Saccharomyces cerevisiae a Only a putative role has been proposed for these DNA polymerases

The 50 –30 exonuclease subdomain of pol I is essential for removal of the RNA primer of Okazaki fragments before the polymerase fills in the newly formed gaps. During base excision repair, the 5–30 exonuclease function of pol I is known to be strongly but not completely inhibited by the deoxyribose-phosphate residue at the 50 terminus. However, bifunctional DNA N-glycosylases/AP

DNA Repair Polymerases

lyases, such as Fpg and Nei or the RecJ exonuclease, can remove 50 -dRP residues before gap filling by pol I during short-patch BER. Moreover, E. coli Fpg and Nei are able to carry out d-elimination reaction with removal of the deoxyribose residue and generation of 30 -phosphate termini (30 -P) which is then processed by the 30 -phosphatase activities of E. coli AP endonucleases (exonuclease III and endonuclease IV). In long-patch BER, pol I can displace the dRP-containing strand during filling of the gap. Cleavage of the displaced strand by the 50 ! 30 exonuclease activity of pol I removes the 50 -dRP along with 2–8 nucleotides. Moreover, pol I is also required during very short-patch repair, a pathway which reduces the mutagenesis of 5methylcytosine and is dependent on a DNA mismatch endonuclease, the vsr gene product. Eukaryotic DNA Polymerases Participating in Base Excision Repair Currently, at least six mammalian DNA polymerases (pols b, d, l, e, ι, and y) and the mitochondrial DNA polymerase g are implicated in BER. Unfortunately, confirming their exact cellular roles remains uncertain due to their functional redundancy; therefore, in vitro biochemical characterization has been instructive in understanding how these enzymes may function in BER. Pols b and l belong to the X family DNA polymerases in vertebrate cells and possess a similar 8 kDa domain which encodes an active dRP lyase activity (see Yamtich and Sweasy (2010) for review). Pol b consists of a single 39 kDa subunit that comprises catalytic and dRP lyase activities. In vitro studies have suggested a role of pol b in BER. While pol b is essential for survival in mice, pol b null embryonic cells survive in cell culture and have normal growth characteristics but are hypersensitive to alkylating agents and are defective in uracil-initiated base excision repair, suggesting that pol b plays a central role in BER in vivo (Yamtich and Sweasy 2010). Pol l is a 67–70 kDa polypeptide which contains a BRCT and a proline/serine-rich domain, the 8 kDa domain, and a C-terminal polymerase domain and shows 32% identity to the corresponding region of pol b. Pol l can function in a backup

243

BER pathway because cells which are pol l deficient and immunodepleted for pol b cannot perform BER (Braithwaite et al. 2005). Pol g, the mitochondrial replicase, and pol y are part of the A family DNA polymerases in vertebrate cells. Pol g is a heterotrimer composed of a 115–143 kDa large subunit and a 55 kDa dimeric subunit. Pol g possesses dRP lyase activity, which is consistent with its essential role in mammalian mitochondrial BER pathways (see Liu and Demple (2010) for review). Pol y is a heterotrimer composed of three subunits of 80, 90, and 100 kDa. It has been shown that the in vitro BER activity of pol y-deficient cell extracts was significantly reduced compared to that of wildtype cellular extract. Pol y also possesses 50 -dRP lyase and single-nucleotide gap-filling activities that are usually associated with BER. These data suggest that pol y plays a role in the BER pathway and backs up the BER functions of pol b. LP-BER was identified as a PCNA-dependent BER pathway and involves the replacement of a short oligonucleotide containing an abasic site and two or more nucleotides 30 to it. It was further demonstrated that this pathway partially involves aphidicolin-sensitive DNA polymerases. These findings pointed to a role for the multi-subunits, replicative pols d and e, belonging to the B family DNA polymerases. This assumption was confirmed by the reconstitution of LP-BER with purified human proteins that required the presence of pol d or e. Pol ι, belonging to the Y family DNA polymerases, is an 80 KDa polypeptide that has an intrinsic dRP lyase activity and efficiently fills 1to 3-base gaps that are typical BER intermediates. It was therefore suggested that pol ι is implicated in specialized forms of base excision repair. Biological Roles of BER DNA Polymerases

As stated previously, pols b, l, g, ι, and y possess a dRP lyase activity that allows a DNA ligase to reconstitute the original template, while the other DNA polymerases will require additional components, such as the 50 -exo-/endonuclease activity provided by FEN1. The repair of oxidized deoxyribose fragments at the 50 terminus after strand break would also require the action of FEN1,

D

244

resulting in LP-BER. Interestingly, singleturnover reaction kinetics indicated that the dRP lyase activity of pol b was at least 20-fold faster than its DNA synthesis activity (Prasad et al. 2010). This was different from previous results performed under steady-state reaction conditions that indicated that the product-release step for the dRP lyase reaction was rate limiting. Both SN-BER and LP-BER involve the processing of a single-nucleotide gap by DNA polymerases to catalyze template-guided gap-filling synthesis. The mechanism that controls the selection of either BER pathway is still not well understood. In vitro experiments showed that pols b, d/e, l, ι, or y can add complementary bases at the gap site. These data suggest that there is a competition among different pols for the 30 OH at the nick site. However, pol b and pol l prefer a single-nucleotide-gapped DNA substrate with a 50 -phosphate on the downstream strand over a non-gapped primer-template DNA substrate. It was suggested that, in wild-type cell extracts, the initial nucleotide insertion involves mainly pol b, likely due to its high affinity for the 50 dRP. A recent work raised the question whether individual steps in BER are sequentially ordered as it showed that BER intermediates of the SN-BER could be channeled from APE1 to pol b to DNA ligase. Coordination between the different stages of the BER process is also orchestrated by XRCC1, a scaffold protein that interacts with most components of the BER short-patch pathway. However, during LP-BER, pol b filled the initial 1-nt gap but this DNA product was not channeled to FEN1 (Prasad et al. 2010). These observations support the model suggesting that lesion processing governs the selection of the BER synthesis pathway. It was found that FEN1 stimulated pol b-mediated DNA synthesis on an LP-BER substrate, and conversely, pol b stimulated FEN1 cleavage on an LP-BER flap substrate. Therefore, pol b and FEN1 could cooperate with each other to form a longer repair patch by a nick-translation mechanism. Because of the efficient gap-filling activity of pol b, this particular mechanism performed by pol b and FEN1 may be more efficient than PCNA-

DNA Repair Polymerases

dependent LP-BER. However, PCNA-dependent LP-BER may be involved in the repair of some common oxidative lesions that have been shown to inactivate pol b.

Nucleotide Excision Repair Short Overview of Prokaryotic DNA Polymerases Participating in Nucleotide Excision Repair Nucleotide excision repair (NER) is a versatile DNA repair mechanism that recognizes and removes a wide variety of structurally unrelated lesions. In the gram-negative bacterium E. coli, the uvrABC system recognizes distortion in the DNA structure, leading to the start of the repair process (for review see Hanawalt et al. (1979)). Both a short-patch and a long-patch NER exist in E. coli. Among the E. coli DNA polymerases, DNA polymerase I is uniquely suitable to perform resynthesis in both patches because it can bind at the nick generated by dimer-specific endonucleases and it is more abundant than the other polymerases. However, the participation of DNA polymerase II and III in resynthesis is also possible. In the gram-positive bacterium B. subtilis, mutants in DNA polymerase I are sensitive to ultraviolet radiation, suggesting an involvement of this polymerase in NER. Eukaryotic DNA Polymerases Participating in Nucleotide Excision Repair Eukaryotic NER requires resynthesis of about 30 nucleotides to fill the gap generated by excision of the lesion from the DNA (Friedberg et al. 2005). NER in cells and in vitro is sensitive to aphidicolin, which inhibits DNA polymerases a, e, and d. However, strong evidence from many sources indicates that DNA pol a is not a NER polymerase. Initial work showed that DNA pol e played a role in NER-dependent DNA synthesis. Indeed, this finding was later confirmed by a study in which reconstitution of mammalian NER with purified proteins was performed with calf thymus DNA pol e as the gap-filling polymerase together with the auxiliary replicative proteins

DNA Repair Polymerases

single-stranded binding protein RPA, processivity factor PCNA, and PCNA loading factor RFC. However, a series of experiments have also implicated DNA pol d in NER, and both DNA polymerases e and d have proven to be functional in an in vitro system that can reconstitute DNA repair with recombinant incision factor and human replication proteins. Determining which DNA polymerase normally participates in NER in cells is difficult and likely either DNA pols e or d can serve in this function. In yeast studies of NER with DNA polymerase, mutant strains led to the conclusion that either DNA pols e or d can participate in the gap-filling step when the other is absent. Interestingly, the question recently rose whether these two polymerases are the only eukaryotic polymerases that participate in the resynthesis step of NER. This question was initially brought by the finding that mouse cells deficient in DNA polymerase k, a member of the Y family DNA polymerases, show substantially reduced levels of NER. DNA polymerases belonging to the Y family are low-fidelity DNA pols capable of performing translesion synthesis (TLS) of DNA lesions (Waters et al. 2009). A recent study has confirmed a role for DNA pol k in NER. It reports that, in human fibroblasts, approximately half of NER resynthesis requires both DNA pols k and d, and both DNA polymerases can be recovered in the same repair complex. DNA pol k is recruited to the repair site by ubiquitinated PCNA and the repair scaffolding protein XRCC1, while DNA pol d is recruited by RFC, the DNA pol accessory factor p66, and unmodified PCNA. The remaining NER resynthesis is dependent on DNA pol e, the recruitment of which is dependent on the alternative clamp loader CTF18RFC. RPA is required in both cases. At a first sight it may seem strange that cells use an error-prone DNA polymerase, i.e., DNA pol k, to carry out NER repair synthesis after removal of the lesion. However, the low Km of DNA pol k may make it particularly suitable for use under conditions of low nucleotide concentration. Alternatively, it is possible that the removal of the damaged 30 mer oligonucleotide during NER uncovers lesions in the DNA template to

245

be copied that require the action of a translesional DNA pol.

Double-Strand Break Repair DNA damage from endogenous and exogenous agents may cause DNA double-strand breaks (DSB) or replication fork collapse, which poses a serious threat to genomic integrity. Cells have evolved different mechanisms to rescue arrested forks or repair DNA double-strand break. In mammalian cells, nonhomologous end joining and recombination are the two primary mechanisms of DSB repair.

Nonhomologous End-Joining Repair Overview of Prokaryotic Nonhomologous End Joining Nonhomologous end-joining (NHEJ) repair pathway was initially identified in mammalian cells and was considered to be restricted to eukaryotes. However, identification of putative Ku genesDNA ligase D operons in a variety of prokaryotic genomes indicates the existence of a functional NHEJ pathway in bacteria. DNA ligase D (LigD) is a member of the ATP-dependent DNA ligase family. In addition to the core ligase domain, most LigD proteins possess a polymerase domain, with homology to members of the archaeo-eukaryotic primase (AEP) superfamily. The polymerase domain of LigD displays template-independent terminal transferase activity on single-stranded (ss) DNA and blunt-ended double-stranded (ds) DNA substrates and can also extend dsDNA (with 50 overhangs) and fill in gapped dsDNA substrates in a template-dependent manner (for review, see Pitcher et al. (2007)). The major determinant for specific binding of the polymerase domain of LigD to DNA is the presence of a 50 phosphate, which is located at the single-/double-stranded junction on overhanging or gapped DNA molecules. This binding specificity allows a large degree of rotational freedom and could promote

D

246 DNA Repair Polymerases, Fig. 1 Template specificity of DNA polymerases in NHEJ

DNA Repair Polymerases

Pol l, Pol m; Pol IV

Pol m

Pol m, TdT

partly complementary 3’ overhangs

non complementary 3’ overhangs

template independence

the formation of end-bridging complexes while allowing the termini to search for complementary sequences on the opposing break. Eukaryotic DNA Polymerases Participating in NHEJ In S. cerevisiae, pol IV, which belongs to the X family DNA polymerase and is homologous to human pol l, has been demonstrated to contribute significantly to NHEJ. Pol IV has a primary role in resynthesizing gaps, as it is specifically required for adding bases during imprecise end joining. However, it was shown that pol II, which belongs to the B family DNA polymerase and is homologous to human pol e, contributes to deleting the flaps at imprecise pairing sites at a DSB. Moreover, pol III, which is homologous to human pol d, is required during complex chromosomal rearrangements, and both pol I (human pol a homologue) and pol II play roles in both imprecise end joining and chromosomal rearrangements. Family X human DNA polymerases pol l, pol m, and TdT each possess an N-terminal BRCT domain which allows all three polymerases to interact with the NHEJ factors Ku, XRCC4, and DNA ligase IV. Deletion of the BRCT domain of pol l and pol m abolishes the ability of these polymerases to interact with a Ku-DNA complex. Pol m was shown to facilitate end joining in vitro. It was also observed that pol l, pol m, and TdT but not pol b are able to participate in the in vitro NHEJ reactions. DNA polymerase activity during NHEJ has to cope with the absence of an unbroken template strand that is not available for template-dependent resynthesis of excised material. “Alignment-based gap fill-in” activity implies that the polymerase is active within the context of ends aligned by core NHEJ factors, presumably

through the ability of the polymerase to form a complex with these factors. Alignment-based gap fill-in also requires that the polymerase retains the activity with minimal base pairing between the primer and template. The ability to perform alignment-based gap fill-in during NHEJ is thus a defining characteristic for polymerases that have a role in this repair pathway. This characteristic is shared by pol l, pol m, and TdT and each enzyme has distinct but overlapping specificities during NHEJ gap-filling synthesis. While either pol can efficiently fill in one-base gaps in aligned DSB ends, only pol l can fill longer gaps and only pol m or TdT can fill gaps with no base pairing (see Fig. 1). As shown, pol l, pol m, and yeast pol IV are active on DNA junctions presenting one or two terminal complementary nucleotides (left part of the figure). This activity is dependent on the BRCT domain of these proteins and therefore supposed to rely on end-bridging reactions performed by NHEJ factors. In the absence of pairing between primer and template, pol l is largely inactive, while pol m and TdT can use unpaired primer termini (central and right part of the figure). A region of the palm subdomain of X family DNA polymerases was identified to provide the structural basis for the template independence of pol m and TdT. Therefore, the capacity of a polymerase to operate on a NHEJ intermediate is dependent on its ability to interact with other NHEJ factors but also on the structure of the end and on intrinsic functional characteristics of the enzyme.

Recombinational Repair of DSB Recombination-dependent DSB repair is conserved in evolution and depends on many

DNA Repair Polymerases

functional homologues. Most models of DNA recombination propose that the ends of a DSB are resected to expose 30 -single-stranded DNA overhangs that are bound by a recombinase. This nucleoprotein filament will allow the search of homologous DNA sequences and invasion of duplex DNA to generate a D loop. This structure can serve as a template for DNA synthesis, which may involve diverse DNA polymerases. Prokaryotic DNA Polymerases Participating in Homologous Recombination In E. coli, initial evidence for the involvement of a DNA replication mechanism during recombination came from the observation that l phage recombination in the absence of the known Holliday junction processing systems requires the major replicative polymerase, DNA pol III (Motamedi et al. 1999). Several studies have revealed the existence of a competition between E. coli pols I, II, and III with pol IV during stress-induced mutagenesis. The most defined mechanism of stress-induced mutagenesis is mutagenesis associated with DSB repair in carbon-starved Escherichia coli cells. The switch from high-fidelity to error-prone DSB repair under stress is controlled by the SOS DNA damage and the Rpos stress responses. Rpos is the main controller of the general stress response which is induced when bacteria enter stationary phase or experience nutrient limitation. It has been suggested that all five DNA polymerases are available during stress-induced mutagenesis, but only DNA polymerases I, II, and III compete with pol IVat the primer terminus. In this model, the SOS response upregulates pol IVabout tenfold, while Rpos could license the use of all DNA polymerases, except pol III, in DSB repair. Eukaryotic DNA Polymerases Participating in Homologous Recombination In vitro and in vivo experiments provided evidences that pol Z has an important function in mitotic gene conversion events. It has been shown that an activity from HeLa whole-cell extracts that can efficiently extend artificial D-loop substrates co-purifies with pol Z. Furthermore, it was shown that pol Z and RAD51 interact

247

in cells subjected to DNA-damaging agents and that RAD51 protein stimulates the DNA extension activity of pol Z on recombination intermediate structures. The direct participation of pol Z in homologous recombination in vivo was also demonstrated in DT40 cells (Kawamoto et al. 2005). DT40 cell lines lacking pol Z show a decrease in gene conversion events happening during the generation of antibody diversity of Ig genes. Moreover, in the absence of pol Z, the length of conversion tracts was significantly increased, while the accuracy of gene conversion was not compromised, suggesting that other DNA polymerases can function in the gene conversion process. It was also recently shown that strand displacement, mediated by the targeted actions of pol Z at a D loop, leads to RAD52/RPA-catalyzed capture of the second end of a DSB. On the contrary, pol d or pol ι, which exhibit only a weak D-loop extension activity, failed to yield significant amounts of fully annealed second-end capture products. In yeast, HR efficiently repairs DSBs, but single-stranded gaps at stalled replication forks might also initiate recombination. DSBs arising from replication fork collapse will have only one free end and require replication to the end of the chromosome by a process known as breakinduced replication (BIR) (McEachern and Haber 2006). In budding yeast, it was shown that Rad51dependent BIR induced by HO endonuclease requires the lagging strand DNA pol a-primase complex as well as pol d to initiate new DNA synthesis. Pol e was required for processive DNA synthesis to the end of the BIR template but not for the initial primer extension step. The DNA pol d subunit Pol32 was also shown to be essential for BIR while having little effect on gene conversion events. These results suggest that Pol32 plays a crucial role in recombinationdependent establishment of a repair replication fork during BIR and is not needed for all types of recombination-associated repair DNA synthesis. A model system for BIR in Saccharomyces cerevisiae using a modified chromosome fragmentation assay revealed that template switching

D

248

DNA Repair Polymerases

Cross-Link Repair

consequence of exposure to chemical toxicants and metals (for review see Reardon et al. (2006)). These adducts serve as substrates for proteolytic degradation, leading to DNA-peptide lesions that can be repaired by NER. Alternatively, DNA-peptide cross-links can be subjected to replication bypass. Using DNA-peptide cross-links as model lesions, it has been shown that, among the bacterial DNA polymerases studied, E. coli DNA pol IV was able to bypass these lesions by incorporating the correct nucleotide. E. coli DNA pols III, II, and V were incapable of replicating past DNA-peptide crosslinks.

Short Overview of Prokaryotic DNA Polymerases Participating in Cross-Link Repair Current models of interstrand cross-link repair (ICL) involve endonucleolytic incisions near the cross-link on one of the two strands, generating the so-called unhooked intermediates (for review see Ho and Schärer (2010)). It is important to realize that in this case, in contrast to lesions affecting only one strand of DNA such as those repaired by NER, a fragment of an oligonucleotide remains attached to the template strand via a cross-link. This intermediate can be processed by DNA recombination or directly replicated by translesion polymerases. In the latter instance, even if this step is potentially mutagenic, one of the two strands is restored to an intact form and can subsequently be used as template for subsequent DNA recombination or NER. Recently two studies in E. coli have implicated bacterial DNA polymerases in ICL replication. In the first one, it was shown that DNA pol IV, a Y family DNA polymerase implicated in TLS in bacteria, could bypass an N2-N2 cross-link but required shortening of the duplex around the ICL to minimize strand displacement synthesis by the polymerase. The DNA pol IV-dependent resynthesis was errorfree. In the second study, it was shown that E. coli DNA pol I was able to bypass a psoralen ICL, albeit inefficiently, by preferentially incorporating dAMP opposite the adducted thymine. DNA-protein cross-links are formed in cells as by-products of DNA metabolism or as

Eukaryotic DNA Polymerases Participating in Cross-Link Repair In yeast, there is evidence for implicating NER and double-strand break (DSB) repair in processing ICL intermediates. In addition, the existence of a repair pathway dependent on TLS has also been demonstrated. The main TLS DNA polymerase in yeast is DNA pol ζ, a heterodimer of the catalytic subunit Rev 3 and of the associated Rev 7 protein. DNA pol ζ belongs to the B family of DNA polymerases but lacks the 3–50 proofreading exonuclease activity usually associated with this family of DNA polymerases. DNA pol ζ activity is often coupled with another DNA pol, Rev 1, a Y family DNA polymerase that is epistatic to Rev 3 with respect to ICL repair. Compelling genetic evidence indicates that DNA pols ζ and Rev 1 are recruited at ICL by ubiquitination of PCNA to perform ICL and the activity of DNA pol ζ was shown to be necessary in replication of defined ICLs in Xenopus laevis egg extracts. In a recent biochemical study, the role of DNA pols ζ and Rev 1 in TLS of ICL was investigated. It was found that Rev 1 could efficiently insert the correct C opposite the cross-linked guanine, in agreement with its property as dCMP insertase; however, DNA pol ζ could not extend beyond the ICL. It is likely that in vitro reconstitution of TLS of an ICL by DNA pol ζ and Rev 1 requires additional factor and/or ubiquitinated PCNA that was not present in this study. Compared to bacteria and yeast, vertebrates have an increased number of DNA polymerases

was occurring for a limited distance downstream of the site of strand invasion, suggesting that after several rounds of strand invasion, the D loop is stabilized and processive replication ensues. It was subsequently shown that BIR is associated with a high level of frameshift mutagenesis and that this hypermutability is independent of pol Z while modestly dependent on pol ζ at some chromosomal positions. These data suggest that the BIR fork contains multiple deficiencies, including decreased pol d replication fidelity.

DNA Repair Polymerases

capable of TLS. Genetic and biochemical evidences indicate that DNA pol ζ and Rev 1 play an important role in ICL repair in vertebrate cells that could be independent from PCNA monoubiquitination. However, no in vitro studies of TLS of ICL with purified mammalian DNA pol ζ and Rev 1 have been reported. DNA pol k is a Y family DNA polymerase identified in a homology search for eukaryotic orthologs of E. coli dinB. The capacity of this DNA polymerase to replicate N2-N2-guanine ICLs was investigated, and this study showed that DNA pol k not only catalyzed accurate incorporation opposite the crosslinked guanine but also replicated beyond the lesion. The efficiency of TLS was greatly enhanced by truncation of both the 50 and 30 ends of the non-templating strand. Consistent with this finding, both cell survival and chromosomal instability were adversely affected in pol k-depleted cells following exposure with the DNA cross-linking agent MMC. DNA polymerase n (POLN) is a newly discovered A family polymerase, and a function of DNA pol n in ICL repair in mammalian cells has been proposed. A recent study has demonstrated that DNA pol n could bypass in vitro DNA interstrand cross-links with linkage between N6-dAs in complementary strand when located in the major groove of DNA. However, the chemically identical DNA interstrand cross-link completely blocked DNA pol n when located in the minor groove via N2-dG linkage. This study suggests that location of the lesion within DNA is one important structural determinant that influences the ability of DNA pol n to bypass an ICL. In a very recent manuscript, Dr Scharer’s group has generated a series of major groove ICLs that induced various degrees of distortions in the DNA helix. With such substrates they have systematically investigated how structural variations affect TLS of the Y family DNA polymerases k, t, Z, and Rev 1 and of the B family DNA polymerase ζ. They found that all DNA polymerases can bypass in vitro the ICLs studied, with an efficiency that is modulated by the length of the DNA flanking the ICLs and the length of the cross-link bridging two bases. So this study shows that major groove ICLs can be bypassed by a number of TLS DNA

249

with an efficiency dramatically affected by the structural feature of the ICL. Not much is known about eukaryotic DNA polymerases potentially involved in TLS of DNA-protein cross-links. Using single-stranded, site-specifically modified vectors, it was shown that tetrapeptides linked to N2-dG or N6-dA could be replicated in COS-7 cells. However a 10 times higher mutation frequency was observed when the tetrapeptide was conjugated to the N2dG minor groove site versus the N6-dA major groove site. The same group has reported that DNA pol k could efficiently and faithfully replicate DNA-peptide cross-links in vitro at the N2-dG position. Very recently, in addition to DNA pol k, DNA pol n was also shown capable of replicating DNA-peptide cross-links but only when placed in the major groove of DNA, as shown for ICLs.

Mismatch Repair Short Overview of Prokaryotic DNA Polymerases Participating in Mismatch Repair DNA mismatch repair (MMR) is an enzyme system capable of recognizing and correcting base pair errors within the DNA helix. In E. coli, MMR has been well characterized and reconstituted from recombinant proteins in a seminal work by Paul Modrich’s group. In this manuscript the authors were able to assemble a purified system consisting of a set of proteins that can process seven to eight base-base mismatches in a strand-specific reaction directed by the unmethylated state of a single d(GATC) sequence located 1 kilobase from the mispair. The resynthesis step of this in vitro MMR process requires the action of DNA polymerase III holoenzyme, acting in conjunction with DNA helicase II on single-stranded binding protein (SSB) covered template. Since DNA polymerase III holoenzyme is highly processive, its involvement is consistent with large repair tracts. Genetic evidences support this conclusion and indicate that another E. coli polymerase, DNA polymerase I, does not contribute in major way to MMR since extracts from a strain deficient in this

D

250

polymerase exhibit normal level of MMR activity. The E. coli b processivity clamp, one of the components of the DNA polymerase III holoenzyme, is also necessary for MMR, and its interaction with the MMR recognition proteins MutS and MutL has been demonstrated, raising the possibility that this component of the replication apparatus could be involved in early steps of MMR. However, a recent publication, although confirming this interaction, found that the b sliding clamp has little effect on the initiation of the excision steps of MMR, suggesting instead that it plays an essential role during the resynthesis stage of MMR. Eukaryotic DNA Polymerases Participating in Mismatch Repair In eukaryotes the signal for strand discrimination is uncertain but may be the DNA ends associated with replication forks (Larrea et al. 2010). Chromosomal DNA replication in eukaryotic cells requires three DNA polymerases: pol a, pol d, and pol e. Pol a is the only polymerase that has an associated activity for synthesis of RNA primers and is able to extend from such primers by synthesizing short stretches of DNA. Subsequently, processive DNA synthesis is resumed by pol d and/or pol e. While in E. coli DNA polymerase III holoenzyme replicates both DNA strands, recent work in yeast supports a model wherein, during normal DNA replication, pol e is primarily responsible for copying the leading strand and pol d primarily responsible for copying the lagging strand (Kunkel and Burgers 2008). Direct evidence for the involvement of pol d in human MMR was provided by a study in which HeLa cell nuclear extract was resolved into a depleted fraction incapable of supporting MMR in vitro, and repair activity was restored upon addition of a purified fraction isolated from HeLa cells by a complementation assay. The most pure fraction contained pol d but was free of detectable pol a and e. The processivity factor PCNA was also implicated in the human MMR reaction. Later on, based on the observation that the recognition proteins MutSa and MutLa, together with the exonuclease EXO1, PCNA, RPA, and the PCNA loading factor RFC, were

DNA Repair Polymerases

sufficient to support bidirectional mismatchprovoked excision, a study was undertaken to reconstitute in vitro bidirectional MMR directed by a strand break located 30 or 50 to the mispair. This was successfully accomplished using seven human activities: MutSa, MutLa, EXO1, PCNA, RPA, PCNA RFC, and pol d. Because previous studies have suggested potential involvement of the editing functions of pols d and e in mismatchprovoked excision, the possible participation of pol d in the excision step of MMR was evaluated. The contribution of the polymerase to mismatchprovoked excision was found to be very limited, while excision repair in the purified system containing pol d was reduced tenfold upon omission of the exonuclease EXO1. Nonetheless, in addition to the proven role of pol d in MMR, it should be noted that a possible role for pol e should not be excluded. For instance, the possible involvement of pols d and e in the 30 excision step of MMR, as suggested by genetic evidences, may require the presence of a replication fork, which has not been yet tested in in vitro assays. Interestingly, several human colorectal carcinomas carrying mutations in the proofreading domains of pols d and e were found to be all defective in MMR.

Cross-References ▶ Depurination ▶ DNA Damage, Types of ▶ DNA Recombination, Mechanisms of ▶ DNA Repair ▶ Double-Strand Break Repair ▶ Exocyclic Adducts ▶ Homologous Recombination in Lesion Bypass ▶ Mismatch Repair ▶ Nucleotide Excision Repair ▶ Recombination: Mechanisms, Pathways, and Applications ▶ Ultraviolet Light DNA Damage Acknowledgments The authors wish to thank Prof Paul Boehmer for the critical reading of the manuscript. The authors deeply apologize to the too many colleagues they were unable to cite due to the space restriction.

DNA Replication

References Braithwaite EK, Prasad RP, Shock DD et al (2005) DNA polymerase l mediates a back-up base excision repair activity in extracts of mouse embryonic fibroblasts. J Biol Chem 280:18469–18475. https://doi.org/ 10.1074/jbc.M411864200 Dianov GL, Lindahl T (1994) Reconstitution of the DNA base excision-repair pathway. Curr Biol 4:1069–1076 Friedberg EC, Walker GC, Siede W et al (2005) DNA repair and mutagenesis, 2nd edn. ASM Press, Washington, DC Hanawalt PC, Cooper PK, Ganesan AK, Smith CA (1979) DNA repair in bacteria and mammalian cells. Annu Rev Biochem 48:783–836. https://doi.org/ 10.1146/annurev.bi.48.070179.004031 Ho TV, Schärer OD (2010) Translesion DNA synthesis polymerases in DNA interstrand crosslink repair. Environ Mol Mutagen 51:552–566. https://doi.org/10.1002/ em.20573 Hübscher U, Spadari S, Villani G, Maga G (2010) DNA polymerases: discovery, characterization and functions in cellular DNA transactions, 1st edn. World Scientific Publishing, Hackensack, NJ Kawamoto T, Araki K, Sonoda E et al (2005) Dual roles for DNA polymerase Z in homologous DNA recombination and translesion DNA synthesis. Mol Cell 20:793–799. https://doi.org/10.1016/j.molcel.2005.10.016 Kornberg A, Baker TA (2005) DNA replication, 2nd edn. University Science, Sausalito, CA Kunkel TA, Burgers PM (2008) Dividing the workload at a eukaryotic replication fork. Trends Cell Biol 18:521–527. https://doi.org/10.1016/j.tcb.2008.08.005 Larrea AA, Lujan SA, Kunkel TA (2010) SnapShot: DNA mismatch repair. Cell 141:730.e1. https://doi.org/ 10.1016/j.cell.2010.05.002 Lindahl T, Wood RD (1999) Quality control by DNA repair. Science 286:1897–1905 Liu P, Demple B (2010) DNA repair in mammalian mitochondria: much more than we thought? Environ Mol Mutagen 51:417–426. https://doi.org/10.1002/em.20576 McEachern MJ, Haber JE (2006) Break-induced replication and recombinational telomere elongation in yeast. Annu Rev Biochem 75:111–135. https://doi.org/ 10.1146/annurev.biochem.74.082803.133234 Motamedi MR, Szigety SK, Rosenberg SM (1999) Doublestrand-break repair recombination in Escherichia coli: physical evidence for a DNA replication mechanism in vivo. Genes Dev 13:2889–2903 Pitcher RS, Brissett NC, Doherty AJ (2007) Nonhomologous end-joining in Bacteria: a microbial perspective. Annu Rev Microbiol 61:259–282. https://doi. org/10.1146/annurev.micro.61.080706.093354 Prasad R, Shock DD, Beard WA, Wilson SH (2010) Substrate channeling in mammalian base excision repair pathways: passing the baton. J Biol Chem 285:40479–40488. https://doi.org/10.1074/jbc.M110.155267 Reardon JT, Cheng Y, Sancar A (2006) Repair of DNA-protein cross-links in mammalian cells. Cell Cycle 5:1366–1370

251 Waters LS, Minesinger BK, Wiltrout ME et al (2009) Eukaryotic translesion polymerases and their roles and regulation in DNA damage tolerance. Microbiol Mol Biol Rev 73:134–154. https://doi.org/10.1128/ MMBR. 00034-08 Yamtich J, Sweasy JB (2010) DNA polymerase family X: function, structure, and cellular roles. BBA Protein Proteomics 1804:1136–1150. https://doi.org/10.1016/ j.bbapap.2009.07.008

D DNA Replication Jon M. Kaguni Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA

Synopsis Free-living organisms must duplicate their chromosomes before each cell division. To ensure that this process occurs at the correct time in the cell cycle, chromosomal DNA replication is highly regulated. Despite specific mechanistic differences among organisms, all domains of life use similar procedures to duplicate their chromosomes and to make certain that DNA replication occurs at the proper time. As described in this series of entries, enzymes that act as molecular machines function during each of the stages of DNA replication: initiation, elongation, and termination. The examination of these enzymes reveals that they are fascinating and complex, sharing the common feature of coordinating nucleotide binding and/or its hydrolysis with a conformational change. In addition to conformational changes during catalysis, altered conformations in these molecular machines are frequently induced during their interaction with ligands and other macromolecules that then result in mechanical work. Examples of mechanical work are the opening of a replication origin during the initiation of DNA replication and unwinding of the parental duplex DNA by a replicative DNA helicase, which is required in order for the cellular replicase (a DNA polymerase) to copy each

252

parental DNA strand. Macromolecular dynamics also happens when the DNA polymerase synthesizes DNA. For DNA synthesis, an oligonucleotide annealed to the parental single-stranded DNA is needed. This oligonucleotide provides a 30 -end that the enzyme will extend via the polymerization of deoxynucleoside monophosphates using deoxynucleoside triphosphates as the substrate. Molecular motion occurs when the replicase is placed at the 30 -end of the oligonucleotide by a multisubunit complex named the clamp loader via a process that depends on ATP binding and its hydrolysis. In the subsequent step of DNA synthesis, a separate type of molecular movement is driven by nucleotide hydrolysis as the DNA polymerase translocates along the parental DNA while it synthesizes DNA. In the last stage of DNA replication, the termination of DNA replication is coordinated with the segregation of two daughter chromosomes to daughter cells. Mechanisms exist to terminate replication forks. Details about the molecular mechanisms that operate at each of these stages of DNA replication are described in individual entries in this volume.

Development of the Field With rare exception, a cell cannot live without a full complement of its genes. Hence, chromosomal DNA replication is essential. Because cells require genes and their functions for viability, organisms have evolved to ensure that chromosomal duplication is tightly coordinated with cell growth and cell division, regulating its occurrence at a specific time in the cell cycle. Of interest, mechanisms control the frequency of chromosomal DNA replication so that chromosomes are duplicated only once per cell cycle. The study of DNA replication has its origins in humanity’s interest in the basis of heredity. Believing that proteins were more complex than DNA, early biologists thought that proteins were the genetic material in order to explain the complexity of organisms. In contrast, a series of classic experiments disproved this idea. Griffith in 1928 showed that a substance derived from

DNA Replication

pneumococcal bacteria that gave rise to colonies with either a rough or smooth morphology caused rats injected with the material to live or die, respectively (Griffith 1928). Heat treating the substance from bacteria with a smooth colony morphology neutralized its toxicity, but not when combined with material from bacteria having a rough colony morphology. Griffith described this substance as the “transforming principle.” Later, Avery, MacLeod, and McCarty were able to show that this substance is DNA and not protein (Avery et al. 1944). Hershey and Chase in 1952 used a bacterial virus (bacteriophage) in which the DNA and proteins of the viral particle were radioactively labeled (Hershey and Chase 1952). Following the fate of the radioactively labeled DNA and protein, they were able to show that the radioactively labeled DNA but not the protein component was inherited in the progeny bacteriophage produced from infected cells. This experiment showed that DNA and not protein is the hereditary material. A molecular understanding of heredity originates from the structure of duplex DNA, determined by Watson and Crick (Watson and Crick 1953). The rules of base pairing and the complementarity of one DNA strand with its partner in double-stranded DNA led to the prediction that one parental DNA strand of a duplex DNA molecule determines the sequence of the complementary DNA when the parental duplex DNA is replicated. The experiments by Meselson and Stahl proved that DNA replication is semiconservative in which each parental DNA strand is used as a template to produce its complementary DNA sequence (Meselson and Stahl 1958). At this point in history, enzymes were recognized to act as catalysts, but the enzymology of DNA replication was unknown. The discovery of DNA polymerase I of E. coli was reported in 1958 by Kornberg and coworkers (Bessman et al. 1958; Lehman et al. 1958). It is the first enzyme to be isolated and characterized that is directly involved in DNA replication. Today, counterparts of E. coli DNA polymerase I derived from various thermophilic bacteria are used in many research laboratories for the manipulation and amplification of DNA. In the years that have passed since the

DNA Replication

discovery of DNA polymerase I, other enzymes have been identified and studied that act in the initiation, elongation, and termination stages. These enzymes and their mechanisms are the topics of the entries in this volume.

The Structure of DNA and DNA Topology DNA replication is a process that underlies cell growth leading to the production of daughter cells that each contains the full complement of the mother cell’s chromosomes. Before considering advanced topics on the enzymology of DNA replication, a brief review of the structure of DNA is useful. DNA is a polymer of deoxyribonucleotides. The chemical structure of a deoxyribonucleotide has three parts: a purine or pyrimidine base; 2-deoxyribose, which is linked to the purine or pyrimidine by an N-glycosidic bond; and a phosphate attached to carbon 5 of the sugar by an ester bond. To distinguish the carbon and nitrogen atoms of the purine and pyrimidine bases from those in 2-deoxyribose, the carbons in this cyclic sugar are given a “prime” designation. In a single strand of DNA, a phosphate molecule connects the 50 carbon of one deoxynucleotide with the 30 carbon of its neighbor, forming a phosphodiester linkage. A single-stranded DNA has a polarity as denoted by the 5’ carbon of the deoxynucleotide at one end and the 30 carbon of the deoxynucleotide at the other. The nitrogenous bases in DNA are adenine (A), guanine (G), thymine (T), and cytosine (C). In duplex DNA, an A base in one strand of DNA hydrogen bonds with T in the other DNA strand. Similarly, a G hydrogen bonds with C. In the Watson-Crick form of duplex DNA, one DNA strand is in an antiparallel orientation relative to its complementary single-stranded DNA, forming a right-handed helix with about 10 base pairs per turn. The alteration in the helical pitch of duplex DNA is called supercoiling and can be assigned a negative or positive sense when the DNA contains less or more base pairs per turn of the helix, respectively. Many biochemistry textbooks

253

describe the primary, secondary, and tertiary structure of DNA in greater detail. DNA topology profoundly influences how enzymes that are involved in duplicating an organism’s chromosomes interact with DNA. Hence, a central issue that underlies DNA replication is the superhelical state of DNA. Indeed, negative superhelicity in duplex DNA is thought to aid in the unwinding of a replication origin during the initiation stage of DNA replication (Collin et al. 2011; Hardy et al. 2004; Vos et al. 2011; Yang 2010). In the elongation stage after replication forks have been established, positive supercoils will accumulate ahead of moving replication forks and will impede fork movement unless this torsional stress is removed. The enzymes that relieve this torsional stress are called topoisomerases. They are grouped into two classes: type 1 topoisomerases break and rejoin one strand of duplex DNA, whereas type 2 topoisomerases introduce a double-strand break in duplex DNA and then reseal the doublestrand break. The consequence of their activity is that topoisomerases alter the topology of duplex DNA by increasing or decreasing the number of turns that one strand of DNA passes around the other compared with Watson-Crick DNA that contains about ten base pairs per turn. Of interest, drugs that inhibit this category of enzymes have been developed for the treatment of cancer and bacterial infection. The entries on topoisomerases summarize our current knowledge of these enzymes and also the mechanism of inhibition of anticancer and antibacterial compounds (see ▶ “Gyrase and Topoisomerase IV as Targets for Antibacterial Drugs” and ▶ “Topoisomerases and Cancer” by Ketron and Osheroff in this volume).

DNA Replicases At first glance, one may expect the process of DNA replication to be simple as it involves making a copy of duplex DNA following the rules of Watson-Crick base pairing. One of the molecular machines involved in DNA replication is the DNA polymerase, more aptly called a DNA

D

254

replicase because of its major role in duplicating chromosomes (see ▶ “DNA Polymerase III Structure” by McHenry in this volume). A basic property of all DNA polymerases is that they require a template strand of DNA that directs the synthesis of the complementary DNA strand. Thus, the DNA replicase synthesizes progeny DNA strands using each parental DNA strand of the duplex chromosome as a template. DNA polymerases also share the feature that they synthesize DNA in the 50 -to-30 direction. The third general characteristic of DNA polymerases is that they require a short DNA or RNA annealed to the DNA strand to be copied. This oligonucleotide serves as a primer by providing a 30 -hydroxyl of the sugar in the 30 terminal nucleotide that is extended during DNA synthesis. Of interest, E. coli has five separate DNA polymerases (see ▶ “DNA Polymerase III Structure” by McHenry in this volume). The enzyme responsible for duplicating the E. coli chromosome is called DNA polymerase III holoenzyme. This enzyme and those that perform similar functions in other organisms are distinct from the DNA polymerases that are needed for DNA repair. As described in entries by McHenry, DNA polymerase III holoenzyme is composed of three subassemblies. In a process that depends on ATP and its hydrolysis, the subassembly named the clamp loader complex places the subassembly called the sliding clamp at the 30 -end of a primer annealed to the parental singlestranded DNA. The core subassembly then interacts with the sliding clamp, followed by highly processive DNA synthesis by extension of the primer end. Other bacterial replicases described in entries by McHenry are compared to this prototypical enzyme (see ▶ “DnaX Complex Composition and Assembly within Cells,” and ▶ “DNA Polymerase III Structure” by McHenry in this volume). By comparison, eukaryotic cells have as many as 17 unique DNA polymerases. The replicative DNA polymerases that copy DNA in the nucleus of the cell are named DNA polymerase a (Pol a), DNA polymerase e (Pol e), and DNA polymerase d (Pol d). The entries by Hamdan describe how these nuclear DNA

DNA Replication

polymerases coordinate their functions to duplicate the nuclear genome (see ▶ “Eukaryotic DNA Replicases” by Hamdan in this volume).

DNA Helicases The DNA replicases summarized above are incredibly proficient in their ability to synthesize DNA. However, as the DNA replicase translocates on the template DNA strand while synthesizing the complementary DNA strand, it is unable to unwind the duplex DNA that is ahead. Unwinding of the parental duplex DNA is a prerequisite. The molecular machine that performs the role of separating the strands of duplex DNA is called a replicative DNA helicase by Soultanas and Bolt in this volume). This enzyme is a hexamer that in bacteria is composed of identical subunits. The E. coli enzyme is named DnaB, whereas the protein in Bacillus subtilis is called DnaC. In eukaryotic cells, the DNA helicase is named Mcm2-7 and is composed of six different polypeptides. A general feature of this class of DNA helicases is that they have a toroid structure. The entries by Soultanas and Bolt describe the structures of these DNA helicases in detail and their biochemical mechanism of DNA unwinding. Although replicative DNA helicases are effective in unwinding duplex DNA, they need a region of antecedent single-stranded DNA before they can start the unwinding process. For example, to measure DNA unwinding in vitro, bacterial replicative DNA helicases such as E. coli DnaB require a forked duplex DNA composed of a portion of duplex DNA joined to single-stranded DNAs that form the fork. The single-stranded DNAs are designated the 50 - and 30 -“tails.” This DNA helicase binds to the singlestranded DNA that forms the 5’-tail and then translocates on this DNA in the 50 -to-30 direction. During unwinding, the strand of parental DNA bound to the enzyme passes through the central cavity of the DnaB toroid while excluding the other strand as ATP hydrolysis drives the translocation of the enzyme on the DNA.

DNA Replication

Initiation of DNA Replication Occurs at Origins of Chromosomal DNA Replication Initiation of DNA replication starts at a replication origin. Because the rate of replication fork movement in mammals is 2–3 kilobase pairs per minute, it would take almost 2 weeks to duplicate the smallest human chromosome (chromosome 21, 47  106 base pairs) if it contained only a single replication origin. Eukaryotic organisms have solved this problem by initiating DNA replication from multiple origins carried in each chromosome. Studies of yeast (Saccharomyces cerevisiae) as a model organism have revealed that its replication origins are located at specific sites in its chromosomes (Bell and Stillman 1992; Diffley 2011; Ding and MacAlpine 2011; Errico and Costanzo 2010; Sclafani and Holzen 2007). These sites carry a DNA sequence element of 11 base pairs named the ARS (autonomously replicating sequence) consensus sequence. This AT-rich DNA element is essential but not sufficient alone; other DNA sequence motifs that contribute to origin function often include a region nearby that is helically unstable, and is named the DNA unwinding element (DUE). Another feature of S. cerevisiae replication origins is that they frequently but not always are found between genes and in nucleosome-free regions that may result from the binding of a transcription factor that causes nucleosomes to be occluded (Eaton et al. 2010; Nieduszynski et al. 2006). In comparison, replication origins in higher eukaryotes do not apparently share a conserved DNA sequence motif (Remus et al. 2004; Vashee et al. 2003). Hence, these replication origins are not found at discrete sites, yet they are localized to regions of DNA that vary in length from 6 to 100 kilobase pairs. As an additional layer of complexity, mammalian chromosomes contain 30,000–50,000 replication origins, but specific eukaryotic replication origins initiate early in S phase whereas others are utilized later relative to the cell cycle (Huberman and Riggs 1966). In addition, only a subset of eukaryotic replication origins is activated per cell cycle.

255

In acting as sites where DNA synthesis begins, replication origins are recognized by proteins whose role is to assemble the enzymatic machinery destined for each replication fork. In S. cerevisiae, a complex composed of six different polypeptides named the origin recognition complex (ORC) recognizes the ARS consensus sequence and a DNA sequence motif named B1 (reviewed in DePamphilis et al. 2006). An orthologous ORC complex in metazoans performs an analogous role, but it does not bind in a sequencespecific manner (Remus et al. 2004; Vashee et al. 2003). Several studies show that ORC interacts directly with a protein named Cdc6, and that this interaction is required to recruit Cdt1 that is in a complex with Mcm2-7 (reviewed in Bochman and Schwacha 2009; Remus and Diffley 2009). Helicase recruitment results in a pair of Mcm2-7 complexes bound to a replication origin, which is followed by localized DNA unwinding mediated by helicase loading factors and stable binding of the helicase to the unwound DNA. In vivo studies support the conclusion that helicase loading occurs during the G1 phase of the cell cycle. The helicase must then be activated in S phase before it is able to unwind DNA. Following activation and the unwinding of the replication origin, DNA polymerase a synthesizes oligonucleotides that are used as primers for DNA synthesis by DNA polymerases d and e (see ▶ “Eukaryotic DNA Replicases” by Hamdan in this volume). In contrast, E. coli DNA replication starts at a single replication origin (oriC) contained in its circular chromosome. In bacteria such as E. coli, a single replication origin suffices because the rate of replication fork movement is faster than in eukaryotic cells, and bacterial chromosomes are smaller. DnaA protein recognizes this site via its interaction with specific DNA sequences contained in it (reviewed in Kaguni 2011; Kawakami and Katayama 2010; Leonard and Grimwade 2011). Other entries in this volume describe these DNA sequence elements and the biochemistry of DnaA in detail (see ▶ “DnaA, DnaB, DnaC,” ▶ “Control of Initiation in

D

256

E. coli,” and ▶ “Replication Origin of E. coli and the Mechanism of Initiation” by Kaguni in this volume). As described therein, DnaA bound to ATP is the active form for initiation and controls the frequency of DNA replication. After DnaA has bound to oriC, it unwinds an AT-rich region near the left edge of oriC and then loads DnaB helicase onto the unwound DNA. Helicase loading requires that DnaB is complexed with DnaC, a protein that is essential for the initiation of DNA replication and that negatively regulates the activity of DnaB. After DnaB has been activated, needing the release of DnaC from DnaB, the DNA helicase then enlarges the unwound region of oriC. The interaction of primase (DnaG) with DnaB leads to the synthesis of primers that are then extended by the DNA replicase, DNA polymerase III holoenzyme, followed by duplication of the bacterial chromosome.

Termination of DNA Replication In the last stage of DNA replication, the termination of DNA synthesis is coordinated with the segregation of two daughter chromosomes to daughter cells. In bacteria possessing a single circular chromosome, replication forks assembled at the replication origin move bidirectionally from this site (reviewed in Kaplan and Bastia 2009; Neylon et al. 2005). Termination of DNA replication normally occurs when the two replication forks meet in the terminus region located about 180 away on the circular chromosome. In E. coli when replication forks fail to terminate, DNA sites in the terminus region named ter that are bound by Tus protein (RTP in B. subtilis) act as replication fork barriers, preventing replication forks from escaping beyond the terminus region. In contrast in S. cerevisiae, termination of DNA replication occurs when replication forks converge at specific regions in its linear chromosomes. A DNA topoisomerase named Top2 is specifically associated with these regions. The molecular mechanisms that operate at this stage of DNA replication in bacteria and eukaryotic

DNA Replication

cells have been reviewed elsewhere (Kaplan and Bastia 2009) (Neylon et al. 2005).

Mechanisms That Regulate the Frequency of Replication Initiation in E. coli DNA replication occurs only once per cell cycle in bacteria and eukaryotic cells growing mitotically. Studies that are summarized in other chapters of this compendium describe molecular mechanisms that regulate the frequency of E. coli DNA replication at the stage of initiation (reviewed in Kaguni 2011; Kawakami and Katayama 2010; Leonard and Grimwade 2011) (see ▶ “Control of Initiation in E. coli” by Kaguni in this volume). One mechanism involves the sequestration of oriC by a protein named SeqA. In E. coli and related bacteria, the adenine bases at GATC sequences in DNA are methylated by DNA adenine methylase. Before DNA replication, both strands of the parental DNA exist in the methylated state. Hemi-methylated DNA arises by the synthesis of the progeny DNA strand, which is not methylated for a period of time following its production. SeqA specifically recognizes and binds to hemi-methylated DNA. When bound to oriC in the hemi-methylated state, SeqA temporarily blocks the binding of DnaA and other proteins that would otherwise initiate another cycle of DNA replication. However after a latent period, dissociation of SeqA from oriC followed by its methylation by DNA adenine methylase readies oriC for another initiation cycle. Other pathways modulate the frequency of initiation by either controlling the availability of DnaA or its activity. A mechanism that affects the cellular abundance of DnaA is based on the ability of DnaA to autoregulate its expression. Thus, when the cellular level of DnaA is low, more DnaA is made. When the abundance of DnaA is elevated, expression of the dnaA gene is repressed. As the cell grows, the accumulation of DnaA to a sufficient level leads to a new cycle of DNA replication. In another pathway, a chromosomal site named datA is thought to

DNA Replication

titrate excess levels of DnaA so that initiation does not occur at inappropriate times in the cell cycle. In a third pathway, a complex containing a protein named Hda and the b sliding clamp interacts with DnaA to stimulate the hydrolysis of ATP bound to DnaA. The product of DnaA complexed to ADP is relatively inert in initiation. Thus, this interaction is thought to regulate the activity of DnaA to control the frequency of initiation. In a fourth pathway, chromosomal sites named DARS stimulate the dissociation of ADP bound to DnaA. Because the level of ATP is higher than ADP in vivo, DnaA originally complexed to ADP can then bind ATP after interacting with DARS sites to render DnaA able to initiate another round of DNA replication. The ability of DARS sites to promote the dissociation of an adenine nucleotide bound to DnaA is like that of acidic phospholipids. Finally, proteins have been identified that interact with DnaA, which either stabilizes or disrupts its assembly at oriC. These pathways appear to act independently to ensure that initiation occurs only once per cell cycle (see ▶ “Control of Initiation in E. coli” by Kaguni in this volume).

Future Outlook

257









The study of the enzymes involved in maintaining the topological state of DNA and those that function in DNA replication continues to be a fascinating area of ongoing research. Compared with where this field started with the structure of duplex DNA by Watson and Crick, we now know about each individual enzyme, how they act as molecular machines, and how they orchestrate their activities within and among them. Several major areas of investigation that address poorly understood aspects of biochemistry are the focus of various laboratories. A limited list of topics is presented here. • Quinolones are antibiotic agents that specifically inhibit DNA gyrase and topoisomerase IV of E. coli. The rational development of quinolone derivatives with improved clinical



efficiency may lead to better treatment of diseases caused by bacteria that resist the present collection of antibiotics. In the DnaA-dependent loading of the DnaBDnaC complex at E. coli oriC, each complex is loaded at specific sites in the unwound region. A major question that remains unanswered is what determines the location where each DnaB-DnaC complex is loaded? Mcm2-7 is the eukaryotic replicative DNA helicase. Future studies should provide mechanistic details of how the DNA helicase is loaded on duplex DNA at a eukaryotic replication origin. Other questions to be addressed are how the replication origin is unwound to permit the firm binding of the helicase to each parental single-stranded DNA and how the gap between the Mcm2 and Mcm5 subunits closes after loading of the Mcm2-7 complex onto duplex DNA or single-stranded DNA. For both bacterial and eukaryotic replicative DNA helicases, it is assumed that they act processively to unwind DNA. At the completion of DNA replication, what is the mechanism that leads to their dissociation, or do the enzymes remain bound to DNA even in daughter cells after cell division? In eukaryotic cells, DNA polymerase a synthesizes the primer that is handed off to DNA polymerase d, which extends the primer to synthesize Okazaki fragments. The leading strand is replicated by DNA polymerase e. In some gram-positive bacteria such as B. subtilis, current evidence suggests a similar handoff from a DNA polymerase containing DnaE to a DNA polymerase containing PolC, which is orthologous to a subunit of E. coli DNA polymerase III holoenzyme. PolC synthesizes the leading strand. Details about the handoff mechanism and whether it is an organized process will hopefully be addressed in future studies. At a replication fork, the primer synthesized by primase for the next Okazaki fragment is transferred to DNA polymerase III holoenzyme. The mechanism of transfer is poorly understood.

D

258

• In the process of replication fork movement, DNA polymerase III holoenzyme, which is attached to the DNA template strand by its interaction with the b clamp, stalls when it encounters a DNA lesion. Current evidence suggests that DNA polymerases IV and V of E. coli that function in DNA repair can synthesize DNA through the lesion because they are able to incorporate nucleotides that are not base paired in the Watson-Crick sense. These errorprone DNA polymerases are thought to exchange with DNA polymerase III by interacting with the b clamp already bound to DNA polymerase III. Because the b clamp is a dimer, one interacting domain is attached to DNA polymerase III, so the other domain is available to interact with DNA polymerase IV or V. Whether the exchange of the DNA polymerase with the 30 -end of the nascent DNA is random or organized awaits further studies. • During the termination of replication in E. coli and B. subtilis and their close relatives, the first replication forks to arrive at termination (Ter) sites are blocked in a polar manner by specific protein-DNA complexes. Remaining questions concern whether similar mechanisms exist in other bacterial species and in higher organisms, the basis for the relatively low efficiency of fork arrest, whether the replisome remains stably bound at the stalled fork, and how the arrival of the second fork leads to the replication of Ter DNA and subsequent chromosome resolution.

Cross-References ▶ Control of Initiation in E. coli ▶ Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis ▶ Division of Labor ▶ DNA Polymerase III Structure ▶ DNA Replication, Chemical Biology of ▶ DNA Topology and Topoisomerases ▶ DnaA, DnaB, DnaC ▶ DnaX Complex Composition and Assembly within Cells ▶ Eukaryotic DNA Replicases

DNA Replication

▶ Gyrase and Topoisomerase IV as Targets for Antibacterial Drugs ▶ Helicase and Primase Interactions with Replisome Components and Accessory Factors ▶ Helicase Mechanism of Action ▶ Initiation Complex Formation, Mechanism of ▶ Many Bacteria use a Special Mutagenic Pol III in Place of Pol V ▶ PCNA Loading by RFC, Mechanism of ▶ PCNA Structure and Interactions with Partner Proteins ▶ Replicative DNA Helicases and Primases ▶ Replication Origin of E. coli and the Mechanism of Initiation ▶ Topoisomerases and Cancer Acknowledgments I thank members of my lab for their support while I wrote. This work is supported by Grant GM090063 from the National Institutes of Health, and by the Michigan Agricultural Experiment Station.

References Avery OT, Macleod CM, McCarty M (1944) Studies on the chemical nature of the substance inducing transformation of pneumococcal types : induction of transformation by a deoxyribonucleic acid fraction isolated from pneumococcus type III. J Exp Med 79(2):137–158 Bell SP, Stillman B (1992) ATP-dependent recognition of eukaryotic origins of DNA replication by a multiprotein complex. Nature 357(6374):128–134 Bessman MJ et al (1958) Enzymatic synthesis of deoxyribonucleic acid. II. General properties of the reaction. J Biol Chem 233(1):171–177 Bochman ML, Schwacha A (2009) The Mcm complex: unwinding the mechanism of a replicative helicase. Microbiol Mol Biol Rev 73(4):652–683 Collin F, Karkare S, Maxwell A (2011) Exploiting bacterial DNA gyrase as a drug target: current state and perspectives. Appl Microbiol Biotechnol 92(3):479–497 DePamphilis ML et al (2006) Regulating the licensing of DNA replication origins in metazoa. Curr Opin Cell Biol 18(3):231–239 Diffley JF (2011) Quality control in the initiation of eukaryotic DNA replication. Philos Trans R Soc Lond B Biol Sci 366(1584):3545–3553 Ding Q, MacAlpine DM (2011) Defining the replication program through the chromatin landscape. Crit Rev Biochem Mol Biol 46(2):165–179 Eaton ML et al (2010) Conserved nucleosome positioning defines replication origins. Genes Dev 24(8):748–753 Errico A, Costanzo V (2010) Differences in the DNA replication of unicellular eukaryotes and metazoans: known unknowns. EMBO Rep 11(4):270–278

DNA Replication, Chemical Biology of Griffith F (1928) The significance of pneumococcal types. J Hyg (Lond) 27(2):113–159 Hardy CD et al (2004) Disentangling DNA during replication: a tale of two strands. Philos Trans R Soc Lond B Biol Sci 359(1441):39–47 Hershey AD, Chase M (1952) Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol 36(1):39–56 Huberman JA, Riggs AD (1966) Autoradiography of chromosomal DNA fibers from Chinese hamster cells. Proc Natl Acad Sci U S A 55(3):599–606 Kaguni JM (2011) Replication initiation at the Escherichia coli chromosomal origin. Curr Opin Chem Biol 15(5):606–613 Kaplan DL, Bastia D (2009) Mechanisms of polar arrest of a replication fork. Mol Microbiol 72(2):279–285 Kawakami H, Katayama T (2010) DnaA, ORC, and Cdc6: similarity beyond the domains of life and diversity. Biochem Cell Biol 88(1):49–62 Lehman IR et al (1958) Enzymatic synthesis of deoxyribonucleic acid. I. Preparation of substrates and partial purification of an enzyme from Escherichia coli. J Biol Chem 233(1):163–170 Leonard AC, Grimwade JE (2011) Regulation of DnaA assembly and activity: taking directions from the genome. Annu Rev Microbiol 65:19–35 Meselson M, Stahl FW (1958) The Replication of DNA in Escherichia coli. Proc Natl Acad Sci U S A 44(7):671–682 Neylon C et al (2005) Replication termination in Escherichia coli: structure and antihelicase activity of the Tus-Ter complex. Microbiol Mol Biol Rev 69(3):501–526 Nieduszynski CA, Knox Y, Donaldson AD (2006) Genome-wide identification of replication origins in yeast by comparative genomics. Genes Dev 20(14):1874–1879 Remus D, Diffley JF (2009) Eukaryotic DNA replication control: lock and load, then fire. Curr Opin Cell Biol 21(6):771–777 Remus D, Beall EL, Botchan MR (2004) DNA topology, not DNA sequence, is a critical determinant for Drosophila ORC-DNA binding. EMBO J 23(4): 897–907 Sclafani RA, Holzen TM (2007) Cell cycle regulation of DNA replication. Annu Rev Genet 41:237–280 Vashee S et al (2003) Sequence-independent DNA binding and replication initiation by the human origin recognition complex. Genes Dev 17(15): 1894–1908 Vos SM et al (2011) All tangled up: how cells direct, manage and exploit topoisomerase function. Nat Rev Mol Cell Biol 12(12):827–841 Watson JD, Crick FH (1953) Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171(4356):737–738 Yang W (2010) Topoisomerases and site-specific recombinases: similarities in structure and mechanism. Crit Rev Biochem Mol Biol 45(6):520–534

259

DNA Replication, Chemical Biology of Charles S. McHenry Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO, USA

Synopsis DNA replication is an essential process required for the propagation of all pathogens. This provides a useful target for the development of antimicrobial chemotherapeutic agents. The target is increased in value because of observations that blocking replication after it has initiated often leads to chromosomal degradation and bactericidal outcomes. Recent advances in using purified replication systems to screen for antibacterial agents and the use of naturally occurring antireplication proteins encoded by phages to identify new targets are reviewed.

Introduction DNA replication is an essential process for the proliferation of all pathogens and offers a largely unexplored target for development of novel antibacterials. Therapeutically useful inhibitors have been developed that inhibit processes upstream (nucleotide precursor biosynthesis) (Hawser et al. 2006) and downstream (DNA gyrase) (Mitscher 2005) of DNA replication. Most of the subunits of the bacterial DNA replication apparatus are essential, suggesting that their inhibition should lead to blockage of cell proliferation or death (Kornberg and Baker 1992). This has been validated by a class of compounds, 6anilinouracils, targeted to the polymerase subunit of the Gram-positive replicase PolC. These compounds are not only potent biochemical inhibitors but specifically block DNA replication in Grampositive bacteria (Daly et al. 2000). While screens targeting individual replicase subunits have been described (Butler and Wright 2008; Georgescu et al. 2008; Shapiro et al. 2005; Yang et al. 2002),

D

260

complete bacterial replicases have only been explored recently by chemical genetics approaches (Dallmann et al. 2010).

Screens Exploiting Replicase Reconstituted from Purified Proteins The ten subunits of E. coli DNA Pol III holoenzyme interact to form a remarkably complex protein machine (Glover and McHenry 2001; Jeruzalmi et al. 2001; Kim and McHenry 1996; Kim et al. 1996; McHenry 2011; Williams et al. 2003). Protein interactions change at the various steps of the replicative reaction. Counting all of the individual protein components and their interactions with other subunits and substrates, it has been estimated that upwards of 100 essential targets are potentially available for the development of antibacterial agents (Dallmann et al. 2010). Given the impracticality of running 100 specific screening assays, a biochemical highthroughput screen was developed in which inhibition of any of the essential targets could be detected through a common endpoint. In a trial screen with a small (20,000-compound) library against full replication systems derived from model Gram-negative and Gram-positive organisms in parallel, it was possible to distinguish compounds that inhibited the replicase of a single species from those compounds that exhibited broad spectrum potential. Counterscreens against non-orthologous enzymes with related activities revealed those compounds that are most likely to be target specific.

Exploiting Anti-replication Proteins Encoded by Phages to Identify Useful Targets Another evolving source of in vitro inhibitors to support mechanistic studies and tools for in vivo studies and elucidation of potential therapeutic targets derives from the discovery of bacteriophages that express peptides directed toward shutting down cellular processes, including DNA replication. For example, some Staphylococcus

DNA Replication, Chemical Biology of

aureus phages produce peptides that bind to and inhibit the b2 sliding clamp and Dna I helicase loader (Belley et al. 2006; Liu et al. 2004). In addition, coliphage N4 produces a peptide inhibitor of E. coli DnaXcx that functions through interaction with the d subunit (Yano and RothmanDenes 2011). Crystal structures of complexes of these peptides and their targets should provide data that could support library design and development of small molecule inhibitors.

Expected Future Developments Further development of this relatively new area promises to provide compounds that will facilitate biochemical investigation. Stage-specific inhibitors will likely be useful in arresting reactions so that normally transient intermediates can be isolated and studied, analogous to examples provided by inhibitors of nucleotide biosynthesis (Santi and McHenry 1972; Sintchak et al. 1996) and enzymes involved in other aspects of nucleic acid metabolism (Classen et al. 2003; Ho et al. 2009). Such compounds should be useful in vivo in combination with classical genetics, for example, in identifying resistant mutants for target identification or verification. There are multiple benefits to a chemical genetics approach. For example, addition of a compound gives an immediate response, permitting the kinetic effects of a block to be determined in real time. And, addition of a compound can be reversed readily. Broad spectrum compounds can be used to modulate responses in many cell types, including genetically intractable ones.

References Belley A, Callejo M, Arhin F, Dehbi M, Fadhil I, Liu J, McKay G, Srikumar R, Bauda P, Ha N, DuBow M, Gros P, Pelletier J, Moeck G (2006) Competition of bacteriophage polypeptides with native replicase proteins for binding to the DNA sliding clamp reveals a novel mechanism for DNA replication arrest in Staphylococcus aureus. Mol Microbiol 62:1132–1143 Butler MM, Wright GE (2008) A method to assay inhibitors of DNA polymerase IIIC activity. Methods Mol Med 142:25–36

DNA Topology and Topoisomerases Classen S, Olland S, Berger JM (2003) Structure of the topoisomerase II ATPase region and its mechanism of inhibition by the chemotherapeutic agent ICRF-187. Proc Natl Acad Sci U S A 100:10629–10634 Dallmann HG, Fackelmayer OJ, Tomer G, Chen J, WiktorBecker A, Ferrara T, Pope C, Oliveira MT, Burgers PM, Kaguni LS, McHenry CS (2010) Parallel multiplicative target screening against divergent bacterial replicases: identification of specific inhibitors with broad spectrum potential. Biochemistry 49:2551–2562 Daly JS, Giehl TJ, Brown NC, Zhi C, Wright GE, Ellison RT III (2000) In vitro antimicrobial activities of novel anilinouracils which selectively inhibit DNA polymerase III of Gram-positive bacteria. Antimicrob Agents Chemother 44:2217–2221 Georgescu RE, Yurieva O, Kim SS, Kuriyan J, Kong XP, O’Donnell M (2008) Structure of a small-molecule inhibitor of a DNA polymerase sliding clamp. Proc Natl Acad Sci U S A 105:11116–11121 Glover BP, McHenry CS (2001) The DNA polymerase III holoenzyme: an asymmetric dimeric replicative complex with leading and lagging strand polymerases. Cell 105:925–934 Hawser S, Lociuro S, Islam K (2006) Dihydrofolate reductase inhibitors as antibacterial agents. Biochem Pharmacol 71:941–948 Ho MX, Hudson BP, Das K, Arnold E, Ebright RH (2009) Structures of RNA polymerase-antibiotic complexes. Curr Opin Struct Biol 19:715–723 Jeruzalmi D, O’Donnell ME, Kuriyan J (2001) Crystal structure of the processivity clamp loader gamma complex of E. coli DNA polymerase III. Cell 106:429–441 Kim DR, McHenry CS (1996) Identification of the b-binding domain of the a subunit of Escherichia coli polymerase III holoenzyme. J Biol Chem 271:20699–20704 Kim S, Dallmann HG, McHenry CS, Marians KJ (1996) Coupling of a replicative polymerase and helicase: a t-DnaB interaction mediates rapid replication fork movement. Cell 84:643–650 Kornberg A, Baker TA (1992) DNA replication. WH Freeman, New York Liu J, Dehbi M, Moeck G, Arhin F, Bauda P, Bergeron D, Callejo M, Ferretti V, Ha N, Kwan T, McCarty J, Srikumar R, Williams D, Wu JJ, Gros P, Pelletier J, DuBow M (2004) Antimicrobial drug discovery through bacteriophage genomics. Nat Biotechnol 22:185–191 McHenry CS (2011) DNA replicases from a bacterial perspective. Annu Rev Biochem 80:403–436 Mitscher LA (2005) Bacterial topoisomerase inhibitors: quinolone and pyridone antibacterial agents. Chem Rev 105:559–592 Santi DV, McHenry CS (1972) 5-fluoro-20 -deoxyuridylate: covalent complex with thymidylate synthetase. Proc Natl Acad Sci U S A 69:1855–1857 Shapiro A, Rivin O, Gao N, Hajec L (2005) A homogeneous, high-throughput fluorescence

261 resonance energy transfer-based DNA polymerase assay. Anal Biochem 347:254–261 Sintchak MD, Fleming MA, Futer O, Raybuck SA, Chambers SP, Caron PR, Murcko MA, Wilson KP (1996) Structure and mechanism of inosine monophosphate dehydrogenase in complex with the immunosuppressant mycophenolic acid. Cell 85:921–930 Williams CR, Snyder AK, Kuzmic P, O’Donnell ME, Bloom LB (2003) Mechanism of loading the Escherichia coli DNA polymerase III sliding clamp I: two distinct activities for individual ATP sites in the g complex. J Biol Chem 279:4376–4385 Yang F, Dicker IB, Kurilla MG, Pompliano DL (2002) PolC-type polymerase III of Streptococcus pyogenes and its use in screening for chemical inhibitors. Anal Biochem 304:110–116 Yano ST, Rothman-Denes LB (2011) A phage-encoded inhibitor of Escherichia coli DNA replication targets the DNA polymerase clamp loader. Mol Microbiol 79:1325–1338

DNA Synthesis ▶ Synthesis of Modified Oligonucleotides

DNA Topology and Topoisomerases Adam C. Ketron and Neil Osheroff Department of Biochemistry and the Vanderbilt Institute of Chemical Biology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synopsis Topological relationships within the double helix (i.e., DNA supercoiling, tangling, and knotting) significantly influence the processes by which the genetic information is passed from generation to generation, expressed, and recombined in all living systems. In vivo, the topological structure of DNA is regulated by ubiquitous enzymes called topoisomerases. These enzymes act by generating transient breaks in the backbone of the genetic material. Topoisomerases can be separated into two major classes based on the number of DNA

D

262

DNA Topology and Topoisomerases

strands that they cleave: type I enzymes cut one strand of the double helix, while type II enzymes cut both. To date, approximately a dozen different topoisomerases have been described in prokaryotic, eukaryotic, and viral species. Among their many physiological functions, these enzymes help to set global levels of DNA supercoiling, alleviate the torsional stress that accumulates in front of replication forks and transcription complexes, unlink daughter chromosomes that are generated during replication, and remove DNA knots that form during recombination events. This entry will discuss biological and mathematical aspects of DNA topology and compare and contrast the reactions catalyzed by individual classes and subclasses of topoisomerases. Finally, it will describe how the reaction mechanisms of the different topoisomerases dictate their cellular functions.

fixed in space and the double helix does not have free rotation, it can be considered to be a topologically closed system. Under these circumstances, the topological properties of DNA are defined as those that cannot be altered without breaking one or both strands of the double helix (Bates and Maxwell 2005; Deweese et al. 2008). In virtually every living system, chromosomes are comprised of extremely long DNA molecules that are circular or linear and are attached to membrane or protein supports. Thus, in general, this definition can be applied to all chromosomal DNA. Topological relationships in DNA can be divided into two categories: relationships between the two strands of the double helix (i.e., supercoiling) and relationships between different segments of duplex DNA (i.e., tangling and knotting) (Bates and Maxwell 2005; Deweese et al. 2008). Both affect DNA function in profound, but different, ways and are discussed below.

Introduction

DNA Supercoiling Double-stranded DNA that is free from torsional stress (i.e., the classical Watson-Crick structure with ~10.4 base pairs per turn) is defined as “relaxed” (Fig. 1; note that the DNA molecules in the figure are depicted as circular ribbon diagrams for simplicity. Similar topological structures exist in linear DNA molecules, as long as the ends of the molecule are fixed in space.) If torsional stress is applied by either under- or overwinding the DNA, molecules writhe about themselves to form superhelical twists (Fig. 1; Bates and Maxwell 2005; Deweese et al. 2008). Hence, DNA that is under torsional stress is called “supercoiled” (SC). Underwound DNA molecules are defined as negatively supercoiled [()SC], and overwound molecules are defined as positively supercoiled [(+)SC]. Globally, chromosomal (and extrachromosomal) DNA in bacteria and eukaryotes is underwound ~6% (Deweese et al. 2008). Because the two strands of the double helix must be separated in order for the genetic information to be replicated or expressed, under- and overwinding have important implications for DNA function (Bates and Maxwell 2005; Deweese et al. 2008). Negative supercoiling introduces energy into the

The genetic information of an organism is encoded in a linear array of DNA bases that is stored in the form of a double helix (Bates and Maxwell 2005; Deweese et al. 2008; Liu et al. 2009). Two critical features punctuate this elegant structure: base pairing and the intertwining of the two DNA strands. Both contribute to the physical integrity of the genome and provide the redundancy that is the underlying basis for DNA replication, recombination, and repair. In addition to the above, however, the interwound nature of the double helix imposes a number of topological constraints on the genetic material that affect all of its physiological functions (Bates and Maxwell 2005; Deweese et al. 2008; Liu et al. 2009).

DNA Topology Topology is a field of mathematics that is concerned with “relationships that are not altered by elastic deformation” (Bates and Maxwell 2005; Deweese et al. 2008). How is this subject applied to DNA? As long as the ends of DNA are

DNA Topology and Topoisomerases

DNA Topology and Topoisomerases, Fig. 1 Topological relationships within DNA. DNA molecules are shown as circular ribbons for simplicity. Top: DNA with no torsional stress is referred to as “relaxed.” Underwinding or overwinding DNA results in negative supercoils [()SC] or positive supercoils [(+)SC], respectively. The directionality of the DNA is shown by internal arrowheads in the ()SC molecule. Supercoils are shown as writhes (DNA crossovers or nodes) for visual ease, but it should be noted that supercoils can be interconverted from writhes to twists. By convention, each writhe (denoted by the crossing of one DNA segment over another segment) is given an integral value of 1 or +1. Middle: Tracing the direction of the DNA, the sign of the node is assigned based on the direction of movement required to align the front segment of DNA with the back segment using a rotation of 99%), making studies that focus on disrupting a tetrameric DnaX irrelevant. A new model for mixed g/t DnaXcx formation that is dependent upon assembly occurring at low,

physiological concentrations of DnaX in the presence of protein partners to steer the reaction has been proposed (McHenry 2011; Fig. 2). The assembly reaction takes advantage of the “transformer” properties of DnaX, which can assume different stable stoichiometries depending on its protein partners. Unassembled DnaX in the cell would start out as a monomer. Upon association with Pol III, the reaction will be driven to Pol III-t-t-Pol III formation (Kd = 70 pM) (Kim and McHenry 1996a). This assembly can interact with g monomers because a time-dependent entry of g into complexes with t in the presence of Pol III has been observed (Pritchard and McHenry 2001). A structure of the c amino-terminus complexed with g3dd’wc shows that c interacts asymmetrically with the three DnaX protomers (Simonetta et al. 2009). The contact shown in Fig. 2 designates the cross-link observed with the extreme amino-terminus of c (▶ Mechanism of Initiation Complex Formation, Mechanism of

294

DnaX Complex Composition and Assembly Within Cells

b2 Loading on Primed DNA). Previous studies have shown that d’ by itself, but not d by itself, can interact with DnaX assemblies (Song and McHenry 2001). The association reaction is ~100-fold slower than association of wc with DnaX (Gao and McHenry 2001a), ensuring ordered assembly. d will bind in a highly cooperative reaction in the presence of d’, leading to closure of the pentameric DnaXcx ring (Gao and McHenry 2001a; Song et al. 2001a). The factors that cause c to interact uniquely in the orientation shown with three DnaX protomers and d’ uniquely with DnaX subunit D (g) (Fig. 1) are not known, but have been shown to occur in vivo through cross-linking studies of authentic Pol III HE (Glover and McHenry 2000).

processive replication by the DNA polymerase III holoenzyme. J Mol Biol 350:228–239 Downey CD, McHenry CS (2010) Chaperoning of a replicative polymerase onto a newly-assembled DNA-bound sliding clamp by the clamp loader. Mol Cell 37:481–491 Flower AM, McHenry CS (1990) The g subunit of DNA polymerase III holoenzyme of Escherichia coli is produced by ribosomal frameshifting. Proc Natl Acad Sci U S A 87:3713–3717 Foster PL (2005) Stress responses and genetic variation in bacteria. Mutat Res 569:3–11 Gao D, McHenry CS (2001a) t binds and organizes Escherichia coli replication proteins through distinct domains. Domain III, shared by g and t, binds dd’ and wc. J Biol Chem 276:4447–4453 Gao D, McHenry CS (2001b) t binds and organizes Escherichia coli replication proteins through distinct domains. domain IV, located within the unique C terminus of t, binds the replication fork helicase, DnaB. J Biol Chem 276:4441–4446 Gao D, McHenry C (2001c) t Binds and Organizes E. coli Replication Proteins through Distinct Domains: Partial Proteolysis of Terminally Tagged t to Determine Candidate Domains and to Assign Domain V as the a Binding Domain. J Biol Chem 276:4433–4440 Glover BP, McHenry CS (1998) The wc subunits of DNA polymerase III holoenzyme bind to single-stranded DNA-binding protein (SSB) and facilitate replication of a SSB-coated template. J Biol Chem 273:23476–23484 Glover BP, McHenry CS (2000) The DnaX-binding subunits d’ and c are bound to g and not t in the DNA polymerase III holoenzyme. J Biol Chem 275:3017–3020 Hersh MN, Ponder RG, Hastings PJ, Rosenberg SM (2004) Adaptive mutation and amplification in Escherichia coli: two pathways of genome adaptation under stress. Res Microbiol 155:352–359 Jarosz DF, Beuning PJ, Cohen SE, Walker GC (2007) Y-family DNA polymerases in Escherichia coli. Trends Microbiol 15:70–77 Jeruzalmi D, O’Donnell ME, Kuriyan J (2001) Crystal structure of the processivity clamp loader gamma complex of E. coli DNA polymerase III. Cell 106:429–441 Kim DR, McHenry CS (1996a) Biotin tagging deletion analysis of domain limits involved in proteinmacromolecular interactions: mapping the t binding domain of the DNA polymerase III a subunit. J Biol Chem 271:20690–20698 Kim DR, McHenry CS (1996b) Identification of the b-binding domain of the a subunit of Escherichia coli polymerase III holoenzyme. J Biol Chem 271:20699–20704 Kim S, Dallmann HG, McHenry CS, Marians KJ (1996a) Coupling of a replicative polymerase and helicase: a t-DnaB interaction mediates rapid replication fork movement. Cell 84:643–650 Kim S, Dallmann HG, McHenry CS, Marians KJ (1996b) t protects b in the leading-strand polymerase complex at the replication fork. J Biol Chem 271:4315–4318

Cross-References ▶ Initiation Complex Formation, Mechanism of

References Becherel OJ, Fuchs RPP, Wagner J (2002) Pivotal role of the b-clamp in translesion DNA synthesis and mutagenesis in E. coli cells. DNA Repair (Amst) 1:703–708 Blinkova A, Hervas C, Stukenberg PT, Onrust R, O’Donnell ME, Walker JR (1993) The Escherichia coli DNA polymerase III holoenzyme contains both products of the dnaX Gene, t and g, but only t is essential. J Bacteriol 175:6018–6027 Blinkowa AL, Walker JR (1990) Programmed ribosomal frameshifting generates the Escherichia coli DNA polymerase III g subunit from within the t subunit reading frame. Nucleic Acids Res 18:1725–1729 Burnouf DY, Olieric V, Wagner J, Fujii S, Reinbolt J, Fuchs RP, Dumas P (2004) Structural and biochemical analysis of sliding clamp/ligand interactions suggest a competition between replicative and translesion DNA polymerases. J Mol Biol 335:1187–1197 Cull MG, McHenry CS (1995) Purification of Escherichia coli DNA polymerase III holoenzyme. Methods Enzymol 262:22–35 Dallmann HG, McHenry CS (1995) DnaX complex of Escherichia coli DNA polymerase III holoenzyme: physical characterization of the DnaX subunits and complexes. J Biol Chem 270:29563–29569 Dohrmann PR, McHenry CS (2005) A bipartite polymerase-processivity factor interaction: only the internal b binding site of the a subunit is required for

Double-Strand Break Repair Larsen B, Wills NM, Nelson C, Atkins JF, Gesteland RF (2000) Nonlinearity in genetic decoding: homologous DNA replicase genes use alternatives of transcriptional slippage or translational frame shifting. Proc Natl Acad Sci U S A 97:1683–1688 McHenry CS (1982) Purification and characterization of DNA polymerase III’: identification of t as a subunit of the DNA polymerase III holoenzyme. J Biol Chem 257:2657–2663 McHenry CS (2011) DNA replicases from a bacterial perspective. Annu Rev Biochem 80:403–436 McHenry CS, Kornberg A (1977) DNA polymerase III holoenzyme of Escherichia coli purification and resolution into subunits. J Biol Chem 252:6478–6484 (and erratum 253:645) McHenry CS, Oberfelder R, Johanson K, Tomasiewicz H, Franden MA (1987) Structure and mechanism of the DNA polymerase III holoenzyme. In: Kelly TJ, McMacken R (eds) DNA replication and recombination. Alan R Liss, New York, pp 47–62 McInerney P, Johnson A, Katz F, O’Donnell M (2007) Characterization of a triple DNA polymerase replisome. Mol Cell 27:527–538 Modrich P (1989) Methyl-directed DNA mismatch correction. J Biol Chem 264:6597–6600 Olson MW, Dallmann HG, McHenry CS (1995) DnaXcomplex of Escherichia coli DNA polymerase III holoenzyme: the wc complex functions by increasing the affinity of t and g for dd’ to a physiologically relevant range. J Biol Chem 270:29570–29577 Onrust R, Finkelstein J, Naktinis V, Turner J, Fang L, O’Donnell ME (1995) Assembly of a chromosomal replication machine: two DNA polymerases, a clamp loader, and sliding clamps in one holoenzyme particle. I. Organization of the clamp loader. J Biol Chem 270:13348–13357 Pages V, Fuchs RPP (2002) How DNA lesions are turned into mutations within cells? Oncogene 21:8957–8966 Pritchard AE, McHenry CS (2001) Assembly of DNA polymerase III holoenzyme: co-assembly of g and t is inhibited by DnaX complex accessory proteins but stimulated By DNA polymerase III core. J Biol Chem 276:35217–35222 Pritchard AE, Dallmann HG, McHenry CS (1996) In vivo assembly of the t-complex of the DNA polymerase III holoenzyme expressed from a five-gene artificial operon: cleavage of the t-complex to form a mixed gt-complex by the OmpT protease. J Biol Chem 271:10291–10298 Pritchard AE, Dallmann HG, Glover BP, McHenry CS (2000) A novel assembly mechanism for the DNA polymerase III holoenzyme DnaX complex: association of dd’ with DnaX(4) forms DnaX(3)dd’. EMBO J 19:6536–6545 Reyes-Lamothe R, Sherratt DJ, Leake MC (2010) Stoichiometry and architecture of active DNA replication machinery in Escherichia coli. Science 328:498–501 Simonetta KR, Kazmirski SL, Goedken ER, Cantor AJ, Kelch BA, McNally R, Seyedin SN, Makino DL,

295 O’Donnell M, Kuriyan J (2009) The mechanism of ATP-dependent primer-template recognition by a clamp loader complex. Cell 137:659–671 Song MS, McHenry CS (2001) Carboxyl-terminal domain III of the d’ subunit of DNA polymerase III holoenzyme binds DnaX and supports cooperative DnaXcomplex assembly. J Biol Chem 276:48709–48715 Song MS, Dallmann HG, McHenry CS (2001a) Carboxylterminal domain III of the d’ subunit of the DNA polymerase III holoenzyme binds d. J Biol Chem 276:40668–40679 Song MS, Pham PT, Olson M, Carter JR, Franden MA, Schaaper RM, McHenry CS (2001b) The d and d’ subunits of the DNA polymerase III holoenzyme are essential for initiation complex formation and processive elongation. J Biol Chem 276:35165–35175 Tippin B, Pham P, Goodman MF (2004) Error-prone replication for better or worse. Trends Microbiol 12:288–295 Tsuchihashi Z, Kornberg A (1990) Translational frameshifting generates the g subunit of DNA polymerase III holoenzyme. Proc Natl Acad Sci U S A 87:2516–2520 Vass RH, Chien P (2013) Critical clamp loader processing by an essential AAA+ protease in Caulobacter crescentus. Proc Natl Acad Sci U S A 110:18138–18143 Yeiser B, Pepper ED, Goodman MF, Finkel SE (2002) SOS-induced DNA polymerases enhance long-term survival and evolutionary fitness. Proc Natl Acad Sci U S A 99:8737–8741 Yuan Q, McHenry CS (2009) Strand displacement by DNA polymerase III occurs through a t-c-w link to SSB coating the lagging strand template. J Biol Chem 284:31672–31679

Double-Strand Break Repair Gilbert Chu Departments of Medicine and Biochemistry, Division of Oncology, CCSR 1145, Stanford University School of Medicine, Stanford, CA, USA

Synopsis DNA double-strand breaks (DSBs) are the most dangerous form of DNA damage and can lead to death, mutation, or malignant transformation. Mammalian cells use three major pathways to repair DSBs: homologous recombination (HR),

D

296

classical nonhomologous end joining (C-NHEJ), and alternative end joining (A-NHEJ). Cells choose among the pathways by interactions of the pathways with CtIP and 53BP1. HR is restricted to S and G2 phases of the cell cycle, utilizing homologous newly replicated DNA for error-free repair. Inherited mutations in the HR genes BRCA1, BRCA2, and PALB2 cause susceptibility to breast, ovarian, and pancreatic cancer. C-NHEJ operates during all phases of the cell cycle and during V(D)J recombination of the antibody and T-cell receptor genes, joining DNA ends while minimizing nucleotide loss and addition for error-prone repair. Inherited mutations in the C-NHEJ genes DNA-PKcs, Artemis, and XLF cause the syndrome of radiation sensitivity with severe combined immunodeficiency. A-NHEJ is a backup pathway for C-NHEJ, joining DNA ends after resection back to regions of microhomology for error-prone repair. A-NHEJ is responsible for creating the chromosomal translocations seen in cancer. Thus, understanding DSB repair pathways will lead to insights for immunology and cancer.

Introduction DNA double-strand breaks (DSBs) pose grave problems for the mammalian cell. Unrepaired breaks lead to death. Inaccurate repair generates mutations or chromosome translocations, which can lead to cancer. DSBs arise from many sources. Exogenous sources include ionizing radiation and topoisomerase II inhibitors. Ionizing radiation produces ends with non-ligatable nucleotides containing ruptured ribose rings or aberrant chemical groups, such as 50 -hydroxyl, 30 -phosphate, or 30 -phosphoglycolate groups. The damaged nucleotide must be either repaired or removed prior to ligation. Topoisomerase II inhibitors include the anticancer drugs etoposide and doxorubicin. Topoisomerase II untangles newly replicated DNA by cutting both strands of one DNA double helix to allow a second DNA double helix to pass through. Topoisomerase II inhibitors trap the enzyme as a protein-bridged DSB intermediate.

Double-Strand Break Repair

Endogenous sources of DSBs include V(D)J recombination and class switch recombination. V(D)J recombination generates DSBs with blunt ends and unusual hairpin ends in order to create a diverse repertoire of B-cell antibodies and T-cell receptors that recognize foreign antigens. The brief definition entitled ▶ “V(D)J Recombination” provides additional information. Class switch recombination switches the constant region of an antibody while preserving its variable region. After class switching, the antibody recognizes the same antigen but interacts with a different effector molecule to activate a specific immunological response. Befitting the danger of DSBs, eukaryotic cells have evolved three repair pathways (Fig. 1). Homologous recombination (HR) occurs during S and G2 phases of the cell cycle, utilizing homologous DNA (usually from the newly replicated sister chromosome) to repair DSBs. HR is a conservative repair pathway because it restores the DNA sequence even if the source of the DSB disrupts nucleotides near the ends. Classical nonhomologous end joining (C-NHEJ) joins compatible DNA ends precisely and noncompatible DNA ends with limited nucleotide deletion or addition. C-NHEJ is nonconservative but optimizes the preservation of DNA sequence. Alternative nonhomologous end joining (A-NHEJ) repairs DSBs by deleting nucleotides back to regions of microhomology, creating junctions with deletions generally larger than those from C-NHEJ. Thus, A-NHEJ is the least conservative pathway. Prokaryotes and eukaryotes differ significantly in how they perform DSB repair. However, the eukaryotic DSB repair pathways are conserved from yeast to mammals. This review emphasizes the mammalian pathways and highlights their relevance for human disease.

Choice of the Pathway for Repairing DSBs Choice of the repair pathway can depend on the source and timing of the DSB. V(D)J recombination generates two DSBs and funnels their joining toward C-NHEJ (Dudley et al. 2005). Class

Double-Strand Break Repair

297

D

Double-Strand Break Repair, Fig. 1 Pathways for DSB repair. Homologous recombination (HR) utilizes homologous DNA (red) to conserve DNA sequence. Classical nonhomologous end joining (C-NHEJ) minimizes

loss of DNA in DNA sequence (black). Alternative NHEJ (A-NHEJ) deletes nucleotides back to regions of microhomology, leaving one copy of the microhomology region (black box)

Double-Strand Break Repair, Fig. 2 Choice of pathway for DSB repair. MRN tethers the DNA ends to each other and activates the ATM kinase. ATM phosphorylates its target proteins to coordinate the cellular response to the DSB. Binding of 53BP1 blocks end resection and channels repair toward C-NHEJ, which is initiated by binding of Ku to the DNA ends. Binding of CtIP to MRN leads to A-NHEJ or HR. In S and G2 phases of the cell cycle, cyclindependent kinase CDK2 phosphorylates CtIP, committing repair to HR

switch recombination generates two DSBs that are joined by both C-NHEJ and A-NHEJ (Yan et al. 2007). In S and G2 phases of the cell cycle, cyclin-dependent kinase CDK2 directs DSBs toward repair by HR. MRN (Mre11-Rad50-Nbs1) accumulates rapidly at sites of DSBs, holding the two DNA ends together (Fig. 2) (Stracker and Petrini 2011). Consistent with its role in tethering DNA ends, MRN has an overall stoichiometry of Mre112-Rad502Nbs12. The proteins in the MRN complex have

distinct biochemical activities. Rad50 contains a globular domain and an extended coiled-coil ending in a hook domain that facilitates homodimerization of two Rad50 molecules. Since Rad50 interacts with Mre11 and Nbs1, Rad50 homodimerization generates the full MRN complex. Mre11 can tether two DNA ends to each other and has endonuclease and 30 –50 exonuclease activities in vitro. The exonuclease activity appears paradoxical, since the search for homologous sequence during HR requires DNA with a 30

298

overhang. Indeed, other nucleases catalyze bulk 50 –30 resection, as described below. After binding to DNA ends at the DSB, MRN recruits ATM (ataxia telangiectasia mutated), a serine-threonine protein kinase (Stracker and Petrini 2011). ATM kinase phosphorylates a number of target proteins, leading to induction of cell cycle checkpoints, chromatin remodeling, and activation of DNA repair. Mutations in the genes encoding the MRN and ATM proteins cause a group of related human diseases, consistent with their interaction with each other. ATM mutations cause ataxia telangiectasia (Savitsky et al. 1995), an autosomal recessive disorder characterized by poor coordination (ataxia), dilation of small blood vessels (telangiectasia), cellular sensitivity to ionizing radiation, immunodeficiency, and high risk for lymphoma and leukemia. Hypomorphic (partially inactivating) recessive mutations in Mre11 cause ataxia telangiectasia-like disorder (ATLD) (Stewart et al. 1999), manifesting ataxia and radiation sensitivity, but distinguished from ataxia telangiectasia by normal immunity, normal cancer risk, slower disease progression, and absence of telangiectasias. Hypomorphic Nbs1 mutations cause Nijmegen breakage syndrome (NBS) (Carney et al. 1998; Varon et al. 1998), which is characterized by microcephaly, facial abnormalities, short stature, immunodeficiency, radiation sensitivity, and high risk for lymphoma. CtIP (CtBP [carboxy-terminal-binding protein]-interacting protein) was originally identified as a cofactor for the transcriptional repressor CtBP (You and Bailis 2010). Activation of the ATM kinase permits CtIP recruitment to DSBs where it interacts with the Nbs1 component of MRN. The interaction between CtIP and MRN facilitates repair by either A-NHEJ or HR (Fig. 2). CtIP undergoes phosphorylation by CDK2 during S and G2 phases of the cell cycle, channeling the DSB toward repair by HR. The p53-binding protein 1 (53BP1) also affects the choice of repair pathway (Panier and Boulton 2014). DSBs activate the ATM signaling cascade, which leads to the phosphorylation of histone 2A variant H2AX together with the recruitment of 53BP1 to the damaged chromatin. Binding of

Double-Strand Break Repair Double-Strand Break Repair, Table 1 Proteins involved in choosing the pathway for double-strand break repair Protein Ku Mre11

Rad50

Biochemical activity DNA end binding Nuclease component of MRN complex (Mre11-Rad50Nbs1) Weak ATPase, component of MRN complex

Nbs1

Nuclear localization, component of MRN complex

ATM

Protein kinase, signaling of the presence of DSBs DNA binding, promotion of MRN nuclease activity

CtIP

Human disease Ataxia telangiectasia-like disorder (Stewart et al. 1999) Nijmegen breakage syndrome-like disorder (Waltes et al. 2009) Nijmegen breakage syndrome (Carney et al. 1998; Varon et al. 1998) Ataxia telangiectasia (Savitsky et al. 1995)

53BP1 to damaged chromatin blocks end resection at DSBs, preserving the DNA ends for binding by Ku and thus channeling repair toward C-NHEJ. Table 1 summarizes the proteins involved in choosing the DSB repair pathway.

Homologous Recombination (HR) Table 2 summarizes the HR proteins. The MRN/CtIP complex resects DNA ends, including ends with covalent modifications or bulky adducts (Buis et al. 2008; Sartori et al. 2007). MRN/CtIP resects no more than 50–110 nucleotides in the 30 –50 direction, and this nuclease activity is dispensable for unblocked DNA ends. However, ATM-mediated phosphorylation of CtIP together with BRCA1 (breast cancer susceptibility type 1)mediated ubiquitination of CtIP leads to the extended 50 –30 resection of 100–200 nucleotides to form the 30 overhang required for homology searching during HR (San Filippo et al. 2008). Extended resection requires MRN, RPA (replication protein A), the nucleases Exo1 and DNA2, and the helicase BLM (Bloom’s syndrome protein) (Nimonkar et al. 2011).

Double-Strand Break Repair

299

Double-Strand Break Repair, Table 2 Proteins involved in homologous recombination Protein Mre11Rad50Nbs1 RPA Exo1

DNA2 BLM BRCA1

BARD1 BRCA2

DSS1

PALB2

Rad51

Biochemical activity See Table 1

Single-stranded DNA binding Exonuclease

Helicase, endonuclease Helicase, partner of EXO1 and DNA2 Ubiquitin ligase

Partner of BRCA1 DNA binding, loading of Rad51 onto ssDNA Partner of BRCA2

Partner and nuclear localizer of BRCA2, binding to BRCA1 DNA binding, ATPase, formation of nucleoprotein filament

Human disease See Table 1

D

Hereditary nonpolyposis colon cancer (Wu et al. 2001)

Bloom’s syndrome (Ellis et al. 1995) Hereditary breast and ovarian cancer (Hall et al. 1990) Hereditary breast and ovarian cancer (Wooster et al. 1994) Split-hand/split-foot syndrome (Crackower et al. 1996) Familial breast or pancreatic cancers (Jones et al. 2009; Rahman et al. 2007)

Bloom’s syndrome is an autosomal disorder characterized by short stature, rash upon sun exposure, and a high risk for cancer, including leukemia, lymphoma, and carcinoma. Bloom’s syndrome cells show a strikingly elevated frequency of exchange events between homologous chromosomes (sister chromatid exchanges). Neither Exo1 nor DNA2 can generate the extensive 30 overhang required for HR when acting as a purified protein in isolation. Exo1 has a relatively weak 50 –30 exonuclease activity. DNA2 has endonuclease activity that will degrade either a 50 or 30 end. However, each nuclease forms a complex with partner proteins to form an independent resection machine: BLM-DNA2-RPA-MRN

Double-Strand Break Repair, Fig. 3 Model for HR. During S and G2, CDK2 phosphorylates CtIP. BRCA1 then binds and ubiquitinates CtIP. The DNA ends undergo extended 50 –30 resection catalyzed by BLM-DNA2-RPAMRN or Exo1-BLM-RPA-MRN. Resection generates 30 ssDNA tails of 100–200 nucleotides, which are protected by RPA binding. BRCA1 recruits PALB2 and BRCA2, which load Rad51 onto the ssDNA, displacing RPA to form a Rad51 nucleoprotein filament. Rad51 mediates strand exchange, forming a D-loop (named for the shape of the DNA structure). Strand exchange creates a primer for DNA synthesis on the homologous DNA template. The homology-guided reaction restores the broken DNA to its original sequence

or Exo1-BLM-RPA-MRN (Nimonkar et al. 2011). In the DNA2-dependent resection machine, DNA2 and BLM act together, utilizing the helicase activity of BLM to enhance the endonuclease activity of DNA2 (Fig. 3). RPA, which binds to ssDNA, enforces 50 polarity. In the Exo1-dependent resection machine, BLM and MRN increase the affinity of Exo1 for DNA

300

ends. MRN and RPA increase the processivity of Exo1 resection. In both resection machines, RPA prevents the ssDNA from folding over and hybridizing to itself to form secondary structures. The intrinsic Mre11 nuclease activity of MRN is not required for either machine. Rather, MRN provides a scaffolding function to recruit and stimulate the nuclease activities of Exo1 and DNA2. Both resection machines create DNA ends with 30 ssDNA overhangs of 100–200 nucleotides coated with RPA. It is not clear why HR utilizes two different resection machines, particularly because both machines include BLM, MRN, and RPA. However, the different activities of Exo1 and DNA2 as isolated nucleases raise the possibility that they have distinct functions other than simple end resection. For example, the endonuclease activity of DNA2 may allow removal of blocked 50 ends. Following resection, several proteins cooperate to create a nucleoprotein filament that allows the ssDNA to invade homologous DNA. BRCA1 and BRCA2 (breast cancer susceptibility type 2) play key roles in loading Rad51 onto the ssDNA (Fig. 3). CDK2-mediated phosphorylation of Ser327 in CtIP facilitates its binding to BRCA1, forming a ternary complex of CtIP, MRN, and BRCA1 (Chen et al. 2008). BRCA1 recruits BRCA2 to the DSB by binding to PALB2 (partner and nuclear localizer of BRCA2), which in turn binds to BRCA2 (Sy et al. 2009; Zhang et al. 2009). Together with DSS1 (deleted in split-hand/split-foot syndrome), BRCA2 binds to DNA via two DNA-binding domains (Yang et al. 2002). One domain binds to ssDNA and the other binds to double-stranded DNA, positioning BRCA2 at the transition point from double-stranded DNA to the 30 ssDNA overhang. Inherited mutations in BRCA1, BRCA2, and PALB2 confer susceptibility to breast, ovarian, and prostate cancer (see the brief definition ▶ “Hereditary Breast and Ovarian Cancer and Poly(ADP-Ribose) Polymerase Inhibition”). BRCA2 and PALB2 load Rad51 onto the ssDNA, displacing RPA (Fig. 3). Rad51 exists as a heptamer in solution but upon binding to ssDNA

Double-Strand Break Repair Double-Strand Break Repair, Table 3 Proteins involved in classical nonhomologous end joining Protein Ku DNA-PKcs

XRCC4

Ligase IV

Biochemical activity DNA end binding DNA-dependent protein kinase, synapsis of DNA ends

DNA binding, partner of ligase IV DNA ligase

XLF (Cernunnos)

DNA binding, promotion of mismatched end ligation

PNKP

Polynucleotide kinase/ phosphatase Endonuclease, exonuclease

Artemis

Pol mu Pol lambda

Human disease

Radiation sensitivity with severe combined immunodeficiency (van der Burg et al. 2009)

Nijmegen breakage syndrome-like disorder (O’Driscoll et al. 2001) Radiation sensitivity with severe combined immunodeficiency (Ahnesorg et al. 2006; Buck et al. 2006) Microcephaly and seizures (Shen et al. 2010) Radiation sensitivity with severe combined immunodeficiency (Moshous et al. 2001)

DNA polymerase DNA polymerase

forms a nucleoprotein filament with six Rad51 monomers per helical turn (San Filippo et al. 2008). Rad51 has an ATPase activity that permits turnover via dissociation from the DNA. Rad51 is the recombinase that mediates strand exchange for HR. The nucleoprotein filament of Rad51-coated ssDNA captures a double-stranded DNA molecule and searches for homologous sequence (Fig. 3). BRCA2 and PALB2 act synergistically to stimulate Rad51-mediated strand exchange and D-loop formation (Buisson et al. 2010; Dray et al. 2010). The invading 30 end of the ssDNA primes DNA synthesis from the

Double-Strand Break Repair

homologous DNA molecule, replacing DNA sequence lost from the broken DNA molecule. The HR pathway can also lead to the crossingover between homologous DNA chromosomes. Thus, HR-directed repair of a DSB during mitosis can lead to a detrimental loss of heterozygosity. In addition to its role in resection, BLM forms a complex with topoisomerase IIIa to suppress crossing-over during homologous recombination (Wu and Hickson 2003). This role may explain why Bloom’s syndrome cells have a high rate of sister chromatid exchange.

Classical Nonhomologous End Joining (C-NHEJ) C-NHEJ repairs DSBs in all phases of the cell cycle by joining the broken ends directly without using homologous DNA. DSBs may create ends with damaged nucleotides, which must be removed before ligation can occur. Since C-NHEJ lacks a mechanism for replacing the damaged nucleotides, the reaction is nonconservative. Table 3 summarizes the proteins involved in C-NHEJ. Initiation of C-NHEJ occurs when Ku (a heterodimer of Ku70 and Ku80) binds to the DNA ends (Smider et al. 1994; Taccioli et al. 1994). Ku recruits DNA-PKcs (DNAdependent protein kinase catalytic subunit) to the DNA end, sliding to an inward position on the DNA (Yoo and Dynan 1999). DNA-PKcs is an unusual serine-threonine protein kinase that undergoes activation on binding to DNA (Jackson et al. 1990). Activation requires free DNA ends and may involve threading of unpaired single strands into enclosed cavities in the DNA-PKcs structure (Hammarsten et al. 2000; Leuther et al. 1999) and occurs upon formation of a synaptic complex containing two DNA ends and two DNA-PKcs molecules (DeFazio et al. 2002). Subsequent DNA-PKcs autophosphorylation generates conformational changes that dissociate DNA-PKcs from the DNA ends (Chan and Lees-Miller 1996; Meek et al. 2008), which allows other C-NHEJ proteins to bind and process the DNA ends.

301

Three proteins participate in the ligation reaction: XRCC4 (X-ray cross-complementing protein 4), ligase IV, and XLF (XRCC-like factor). XRCC4 and ligase IV form a complex with DNA ligase activity (Grawunder et al. 1997). XLF (also known as Cernunnos) interacts with XRCC4 (Ahnesorg et al. 2006; Buck et al. 2006). XRCC4 and XLF are homodimers that interact with each other via their head domains, and they are capable of forming a structure of alternating XRCC4 and XLF molecules (Andres and Junop 2011; Hammel et al. 2011; Ropars et al. 2011; Wu et al. 2011). Ku, XRCC4/ligase IV, and XLF have a mismatched end (MEnd) ligase activity that ligates any pair of DNA ends, even ends with mismatched overhangs (Tsai et al. 2007). The proteins may form a megadalton protein-DNA complex containing Ku plus a filament of alternating molecules of XRCC4/ligase IV and XLF (Tsai and Chu 2013). The cooperative assembly of the filament may facilitate juxtaposition of the mismatched ends for ligation. MEnd ligase activity is particularly robust for 30 overhangs, which is relevant for V(D)J recombination, the pathway that generates immunological diversity (see the brief definition ▶ “V(D)J Recombination”). Recombination occurs by a pair of cleavages adjacent to two recombination signal sequences. The cleavages are catalyzed by the endonuclease RAG1/RAG2 (recombinationactivating gene products 1 and 2) (McBlane et al. 1995). Each cleavage generates a blunt end adjacent to signal sequence and a hairpin end adjacent to protein-coding sequence. C-NHEJ then joins the blunt ends to each other in a precise reaction and the hairpin ends to each other in an error-prone reaction. In a complex with DNA-PKcs, the endonuclease Artemis opens the hairpin ends, either with a nick at the tip of the hairpin or an asymmetric nick that creates a palindromic 30 single-stranded overhang (Ma et al. 2002). Terminal deoxynucleotidyl transferase may also add random nucleotides to the 30 end. The MEnd ligase complex can ligate a mismatched 30 overhang to preserve the nucleotide sequences from the palindromic 30 overhang (P addition) or from the addition of random nucleotides (N addition). Mutations in C-NHEJ genes cause several human diseases (Table 3), including

D

302

radiation sensitivity with severe combined immunodeficiency (RS-SCID), where SCID is due to the importance of DNA-PKcs, Artemis, and XLF in V(D)J recombination. In addition to ends ending in hairpins, C-NHEJ must join DNA ends with disrupted termini that require enzymatic processing prior to ligation. Artemis has an intrinsic 50 –30 exonuclease activity. Upon forming a complex with DNA-PKcs, Artemis acquires an endonuclease activity that cleaves 50 and 30 ssDNA overhangs (Ma et al. 2002). Polynucleotide kinase/phosphatase (PNKP) converts 50 -hydroxyl and 30 -phosphate DNA termini (such as those created by ionizing radiation) into ligatable 50 -phosphate and 30 -hydroxyl ends. PNKP requires XRCC4, thus linking its processing activity to ligation by ligase IV (Chappell et al. 2002; Koch et al. 2004; Mani et al. 2010). DNA polymerases mu and lambda have been implicated in C-NHEJ by their interactions with Ku and XRCC4. Both enzymes are recruited to fill gaps when ends have partially complementary overhangs (Nick McElhinny et al. 2005). C-NHEJ optimizes the preservation of DNA sequence by suppressing processing so that compatible DNA ends are joined precisely. For example, the blunt ends adjacent to V(D)J recombination signal sequences are joined without nucleotide addition or deletion virtually 100% of the time. In human cell extracts that recapitulate C-NHEJ in intact cells, blunt and cohesive ends are joined precisely, even though processing readily targets noncompatible ends to generate both nucleotide addition and deletion (Budman and Chu 2005). These observations suggest that C-NHEJ controls access of the processing enzymes to the DNA ends. How does C-NHEJ prevent unnecessary processing of the ends? All polymerase and most nuclease activity require the presence of XRCC4/ligase IV (Budman et al. 2007). The processing enzymes polymerase mu, polymerase lambda, and PNKP interact with XRCC4, which appears to grant access to the DNA ends. This mechanism ensures that XRCC4/ligase IV binds to the ends first and catalyzes ligation if processing is not required. The small fraction of

Double-Strand Break Repair

nuclease activity in the absence of XRCC4/ligase IV could occur via recruitment and activation of Artemis by DNA-PKcs. Thus, C-NHEJ regulates processing via an ordered series of steps: binding by Ku and then DNA-PKcs, synapsis of the ends and DNA-PKcs autophosphorylation, release of the ends for binding by XRCC4/ligase IV and XLF, and finally XRCC4-dependent recruitment of PNKP and polymerase (Fig. 4). Additional questions remain unanswered. Although Artemis participates in C-NHEJ, it is not known if other nucleases also contribute to processing of the ends. The MRN complex tethers DNA ends and contributes to end resection in A-NHEJ and HR. It is not known if MRN contributes to end processing during C-NHEJ. Perhaps, the complex of MRN and CtIP resects DNA ends before Ku binds to the ends, accounting for some of the deletions observed in C-NHEJ.

Alternative End Joining (A-NHEJ) Evidence has accumulated for an alternative nonhomologous end-joining pathway (A-NHEJ) (Deriano and Roth 2013). Deletion of XRCC4 or ligase IV reveals robust alternative end joining during class switch recombination (Yan et al. 2007). Removal of portions of the RAG1 and RAG2 proteins reveals an alternative joining pathway during V(D)J recombination in NHEJdeficient cells (Corneo et al. 2007). In the absence of XRCC4/ligase IV, chromosomal translocations occur with much greater frequency (Simsek and Jasin 2010). A-NHEJ appears to be a distinct pathway, rather than a pathway cobbled together by substituting another repair protein for a missing C-NHEJ protein (Deriano and Roth 2013). The junctions created by A-NHEJ are not affected by whether Ku or XRCC4/ligase IV is absent. In the RAG mutants cited above, A-NHEJ is robust even in the presence of intact C-NHEJ. Finally, A-NHEJ may have evolved before C-NHEJ, since E. coli catalyze end joining by an A-NHEJ pathway but lack C-NHEJ proteins, such as Ku. Table 4 summarizes the proteins believed to be involved in A-NHEJ.

Double-Strand Break Repair

303

D

Double-Strand Break Repair, Fig. 4 Model for C-NHEJ. Ku binds to the DNA ends and slides inward upon recruiting DNA-PKcs to the end. DNA-PKcs brings the ends together in a synaptic complex, which activates the DNA-PKcs kinase. DNA-PKcs undergoes autophosphorylation and releases the ends for binding by XRCC4/ligase IV (XL). (The red dot indicates where ligase IV interacts with XRCC4.) If the ends are ligatable

Double-Strand Break Repair, Table 4 Proteins involved in alternative nonhomologous end joining Protein Mre11Rad50-Nbs1 CtIP PARP-1 XRCC1 Ligase III

Biochemical activity See Table 1

Human disease See Table 1

See Table 1 Poly(ADP-ribose) polymerase DNA binding, partner of ligase III DNA ligase

The junctions created by A-NHEJ contain deletions, usually back to regions of microhomology. These deletions are significantly longer than those seen in C-NHEJ. Thus, A-NHEJ is

with blunt or complementary overhangs, XL ligates the ends directly. If 30 -phosphate or 50 -hydroxyl groups have damaged the ends, XRCC4 recruits PNKP to repair the damage. XRCC4 and XLF assemble cooperatively into a protein-DNA filament to align the DNA ends, facilitating the ligation of DNA ends even if they have noncomplementary overhangs. DNA polymerases mu and lambda and nuclease activity process the ends

the least conservative of the three repair pathways, serving as a backup system for HR and C-NHEJ. Apparently, DSBs pose such a threat that cells evolved A-NHEJ despite its error-prone nature. On the other hand, it appears that A-NHEJ is primarily responsible for the chromosomal translocations seen in cancer. Some elements of the A-NHEJ pathway are now known (Fig. 5). Recruitment of CtIP to the MRN complex activates Mre11 nuclease to resect the DNA ends (Lee-Theilen et al. 2011). Ligase III promotes A-NHEJ during chromosomal translocation (Simsek et al. 2011). XRCC1 forms a tight complex with ligase III (Caldecott et al. 1994), indicating that XRCC1 is also involved in A-NHEJ. Poly(ADP-ribose) polymerase 1 (PARP-1) facilitates A-NHEJ

304

Double-Strand Break Repair

is a conservative pathway that preserves DNA sequence, operating only in S and G2 phases of the cell cycle when an undamaged sister chromosome is available for recombination. C-NHEJ joins ends precisely if possible, but otherwise optimizes the preservation of DNA sequence. C-NHEJ plays key roles in V(D)J recombination and class switch recombination. A-NHEJ joins broken ends not repaired by HR or C-NHEJ. Although this pathway may prevent death of the cell from an unrepaired chromosome, it generates chromosome translocations that lead to cancer.

Cross-References

Double-Strand Break Repair, Fig. 5 Model for A-NHEJ. After limited resection by MRN/CtIP, the two ends are bound and held in a synaptic complex by PARP-1. Nuclease activity removes the protruding 30 ends back to regions of microhomology (black boxes). PARP-1 recruits XRCC1/ligase III to ligate the DNA ends

during class switch recombination (Robert et al. 2009). PARP-1, XRCC1, and DNA ligase III act together to join DNA ends. PARP-1 binds to ends, albeit with a lower affinity than Ku (Wang et al. 2006), consistent with a hierarchy that favors C-NHEJ over A-NHEJ. Features of A-NHEJ remain unclear. What regulates the choice between A-NHEJ and the other pathways of HR and C-NHEJ? Upon binding to DNA, PARP-1 catalyzes transfer of 50–200 molecules of ADP-ribose from NAD+ to itself. PolyADP-ribosylation of PARP-1 results in its dissociation from the DNA end. Does PARP-1 selfmodification recruit other A-NHEJ proteins?

Conclusions To mitigate the danger posed by DNA DSBs, cells have evolved three pathways for DSB repair. HR

▶ DNA Damage, Types of ▶ DNA Recombination, Mechanisms of ▶ DNA Repair ▶ DNA Repair Polymerases ▶ DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of ▶ End Joining, Classical and Alternative ▶ Hereditary Breast and Ovarian Cancer and Poly(ADP-Ribose) Polymerase Inhibition ▶ Homologous Recombination in Lesion Bypass ▶ Mismatch Repair ▶ Nucleotide Excision Repair ▶ Recombineering ▶ Single-Strand Annealing ▶ V(D)J Recombination Acknowledgments The author thanks Chun Tsai, Jian Fung, and Alex Chu for their helpful comments. This work was supported by NIH grant 1R01GM086579.

References Ahnesorg P, Smith P, Jackson SP (2006) XLF interacts with the XRCC4-DNA ligase IV complex to promote DNA nonhomologous end-joining. Cell 124:301–313 Andres SN, Junop MS (2011) Crystallization and preliminary X-ray diffraction analysis of the human XRCC4XLF complex. Acta Crystallogr Sect F Struct Biol Cryst Commun 67:1399–1402 Buck D, Malivert L, de Chasseval R, Barraud A, Fondaneche MC, Sanal O, Plebani A, Stephan JL, Hufnagel M, le Deist F et al (2006) Cernunnos, a novel nonhomologous end-joining factor, is mutated

Double-Strand Break Repair in human immunodeficiency with microcephaly. Cell 124:287–299 Budman J, Chu G (2005) Processing of DNA for nonhomologous end-joining by cell-free extract. EMBO J 24:849–860 Budman J, Kim SA, Chu G (2007) Processing of DNA for nonhomologous end-joining is controlled by kinase activity and XRCC4/ligase IV. J Biol Chem 282:11950–11959 Buis J, Wu Y, Deng Y, Leddon J, Westfield G, Eckersdorff M, Sekiguchi JM, Chang S, Ferguson DO (2008) Mre11 nuclease activity has essential roles in DNA repair and genomic stability distinct from ATM activation. Cell 135:85–96 Buisson R, Dion-Cote AM, Coulombe Y, Launay H, Cai H, Stasiak AZ, Stasiak A, Xia B, Masson JY (2010) Cooperation of breast cancer proteins PALB2 and piccolo BRCA2 in stimulating homologous recombination. Nat Struct Mol Biol 17:1247–1254 Caldecott KW, McKeown CK, Tucker JD, Ljungquist S, Thompson LH (1994) An interaction between the mammalian DNA repair protein XRCC1 and DNA ligase III. Mol Cell Biol 14:68–76 Carney JP, Maser RS, Olivares H, Davis EM, Le Beau M, Yates JR 3rd, Hays L, Morgan WF, Petrini JH (1998) The hMre11/hRad50 protein complex and Nijmegen breakage syndrome: linkage of double-strand break repair to the cellular DNA damage response. Cell 93:477–486 Chan DW, Lees-Miller SP (1996) The DNA-dependent protein kinase is inactivated by autophosphorylation of the catalytic subunit. J Biol Chem 271:8936–8941 Chappell C, Hanakahi LA, Karimi-Busheri F, Weinfeld M, West SC (2002) Involvement of human polynucleotide kinase in double-strand break repair by non-homologous end joining. EMBO J 21:2827–2832 Chen L, Nievera CJ, Lee AY, Wu X (2008) Cell cycledependent complex formation of BRCA1.CtIP.MRN is important for DNA double-strand break repair. J Biol Chem 283:7713–7720 Corneo B, Wendland RL, Deriano L, Cui X, Klein IA, Wong SY, Arnal S, Holub AJ, Weller GR, Pancake BA et al (2007) Rag mutations reveal robust alternative end joining. Nature 449:483–486 Crackower MA, Scherer SW, Rommens JM, Hui CC, Poorkaj P, Soder S, Cobben JM, Hudgins L, Evans JP, Tsui LC (1996) Characterization of the split hand/split foot malformation locus SHFM1 at 7q21.3-q22.1 and analysis of a candidate gene for its expression during limb development. Hum Mol Genet 5:571–579 DeFazio L, Stansel R, Griffith J, Chu G (2002) Synapsis of DNA ends by the DNA-dependent protein kinase. EMBO J 21:3192–3200 Deriano L, Roth DB (2013) Modernizing the nonhomologous end-joining repertoire: alternative and classical NHEJ share the stage. Annu Rev Genet 47:433–455 Dray E, Etchin J, Wiese C, Saro D, Williams GJ, Hammel M, Yu X, Galkin VE, Liu D, Tsai MS

305 et al (2010) Enhancement of RAD51 recombinase activity by the tumor suppressor PALB2. Nat Struct Mol Biol 17:1255–1259 Dudley DD, Chaudhuri J, Bassing CH, Alt FW (2005) Mechanism and control of V(D)J recombination versus class switch recombination: similarities and differences. Adv Immunol 86:43–112 Ellis NA, Groden J, Ye TZ, Straughen J, Lennon DJ, Ciocci S, Proytcheva M, German J (1995) The Bloom’s syndrome gene product is homologous to RecQ helicases. Cell 83:655–666 Grawunder U, Wilm M, Xiantuo W, Kulezla P, Wilson TE, Mann M, Lieber MR (1997) Activity of DNA ligase IV stimulated by complex formation with XRCC4 protein in mammalian cells. Nature 388:492–494 Hall JM, Lee MK, Newman B, Morrow JE, Anderson LA, Huey B, King MC (1990) Linkage of early-onset familial breast cancer to chromosome 17q21. Science 250:1684–1689 Hammarsten O, DeFazio L, Chu G (2000) Activation of DNA-dependent protein kinase by single-stranded DNA ends. J Biol Chem 275:1541–1550 Hammel M, Rey M, Yu Y, Mani RS, Classen S, Liu M, Pique ME, Fang S, Mahaney BL, Weinfeld M et al (2011) XRCC4 protein interactions with XRCC4-like factor (XLF) create an extended grooved scaffold for DNA ligation and double strand break repair. J Biol Chem 286:32638–32650 Jackson SP, MacDonald JJ, Lees-Miller S, Tjian R (1990) GC box binding induces phosphorylation of Sp1 by a DNA-dependent protein kinase. Cell 63:155–165 Jones S, Hruban RH, Kamiyama M, Borges M, Zhang X, Parsons DW, Lin JC, Palmisano E, Brune K, Jaffee EM et al (2009) Exomic sequencing identifies PALB2 as a pancreatic cancer susceptibility gene. Science 324:217 Koch CA, Agyei R, Galicia S, Metalnikov P, O’Donnell P, Starostine A, Weinfeld M, Durocher D (2004) Xrcc4 physically links DNA end processing by polynucleotide kinase to DNA ligation by DNA ligase IV. EMBO J 23:3874–3885 Lee-Theilen M, Matthews AJ, Kelly D, Zheng S, Chaudhuri J (2011) CtIP promotes microhomologymediated alternative end joining during class-switch recombination. Nat Struct Mol Biol 18:75–79 Leuther KK, Hammarsten O, Kornberg RD, Chu G (1999) Structure of DNA-dependent protein kinase: implications for its regulation by DNA. EMBO J 18:1114–1123 Ma Y, Pannicke U, Schwarz K, Lieber MR (2002) Hairpin opening and overhang processing by an Artemis/DNAdependent protein kinase complex in nonhomologous end joining and V(D)J recombination. Cell 108:781–794 Mani RS, Yu Y, Fang S, Lu M, Fanta M, Zolner AE, Tahbaz N, Ramsden DA, Litchfield DW, Lees-Miller SP et al (2010) Dual modes of interaction between XRCC4 and polynucleotide kinase/phosphatase: implications for nonhomologous end joining. J Biol Chem 285:37619–37629

D

306 McBlane F, van Gent D, Ramsden D, Romeo C, Cuomo C, Gellert M, Oettinger M (1995) Cleavage at a V(D)J recombination signal requires only RAG1 and RAG2 proteins and occurs in two steps. Cell 83:387–395 Meek K, Dang V, Lees-Miller SP (2008) DNA-PK: the means to justify the ends? Adv Immunol 99:33–58 Moshous D, Callebaut I, de Chasseval R, Corneo B, Cavazzana-Calvo M, Le Deist F, Tezcan I, Sanal O, Bertrand Y, Philippe N et al (2001) Artemis, a novel DNA double-strand break repair/V(D)J recombination protein, is mutated in human severe combined immune deficiency. Cell 105:177–186 Nick McElhinny SA, Havener JM, Garcia-Diaz M, Juarez R, Bebenek K, Kee BL, Blanco L, Kunkel TA, Ramsden DA (2005) A gradient of template dependence defines distinct biological roles for family X polymerases in nonhomologous end joining. Mol Cell 19:357–366 Nimonkar AV, Genschel J, Kinoshita E, Polaczek P, Campbell JL, Wyman C, Modrich P, Kowalczykowski SC (2011) BLM-DNA2-RPA-MRN and EXO1-BLM-RPAMRN constitute two DNA end resection machineries for human DNA break repair. Genes Dev 25:350–362 O’Driscoll M, Cerosaletti KM, Girard PM, Dai Y, Stumm M, Kysela B, Hirsch B, Gennery A, Palmer SE, Seidel J et al (2001) DNA ligase IV mutations identified in patients exhibiting developmental delay and immunodeficiency. Mol Cell 8:1175–1185 Panier S, Boulton SJ (2014) Double-strand break repair: 53BP1 comes into focus. Nat Rev Mol Cell Biol 15:7–18 Rahman N, Seal S, Thompson D, Kelly P, Renwick A, Elliott A, Reid S, Spanova K, Barfoot R, Chagtai T et al (2007) PALB2, which encodes a BRCA2interacting protein, is a breast cancer susceptibility gene. Nat Genet 39:165–167 Robert I, Dantzer F, Reina-San-Martin B (2009) Parp1 facilitates alternative NHEJ, whereas Parp2 suppresses IgH/c-myc translocations during immunoglobulin class switch recombination. J Exp Med 206:1047–1056 Ropars V, Drevet P, Legrand P, Baconnais S, Amram J, Faure G, Marquez JA, Pietrement O, Guerois R, Callebaut I et al (2011) Structural characterization of filaments formed by human Xrcc4-Cernunnos/XLF complex involved in nonhomologous DNA end-joining. Proc Natl Acad Sci U S A 108:12663–12668 San Filippo J, Sung P, Klein H (2008) Mechanism of eukaryotic homologous recombination. Annu Rev Biochem 77:229–257 Sartori AA, Lukas C, Coates J, Mistrik M, Fu S, Bartek J, Baer R, Lukas J, Jackson SP (2007) Human CtIP promotes DNA end resection. Nature 450:509–514 Savitsky K, Sfez S, Tagle D, Ziv Y, Sartiel A, Collins F, Shiloh Y, Rotman G (1995) The complete sequence of the coding region of the ATM gene reveals similarity to cell cycle regulators in different species. Hum Mol Genet 4:2025–2032 Shen J, Gilmore EC, Marshall CA, Haddadin M, Reynolds JJ, Eyaid W, Bodell A, Barry B, Gleason D, Allen

Double-Strand Break Repair K et al (2010) Mutations in PNKP cause microcephaly, seizures and defects in DNA repair. Nat Genet 42:245–249 Simsek D, Jasin M (2010) Alternative end-joining is suppressed by the canonical NHEJ component Xrcc4ligase IV during chromosomal translocation formation. Nat Struct Mol Biol 17:410–416 Simsek D, Brunet E, Wong SY, Katyal S, Gao Y, McKinnon PJ, Lou J, Zhang L, Li J, Rebar EJ et al (2011) DNA ligase III promotes alternative nonhomologous end-joining during chromosomal translocation formation. PLoS Genet 7:e1002080 Smider V, Rathmell WK, Lieber M, Chu G (1994) Restoration of X-ray resistance and V(D)J recombination in mutant cells by Ku cDNA. Science 266:288–291 Stewart GS, Maser RS, Stankovic T, Bressan DA, Kaplan MI, Jaspers NG, Raams A, Byrd PJ, Petrini JH, Taylor AM (1999) The DNA double-strand break repair gene hMRE11 is mutated in individuals with an ataxiatelangiectasia-like disorder. Cell 99:577–587 Stracker TH, Petrini JH (2011) The MRE11 complex: starting from the ends. Nat Rev Mol Cell Biol 12:90–103 Sy SM, Huen MS, Chen J (2009) PALB2 is an integral component of the BRCA complex required for homologous recombination repair. Proc Natl Acad Sci U S A 106:7155–7160 Taccioli GE, Gottlieb TM, Blunt T, Priestly A, Demengeot J, Mizuta R, Lehmann AR, Alt FW, Jackson SP, Jeggo PA (1994) Ku80: product of the XRCC5 gene and its role in DNA repair and V(D)J recombination. Science 265:1442–1445 Tsai CJ, Chu G (2013) Cooperative assembly of a megadalton protein-DNA complex for nonhomologous end joining. J Biol Chem 288:18110–18120 Tsai CJ, Kim SA, Chu G (2007) Cernunnos/XLF promotes the ligation of mismatched and noncohesive DNA ends. Proc Natl Acad Sci U S A 104:7851–7856 van der Burg M, Ijspeert H, Verkaik NS, Turul T, Wiegant WW, Morotomi-Yano K, Mari PO, Tezcan I, Chen DJ, Zdzienicka MZ et al (2009) A DNA-PKcs mutation in a radiosensitive T-B- SCID patient inhibits Artemis activation and nonhomologous end-joining. J Clin Invest 119:91–98 Varon R, Vissinga C, Platzer M, Cerosaletti KM, Chrzanowska KH, Saar K, Beckmann G, Seemanova E, Cooper PR, Nowak NJ et al (1998) Nibrin, a novel DNA double-strand break repair protein, is mutated in Nijmegen breakage syndrome. Cell 93:467–476 Waltes R, Kalb R, Gatei M, Kijas AW, Stumm M, Sobeck A, Wieland B, Varon R, Lerenthal Y, Lavin MF et al (2009) Human RAD50 deficiency in a Nijmegen breakage syndrome-like disorder. Am J Hum Genet 84:605–616 Wang M, Wu W, Rosidi B, Zhang L, Wang H, Iliakis G (2006) PARP-1 and Ku compete for repair of DNA double strand breaks by distinct NHEJ pathways. Nucleic Acids Res 34:6170–6182

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of Wooster R, Neuhausen SL, Mangion J, Quirk Y, Ford D, Collins N, Nguyen K, Seal S, Tran T, Averill D et al (1994) Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12-13. Science 265:2088–2090 Wu L, Hickson ID (2003) The Bloom’s syndrome helicase suppresses crossing over during homologous recombination. Nature 426:870–874 Wu Y, Berends MJ, Post JG, Mensink RG, Verlind E, Van Der Sluis T, Kempinga C, Sijmons RH, van der Zee AG, Hollema H et al (2001) Germline mutations of EXO1 gene in patients with hereditary nonpolyposis colorectal cancer (HNPCC) and atypical HNPCC forms. Gastroenterology 120:1580–1587 Wu Q, Ochi T, Matak-Vinkovic D, Robinson CV, Chirgadze DY, Blundell TL (2011) Non-homologous end-joining partners in a helical dance: structural studies of XLF-XRCC4 interactions. Biochem Soc Trans 39:1387–1392, suppl 1382 p following 1392 Yan CT, Boboila C, Souza EK, Franco S, Hickernell TR, Murphy M, Gumaste S, Geyer M, Zarrin AA, Manis JP et al (2007) IgH class switching and translocations use a robust non-classical end-joining pathway. Nature 449:478–482 Yang H, Jeffrey PD, Miller J, Kinnucan E, Sun Y, Thoma NH, Zheng N, Chen PL, Lee WH, Pavletich NP (2002) BRCA2 function in DNA binding and recombination from a BRCA2-DSS1-ssDNA structure. Science 297:1837–1848 Yoo S, Dynan WS (1999) Geometry of a complex formed by double strand break repair proteins at a single DNA end: recruitment of DNA-PKcs induces inward translocation of Ku protein. Nucleic Acids Res 27:4679–4686 You Z, Bailis JM (2010) DNA damage and decisions: CtIP coordinates DNA repair and cell cycle checkpoints. Trends Cell Biol 20:402–409 Zhang F, Ma J, Wu J, Ye L, Cai H, Xia B, Yu X (2009) PALB2 links BRCA1 and BRCA2 in the DNA-damage response. Curr Biol 19:524–529

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of Federica Marini and Achille Pellicioli Dipartimento di Bioscienze, Università di Milano, Milan, Italy

Synopsis Of the many types of DNA lesions, DNA doublestrand breaks (DSBs) are considered the most harmful, because one unrepaired DSB is sufficient

307

to trigger permanent growth arrest and cell death. In addition, DSBs are potent inducers of gross chromosomal rearrangements such as deletions, translocations, and amplifications. DSB signaling and repair through different pathways is crucial to preserve genomic integrity and maintain cellular homeostasis. Therefore, it is no wonder if the cell finely regulates DSB repair pathways in the different cell cycle phases and following activation of the DNA damage checkpoint. In this short entry we will illustrate some known aspects of the regulation of DSB repair in the mitotic cell cycle. In particular we will focus on the balance of the two main DSB repair pathways – NHEJ, nonhomologous end joining, and H(D)R, homologous (directed) recombination – as well as on the regulation of the resolution of joint molecules that arise during H(D)R.

Introduction DNA double-strand breaks (DSBs) are efficiently repaired by two alternative processes: (i) joining of the DNA ends (NHEJ) and (ii) recombination with a homologous DNA sequence H(D)R. NHEJ and HDR are achieved through several sub-pathways, which are schematically illustrated in Fig. 1. Experimental evidence indicates that channeling of a DSB into one of the various repair pathways is a finely regulated process, most likely aimed at limiting the risk of unwanted genome deletions and rearrangements. Indeed, the existence of accurate regulatory networks to monitor DSB repair events is not surprising, as DSBs are among the most dangerous types of chromosomal lesions, and may lead to chromosome loss, genetic instability, and eventually malignant transformation, if not accurately repaired. In fact, defects in DSB repair and mutations in genes involved in these processes lead to cancer predisposition or other genome instability associated syndromes. Moreover, accumulation of persistent DSBs is linked to aging and cellular senescence (Aparicio et al. 2014). The central focus of DSB repair regulation is on the end processing mechanism, which affects

D

308

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

5’-to-3’ resection/annealing

short processing/ligation

DSB

5’-to-3’ resection SSA

Mre11-Rad50-Xrs2 [Nbs1]; Sae2 [CtIP]; Exo1; Sgs1 [BLM]; Dna2; RPA

Rad52; Saw1-Rad1-Rad10 [ERCC1-XPF]; Msh2-3; Slx4;

Rad51 [RAD51B, RAD51C, RAD51D, XRCC2, XRCC3]; Rad55-57; [BRCA2-DSS1]; Shu1-Shu2-Psy3-Csm2; Rad54; Rdh54 [RAD54B];

BIR

NHEJ

KU70/80; DNA-PK; ARTEMIS; XRCC4; Pol4; Ligase IV; Lif1, Nej1;

Strand Invasion; D-loop formation

SDSA

dHJ Subpathways

Sgs1-Top3-Rim1 [BLM]; Mph1, Fml1 [FANCM]; Srs2;

Sgs1-Top3-Rim1 [BLM]; Mph1, Fml1 [FANCM]; Slx1-4; Yen1 [GEN1]; Mus81-MMS4 [EME1];

half-crossover

noncrossover

dissolution

noncrossover resolution noncrossover

crossover DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of, Fig. 1 Pathways of double-strand break repair. Broken lines indicate new DNA synthesis. Abbreviations: DSB DNA double-strand

break, NHEJ nonhomologous end joining, SSA singlestrand annealing, SDSA synthesis-dependent strand annealing, BIR break-induced replication, dHJ double Holliday junction

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

the balance between NHEJ and H(D)R (Heyer et al. 2010; Panier and Boulton 2014). Then, when a DSB is channeled into H(D)R, the regulation of DSB repair is achieved at different steps of the end processing: (i) nucleolytic processing of DSB ends, (ii) formation of a nucleoprotein filament during the presynaptic step, (iii) formation of D-loop intermediate during the synaptic step, and (iv) formation and elimination of joint DNA molecules during the postsynaptic step. This short entry will focus on some aspects of the regulation of DSB repair in the mitotic cell cycle.

DSBs and the Checkpoint Response Formation of DSB lesions frequently leads to the activation of a surveillance mechanism, called the DNA damage checkpoint response (DDR), to block cell cycle progression and facilitate DNA repair (Lazzaro et al. 2009). Budding yeast is an ideal model system to induce the formation of a single DSB lesion and study DDR and DNA repair. Indeed, many results and models were first obtained in yeast and further explored in other eukaryotes. Protein kinases play fundamental roles in the checkpoint signaling pathway. In particular, the DSB ends are sensed by the ATM and DNA-PKcs kinases, whereas resected DSBs are sensed by the ATR kinase, which is recruited to a 30 ssDNA filament covered by the single-strand DNA-binding protein RPA (Lazzaro et al. 2009). Therefore, if a DSB is extensively resected and the 30 ssDNA tail is not immediately engaged in strand invasion and repair mechanisms, the DDR pathway is activated and cell cycle progression is blocked. In budding yeast, even a single irreparable DSB is enough to activate a robust ATR checkpoint and stop cell cycle in the G2/M transition (Pellicioli et al. 2001). Interestingly, if a DSB is channeled into the single-strand annealing (SSA) repair pathway, which requires resection of the DSB but not strand invasion (see Fig. 1), the intensity of the DDR and the cell cycle block are directly linked to the distance between the cut site and the homologous sequence

309

involved in the repair. In contrast, if a DSB lesion is minimally resected, such as in G1 cell cycle phase, and is repaired faster through NHEJ or H(D)R, ATR signaling is not activated and the cell cycle is not delayed (Pellicioli et al. 2001).

DSB Repair and Cell Cycle Signaling The changing balance between NHEJ and H(D)R events throughout the cell cycle clearly indicates that DSB repair is under a cell cycle regulatory network. During replication and post-replication phases (S and G2/M), a homologous copy of a cut sequence is available on the sister chromatid and may serve as a perfect template (called the donor sequence) to repair the lesion through H(D)R, also reducing unwanted nonallelic recombination and loss of heterozygosity. When a homologous sequence is not available and a sister chromatid is not present (G1 phase), DSBs can be repaired through NHEJ (Heyer et al. 2010). Further supporting this notion, a number of DSB repair factors undergo specific posttranslational modifications (see Table 1), which are also linked with cell cycle progression. Interestingly, some of these regulatory events are mediated by DDR, indicating that DSB repair is influenced by the checkpoint response throughout the cell cycle. The functional roles of these protein regulations are largely unknown; however, specific domains for the interaction with phosphorylated, ubiquitylated, and sumoylated peptides are present in several factors involved in DDR and DNA repair. This strongly suggests that some posttranslational modifications may affect protein interactions and the formation of novel protein complexes in the presence of DSBs. Indeed, in several cases it has been shown that post-translational modifications affect the localization of proteins to DSB sites, where it is known that large and heterogeneous protein complexes are formed. Furthermore, specific post-translational modifications may also influence protein stability and/or their enzymatic activities, raising the possibility that these represent direct mechanisms to switch on/off important steps of DSB repair.

D

310

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of, Table 1 Post-translational modifications and their effects on proteins involved in DSB repair Protein Ku70Ku80

Function Binds DNA ends, NHEJ, involved in checkpoint adaptation

Organism X. laevis

Nej1/ XLF

NHEJ

S. cerevisiae

Mre11

Endo-exonuclease, processes DNA ends

Mammals

Mammals S. cerevisiae Mammals

Xrs2/ Nbs1

Sae2/ CtIP

Mre11-associated subunit

Mre11-associated factor, processes DNA ends, involved in checkpoint adaptation

X. laevis S. cerevisiae

Post-translational modifications –poly-UBI (mediates Ku80 removal from DNA) –poly-UBI (affects Ku70-80 levels) –PO4 (DNA-PKcs phosphorylates Ku70 at S6 and Ku80 at S577, S580, and T715; CDK2 phosphorylates Ku70) –SUMO (sumoylation at Ku70K556 enhances protein stability) –PO4 (Dun1-mediated PO4 at S297 enhances binding to Srs2, favoring NHEJ/SSA) –PO4 –PO4 (mediated by Tel1) –PO4 (mediated by ATM and ATR) –PO4 (mediated by ATR) –PO4 (mediated by Tel1 and CDK1)

Mammals

–PO4 (PO4 at S278 and S343 required for checkpoint activation)

S. cerevisiae

–PO4 (Mec1/Tel1-mediated PO4 at S/T-Q motifs important for co-localization with Mre11, DSB resection in meiosis and DSB repair; CDK1-mediated PO4 at S267 required for DSB resection; Cdc5-mediated hyper-PO4 enhances loading to DSB) –Ac. (protein degradation by autophagy) –PO4 (CKII-mediated PO4 at SXT repeats mediates interaction with Nbs1 and recruitment to DSB) –PO4 (CDK1-mediated PO4 at S847 required for DSB resection) –poly-UBI (BRCA1-dependent poly-ubiquitylation enhances chromatin binding) –Ac (SIRT6-dependent deacetylation at L432, L526, and L604 important for DSB resection)

S. pombe

Mammals

References (in short) Postow et al. (2008) Chan et al. (1999), Gama et al. (2006), Muller-Tidow et al. (2004), Yurchenko et al. (2008)

Ahnesorg and Jackson (2007) Yu et al. (2008) D’Amours and Jackson (2001)

D’Amours and Jackson (2001), Ubersax et al. (2003) Gatei et al. (2000b), Lim et al. (2000), Wu et al. (2000), Zhao et al. (2000) Baroni et al. (2004), Cartagena-Lirola et al. (2006), Donnianni et al. (2010), Robert et al. (2011), Sartori et al. (2007), Terasawa et al. (2008)

Lloyd et al. (2009), Williams et al. (2009)

Huertas and Jackson (2009), Kaidi et al. (2010), Yu et al. (2006)

(continued)

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

311

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of, Table 1 (continued) Protein RPA

Function ssDNA-binding factor; multiple roles in replication, recombination, repair, and checkpoint

Organism S. cerevisiae

Mammals

Artemis

NHEJ

Mammals

XRCC4

NHEJ

Mammals

DNA ligase IV Exo1

Dna2

Rad51

Rad52

NHEJ

Mammals

50 –30 exonuclease, processes DNA ends during recombination and repair

S. cerevisiae

ATP-dependent nuclease and helicase with multiple role in replication and recombination; processes DNA ends in recombination and repair ssDNA-binding protein; mediates homology search and strand invasion

S. cerevisiae

Rad51 filament formation; SSA

S. cerevisiae

Mammals

Mammals

–Ac (p300-mediated acetylation promotes Dna2 activities and DNA binding)

Mammals

–PO4 (CHK1-mediated PO4 at T309 mediates localization to DSB; cAbl-mediated PO4 at Y315 affects Rad51 filament) –SUMO (controls protein localization) –SUMO (Ubc9-mediated sumoylation at L10, L11, L220 affects recombination, protein stability and localization; inhibits DNA-binding and annealing activities); –PO4 (important for recruitment to repair centers); –PO4 (cAbl-mediated PO4 at Y104 important for recombination) –PO4 (Mek1-dependent PO4 at T132, reduces Rad51 binding in meiosis)

Mammals

Rad54

ATP-dependent DNA translocase, mediates strand exchange and removes Rad51 from dsDNA

Post-translational modifications –PO4 (mediated by CDK1 and Mec1) –SUMO (affected by Slx5/8, disfavors SSA and Rad51independent BIR) –SUMO (sumoylation at RPA70 K449 and K577 enhances its binding to Rad51, facilitating recruitment on to DSBs) –PO4 (mediated by ATM and DNA-PK) –PO4 (DNA-PKcs-mediated PO4 at S260, S318) –SUMO (sumoylation at K210 controls XRCC4 nuclear localization) –PO4 (DNA-PKcs-mediated PO4 at T650) –PO4 (Rad53-mediated PO4 at S372, 567, 692 inhibits Exo1 activity) –PO4 (ATR-mediated PO4 leads to degradation) –PO4 (mediated by CDK1)

S. cerevisiae

References (in short) Anantha et al. (2007), Kim and Brill (2003), Riballo et al. (2004)

Bruderer et al. (2011), Dou et al. (2010)

Ma et al. (2005), Riballo et al. (2004) Lee et al. (2004) Yurchenko et al. (2006)

Wang et al. (2004) Morin et al. (2008)

El-Shemerly et al. (2008) Chen et al. (2011), Ubersax et al. (2003) Balakrishnan et al. (2010)

Conilleau et al. (2004), Ouyang et al. (2009), Sorensen et al. (2005)

Altmannova et al. (2010), Antunez de Mayolo et al. (2006), Sacher et al. (2006)

Kitao and Yuan (2002)

Niu et al. (2009)

(continued)

D

312

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of, Table 1 (continued) Protein Rdh54

Rad55

Srs2

Mus81

Function ATP-dependent DNA translocase, mediates strand exchange and removes Rad51 from dsDNA, involved in checkpoint adaptation Rad51 filament assembly/stability

Organism S. pombe

Post-translational modifications –PO4 (Mek1-dependent PO4 at T6, T673 inhibits Rdh54 in meiosis, Mec1- and Rad53- PO4 dependent after DNA damage in mitotic cells)

References (in short) Ferrari et al. (2013), Tougan et al. (2010)

S. cerevisiae

Herzberg et al. (2006)

ATP-dependent DNA translocase/helicase, removes Rad51 from DNA, involved in checkpoint adaptation Together with MMS4 cleaves branched DNA, resolution of HJs

S. cerevisiae

–PO4 (PO4 at S2, 8, and14 required for Rad51 filament assembly/stability) –PO4 SUMO (CDK1-mediated PO4 prevents sumoylation, promoting SDSA) –PO4 (Cds1-mediated PO4; Mek1-dependent PO4 at T275 inhibits Mus81 in meiosis) –PO4 (Cdk1- and Cdc5-mediated PO4 activates Mus81/Mms4 nuclease activity in G2) –PO4 (Cdk1- and ATR-mediated PO4 activates Mus81/Eme1 nuclease activity) –PO4 (Cdk1-mediated PO4 inhibits Yen1 catalytic activity and promotes its nuclear exclusion. De-PO4 by Cdc14 in mitosis) –PO4 (Mec1-mediated PO4 at T113 required for cleavage of nonhomologous DNA tails; Cdk1-mediated PO4 at S486 promotes Slx4-Dpb11 interaction and checkpoint dampening in MMS) –PO4 (ATM- and ATR-mediated PO4 at S1387, S1423, and S1524 important for checkpoint response; CHK2-mediated PO4 at S988 promotes HDR) –PO4 (CDK1-mediated PO4 at S3291 blocks interaction with Rad51; Chk1 and Chk2-mediated phosphorylations regulate interaction with RAD51)

Kai et al. (2005), Tougan et al. (2010)

S. pombe

S. cerevisiae

Mms4/ Eme1

S. pombe

Yen1

Cleaves branched DNA, resolution of HJs

S. cerevisiae

Slx4

Together with Slx1 cleaves branched DNA, cleaves 30 nonhomologous DNA tails; resolution of HJs; involved in checkpoint downregulation E3 ligase involved in HDR and NHEJ

S. cerevisiae

BRCA2

Rad51 filament formation

Mammals

RecQ helicases

DNA helicase; multiple roles in replication, recombination, repair, and checkpoint

BRCA1

Mammals

Saponaro et al. (2010)

Gallo-Fernandez et al. (2012) Dehe et al. (2013)

Blanco et al. (2014), Eissler et al. (2014)

Ohouo et al. (2010), Ohouo et al. (2013), Toh et al. (2010)

Cortez et al. (1999), Gatei et al. (2000a), Lee et al. (2000), Tibbetts et al. (2000) Bahassi et al. (2008), Esashi et al. (2005)

(continued)

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

313

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of, Table 1 (continued) Protein Sgs1 BLM

WRN

Function

(Helicase and nuclease)

Organism S. cerevisiae Mammals

Mammals

Post-translational modifications –SUMO (Ubc9 mediated) –SUMO (sumoylation at K317, K331, K334, and K347 controls protein localization) –Ac –PO4 (ATM-mediated phoshorylations at T99, T122 important for recruitment to damage sites and stall forks recovery; cdc2-mediated phosphorylations at S714, T766) –SUMO (controls protein localization) –Ac (CBP- and p300-mediated acetylation enhances protein stability and regulates enzymatic activities) –PO4 (ATR-mediated phoshorylations at S991, T1152, S1256; ATM-mediated phosphorylations at S1141, S1058, S1292)

References (in short) Branzei et al. (2006) Bayart et al. (2006), Choudhary et al. (2009), Davies et al. (2004), Eladad et al. (2005)

D

Ammazzalorso et al. (2010), Blander et al. (2002), Kawabe et al. (2000), Li et al. (2010), Muftuoglu et al. (2008), Woods et al. (2004)

Abbreviations: PO4 phosphorylation, UBI ubiquitylation, SUMO sumoylation, Ac acetylation

DSB End Protection and Joining

DSB Resection

NHEJ is a faster, although often less accurate, process than H(D)R. Indeed, the KU70-KU80 and Mre11-Rad50-Nbs1 complexes quickly bind to the DSB and may have a physical role in stabilizing the DNA ends (Chapman et al. 2012). The KU complex mediates recruitment and activation of DNA-PKcs, which in turn phosphorylates and likely regulates many factors (including RPA, WRN, Artemis, XRCC4, and Ligase IV; see Table 1), during the NHEJ process in higher eukaryotes. Phosphorylation and ubiquitylation of the KU complex seem to mediate its turnover on to the DSB lesion. This seems to be a key focal point to understand how the balance is achieved between NHEJ and HDR. Furthermore, the cyclin A1-CDK2 kinase phosphorylates the KU70 subunit, supporting the notion that the NHEJ pathway is regulated during the cell cycle (Waters et al. 2014).

Although short processing of DSBs may be required to achieve NHEJ, extensive DNA end resection is the main event needed to channel a DSB into H(D)R while preventing NHEJ. Initiation of H(D)R requires the coordinated actions of Mre11 complex, Exo1, and Dna2 nucleases to resect the 5-filament (Fig. 2). Once resection is initiated, NHEJ is prevented. The “point of non-return” seems to be the endonucleolytic nick close to DSB end by Mre11, activated by CtIP (Sae2 in budding yeast), which triggers the 30 –50 nucleolytic processing toward the DNA end, Mre11 and KU factors removal, and Exo1 and Dna2-Sgs1 recruitment. Several findings suggest that DSB resection is a double-edged sword, if not finely regulated, since, on one hand, it is needed for faithful H(D)R, but on the other, it may lead to extensive DNA deletions associated with genome rearrangements,

314

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

DSB

Ku

G0/G1 NHEJ Alt-NHEJ MMEJ Sae2

CDK1

MRX

(Ira et al. Nature. 2004. 431, 1011-17) Exo1 Dna2-Sgs1

Rad9 (Lazzaro et al. EMBO J.2008. 27, 1502-12)

Ku and MRX-Sae2 release RPA

Rad51 Rad52

Intra S/G2 SSA SDSA BIR dHJ subpathways (DDR)

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of, Fig. 2 Schematic representation of the DSB resection process. Positve (+) and negative () regulations are indicated

mainly through highly error-prone alternative end joining, micro-homology end joining, and SSA events. Experimental evidence indicates that DSB resection is slower in G1 than in G2 phase, where it is stimulated by CDK1-Clb activity (Ira et al. 2004). CDK1 might influence DSB resection and repair through the phosphorylation of multiple factors. One crucial CDK1 substrate is CtIP/ Sae2, whose phosphorylation at one conserved serine stimulates resection. In human cells, CDK1 phosphorylates CtIP at one additional site, promoting an interaction with BRCA1 and the recruitment of CtIP to repair centers. However, CtIP/Sae2 protein is a target of other kinases (ATR/Mec1, ATM/Tel1, and Polo-like kinase Cdc5) and acetyltransferases, suggesting that its functional role in DSB resection and repair might be finely regulated in several ways (see Table 1). CtIP/Sae2 participates together with the Mre11 complex in the first step of DSB resection. At this step, the Mre11 complex and CtIP/Sae2 process a few hundred nucleotides of the 50 strand of

a DSB end, likely removing critical anomalous structures (such as hairpin-like structures) and/or proteins that might have been previously loaded on to the DNA lesion (such as the KU complex), as explained above (Chapman et al. 2012). Then DSB resection is carried out, redundantly, by the more processive Exo1 exonuclease and Dna2/ Sgs1 exonuclease/helicase, generating long 30 ssDNA tails (Mimitou and Symington 2009) (see also Fig. 2). Interestingly, as indicated in Table 1, Exo1, Dna2, and Sgs1 are posttranslationally modified; however, the functional role of these protein regulations in DSB resection and repair still requires further studies. It is worth noting that factors involved in DSB end resection (such as CtIP/ Sae2, Mre11, Exo1) are also phosphorylated by checkpoint kinases (see Table 1), suggesting that DDR may directly regulate the process that is responsible for its own activation (e.g., the accumulation of ssDNA by DSB end resection), also affecting a crucial step of DSB repair.

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

Recent evidence indicates that the RPA complex is also important in stimulating processive DSB resection (Chen et al. 2013), probably through its role in binding and stabilizing the nascent ssDNA strand. During the cell cycle, the heterotrimeric RPA complex is a target of multiple phosphorylation events, mediated by different kinases (including ATR, ATM, and CDK1), supporting the idea that the role of RPA in DSB resection and repair must be finely regulated. Both in yeast and mammalian cells, DSB resection seems to be physically antagonized through the checkpoint mediator protein 53BP1/ Rad9, whose recruitment onto chromatin around the DNA lesion requires the direct recognition of a DSB-specific histone code. There, through binding to other factors such as RIF1 and PTIP, it acts as a barrier to DSB resection promoting NHEJ and preventing H(D)R (Lazzaro et al. 2009). Interestingly, tumor predisposition and HR defect of Brca1-deficient mice can be rescued by deleting 53BP1, suggesting an antagonistic relationship between 53BP1 and BRCA1 in promoting either NHEJ or HR in mammalian cells. Indeed, BRCA1 participates in HR by functioning as a platform to recruit key enzymatic factors and promoting DNA end resection. 53BP1/Rad9 is a target of ATR/Mec1, ATM/Tel1, Rad53, and CDK1 kinases (Callen et al. 2013; Panier and Boulton 2014). Interestingly, it was recently discovered that 53BP1 dephosphorylation enables its recruitment to DSB ends (Lee et al. 2014).

Rad51-ssDNA Filament and Strand Exchange During early stages of H(D)R, the Rad51-ssDNA filament, which is the central player in homology searching and strand exchange reactions (Fig. 1), is a finely regulated intermediate. Although evidence in human cells indicate that the Rad51 protein itself is directly regulated through a DDR-dependent phosphorylation, influencing Rad51-ssDNA filament formation, it is known that assembly/disassembly of Rad51ssDNA is mediated by several factors (Heyer et al. 2010), which are themselves targets of

315

positive and negative regulations (see Table 1). Among these factors, a main antirecombinogenic role is mediated by specific DNA translocases/ helicases that, moving along ssDNA, promote disassembly of Rad51. Indeed, disruption of the Rad51-ssDNA filament stops unwanted recombination events, reducing the risk of deleterious genome rearrangements. Interestingly, in yeast cells the Srs2 helicase finely regulates the Rad51 nucleoprotein filament and participates in different repair mechanisms depending on its phosphorylated and sumoylated state, further explaining its pro- and antirecombinogenic functions. A number of DNA translocases/helicases might also disrupt D-loop intermediates or, eventually, limit their extension. As a consequence, these DNA translocases/helicases can inhibit recombination or reduce crossover outcomes. It will be important to explore if posttranslational modifications of these factors (see Table 1) might regulate their activity on D-loops as well.

Joint DNA Molecules Formation and Dissolution/Resolution During H(D)R, joint DNA molecules (named Holliday Junctions, HJs) can be formed (Fig. 1). Different mechanisms and factors are responsible for their elimination, which are cell cycle regulated and result in crossover or noncrossover outcomes (Schwartz and Heyer 2011). In S. cerevisiae there are two major pathways for the processing of late HR intermediates. The first, mediated by the Sgs1-Top3-Rmi1 (STR) complex, specializes in the elimination of double HJs (dHJs) during S phase. A related mechanism of double HJ “dissolution” occurs in human cells, driven by BLM-TopoIIIa-RMI1-RMI2 (BTR complex), and is important for the avoidance of sister chromatid exchanges and loss of heterozygosity. Indeed, STR/BTR plays an important role in limiting crossover (CO) formation by directing the products of recombination to noncrossovers (NCOs). The second pathway for the processing of HJs involves two structure-specific endonucleases: in yeast, these are the XPF family endonuclease Mus81-Mms4 and the XPG family member

D

316

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

Yen1, homologous to human MUS81-EME1 and GEN1, respectively, both important for promoting HJs “resolution.” In S. cerevisiae, recent evidences suggest that the STR complex, Mus81Mms4, and Yen1 are differentially regulated during the cell cycle. Cdk-/Cdc5-mediated phosphorylation of Mms4 drives the hyperactivation of Mus81-Mms4 at the G2/M transition, whereas S phase phosphorylation of Yen1 holds this protein outside the nucleus and in an inactive state until it is activated by Cdc14 dephosphorylation at anaphase. As a result, HJs are normally dissolved in S phase by the noncrossover-promoting STR complex which prevents genomic rearrangements and loss of heterozygosity. Only in later phases of the cell cycle, firstly Mus81-Mms4 and finally Yen1 are activated to avoid that cells enter telophase with unresolved HJs and correctly segregate DNA (Blanco et al. 2014; Eissler et al. 2014). Slx4 is a key factor that regulates the recruitment of certain nucleases/resolvases onto HJs. Interestingly, in yeast DDR-dependent phosphorylation of Slx4 promotes a Rad1-dependent cleavage of nonhomologous 30 DNA tails during H(D) R processes (Toh et al. 2010), further supporting the notion that the nucleases/resolvases are finely regulated enzymes and, depending upon their posttranslational modifications and protein interactions, might participate in different DSB repair processes.

References Ahnesorg P, Jackson SP (2007) The non-homologous endjoining protein Nej1p is a target of the DNA damage checkpoint. DNA Repair (Amst) 6:190–201 Altmannova V, Eckert-Boulet N, Arneric M, Kolesar P, Chaloupkova R, Damborsky J, Sung P, Zhao X, Lisby M, Krejci L (2010) Rad52 SUMOylation affects the efficiency of the DNA repair. Nucleic Acids Res 38:4708–4721 Ammazzalorso F, Pirzio LM, Bignami M, Franchitto A, Pichierri P (2010) ATR and ATM differently regulate WRN to prevent DSBs at stalled replication forks and promote replication fork recovery. EMBO J 29:3156– 3169 Anantha RW, Vassin VM, Borowiec JA (2007) Sequential and synergistic modification of human RPA stimulates chromosomal DNA repair. J Biol Chem 282:35910– 35923

Antunez de Mayolo A, Lisby M, Erdeniz N, Thybo T, Mortensen UH, Rothstein R (2006) Multiple start codons and phosphorylation result in discrete Rad52 protein species. Nucleic Acids Res 34:2587–2597 Aparicio T, Baer R, Gautier J (2014) DNA double-strand break repair pathway choice and cancer. DNA Repair (Amst) 19:169–175 Bahassi EM, Ovesen JL, Riesenberg AL, Bernstein WZ, Hasty PE, Stambrook PJ (2008) The checkpoint kinases Chk1 and Chk2 regulate the functional associations between hBRCA2 and Rad51 in response to DNA damage. Oncogene 27:3977–3985 Balakrishnan L, Stewart J, Polaczek P, Campbell JL, Bambara RA (2010) Acetylation of Dna2 endonuclease/ helicase and flap endonuclease 1 by p300 promotes DNA stability by creating long flap intermediates. J Biol Chem 285:4398–4404 Baroni E, Viscardi V, Cartagena-Lirola H, Lucchini G, Longhese MP (2004) The functions of budding yeast Sae2 in the DNA damage response require Mec1- and Tel1-dependent phosphorylation. Mol Cell Biol 24:4151–4165 Bayart E, Dutertre S, Jaulin C, Guo RB, Xi XG, AmorGueret M (2006) The Bloom syndrome helicase is a substrate of the mitotic Cdc2 kinase. Cell Cycle 5:1681–1686 Blanco MG, Matos J, West SC (2014) Dual control of yen1 nuclease activity and cellular localization by cdk and cdc14 prevents genome instability. Mol Cell 54:94–106 Blander G, Zalle N, Daniely Y, Taplick J, Gray MD, Oren M (2002) DNA damage-induced translocation of the Werner helicase is regulated by acetylation. J Biol Chem 277:50934–50940 Branzei D, Sollier J, Liberi G, Zhao X, Maeda D, Seki M, Enomoto T, Ohta K, Foiani M (2006) Ubc9- and mms21-mediated sumoylation counteracts recombinogenic events at damaged replication forks. Cell 127:509–522 Bruderer R, Tatham MH, Plechanovova A, Matic I, Garg AK, Hay RT (2011) Purification and identification of endogenous polySUMO conjugates. EMBO Rep 12:142–148 Callen E, Di Virgilio M, Kruhlak MJ, Nieto-Soler M, Wong N, Chen HT, Faryabi RB, Polato F, Santos M, Starnes LM et al (2013) 53BP1 mediates productive and mutagenic DNA repair through distinct phosphoprotein interactions. Cell 153:1266–1280 Cartagena-Lirola H, Guerini I, Viscardi V, Lucchini G, Longhese MP (2006) Budding Yeast Sae2 is an In Vivo Target of the Mec1 and Tel1 Checkpoint Kinases During Meiosis. Cell Cycle 5:1549–1559 Chan DW, Ye R, Veillette CJ, Lees-Miller SP (1999) DNAdependent protein kinase phosphorylation sites in Ku 70/80 heterodimer. Biochemistry 38:1819–1828 Chen X, Niu H, Chung WH, Zhu Z, Papusha A, Shim EY, Lee SE, Sung P, Ira G (2011) 11. Cell cycle regulation of DNA double-strand break end resection by Cdk1dependent Dna2 phosphorylation. Nat Struct Mol Biol 18:1015–1019

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of Chapman JR, Taylor MR, Boulton SJ (2012) Playing the end game: DNA double-strand break repair pathway choice. Mol Cell 47:497–510 Chen H, Lisby M, Symington LS (2013) RPA coordinates DNA end resection and prevents formation of DNA hairpins. Mol Cell 50:589–600 Choudhary C, Kumar C, Gnad F, Nielsen ML, Rehman M, Walther TC, Olsen JV, Mann M (2009) Lysine acetylation targets protein complexes and co-regulates major cellular functions. Science 325:834–840 Conilleau S, Takizawa Y, Tachiwana H, Fleury F, Kurumizaka H, Takahashi M (2004) Location of tyrosine 315, a target for phosphorylation by cAbl tyrosine kinase, at the edge of the subunit-subunit interface of the human Rad51 filament. J Mol Biol 339:797–804 Cortez D, Wang Y, Qin J, Elledge SJ (1999) Requirement of ATM-dependent phosphorylation of brca1 in the DNA damage response to double-strand breaks. Science 286:1162–1166 D’Amours D, Jackson SP (2001) The yeast Xrs2 complex functions in S phase checkpoint regulation. Genes Dev 15:2238–2249 Davies SL, North PS, Dart A, Lakin ND, Hickson ID (2004) Phosphorylation of the Bloom's syndrome helicase and its role in recovery from S-phase arrest. Mol Cell Biol 24:1279–1291 Donnianni RA, Ferrari M, Lazzaro F, Clerici M, Tamilselvan Nachimuthu B, Plevani P, Muzi-Falconi M, Pellicioli A (2010) Elevated levels of the polo kinase Cdc5 override the Mec1/ATR checkpoint in budding yeast by acting at different steps of the signaling pathway. PLoS Genet 6:e1000763 Dou H, Huang C, Singh M, Carpenter PB, Yeh ET (2010) Regulation of DNA repair through deSUMOylation and SUMOylation of replication protein A complex. Mol Cell 39:333–345 Eissler CL, Mazon G, Powers BL, Savinov SN, Symington LS, Hall MC (2014) The cdk/cdc14 module controls activation of the yen1 holliday junction resolvase to promote genome stability. Mol Cell 54:80–93 El-Shemerly M, Hess D, Pyakurel AK, Moselhy S, Ferrari S (2008) ATR-dependent pathways control hEXO1 stability in response to stalled forks. Nucleic Acids Res 36:511–519 Eladad S, Ye TZ, Hu P, Leversha M, Beresten S, Matunis MJ, Ellis NA (2005) Intra-nuclear trafficking of the BLM helicase to DNA damage-induced foci is regulated by SUMO modification. Hum Mol Genet 14:1351–1365 Esashi F, Christ N, Gannon J, Liu Y, Hunt T, Jasin M, West SC (2005) CDK-dependent phosphorylation of BRCA2 as a regulatory mechanism for recombinational repair. Nature 434:598–604 Ferrari M, Nachimuthu BT, Donnianni RA, Klein H, Pellicioli A (2013) Tid1/Rdh54 translocase is phosphorylated through a Mec1- and Rad53-dependent manner in the presence of DSB lesions in budding yeast. DNA Repair (Amst) 12:347–355

317

Gama V, Yoshida T, Gomez JA, Basile DP, Mayo LD, Haas AL, Matsuyama S (2006) Involvement of the ubiquitin pathway in decreasing Ku70 levels in response to druginduced apoptosis. Exp Cell Res 312:488–499 Gatei M, Scott SP, Filippovitch I, Soronika N, Lavin MF, Weber B, Khanna KK (2000a) Role for ATM in DNA damage-induced phosphorylation of BRCA1. Cancer Res 60:3299–3304 Gatei M, Young D, Cerosaletti KM, Desai-Mehta A, Spring K, Kozlov S, Lavin MF, Gatti RA, Concannon P, Khanna K (2000b) ATM-dependent phosphorylation of nibrin in response to radiation exposure. Nat Genet 25:115–119 Herzberg K, Bashkirov VI, Rolfsmeier M, Haghnazari E, McDonald WH, Anderson S, Bashkirova EV, Yates JR 3rd, Heyer WD (2006) Phosphorylation of Rad55 on serines 2, 8, and 14 is required for efficient homologous recombination in the recovery of stalled replication forks. Mol Cell Biol 26:8396–8409 Heyer WD, Ehmsen KT, Liu J (2010) Regulation of homologous recombination in eukaryotes. Annu Rev Genet 44:113–139 Huertas P, Jackson SP (2009) Human CtIP mediates cell cycle control of DNA end resection and double strand break repair. J Biol Chem 284:9558–9565 Ira G, Pellicioli A, Balijja A, Wang X, Fiorani S, Carotenuto W, Liberi G, Bressan D, Wan L, Hollingsworth NM et al (2004) DNA end resection, homologous recombination and DNA damage checkpoint activation require CDK1. Nature 431:1011–1017 Kai M, Boddy MN, Russell P, Wang TS (2005) Replication checkpoint kinase Cds1 regulates Mus81 to preserve genome integrity during replication stress. Genes Dev 19:919–932 Kaidi A, Weinert BT, Choudhary C, Jackson SP (2010) Human SIRT6 promotes DNA end resection through CtIP deacetylation. Science 329:1348–1353 Kawabe Y, Seki M, Seki T, Wang WS, Imamura O, Furuichi Y, Saitoh H, Enomoto T (2000) Covalent modification of the Werner's syndrome gene product with the ubiquitin-related protein, SUMO-1. J Biol Chem 275:20963–20966 Kim HS, Brill SJ (2003) MEC1-dependent phosphorylation of yeast RPA1 in vitro. DNA Repair (Amst) 2:1321–1335 Kitao H, Yuan ZM (2002) Regulation of ionizing radiationinduced Rad52 nuclear foci formation by c-Abl-mediated phosphorylation. J Biol Chem 277:48944–48948 Lazzaro F, Giannattasio M, Puddu F, Granata M, Pellicioli A, Plevani P, Muzi-Falconi M (2009) Checkpoint mechanisms at the intersection between DNA damage and repair. DNA Repair (Amst) 8:1055–1067 Lee JS, Collins KM, Brown AL, Lee CH, Chung JH (2000) hCds1-mediated phosphorylation of BRCA1 regulates the DNA damage response. Nature 404:201–204 Lee KJ, Jovanovic M, Udayakumar D, Bladen CL, Dynan WS (2004) Identification of DNA-PKcs phosphorylation sites in XRCC4 and effects of mutations at these

D

318

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of

sites on DNA end joining in a cell-free system. DNA Repair (Amst) 3:267–276 Lee DH, Acharya SS, Kwon M, Drane P, Guan Y, Adelmant G, Kalev P, Shah J, Pellman D, Marto JA et al (2014) Dephosphorylation enables the recruitment of 53BP1 to double-strand DNA breaks. Mol Cell 54:512–525 Li K, Wang R, Lozada E, Fan W, Orren DK, Luo J (2010) Acetylation of WRN protein regulates its stability by inhibiting ubiquitination. PLoS One 5:e10341 Lim DS, Kim ST, Xu B, Maser RS, Lin J, Petrini JH, Kastan MB (2000) ATM phosphorylates p95/nbs1 in an S-phase checkpoint pathway. Nature 404: 613–617 Lloyd J, Chapman JR, Clapperton JA, Haire LF, Hartsuiker E, Li J, Carr AM, Jackson SP, Smerdon SJ (2009) A supramodular FHA/BRCT-repeat architecture mediates Nbs1 adaptor function in response to DNA damage. Cell 139:100–111 Ma Y, Pannicke U, Lu H, Niewolik D, Schwarz K, Lieber MR (2005) The DNA-dependent protein kinase catalytic subunit phosphorylation sites in human Artemis. J Biol Chem 280:33839–33846 Mimitou EP, Symington LS (2009) DNA end resection: many nucleases make light work. DNA Repair (Amst) 8:983–995 Morin I, Ngo HP, Greenall A, Zubko MK, Morrice N, Lydall D (2008) Checkpoint-dependent phosphorylation of Exo1 modulates the DNA damage response. EMBO J 27:2400–2410 Muftuoglu M, Kusumoto R, Speina E, Beck G, Cheng WH, Bohr VA (2008) Acetylation regulates WRN catalytic activities and affects base excision DNA repair. PLoS One 3:e1918 Muller-Tidow C, Ji P, Diederichs S, Potratz J, Baumer N, Kohler G, Cauvet T, Choudary C, van der Meer T, Chan WY et al (2004) The cyclin A1-CDK2 complex regulates DNA double-strand break repair. Mol Cell Biol 24:8917–8928 Niu H, Wan L, Busygina V, Kwon Y, Allen JA, Li X, Kunz RC, Kubota K, Wang B, Sung P et al (2009) Regulation of meiotic recombination via Mek1-mediated Rad54 phosphorylation. Mol Cell 36:393–404 Ohouo PY, Bastos de Oliveira FM, Almeida BS, Smolka MB (2010) DNA damage signaling recruits the Rtt107Slx4 scaffolds via Dpb11 to mediate replication stress response. Mol Cell 39:300–306 Ohouo PY, Bastos de Oliveira FM, Liu Y, Ma CJ, Smolka MB (2013) DNA-repair scaffolds dampen checkpoint signalling by counteracting the adaptor Rad9. Nature 493:120–124 Ouyang KJ, Woo LL, Zhu J, Huo D, Matunis MJ, Ellis NA (2009) SUMO modification regulates BLM and RAD51 interaction at damaged replication forks. PLoS Biol 7:e1000252 Panier S, Boulton SJ (2014) Double-strand break repair: 53BP1 comes into focus. Nat Rev Mol Cell Biol 15:7–18 Pellicioli A, Lee SE, Lucca C, Foiani M, Haber JE (2001) Regulation of Saccharomyces Rad53

checkpoint kinase during adaptation from DNA damage-induced G2/M arrest. Mol Cell 7:293–300 Postow L, Ghenoiu C, Woo EM, Krutchinsky AN, Chait BT, Funabiki H (2008) Ku80 removal from DNA through double strand break-induced ubiquitylation. J Cell Biol 182:467–479 Riballo E, Kuhne M, Rief N, Doherty A, Smith GC, Recio MJ, Reis C, Dahm K, Fricke A, Krempler A et al (2004) A pathway of double-strand break rejoining dependent upon ATM, Artemis, and proteins locating to gammaH2AX foci. Mol Cell 16:715–724 Robert T, Vanoli F, Chiolo I, Shubassi G, Bernstein KA, Rothstein R, Botrugno OA, Parazzoli D, Oldani A, Minucci S et al (2011) HDACs link the DNA damage response, processing of double-strand breaks and autophagy. Nature 471:74–79 Sacher M, Pfander B, Hoege C, Jentsch S (2006) Control of Rad52 recombination activity by double-strand breakinduced SUMO modification. Nat Cell Biol 8:1284– 1290 Saponaro M, Callahan D, Zheng X, Krejci L, Haber JE, Klein HL, Liberi G (2010) Cdk1 targets Srs2 to complete synthesis-dependent strand annealing and to promote recombinational repair. PLoS Genet 6:e1000858 Sartori AA, Lukas C, Coates J, Mistrik M, Fu S, Bartek J, Baer R, Lukas J, Jackson SP (2007) Human CtIP promotes DNA end resection. Nature 450:509–514 Schwartz EK, Heyer WD (2011) Processing of joint molecule intermediates by structure-selective endonucleases during homologous recombination in eukaryotes. Chromosoma 120:109–127 Sorensen CS, Hansen LT, Dziegielewski J, Syljuasen RG, Lundin C, Bartek J, Helleday T (2005) The cell-cycle checkpoint kinase Chk1 is required for mammalian homologous recombination repair. Nat Cell Biol 7:195–201 Terasawa M, Ogawa T, Tsukamoto Y, Ogawa H (2008) Sae2p phosphorylation is crucial for cooperation with Mre11p for resection of DNA double-strand break ends during meiotic recombination in Saccharomyces cerevisiae. Genes Genet Syst 83:209–217 Tibbetts RS, Cortez D, Brumbaugh KM, Scully R, Livingston D, Elledge SJ, Abraham RT (2000) Functional interactions between BRCA1 and the checkpoint kinase ATR during genotoxic stress. Genes Dev 14:2989–3002 Toh GW, Sugawara N, Dong J, Toth R, Lee SE, Haber JE, Rouse J (2010) Mec1/Tel1-dependent phosphorylation of Slx4 stimulates Rad1-Rad10-dependent cleavage of non-homologous DNA tails. DNA Repair (Amst) 9:718–726 Tougan T, Kasama T, Ohtaka A, Okuzaki D, Saito TT, Russell P, Nojima H (2010) The Mek1 phosphorylation cascade plays a role in meiotic recombination of Schizosaccharomyces pombe. Cell Cycle 9:4688–4702 Ubersax JA, Woodbury EL, Quang PN, Paraz M, Blethrow JD, Shah K, Shokat KM, Morgan DO (2003) Targets of the cyclin-dependent kinase Cdk1. Nature 425:859–864

DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of Wang YG, Nnakwe C, Lane WS, Modesti M, Frank KM (2004) Phosphorylation and regulation of DNA ligase IV stability by DNA-dependent protein kinase. J Biol Chem 279:37282–37290 Waters CA, Strande NT, Wyatt DW, Pryor JM, Ramsden DA (2014) Nonhomologous end joining: a good solution for bad ends. DNA Repair (Amst) 17:39–51 Williams RS, Dodson GE, Limbo O, Yamada Y, Williams JS, Guenther G, Classen S, Glover JN, Iwasaki H, Russell P et al (2009) Nbs1 flexibly tethers Ctp1 and Mre11-Rad50 to coordinate DNA double-strand break processing and repair. Cell 139:87–99 Woods YL, Xirodimas DP, Prescott AR, Sparks A, Lane DP, Saville MK (2004) p14 Arf promotes small ubiquitin-like modifier conjugation of Werners helicase. J Biol Chem 279:50157–50166 Wu X, Ranganathan V, Weisman DS, Heine WF, Ciccone DN, O'Neill TB, Crick KE, Pierce KA, Lane WS, Rathbun G et al (2000) ATM phosphorylation of Nijmegen breakage syndrome protein is required in a DNA damage response. Nature 405:477–482

319

Yu X, Fu S, Lai M, Baer R, Chen J (2006) BRCA1 ubiquitinates its phosphorylation-dependent binding partner CtIP. Genes Dev 20:1721–1726 Yu Y, Mahaney BL, Yano K, Ye R, Fang S, Douglas P, Chen DJ, Lees-Miller SP (2008) DNA-PK and ATM phosphorylation sites in XLF/Cernunnos are not required for repair of DNA double strand breaks. DNA Repair (Amst) 7:1680–1692 Yurchenko V, Xue Z, Gama V, Matsuyama S, Sadofsky MJ (2008) Ku70 is stabilized by increased cellular SUMO. Biochem Biophys Res Commun 366: 263–268 Yurchenko V, Xue Z, Sadofsky MJ (2006) SUMO modification of human XRCC4 regulates its localization and function in DNA double-strand break repair. Mol Cell Biol 26:1786–1794 Zhao S, Weng YC, Yuan SS, Lin YT, Hsu HC, Lin SC, Gerbino E, Song MH, Zdzienicka MZ, Gatti RA et al (2000) Functional link between ataxia-telangiectasia and Nijmegen breakage syndrome gene products. Nature 405:473–477

D

E

Electrophiles, Types of Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synonyms Modification of DNA

Definition Different types of electrophiles (molecules with partial or full positive charges) react differently Electrophiles, Types of, Fig. 1 Examples of classes of electrophiles that react with DNA

# Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

with DNA and specific atoms of the bases within DNA. Discussion: By definition, electrophiles are molecules that “seek electrons” because they bear a full or partial positive charge at a particular atom. These include (a) Michael acceptors, (b) compounds with good leaving groups, and (c) strained ring systems (Fig. 1). Michael acceptors react via the unsaturated part of the molecule. In the cases of (a) and (b), attack is at the positive center indicated (Fig. 1). In some cases, the reactions may be more complex, particularly if the reactive compound is a bis-electrophile (i.e., has two reactive centers). Both electrophiles may react with a single DNA base, producing a cyclic product, or DNA crosslinking can result (or DNA-protein cross-linking) (Fig. 2).

322

Embryophytes

Electrophiles, Types of, Fig. 2 Bifunctional electrophiles that react with DNA

Cross-References ▶ Bioactivation of Carcinogens ▶ Damage DNA, Natural Products that ▶ DNA Damage by Endogenous Chemicals ▶ DNA Damage, Frequency of ▶ Exocyclic Adducts ▶ Hydrolytic, Deamination, and Rearrangement Reactions of DNA Adducts ▶ Kinetics of DNA Damage ▶ Selectivity of Chemicals for DNA Damage ▶ Site-Specific Mutagenesis ▶ Synthesis of Modified Oligonucleotides ▶ Ultraviolet Light DNA Damage

Embryophytes ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

End Joining, Classical and Alternative Sang Eun Lee Department of Molecular Medicine, University of Texas Health Science Center, San Antonio, TX, USA

Synopsis Nonhomologous end joining (NHEJ), or classical end joining, mends broken chromosomes by juxtaposing and rejoining DNA ends in the absence of homologous donor sequences. Alternative end joining (A-EJ), or microhomology-mediated end joining (MMEJ), can substitute for NHEJ, and it repairs DNA double-strand breaks (DSBs) by annealing microhomologous sequences flanking a DSB. Here I will describe basic framework and genetics of NHEJ and A-EJ/MMEJ and how multiple DSB repair pathways interact for efficient DSB repair in several model systems.

End Joining, Classical and Alternative

323

Introduction DNA double-strand breaks (DSBs) are critical lesions; unrepaired or mis-repaired DNA breaks cause gross chromosomal rearrangements and cancer-causing mutations. Efficient detection and proper repair of DNA breaks restore genomic integrity and are thus fundamental for the viability of every living cell. Multiple pathways exist to eliminate DSBs in cells. The two most wellstudied repair mechanisms are homologous recombination (HR) and nonhomologous end joining (NHEJ). HR repairs DSBs by searching for and copying from a homologous donor sequence to restore any missing genetic information across a DNA break (Heyer et al. 2010), whereas NHEJ simply ligates broken ends back together regardless of the presence of a homologous donor (Lieber 2010; Lieber and Wilson 2010). Accordingly, NHEJ is best suited for healing DNA breaks during G1 when cells have not yet produced sister chromatids, the ideal homologous templates. NHEJ also contributes to the formation of B and T cell receptors, viral integration, and telomere fusions, all of which entail DNA breakage and rejoining steps.

Classical or Nonhomologous End Joining Biohcemical Steps of NHEJ NHEJ commences by the binding of Ku proteins to the broken DNA ends (Fig. 1). Ku is a heterodimeric complex comprising of 70 and 80 kDa proteins that form a ring-shaped molecule. Ku threads onto a broken DNA end, encircling the duplex DNA with a high affinity, regardless of the sequence context. Upon binding to duplex DNA, Ku protects the DNA ends from nucleolytic cleavage and suppresses HR. Ku also recruits the DNA-dependent protein kinase (DNA-PK) catalytic subunit to the DNA ends in higher eukaryotes, initiating DNA damage signaling and coordinating the actions of multiple repair

end alignment

Ku70-Ku80-DNA PKcs Mre11-Rad50-Nbs1

end processing

Rad27/Artemis/Tdp1 Polλ, Polm

E ligation

DNA ligase IV-Xrcc4-Xlf

End Joining, Classical and Alternative, Fig. 1 Model for NHEJ and putative roles of each NHEJ protein

enzymes. Ku has been also implicated in removing apurinic/apyrimidinic (AP) site near DSBs and thereby processing DNA ends prior to ligation. Almost concurrently with Ku, the Mre11/ Rad50/Nbs1 (MRN) (or Mre11/Rad50/Xrs2 (MRX) in yeast) complex also binds to the DNA ends and likely aligns/synapses the broken chromosomal ends for religation. The filament-like structure of Rad50 is uniquely suited for juxtaposition of two ends at a distance. However, the role of the Mre11 complex in NHEJ has been hotly debated for sometime. In budding yeast, deletion of any of the MRX components causes severe NHEJ deficiency. In contrast, deletion of MRN in fission yeast does not confer a NHEJ defect, challenging its role for NHEJ in this organism. In mammals, the analysis of the MRN function has been hampered due to the lethality caused by deletion of any of the MRN genes. Most recently, however, the nuclease-deficient mre11 mutant was successfully engineered in mouse cells, and the subsequent analysis confirmed the role of this complex in both classical and alternative end joining.

324

Following end alignment/synapse, DNA ends are further processed by a series of nucleases/ polymerases, making the ends compatible for ligation. This “end-processing” step is extremely versatile and operated redundantly by multiple enzymes that possess either similar or related activities and thus substitute for each other in their absence. Accordingly, deletion of an end-processing gene often impacts the formation of specific types of NHEJ products without affecting the overall NHEJ frequency. Despite this experimental challenge, the available evidence indicates that Rad27/Fen1 trims DNA ends as a 50 –30 nuclease, an activated Artemis and Tdp1 process DNA ends as 30 –50 nucleases, and DNA polymerases m and l (DNA Pol IV in budding yeast) catalyze fill-in gap synthesis. Interestingly, DNA Pol m and l allow for template-independent DNA synthesis across missing or folded nucleotides, forming the basis of NHEJ-mediated mutagenesis. The final step of NHEJ is DNA ligation. An evolutionarily conserved ligase (DNA ligase IV in mammals, DNL4 in yeast) that is exclusively dedicated for NHEJ performs this reaction in conjunction with two accessory proteins, Xrcc4 (Lif1 in yeast) and Xlf1 (Nej1 in yeast). DNA ligase IV physically interacts with both of these accessory proteins, and the interaction with Xrcc4 renders its enrichment at DNA breaks. Catalytically, DNA ligase IV can ligate DNA ends across gaps and between incompatible ends, enabling flexible NHEJ reactions along with DNA polymerases m and l. Regulation of NHEJ

Unlike HR wherein the types of repair products are dictated by the obligatory reliance on an available homologous template, NHEJ is far more flexible, yielding a wider range of breakpoint junctions, many of which gain or lose a few to even kilobases of nucleotides at the junction. This flexibility led to the notion that NHEJ is mutagenic. Most of these junctions also feature 0–2 bp overlapping sequence to one or both ends that can mediate base pairing and assist end alignment/synapse. The high plasticity of the NHEJ

End Joining, Classical and Alternative

reaction is fundamental in dealing with a wide range of DNA lesions induced by toxic chemicals and metabolic damage. Nevertheless, such a feature also poses a challenge to keep repairassociated mutations to a minimum so chromosome integrity cannot be sacrificed. Indeed, accumulated evidence indicates that NHEJ follows discrete governing rules that minimize the extent and the types of mutations acquired during NHEJ reaction. For instance, factors and activities less likely to alter junctional sequences may gain earlier access to DNA ends ahead of those with higher mutation potentials. Furthermore, when cells suffer multiple DSBs, mechanisms are in place to limit interchromosomal join; otherwise, chromosomal translocations may ensue. Evidence has emerged that the ATM-dependent end tethering suppresses interchromosomal end joining, thereby suppressing break-induced chromosomal translocations. Identification and characterization of these rules will be an important future goal of NHEJ research.

Alternative End Joining/ Microhomology-Mediated End Joining Distinct from NHEJ and HR, another mechanism termed alternative end joining (A-EJ) or backup end joining operates to deal with DSBs when NHEJ and/or HR is not available (McVey and Lee 2008). Extensive use of microhomology at the junction and larger deletion among A-EJ products incurred the name “microhomologymediated end joining (MMEJ),” albeit not all A-EJ products feature microhomology at the junctions. A-EJ has been noted for some time, but its role in DSB repair was largely dismissed as insignificant on the grounds that it purely serves as a backup for NHEJ and thus poses no physiological relevance. Recently, however, evidence has accumulated that A-EJ is surprisingly robust even in NHEJ-competent cells and likely responsible for the formation of chromosomal rearrangements in multiple genetic diseases, igniting the research efforts on A-EJ in multiple model systems.

End Joining, Classical and Alternative

A-EJ Mechanisms Mechanistically, A-EJ repairs DSBs by annealing flanking microhomology (2–20 bp) longer than classical NHEJ (0–2 bp) but shorter than singlestrand annealing (SSA) (>30 bp) and produces repair products lacking inter-repeat sequence (Fig. 2). A-EJ thus operates as a hybrid repair pathway between classical NHEJ and SSA. Furthermore, genetics of A-EJ looks like the amalgam of NHEJ and SSA, relying on the subset of NHEJ genes such as Mre11 and Rad50 and the SSA genes such as XPF and ERCC1. The precise steps and enzymology of A-EJ are still emerging; however, a basic framework and the required gene products have been elucidated in yeast and a few other experimental models. According to these studies, A-EJ begins with the end resection that degrades the 50 strand and yields 30 ssDNA overhanging ends. The Mre11 complex and Sae2/CtIP have been implicated in the initial stage of end resection and are thus required for A-EJ in yeast and mice. Resection unmasks flanking microhomology, and the exposed microhomology anneals to each other in a manner independent of Rad52, the primary annealing factor in HR. The single-strand DNA-binding replication protein A (RPA) complex limits spontaneous annealing of microhomology and thereby antagonizes MMEJ.

325

Following annealing, 30 flaps are removed by Xpf-ERCC1 nuclease, and the ssDNA gaps formed after end resection are filled in by DNA polymerases. In budding yeast, Pol4 and several error-prone polymerases including Pol32, the nonessential subunit of DNA polymerase d, have been implicated in MMEJ, likely at the repair synthesis step, but the role of homologous enzymes in higher eukaryotes is not defined yet. A-EJ is then completed by ligation of the nicks by one or more DNA ligases. In budding yeast, both Dnl4 and Cdc9 catalyze the ligation step, whereas in mammals, ligase III and its binding partners, XRCC1, are implicated in the A-EJ ligation reaction. Poly (ADP ribose) polymerase (PARP) is also needed for A-EJ, but its precise role is not elucidated yet. Does A-EJ Represent Multiple Pathways of DSB Repair?

Recently, a few additional DSB repair mechanisms (Fig. 3) featuring a substantial size of microhomology at the junctions have been brought to light, but their mechanistic/genetic parameters and their relationship to each other and to A-EJ have yet to be resolved. For instance, junctions of interchromosomal template switch events between Kluyveromyces URA3 located at HML and ura3-52 in budding yeast containing

End Joining, Classical and Alternative, Fig. 2 Model for A-EJ/MMEJ and its similarity to single-strand annealing and NHEJ

E

326

End Joining, Classical and Alternative

End Joining, Classical and Alternative, Fig. 3 Model for other microhomology-mediated repair mechanisms

2–17 bp microhomology and are accomplished by a synthesis-dependent strand annealing-like mechanism (Hicks et al. 2010). Breakpoint junctions of spontaneous segmental duplications in yeast and copy-number variations in human cancers also possess 2–20 bp microhomology and are proposed to form by microhomology-mediated break-induced replication (BIR) (Hastings et al. 2009). In Drosophila, a short inverted or direct repeat flanking an I-SceI-induced DSB produces templated sequence insertions and microhomology at the repair junctions (Yu and McVey 2010). This so-called synthesis-dependent microhomology-mediated end joining (SDMMEJ) is deduced to occur by priming one or multiple rounds of error-prone DNA synthesis using flanking repeats, followed by the dissociation/reannealing of partially synthesized strands, yielding de novo microhomology at the repair

junctions. The event is Ku, LigIV, and Rad51 independent but dependent on POLQ, reiterating the substantial flexibility of the end joining reaction. Collectively, these results suggest that A-EJ may constitute multiple pathways entailing different genetic and mechanistic requirements and operate flexibly according to the end configuration and the available flanking microhomology.

Do NHEJ/A-EJ Compete with HR? The presence of multiple repair pathways (i.e., HR, NHEJ, and A-EJ) that likely compete for the same DSBs raises the intriguing question of how a cell implements one pathway over the others. Several parameters likely contribute to this repair decision according to the temporal and spatial distribution of a DSB and the type of

End Joining, Classical and Alternative

cell suffering the DNA damage. Some of these emerging principles will be described below. Effect of Cell Cycle The reliance on a homologous template justifies that HR is a preferred mechanism post-G1. Sister chromatids not only confer yet another homologous template besides the allelic sequence on the homologous chromosome but also the least mutagenic one. Conversely, NHEJ becomes the primary means to eliminate DSBs at G1. At the molecular level, cell cycle-dependent regulation of repair pathway selection impinges on the end resection step that occurs primarily in the S/G2 phase of the cell cycle (Symington and Gautier 2010). The resection of broken DNA ends is a prerequisite for HR, and thus withholding end resection will effectively discourage the outset of HR. In contrast, end resection depletes the preferred NHEJ substrate: duplex DNA ends, thereby strongly inhibiting NHEJ. Mutants deficient in end resection also tip the balance toward NHEJ. Importantly, end resection is tightly regulated by Cdk1, which phosphorylates both Sae2/CtIP, an enzyme needed for the initial stage of end resection, and Dna2, one of the two nucleases involved in extensive end resection. Cdk1dependent phosphorylation activates these enzymes and thereby promotes the initiation and progression of end resection. Cdk1 likely affects other steps of HR or NHEJ as both Rad51dependent strand annealing and Ku binding to the DNA ends are compromised by the inhibition of Cdk1 activity during G2 in yeast. Ku also interferes with the access of Exo1 to the DNA ends and thereby suppresses end resection. Effect of Intranuclear Distribution of DSBs

Evidence suggests that the genomic position of a DSB impacts the repair pathway selection. In yeast, DSBs at ribosomal DNA loci are primarily repaired by HR but need relocation to the outer nucleolar space. In higher eukaryotic cells, a DSB formed at the heterochromatin is either relocated to the euchromatin territory to engage in HR or is subjected to NHEJ on-site in an ATM- and KAP1dependent manner. Both the nucleolus and heterochromatin may pose physical barriers to repair

327

proteins from gaining access to DNA lesions, impacting distinct repair path utility. When the repair is protracted and leads to persistent DSBs, DNA breaks are relocated to nuclear periphery, and there, they can engage in more error-prone mechanisms for repair. Effect of Cell Types and Culture Conditions

Besides temporal and spatial regulation, repair pathway selection is dependent on the cell type and the growth conditions under which cells face DNA damage. In yeast, NHEJ is greatly suppressed in diploid cells due to the repression of NEJ1 gene, making them almost exclusively reliant on HR for DSB repair. During meiosis, expression of Ku is markedly reduced, and therefore, repair is primarily dependent on HR. Conversely, immunoglobulin receptor gene rearrangement operates almost exclusively by NHEJ because the RAG1/RAG2 induction of DNA breaks at the recombination signal sequence shuttles DNA lesions to NHEJ and concurrently suppresses HR and A-EJ. Specific RAG mutants that unleash HR or A-EJ contributions during VDJ recombination destabilize the post-cleavage complex. Interestingly, NHEJ predominates in diauxic stage cells or cells with low nutrient condition. The precise mechanism underlying this regulation is not known yet. The accumulating evidence suggests that NHEJ, A-EJ, and HR do not simply compete for one another, but their relationship is far more complex (Shrivastav et al. 2008). Deletion of NHEJ factors, particularly Ku, elevates HR and A-EJ, but the same does not hold true for those deletions inactivating the HR and A-EJ pathways, as they are unable to increase NHEJ frequency. Further studies are necessary to elucidate the genetics and biochemistry of pathway selection upon DNA lesion induction.

Cross-References ▶ Double-Strand Break Repair ▶ DSB Repair by Cell-Cycle Signaling and the DNA Damage Response, Regulation of ▶ V(D)J Recombination

E

328

Enhancers

References

Synopsis

Hastings PJ, Lupski JR, Rosenberg SM, Ira G (2009) Mechanisms of change in gene copy number. Nat Rev Genet 10:551–564 Heyer WD, Ehmsen KT, Liu J (2010) Regulation of homologous recombination in eukaryotes. Annu Rev Genet 44:113–139 Hicks WM, Kim M, Haber JE (2010) Increased mutagenesis and unique mutation signature associated with mitotic gene conversion. Science 329:82–85 Lieber MR (2010) The mechanism of double-strand DNA break repair by the nonhomologous DNA end-joining pathway. Annu Rev Biochem 79:181–211 Lieber MR, Wilson TE (2010) SnapShot: nonhomologous DNA end joining (NHEJ). Cell 142:496–496 e491 McVey M, Lee SE (2008) MMEJ repair of double-strand breaks (director’s cut): deleted sequences and alternative endings. Trends Genet 24:529–538 Shrivastav M, De Haro LP, Nickoloff JA (2008) Regulation of DNA double-strand break repair pathway choice. Cell Res 18:134–147 Symington LS, Gautier J (2011) Double-strand break end resection and repair pathway choice. Annu Rev Genet 45:247–271 Yu AM, McVey M (2010) Synthesis-dependent microhomology-mediated end joining accounts for multiple types of repair junctions. Nucleic Acids Res 38:5706–5717

Molecular cloning is a suite of techniques that allow researchers to manipulate DNA and produce DNA molecules with novel combinations of genetic information from different organisms. Molecular cloning, typically referred to simply as cloning, is made possible by co-opting cellular enzymes normally involved in DNA replication, recombination, and repair and using these both in vitro and in vivo to manipulate and combine DNA molecules from different sources. The products of molecular cloning are known as recombinant DNA. Successful cloning requires segments of DNA that allow the recombinant DNA to replicate in a host organism. These segments of DNA are known as cloning vectors and have been engineered with a range of features that facilitate both the cloning process and further analysis of the recombinant DNA. Knowledge gained from molecular cloning is at the foundation of our current understanding of biology and has resulted in significant advances in biotechnology.

Introduction

Enhancers ▶ Cis-Regulation of Eukaryotic Transcription

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of Jon R. Stoltzfus Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA

Synonyms Gene cloning; Molecular cloning; rDNA

Cloning DNA allows researchers to preserve a specific fragment of DNA, analyze it, and in some cases express proteins from it. Cloning consists of combining a vector DNA with an insert DNA and propagating this recombinant DNA in living host cells. The process typically involves the following steps: (1) isolating DNA from an organism of interest, (2) combining that DNA (the insert DNA) with a cloning vector to produce a recombinant DNA molecule, (3) introducing the recombinant DNA into a suitable host cell, and (4) selecting host cells that contain a recombinant DNA molecule with a particular insert sequence. See Fig. 1. The many variations of this core technology allow researchers to form new combinations of DNA that can be analyzed in the lab or incorporated into living cells to modify their metabolism. Researchers can then study the function of the insert DNA or the proteins it encodes and manipulate the processes underlying life.

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

329

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of, Fig. 1 Overview of DNA cloning. DNA begins with the isolation of DNA from the organism of interest. This DNA is combined with vector DNA. The vector DNA allows the DNA to replicate in the host cell and for selection of cells harboring the recombinant DNA. DNA is then introduced into host cells via transformation, and the cells are grown on culture media that selects for cells containing the recombinant DNA. Finally, host cells that harbor the insert DNA of interest are identified. Common methods used to identify host cells harboring the DNA of interest include using labeled probes, restriction digests, and DNA sequencing

Early experiments using recombinant DNA focused on determining the DNA sequence and the function of coding sequences. Since then, the cloning of DNA has become an integral part of many fields of scientific inquiry and has revolutionized the way scientists study living organisms. It has resulted in numerous advances in agriculture, medicine, biotechnology, and other fields. Today recombinant DNA is commonly used to determine how gene expression is controlled, find interactions between proteins and other molecules, produce important medicines, create proteins with novel functions, and engineer organisms to meet societal needs. In the future, recombinant DNA may be used to cure genetic disorders and cancers that are currently difficult or impossible to treat.

E

A Brief History of DNA Cloning After James Watson and Francis Crick’s elucidation of the DNA double helix in 1953, our understanding of the structure and function of DNA expanded rapidly. Classical genetics and microbiological experiments had demonstrated that segments of DNA could recombine, creating novel combinations of traits. At the same time, biochemists isolated and characterized enzymes capable of modifying DNA and learned to use these enzymes to manipulate DNA in vitro. By 1970 the stage was set for the production of recombinant DNA in vitro and transformation of living cells with this recombinant DNA. Several labs independently developed and implemented strategies accomplishing this goal (Berg and Mertz 2010).

330

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

The first recombinant DNA made in vitro was produced in the early 1970s when David Jackson, Robert Symons, and Paul Berg combined the SV40 virus genome with a bacteriophage l derivative containing the gal operon from Escherichia coli. SV40 was used with the goal of transforming mammalian cells with the recombinant DNA. The l-phage DNA containing the gal operon was included to allow replication and selection of the recombinant DNA in E. coli. However, this recombinant DNA was not transformed into E. coli due to concerns over the oncogenic potential of the SV40 genome (Berg and Mertz 2010). Cloning experiments that did not involve SV40 soon followed and successfully introduced recombinant DNA into bacterial cells. Some of these early cloning experiments used plasmids (small circular DNA molecules that are maintained independent of the host cell’s chromosomal DNA) as vectors to clone various DNA fragments and analyze their sequence or function. In the early 1970s Stanley Cohen, Annie Chang, and Herbert Boyer successfully used DNA from two different species to create recombinant DNA in vitro and used this recombinant DNA to transform living cells. They used the restriction endonuclease EcoRI to cleave both the E. coli-derived plasmid vector pSC101 and DNA isolated from Staphylococcus aureus. DNA ligase was then used to join fragments of the S. aureus DNA with the plasmid DNA, creating a new, recombinant, plasmid that contained the original pSC101 vector plus S. aureus insert DNA encoding resistance to penicillin. The recombinant DNA was used to transform living E. coli cells. Bacteria transformed with the recombinant plasmid DNA acquired both tetracycline resistance encoded by the vector and penicillin resistance encoded by the S. aureus DNA. This basic protocol of digestion, ligation, and transformation with selection became the template for many additional cloning experiments, with the cloning of eukaryotic DNA into a plasmid quickly following (Berg and Mertz 2010). See Fig. 1. Other early cloning experiments used bacteriophage l as the vector. l-phage is a virus that infects E. coli, injecting its DNA into the cell

where the DNA is replicated. In 1969 Peter Lobban, while a graduate student at Stanford, proposed using l-phage as a vector by replacing a nonessential portion of the viral DNA with insert DNA from another source, but he did not immediately carry out these experiments (Berg and Mertz 2010). Noreen and Kenneth Murray produced a recombinant l-phage containing as an insert the E. coli genes necessary for tryptophan biosynthesis (trp operon genes) (Murray and Murray 1974). During the same time period, Marjorie Thomas, John Cameron, and Ron Davis cloned DNA from Drosophila melanogaster into a l-phage vector (Thomas et al. 1974). They also suggested that this vector could be used to create and propagate clones containing essentially all the DNA from a eukaryotic cell, in other words creating a genomic library. See Fig. 2. These early experiments launched the age of cloning. DNA from any source could be joined to a vector in vitro and maintained as a clone in vivo. The insert DNA could then be isolated and analyzed by a variety of methods. See Fig. 1. These experiments also demonstrated that cloning a specific insert can change the characteristics of the host in a specific way, for example, by allowing the host to grow on medium containing an antibiotic that would have killed the untransformed host. This opened the door to exploration of many fundamental scientific questions as well as development of the biotechnology industry and key medical advances. However, the initial cloning vectors needed improvement for cloning to truly come of age. The plasmid vectors used in early cloning experiments contained genetic markers that allowed for selection of transformed host cells. For example, pSC101 has the genes needed to confer resistance to the antibiotic tetracycline on the host so that only cells that were transformed with the plasmid can grow on media containing tetracycline. However, these vectors did not allow for selection for the presence of insert DNA unless that insert DNA produced a novel detectable trait in the host, for example, resistance to an additional antibiotic or the ability to grow on media lacking an essential nutrient. This made the cloning process inefficient because it typically resulted

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

331

E

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of, Fig. 2 Cloning using bacteriophage l. This figure illustrates the use of bacteriophage l as a vector to clone genomic DNA. Insert DNA can be fragmented mechanically or partially digested with restriction enzymes. The DNA is then selected for inserts in the proper size range. When DNA is fragmented mechanically, it can be methylated to prevent digestion,

linkers containing a restriction site can be added, and these can be digested to produce the ends needed for introduction into the vector. Linear DNA from bacteriophage l is circularized by joining the COS sites. This is then digested with the proper restriction enzyme, ligated to the insert DNA, and packaged into bacteriophage l heads. The bacteriophage then introduces the recombinant DNA into host cells

332

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

in many vectors without inserts, and identifying clones containing insert DNA was a timeconsuming, labor-intensive process. By the late 1970s, a second generation of plasmid cloning vectors with additional restriction sites and selection options was developed (Bolivar et al. 1977). The plasmid vector pBR322 was engineered to contain the genes for both ampicillin resistance and tetracycline resistance with one or more unique restriction sites in each resistance gene. Insertional inactivation takes place when the insert DNA is cloned into a restriction site inside a gene disrupting that gene and destroying its function. See Fig. 3. The presence of two antibiotic resistance genes allows selection for transformation of the host using one antibiotic resistance gene and selection for recombinant plasmids by insertional inactivation of the second antibiotic resistance gene. In the late 1970s, Joachim Messing developed a cloning system based on M13 phage, another virus that infects E. coli (Messing 1991). Messing improved his original M13 phage vectors by adding a polylinker, a short DNA sequence with several restriction sites in close proximity that increased cloning flexibility by allowing a choice of the restriction enzyme used to digest the insert DNA. They also added a fragment of the lacZ gene encoding a peptide fragment of betagalactosidase. The lacZ gene fragment in the vectors enables certain strains of E. coli to produce a blue color when exposed to X-Gal (5-bromo-4chloro-3-indolyl-beta-D-galactopyranoside). Plaques formed by phage with no DNA insert turn blue, while phage with an insertional inactivation of the lacZ gene produces white plaques. This blue-white selection protocol greatly increased the efficiency with which DNA could be cloned. The vector also contained a sequence complementary to universal DNA sequencing primer that allowed easy sequencing of any DNA inserted into the polylinker. Polylinkers, blue-white selection, and universal primers were soon incorporated into other vector systems as well. In early experiments requiring expression of the cloned gene, expression relied on promoter elements associated with the gene from the original organism and present on the insert DNA (see

references 7–16 in (Helfman 1983). In the late 1970s, Patrick Charnay, Michel Perricaudet, Francis Galibert, and Pierre Toillais created both plasmid and l-phage vectors that allowed cloning and expression of cDNA fragments using the lac promoter from E. coli. See Fig. 4. Improvements of this system included vectors that allowed cloning of DNA into all three reading frames of the lac promoter (Charnay et al. 1978) and addition of a Shine-Dalgarno sequence to increase translation of the mRNA (Backman and Ptashne 1978). In the mid-1980s, a series of vectors known as the pET vectors were developed that allowed strong, specific expression of cloned genes using the RNA polymerase from T7 bacteriophage (Rosenberg et al. 1987). Some vectors were developed that combined qualities found in plasmids with those found in viral vectors. In the late 1970s, John Collins and Barbara Hohn developed cosmid vectors (Collins and Hohn 1978). These vectors are small plasmids containing the cos site from l-phage that allow larger inserts than most plasmid vectors and efficient introduction of the recombinant plasmids into the host using phage infection. In the early 1980s, Jeffrey Vieira and Joachim Messing introduced key attributes of M13 phage vectors into a plasmid, producing the pUC series of plasmids (Vieira and Messing 1982). This allowed flexible cloning using different restriction enzymes and blue-white selection for plasmids containing insert DNA. Some pUC derivatives are phagemid vectors that carry the intergenic region of a filamentous phage, in this case the ori from M13. With the use of a helper plasmid or helper phage, these phagemid vectors produce large quantities of single-stranded DNA by rollingcircle DNA replication. Baculovirus systems that allowed expression of cloned genes in insect cells were developed in the early 1980s. The baculovirus system allowed processing of mRNA and proteins in eukaryotic cells, resulting in high-level production of cloned eukaryotic proteins (Smith et al. 1983). The early 1980s also led to the development of yeast artificial chromosomes (YACs) (Murray and Szostak 1983) which allowed cloning of DNA inserts several orders of magnitude larger than is possible

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

333

E

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of, Fig. 3 Selection using insertional inactivation. This figure illustrates the use of insertional inactivation of the lacZ gene to select for host cells harboring recombinant DNA containing both vector and insert. The vector contains both an antibiotic resistance gene and a lacZ gene with a polylinker near the beginning of the

gene. Insertion of DNA into the polylinker site inactivates the lacZ gene. Following transformation and plating on media containing the proper antibiotic and X-Gal, host cells without the plasmid are unable to grow, cells harboring plasmids without insert grow and turn blue from the activity of the lacZ gene, and cells harboring plasmids with insert grow but do not turn blue as they do not have a functional lacZ gene

334

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of, Fig. 4 Expression vectors. Expression vectors have a strong promoter followed by a polylinker and a terminator. Directional cloning of cDNA insert followed by transformation and growth of an appropriate host results in the production of large quantities of protein. This protein can be isolated and analyzed

with plasmids or l-phage-derived vectors. In the early 1990s, P1 vectors (Sternberg 1990), bacterial artificial chromosomes (BACs) (Shizuya et al. 1992), and P1-derived artificial chromosomes (PACs) (Ioannou et al. 1994) were developed. These vectors allowed cloning DNA inserts larger than l-phage-derived vectors but not as

large as yeast artificial chromosomes. These vectors overcame many of the insert instability problems associated with yeast artificial chromosome vectors. The development of new vectors based on these classic vectors as well as specialized vectors allowing propagation of cloned DNA in a variety of hosts continues to expand the ease, flexibility, and power of DNA cloning technologies. Two other major advances have greatly facilitated the cloning of virtually any DNA fragment and circumvented the need for laborious procedures to identify cells transformed with the desired insert DNA. First was the development of the polymerase chain reaction (PCR), allowing for the production of virtually any DNA segment in vitro. Second was the advent of genome sequencing, which provided complete sequences of all genes in a great many organisms, from bacteria to humans. It thus became possible to identify the DNA sequence of interest in a genome and use PCR to prepare a doublestranded DNA fragment containing that sequence. The PCR product can then serve as a DNA insert for ligation into a vector of choice (Guo and Bi 2002). Other advances in addition to new cloning vectors also increased the ease and utility of cloning. Successful application of the original cloning techniques required the segment of DNA to be cloned, a vector into which the segment of DNA is inserted, enzymes that allowed in vitro insertion of the segment of DNA into the vector, and transformation of the vector with the inserted DNA into a host that replicated the vector along with the insert DNA. During the late 1990s, several techniques allowing in vivo manipulations of DNA were developed that simplified recombinant DNA production, decreased the time required to produce recombinant DNA molecules, and allowed greater flexibility in storing, manipulating, and analyzing the recombinant DNA (Copeland et al. 2001; Sawitzke et al. 2007; Narayanan and Chen 2011). These systems rely on homologous recombination in vivo, eliminating the often troublesome requirement for proper restriction sites inherent in traditional cloning methods. These recombination-based techniques are especially

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

useful when working with very large inserts and have revolutionized the way cloning is done in certain areas of research. Altogether these developments have made cloning DNA a routine procedure in laboratories spanning the globe and allow researchers to ask and answer complex questions regarding life at the molecular level.

335

In all cases, cloning protocols require both a cloning vector and insert DNA. There are a variety of vectors available, each tailored to different desired outcomes of the cloning process. Insert DNA is commonly generated directly from genomic DNA, by converting mRNAs into cDNAs, or by the use of the PCR. Selection of the appropriate vector and determining how to generate insert DNA are key aspects of the cloning process.

General Strategies for DNA Cloning The production of recombinant DNA requires the covalent joining of two or more linear molecules of DNA to form a single recombinant molecule. One molecule is the vector that allows a host cell to replicate the DNA so it can be preserved through time and so quantities of DNA large enough to be analyzed can be purified. The vector may also contain sequences that facilitate the cloning process or analysis of the cloned product. The other molecule or molecules of DNA are called the insert DNA and contain the DNA the researcher is interested in preserving or analyzing. See Fig. 1. Many cloning protocols involve the digestion of both the vector and insert with a restriction enzyme or enzymes followed by ligation and transformation. Restriction enzymes cut DNA at specific recognition sequences. They can produce either blunt ends or short single-stranded overhangs called sticky ends. Ligation joins the ends of DNA molecules together via covalent bonds. Transformation introduces the DNA into the host cells and generally requires selection to identify host cells containing the recombinant DNA. Some cloning techniques rely on recombination to produce the desired mixing of DNA from different sources. The recombination may be sitespecific recombination or it may be homologous recombination. Restriction enzymes and ligation are not required. Recombineering techniques use homologous recombination in vivo and therefore do not require transformation of the final recombinant product into host cells which facilitates manipulation of clones with large inserts. Some type of selection is required to identify host cells harboring the desired recombinant DNA.

Common Characteristics of Cloning Vectors Cloning vectors contain all the information needed to maintain DNA within a host cell plus additional sequences that facilitate the cloning process. All vectors have sequences to initiate replication in the host cells allowing new copies of the DNA to be synthesized. Many vectors also have additional sequences that are required for propagation in their specific host. A marker that allows selection for host cells that contain the vector is highly desirable and found in virtually all cloning vectors. In addition, most modern cloning vectors contain sequences that allow selection for vectors with insert DNA and facilitate insertion of DNA fragments. Some vectors also have sequences that allow the production of proteins from the inserted DNA, identification of regulatory elements, transfer of inserted DNA between different vectors, and tagging of proteins to facilitate isolation or cellular detection. Sequences Required for Replication and Maintenance in the Host Cell At its core, cloning a particular segment of DNA is done to maintain that segment of DNA through time and to produce large quantities of DNA for further analysis or use. This requires a vector that replicates within host cells. Any vector must contain an origin of DNA replication that allows it to be propagated in the host cells. Bacteriophage and viral vectors generally require additional sequences that direct processing of the viral DNA and packaging into viral particles. Artificial chromosomes usually contain sequences that

E

336

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

ensure their stability in the cell and correct segregation at host cell division. In some cases, cells that contain the recombinant DNA must also contain a second DNA construct that encodes proteins required for propagation of the vector DNA. These additional constructs are referred to as “helper” plasmids or phages. Selectable Markers The processes required for successful cloning of DNA happen at a relatively low frequency, and identifying these events in an efficient manner requires a method for selecting the desired events from undesired events. Ideally this selection process will either prevent the growth of host cells that do not contain the desired vectors with inserted DNA or allow easy visualization of the cells that do contain the desired recombinant DNA. Selection for Transformation

The number of host cells in a transformation mixture that have successfully acquired the vector is relatively small compared to the number of host cells that have not. Detection of these rare events involves eliminating cells that have not acquired the vector. An essential component of cloning vectors is a gene that facilitates this process either by providing the cell with resistance to an antibiotic or allowing the cell to produce an essential nutrient missing from the growth medium. Antibiotic resistance genes produce proteins that allow cells containing these genes to grow on a toxic substance. Following transformation, only the cells that have taken up a vector encoding antibiotic resistance will be able to grow on media supplemented with that antibiotic. Antibiotics such as ampicillin, tetracycline, kanamycin, and streptomycin are commonly used to select for transformed cells and to demonstrate the construction of recombinant DNA plasmids. Host cells with a mutation in a gene required to produce an essential nutrient, for example, an amino acid, can grow on media containing that nutrient but fail to grow on media without that nutrient. For example, yeast artificial chromosomes (YACs) commonly contain the TRP1 and URA3 genes for selection to ensure that both arms of the chromosome are present (Burke

et al. 1987). The TRP1 gene enables yeast cells containing the YAC to grow on media lacking the amino acid tryptophan, while the URA3 gene is required for growth on media lacking pyrimidine nucleotides. Sites for DNA Insertion Traditional cloning vectors contain unique restriction enzyme recognition sites that allow insertion of DNA using specific restriction enzymes. Most modern vectors contain a series of restriction sites in close proximity known as a polylinker or a multiple cloning site (MCS). Polylinkers facilitate cloning by increasing the possible restriction enzymes that can be used to digest the insert DNA. In addition, they allow directional cloning in which digestion of the vector and insert with two different enzymes results in insertion of the cloned DNA into the vector in only one orientation. Today some protocols use site-specific recombination between the vector and insert DNA to assemble the recombinant DNA, eliminating the need for specific restriction sites on the vector. One popular system marketed by Invitrogen as Gateway ® vectors utilizes modified att recombination sites and specific recombinase enzymes to mediate insertion of insert DNA into the vector (Walhout et al. 2000). Site-specific recombination systems using the Cre-lox and Flp-FRT systems have also been developed. Plasmids in these systems contain the short sequences needed for sitespecific recombination rather than polylinkers. The insert DNA is often generated by PCR using primers containing the site-specific recombination sequences. Alternatively, short linkers containing these site-specific recombination sequences can be ligated to the ends of the insert. The insert is then introduced into a donor vector in vitro via site-specific recombination. From this donor vector the insert can easily be recombined into a variety of specialized vectors containing the site-specific sequence linked to sequences that allow further analysis like promoters or tags (Walhout et al. 2000). Selection for Insert

When creating recombinant DNA molecules, it is common for the transformation mixture to contain

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

vectors that do not contain insert DNA. This can happen because some vector molecules escaped digestion with the restriction endonuclease or because the vector’s own ends were ligated together without incorporating insert DNA. In some cases the frequency of recovering host cells with recombinant DNA can be as low as 106 to 107/ml/ug of DNA (Bolivar et al. 1977). Vectors which make it easy to identify events in which insert DNA was incorporated into the vector provide a clear advantage over vectors that do not. Common approaches involve insertional inactivation of a marker such as an antibiotic resistance gene or a gene that produces a visual marker like color. See Fig. 3. Vectors that contain two independent antibiotic resistance genes with different cloning sites inside each of these genes allow for both selection of transformed host cells and identification of specific clones containing insert DNA. In recombinant DNA molecules, the insert interrupts expression of that gene making the host sensitive to that antibiotic, called insertional inactivation. For example, the cloning vector pBR322 contains genes that confer resistance to both ampicillin and tetracycline (Bolivar et al. 1977). Host cells transformed with pBR322 with insert DNA cloned into a restriction site in the tetracycline resistance gene are resistant to ampicillin but are killed by tetracycline. Cells containing the plasmid with no insert are able to grow in the presence of both antibiotics. Colorimetric selection is a more efficient and more commonly used procedure for identifying clones with recombinant DNA. Blue-white selection uses plasmid vectors containing a portion of the lacZ gene to produce an active b-galactosidase enzyme in host cells. The activity of this enzyme produces a blue product from the colorless substrate X-Gal. Introduction of insert DNA into the cloning site in the lacZ gene inactivates the enzyme, producing white colonies. See Fig. 3. This system is found in some M13 vectors, in pUC vectors, and in pBluescript vectors. A second example of colorimetric selection comes from yeast. When insert DNA interrupts the SUP4 gene on pYAC2, this interruption causes the yeast colonies to turn red (Burke et al. 1987).

337

Cloning of PCR products can be particularly efficient. Blunt vectors available from Invitrogen Corp. are provided as linear DNA fragments with blunt ends that can be joined to a blunt-ended PCR product insert. A vector that re-circularizes without incorporation of an insert generates an active ccdB gene from the F plasmid. The CcdB protein kills the host cell by inhibiting DNA gyrase which stops DNA replication. Thus, only those host cells transformed with vector containing insert can grow. Transformed cells containing the desired recombinant plasmid can be identified easily by isolating plasmids from the transformed cells, releasing the insert DNA by digestion with appropriate restriction endonucleases, and analyzing the insert DNA size by gel electrophoresis (Bernard et al. 1994). Recombineering protocols use homologous recombination rather than restriction enzymes and DNA ligase to generate recombinant DNA. Some of these protocols use a two-step selection/counterselection technique to identify cells with the desired recombinant DNA (Copeland et al. 2001; Sawitzke et al. 2007). In the first step, the insert DNA contains a cassette including both an antibiotic resistance marker and a conditionally lethal gene. For example, the cassette might contain the neo antibiotic resistance gene and the sacB gene which kills cells in the presence of sucrose. This insert is introduced to the site at which the DNA is being manipulated using homologous recombination. Culture medium containing neomycin is used to select cells in which the cassette has integrated into the target site. In the second step, homologous recombination is used to insert DNA that adds to, removes, or changes a few bases of the original DNA at the target site. This removes the cassette introduced in the first step and alters the DNA in the desired manner. The conditionally lethal gene is used for counterselection in this second step. Growing cells in the proper conditions kills those cells in which DNA modification and cassette removal did not take place, allowing only cells with the desired insert to grow. In this example the sacB gene is used to kill cells in which the cassette was not replaced by the desired insert sequences by growing cells on medium containing sucrose.

E

338

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

Cloning Vectors The type of vector selected is a key factor in the cloning process. The vector determines what host or hosts are capable of replicating the cloned DNA. The upper size of insert DNA also varies with vector from around 20 kb for plasmids to over 2,000 kb for YACs (Ramsay 1994). Specialized features of the vector also determine the types of analyses available following successful cloning. Plasmids

Plasmids are the workhorse cloning vector in recombinant DNA production. Plasmids are small circular DNA molecules that replicate autonomously in their host organism. While species from all three domains of life can host plasmids (del Solar et al. 1998), their major use in cloning is as vectors in bacteria or yeast. Plasmids are easy to work with in vitro and the small size of plasmids makes it easy to separate the cloned DNA from the host’s genomic DNA. A major limitation of plasmids is the maximum size of insert DNA, with most inserts limited to less than 20 kb because cloning efficiency and plasmid stability decrease with plasmid size. Plasmid vectors contain the genes and DNA sequences needed to control and initiate plasmid replication independent of the host’s chromosomal replication cycle. The actual replication of the DNA is then carried out by the host cell’s replication machinery. In bacteria, the sequences that initiate replication are typically identified as the ori sequences. The ori sequence of a plasmid determines the number of times the DNA replicates per cell cycle. This determines how many copies of the plasmid are present in each cell and is an important determinant of yield when isolating plasmid DNA. Because the plasmid replicates autonomously, each host cell can contain many copies of the plasmid. The ori sequence also determines compatibility with other plasmids. This determines which plasmids can coexist in the same cell and is important in experiments requiring host cells that contain different plasmids. The colE1-like ori system, used in pBR322, pUC, and pBluescript vectors, is one of the most common systems found in plasmid vectors that

use E. coli as a host. This ori initiates theta-type replication producing double-stranded copies of the plasmid (Camps 2010). The number of plasmids per cell produced by this system ranges from low copy (about 20 plasmids per cell) with pBR322 to high copy (around 500 plasmids per cell) with pBluescript. Because both pBR322 and pBluescript share similar origins of replication, they are incompatible and cannot be simultaneously maintained in the same cell. A variety of other ori systems are also found in plasmid cloning vectors (del Solar et al. 1998). Bacteriophage Vectors Bacteriophages, commonly called phages, are viruses that infect bacteria. Phages transform bacteria efficiently, can carry larger inserts than plasmids, and replicate insert DNA along with the viral vector DNA. High transformation efficiencies increase the chance of recovering a clone containing the desired recombinant DNA. Larger inserts allow cloning of large eukaryotic genes and their regulatory elements, more rapid gene mapping, and quicker identification of a gene via positional cloning using chromosome walking or jumping. The larger insert size also reduces the total number of clones needed to ensure that a DNA library contains the entire genomic DNA from a species. Replication of the viral vector and its insert facilitate the isolation of large quantities of DNA needed for additional analysis of the insert. Bacteriophage vectors also allow rapid production of large quantities of DNA making use of the phage replicative processes. Cloning vectors have been developed based on a number of different phages, including M13, l, and P1. Bacteriophage M13-Derived Vectors Many common cloning vectors are based on the M13 phage. M13 is a filamentous phage, which allows it to package extra DNA in its viral particle. Its life cycle includes a circular double-stranded DNA (dsDNA) replicative form found in host cells during replication and a circular single-stranded DNA (ssDNA) infective form found in the viral particle. These vectors require both a dso site to initiate synthesis of dsDNA from the original single-stranded phage genome and a sso site to

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

initiate rolling-circle replication for the production of new single-stranded genomes. They also require genes encoding replicative and coat proteins (Weigel and Seitz 2006). Initial cloning is accomplished by producing circular dsDNA containing the viral genes needed for viral replication and the insert DNA. This recombinant DNA is transformed into host cells where the viral replication system then causes the DNA to begin replicating as a virus. During this process rolling-circle replication of the double-stranded phage DNA produces ssDNA that is circularized and packaged into viral particles. Because it does not lyse the host cells, M13 can be used to produce large quantities of recombinant viral DNA. This ssDNA provides an excellent template for DNA sequencing, site-directed mutagenesis, or production of strand-specific DNA probes. Alternately, the viral particles can be stored or used to infect new cells. Some vectors combine the replicative sequences of plasmids and phage. Phagemids contain ori sequences to maintain the clone as a plasmid and sso sequences to initiate rolling-circle replication. Bacterial hosts containing a phagemid and infected with a helper phage that encodes phage replication proteins produce a large quantity of ssDNA by rolling-circle replication of the phagemid (Chauthaiwale et al. 1992). Bacteriophage l-Derived Vectors Vectors based on bacteriophage l are also common choices for cloning. l-Phage is a double-stranded DNA bacteriophage that infects E. coli. The oril sequence is required to initiate replication, and a number of genes encoding replicative proteins and coat proteins are also required (Weigel and Seitz 2006). The phage DNA is replicated by a combination of theta and rolling-circle replication that results in linear dsDNA (Chauthaiwale et al. 1992). The cos sequence in the linear replication product allows circularization of the linear genome following infection and packaging of DNA into phage heads. Infection by l-phage requires genes on either end of the phage’s linear genome. However the DNA between these two “arms” is not essential for this process and can be replaced with insert

339

DNA. Cloning involves ligating insert DNA between the two arms, packaging the recombinant DNA into phage particles in vitro, and infecting E. coli. See Fig. 2. Phage replication then produces long double-stranded recombinant DNA repeats joined together at cos sites. Processing cleaves the repeated DNA at the cos sites and packages the recombinant DNA into phage particles that eventually lyse the cell, releasing the phage. The phage particles can be stored, used to isolate recombinant DNA, or used to infect new host cells. Because the l-phage head is icosahedral, it packages a fixed amount of DNA limiting both the minimum and maximum insert DNA size. l-Insertion vectors allow cloning of small inserts into the nonessential region, while l-replacement vectors remove the entire nonessential region and allow inserts of around 23 kb in length (Chauthaiwale et al. 1992). Cosmid vectors combine the efficiency of l-phage transformation with the benefits of plasmid replication and DNA isolation. Cosmids contain ori sequences to maintain the clone as a plasmid and the cos sequence of l-phage which allow packaging of recombinant DNA into phage particles for infection of E. coli host cells. Once inside the cells, the cos sites cause the DNA to circularize and form a new recombinant plasmid (Collins and Hohn 1978). Cosmids allow larger DNA inserts than l-phage vectors, with the maximum size around 48 kb (Sternberg 1990). l-Zap vectors combine features of l-vectors with phagemids. Recombinant DNA in a l-zap vector is packaged into a l-phage particle and transformed into the host cells. Conversion of the linear l vector with the insert DNA to a circular plasmid lacking the l vector sequences is accomplished in the host cell using elements of the phage f1 replication system. The l vector contains insert DNA cloned into the polylinker of the Bluescript vector. A helper phage containing f1 replicative genes is used to infect the host and provide phage f1 replicative proteins. These proteins use the f1 origin of replication sites to produce circular ssDNA that contains the pBluescript vector sequences as well as any inserted DNA. The intact Bluescript origin of replication can then be used to produce dsDNA

E

340

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

plasmids. The f1 origin of replication in the intact plasmid also makes it a phagemid so initiation of rolling-circle replication can produce ssDNA for further analysis (Short et al. 1988). Bacteriophage P1-Derived Vectors The P1 cloning system is based on P1 phage and allows cloning of DNA inserts up to 100 kb. It has higher transformation efficiencies and fewer instability problems than YAC vectors. Prior to cloning, this vector is maintained as a plasmid using the colE1-like ori system from pBR322. Following production of recombinant DNA, a pac site in the vector initiates packaging of the DNA into phage particles. Packaging continues until the head is filled, at which point the DNA is cleaved. The maximum size of the DNA packaged is limited by the size of DNA accommodated by the icosahedral phage head. The P1 phage then injects the recombinant DNA into the host cell where it is circularized and maintained as a single copy by the P1 plasmid replicon (Sternberg 1990). Baculovirus Vectors Baculovirus systems enable efficient expression of eukaryotic genes in higher eukaryotic cells. This allows for gene expression, protein folding, and posttranslational modification to take place in a eukaryotic cellular environment, which often leads to the production of higher levels of active protein than can be obtained when eukaryotic genes are expressed in prokaryotic hosts. The host cells for these vectors are insect cells from Lepidoptera species. Baculovirus vectors are bacterial plasmids containing a portion of the baculovirus DNA genome and a polylinker sequence following a strong baculovirus promoter such as the polyhedron promoter. Following insertion of DNA into the polylinker, the recombinant DNA is cotransformed into insect cells along with an infective baculovirus. The cells are grown in culture where homologous recombination produces viral particles containing the baculovirus genome with the insert DNA. Recombinant viral particles can be selected based on viral plaque phenotype or by other screening methods. Infection of cultured cells with recombinant viral particles results in expression of the cloned gene. Because the

protein is only expressed during the late stages of infection, large quantities of protein are produced. The rodlike structure of the baculovirus particle allows expansion of the viral particle to accommodate large insert DNA, perhaps up to 100 kb (Miller 1989). Artificial Chromosomes Artificial chromosome vectors dramatically increase the possible size of insert DNA compared to plasmid and phage vectors. Yeast artificial chromosomes (YACs) introduce the possibility of cloning inserts of around 2,000 kb. Most YAC vectors contain sequences that enable them to be maintained as plasmids in E. coli cells, as well as the sequences required for replication and maintenance of a chromosome in yeast. The YAC vector DNA can be isolated from E. coli and digested with restriction enzymes to produce left and right chromosomal arms. Ligation of insert DNA between the two arms forms a contiguous linear chromosome which is transformed into yeast. The YAC also contains a selectable marker for each chromosomal arm as well as a marker that allows for the selection of clones containing insert DNA (Burke et al. 1987). YACs include three elements that enable them to be replicated and maintained in yeast – an origin of replication, a centromere, and telomeres (Murray and Szostak 1983; Monaco and Larin 1994). The origin of replication in yeast, also called the autonomously replicating sequence (ARS), is required for the initiation of replication (Hsiao and Carbon 1979; Stinchcomb et al. 1979). The centromere (CEN) contains a series of AT-rich repeats and additional chromatin features required for spindle attachment and segregation of the chromosomes during mitosis or meiosis and ensures that the associated DNA is not lost during these processes (Clarke and Carbon 1980; Dalal and Bui 2010). Telomeres (TEL) are repeating sequences found at the ends of linear chromosomes that protect and maintain these ends (Szostak and Blackburn 1982; Chan and Chang 2010). Because of their ability to handle large inserts, YACs are the vector of choice for many genome projects. The large insert size is highly beneficial

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

for genomic mapping projects and allows cloning of large genes along with their regulatory elements for functional studies in organisms such as mice (Monaco and Larin 1994). However, problems with YACs include low transformation efficiency, issues with chimeric inserts containing DNA from two different genomic locations, instability resulting in loss of inserted DNA, and difficulties separating the cloned DNA from the host’s endogenous genetic material, making it difficult to isolate large quantities of insert DNA (Sternberg 1990; Monaco and Larin 1994). Other artificial chromosomes were developed to overcome the pitfalls associated with YAC clones. Bacterial artificial chromosomes (BACs) use an F-factor-based system to maintain a single copy of the recombinant DNA in each host cell. They have better insert stability than YACs but often have low DNA yields (Monaco and Larin 1994). PACs combine features of P1 vectors and BACs. Rather than requiring packaging into P1 heads for transformation, these vectors are transformed by electroporation, eliminating the size limit imposed by packaging and allowing inserts of up to 300 kb (Monaco and Larin 1994). The advent of in vivo manipulation of DNA sequences found on BACs, P1 clones, or PACs using recombineering has made these vectors valuable for preparing large insert DNA for the production of transgenic animals to study gene expression (Narayanan and Chen 2011). While mammalian artificial chromosomes (MACs) have been produced, they are not commonly used as cloning vectors. Work to develop MACs for use in human gene therapy continues (Lipps 2003).

Additional Specialized Features of Cloning Vectors Expression Systems One common goal of cloning is to express the cloned gene allowing screening or purification and analysis of the protein. This requires a vector containing both promoter and terminator sequences. The gene of interest is inserted between these sequences in the proper orientation, a process greatly facilitated by the presence of

341

a polylinker that allows directional cloning in a flexible manner. The promoter is recognized by the host’s transcriptional machinery producing RNA using the insert as a template. The terminator sequence indicates where the production of RNA should stop. The RNA may also be isolated and used for further analysis. The resulting mRNA may also be translated in vivo to produce protein. See Fig. 4. In some cases the vector includes sequences that become part of the mRNA and contain information needed for the initiation of protein production, for example, the Shine-Dalgarno sequence for translation in bacteria. Efficient translation of eukaryotic genes in bacterial hosts often requires bacteria modified to produce greater amounts of tRNAs to translate codons not commonly found in bacterial genes. Use of bacteria as the host for expression is common, and a range of vectors using several common inducible promoters are available. The use of inducible promoters allows growth of the host cells to high density prior to the initiation of protein production and aids in producing large quantities of protein. Many of the inducible promoters used in E. coli contain operator sequences from the lac operon. The Lac repressor protein binds to these sequences, greatly reducing expression until an inducer such as IPTG is added. Another common expression system in E. coli uses bacteriophage T7 promoter sequences and an inducible phage T7 RNA polymerase to allow robust expression under tight control (Terpe 2006). In some cases expression systems using hosts other than bacteria offer advantages. Baculovirus vectors allowing protein production in insect cell culture are quite popular. Because the protein is expressed in a eukaryotic host, it is more likely to undergo correct posttranslational modification, a process that does not happen in bacterial hosts. Vectors for expressing genes in yeast, filamentous fungi, and mammalian cell cultures are also available (Schmidt 2004). Yeast two-hybrid systems and phage display systems extend cloning and expression technology to allow the study of the interactions between proteins and other molecules (Fields 2009; Li and Caberoy 2010).

E

342

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

Promoter-Probe and Terminator-Probe Systems Promoter-probe and terminator-probe vectors allow the identification of DNA sequences that initiate or terminate transcription. Promoterprobe vectors include a polylinker upstream of a reporter gene that lacks the regulatory sequences needed to initiate transcription. Cloning DNA into the polylinker of these vectors results in the expression of the reporter gene if the cloned DNA contains promoter elements. The reporter gene produces a protein that is easily identified, often changing the color of the cell or producing a fluorescent protein. Terminator-probe vectors include a polylinker between the promoter and a reporter gene. Cloning DNA into the polylinker will reduce or eliminate expression of the reporter gene if the cloned DNA contains transcription terminator elements (Ruiz-Cruz et al. 2010). Tagging Systems Some vectors contain the sequence for additional amino acid residues or complete proteins which become part of the expressed protein. These tags can increase the solubility of the expressed protein and aid in its purification and often include sequences that allow removal of the tag following purification (Malhotra 2009). Examples of tagging sequences include His tags and GST tags. Vectors encoding fluorescent protein tags allow in vivo study of many aspects of protein structure and function including protein expression, movement, localization, turnover, and interactions (Chudakov et al. 2010). A variety of vectors with fluorescent tags based on green fluorescent protein (GFP) are available. Shuttle Vectors Vectors that include origins of replication from two different hosts are able to replicate in both hosts and to “shuttle” back and forth between the two hosts. This allows the use of well-established and relatively easy procedures for transformation, selection, and amplification of the plasmid in a simple host such as E. coli. The plasmid can then be transformed into the less amenable experimental host of interest for additional experimentation or manipulation. If needed, the plasmid can

be isolated from the second host and again transformed into the original host for further analysis or manipulation.

Key Enzymes Used in Cloning Restriction Endonucleases Restriction endonucleases, also called restriction enzymes, are enzymes that recognize specific sequences in double-stranded DNA and cleave the DNA by hydrolyzing phosphodiester bonds in both strands in or near that recognition sequence. The major in vivo function of these enzymes, along with partner DNA methyltransferases, is to protect bacteria and archaea from foreign DNA. The methyltransferase modifies the recognition sites in the cell’s DNA, protecting it from cleavage by the restriction enzyme. The restriction enzyme then cleaves any unmethylated foreign DNA, for example, from an invading bacteriophage. This helps the cell maintain the integrity of the cell’s genomic information. Traditional cloning uses restriction enzymes to cut two or more molecules of DNA at specific sites so they can be joined to form a recombinant molecule. Most restriction enzymes used in cloning are called type II restriction enzymes (Pingoud et al. 2005). Restriction enzyme-catalyzed cleavage of the phosphodiester backbone of DNA produces two ends, one with a 30 hydroxyl group and one with a 50 phosphate group. The type II enzymes typically recognize palindromic sequences four to eight base pairs in length and can produce either blunt ends or sticky ends. Blunt ends are produced when the enzyme cleavage site is in the middle of the recognition palindrome, resulting in cleavage of both strands at the same location. Sticky ends are produced when the enzyme cleavage sites are offset in the recognition palindrome, resulting in offset cleavage and short, single-stranded overhangs. A sticky end can have either a 50 overhang or 30 overhang. The overhang varies in length but typically has a length of five or fewer bases. Sticky ends with identical overhangs can hydrogen bond with each other to form base pairs and are called compatible ends. Compatible ends facilitate cloning because the hydrogen

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

bonding between the overhanging bases produced when DNA from two sources are cleaved with the same restriction enzyme increases ligation efficiency. Ligases DNA ligases catalyze the formation of a phosphodiester bond between a 30 end and a 50 end of DNA (Tomkinson et al. 2006). In cells, DNA ligases serve a variety of functions in DNA replication, repair, and recombination. For example, a DNA ligase joins lagging strand Okazaki fragments during DNA replication. Traditional cloning uses DNA ligase to join the ends of DNA molecules from two sources producing a recombinant DNA molecule. Some protocols to create cDNA libraries from small noncoding RNA make use of RNA ligase to add short RNA adaptors to the ends of RNA molecules. The use of T4 RNA ligases allows specific addition of one linker to the 30 end of an ssRNA molecule followed by the addition of a different linker to the 50 end of the ssRNA molecule. This allows the creation of a cDNA from the RNA using primers for these linkers and reverse transcription protocols (Harbers 2008). Recombinases Some modern cloning protocols and vectors have eliminated the need for manipulation of DNA using restriction enzymes and ligases. Instead, these methods rely on recombination to produce the desired DNA molecules. This requires the use of enzymes that catalyze the desired recombination between the vector and insert DNA. Site-specific recombination can be used in vitro to add insert DNA to a vector or to move insert DNA between vectors. The Gateway ® system uses modified att sites for recombination. Recombination into the donor vector is carried out by two proteins, the l-phage integrase (Int) and the bacterial integration host factor (IHF). Recombination from the donor vector into the destination vectors requires these same two proteins plus the l-phage excisionase (Xis). In other systems the Cre recombinase is used to recombine lox sites, or Flp recombinase is used to recombine FRT sites. These site-specific recombinase

343

enzymes are members of the tyrosine recombinase family of enzymes. They use active site tyrosine residues to catalyze transesterification reactions resulting in recombination at the specific sites in the DNA (Grindley et al. 2006). Hybrid zincfinger recombinase enzymes engineered to recognize and recombine novel sequences are also being developed (Prorocic et al. 2011). Recombineering is based on homologous recombination between relatively short sequences with about 50 bases of homology to produce novel recombinant DNA (Narayanan and Chen 2011). Insert DNA is located between the two of these short sequences that direct recombination into the host genome. This is carried out in vivo which facilitates cloning large molecules of DNA that are typically difficult to clone in vitro. While a variety of recombination-based cloning systems have been developed, two predominate. The l-Red cloning system is based on l-phage Red recombination and uses three l-phage homologous recombination proteins. The Gam protein inhibits E. coli proteins that normally degrade linear DNA allowing introduction of linear target sequences to enter the host without immediate degradation. The Exo protein prepares the ends of the linear DNA by producing the 30 overhangs needed for homologous recombination. The beta protein then binds to the single-stranded overhang produced by Exo and promotes its annealing to homologous sequences. This annealing presumably takes place with ssDNA generated as part of the replication process. Completion of replication finishes the recombination process. Since the production of recombinant DNA takes place in vivo, the gam, exo, and bet genes are expressed in the host cell either via a plasmid or a prophage. Inducing expression of these genes and introducing linear DNA homologous to sequences on a vector already present in the host result in recombination between the linear DNA and the vector producing the desired recombinant DNA (Sawitzke et al. 2007). ET cloning is similar to l-Red cloning but is based on Rac prophage RecE recombination. It uses Rac prophage recombination proteins and the Gam protein from l-phage. The prophage proteins RecE and RecT have functions analogous to the Exo and beta proteins in the

E

344

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

l-phage recombination et al. 2001).

(Copeland

Polymerases DNA polymerases synthesize DNA in a 50 !30 direction using a single-stranded DNA template. In cells they play a variety of roles in DNA replication, recombination, and repair. In addition to synthesizing new DNA, some polymerases harbor exonuclease activity allowing them to degrade DNA in either the 50 !30 direction, the 30 !50 direction, or both. A variety of DNA polymerases are used in cloning and play a number of different roles in cloning protocols (Guo and Bi 2002). Thermostable DNA polymerases are found in species that live in high-temperature environments. These polymerases are suitable for use in PCR because the repeated rounds of heating and cooling do not destroy the enzyme’s activity. The presence or absence of 30 !50 exonuclease activity influences the type of end generated by thermostable polymerases. Use of a thermostable polymerase with 30 !50 polymerase activity like Vent polymerase produces blunt-ended PCR products suitable for blunt-ended cloning. On the other hand, Taq polymerase lacks the 30 !50 exonuclease activity and includes a terminal transferase activity that results in template-independent addition of an A to the 30 end of the dsDNA. This activity results in a single unpaired A at the 30 end of the insert DNA. Preparation of a vector with an overhanging 30 T allows efficient cloning of the PCR product into the vector. This procedure is called TA cloning. The primers used to generate PCR product often have additional nucleotides at their 50 end that include a restriction site. This allows the generation of sticky ends by digesting the PCR product with the appropriate restriction enzyme or enzymes. If each member of the primer pair has a different restriction site, the PCR product can be cloned directionally. ä

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of, Fig. 5 Producing and cloning a cDNA using reverse transcriptase. Cloning of cDNA begins with isolation of mRNA from the organism of interest. An oligo-dT primer is used in conjunction with reverse transcriptase to synthesize the first strand of DNA using the mRNA as a template. Many protocols use RNase H to nick the backbone of the mRNA and DNA polymerase I to remove the mRNA and synthesize the second strand of

system

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of, Fig. 5 (continued) DNA using the first strand of DNA as a template. The resulting cDNA is then cloned into a vector, transformed, and propagated in a host

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

The Klenow fragment is a modified form of DNA polymerase I from E. coli. It has both polymerase and 30 !50 exonuclease activities, but the endogenous 50 !30 exonuclease activity is removed. T4 and Pfu DNA polymerases also have these same activities. Their polymerase activity is used to fill 50 overhangs to produce blunt ends for blunt-ended cloning. Their 30 !50 exonuclease activity is used to remove 30 overhangs. This is called polishing and is often used to remove the 30 A at the ends of PCR products generated by Taq polymerase to allow blunt-end ligations with these products. Reverse transcriptases are DNA polymerases capable of producing dsDNA using an ssRNA template. During retroviral infection these enzymes first synthesize a strand of DNA using the ssRNA viral genome as a template. They then concurrently degrade the RNA from the DNA/RNA hybrid and use the DNA as a template to produce a second strand of DNA. The resulting DNA duplex is integrated into the host’s genome. In vitro this enzyme can produce dsDNA from purified cellular RNA. The dsDNA is then cloned to produce cDNA libraries. While a single enzyme is capable of catalyzing all three steps in this process, the efficiency can be increased by using modified reverse transcriptase that only carries out the first step and enzymes like RNase H and DNA polymerase I from E. coli to complete the procedure (Harbers 2008). See Fig. 5. Terminal deoxyribonucleotidyl transferases (TdTs) synthesize nucleic acids by adding nucleotides to the ends of a DNA molecule in a template-independent manner. TdT is used in some cloning protocols to modify the ends of DNA in preparation for cloning. Providing TdT with only one type of nucleotide results in the addition of one or more of that specific residue to the 30 end of the substrate DNA. This can be used to produce complementary cohesive ends that facilitate cloning. Early cloning experiments used calf-thymus TdT to produce stretches of poly T on the ends of one segment of DNA and stretches of poly A on the other segment of DNA producing cohesive ends that facilitated cloning (Berg and Mertz 2010). TdT is still used to clone some PCR products by AT cloning using

345

ddTTP so that only one T residue is added to each 30 end of the DNA (Zhou and Gomez-Sanchez 2000). Phosphatases and Kinases Certain manipulations during cloning require the presence or absence of phosphoryl groups at specific ends of the DNA molecules involved in the cloning process. Bacteriophage T4 polynucleotide kinase is used to add a phosphate to the 50 end of either ssDNA or dsDNA using the g-phosphate from an NTP donor (Eastberg et al. 2004). Since a 50 phosphate is required by DNA ligase, DNA fragments lacking a 50 phosphate must be treated with a polynucleotide kinase prior to ligation reactions. Conversely, removal of the 50 phosphate prevents ligation. In some instances it is beneficial to remove the 50 phosphates using alkaline phosphatase. For example, treating either the vector or a large insert with alkaline phosphatase will remove the 50 phosphate and prevent self-ligation of the molecule and increase the frequency of ligation between the vector and the insert (Ullrich et al. 1977). Typically either calf intestine alkaline phosphatase (CIP) or bacterial alkaline phosphatase (BAP) is used.

Cross-References ▶ Artificial Chromosomes ▶ Bacteriophage and Viral Cloning Vectors ▶ Blue/White Selection ▶ DNA Recombination, Mechanisms of ▶ Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of ▶ Key Enzymes Used in Cloning, Some ▶ Plasmid Cloning Vectors ▶ Plasmid Incompatibility ▶ Plasmids, Naming and Annotation of ▶ Polymerase Chain Reaction ▶ Recombineering ▶ Restriction Endonucleases ▶ Rolling Circle Replicating Plasmids ▶ Selection with Antibiotics ▶ Theta-Replicating Plasmids, Large

E

346

Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

References Backman K, Ptashne M (1978) Maximizing gene expression on a plasmid using recombination in vitro. Cell 13(1):65–71 Berg P, Mertz JE (2010) Personal reflections on the origins and emergence of recombinant DNA technology. Genetics 184(1):9–17 Bernard P, Gabant P et al (1994) Positive-selection vectors using the F plasmid ccdB killer gene. Gene 148(1):71–74 Bolivar F, Rodriguez RL et al (1977a) Construction and characterization of new cloning vehicles. I. Ampicillinresistant derivatives of the plasmid pMB9. Gene 2(2):75–93 Bolivar F, Rodriguez RL et al (1977b) Construction and characterization of new cloning vehicles. II. A multipurpose cloning system. Gene 2(2):95–113 Burke DT, Carle GF et al (1987) Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 236(4803):806–812 Camps M (2010) Modulation of ColE1-like plasmid replication for recombinant gene expression. Recent Pat DNA Gene Seq 4(1):58–73 Chan SS, Chang S (2010) Defending the end zone: studying the players involved in protecting chromosome ends. Febs Lett 584(17):3773–3778 Charnay P, Perricaudet M et al (1978) Bacteriophage lambda and plasmid vectors, allowing fusion of cloned genes in each of the three translational phases. Nucleic Acids Res 5(12):4479–4494 Chauthaiwale VM, Therwath A et al (1992) Bacteriophage lambda as a cloning vector. Microbiol Rev 56(4): 577–591 Chudakov DM, Matz MV et al (2010) Fluorescent proteins and their applications in imaging living cells and tissues. Physiol Rev 90(3):1103–1163 Clarke L, Carbon J (1980) Isolation of a yeast centromere and construction of functional small circular chromosomes. Nature 287(5782):504–509 Collins J, Hohn B (1978) Cosmids: a type of plasmid genecloning vector that is packageable in vitro in bacteriophage lambda heads. Proc Natl Acad Sci U S A 75(9):4242–4246 Copeland NG, Jenkins NA et al (2001) Recombineering: a powerful new tool for mouse functional genomics. Nat Rev Genet 2(10):769–779 Dalal Y, Bui M (2010) Down the rabbit hole of centromere assembly and dynamics. Curr Opin Cell Biol 22(3):392–402 del Solar G, Giraldo R et al (1998) Replication and control of circular bacterial plasmids. Microbiol Mol Biol Rev 62(2):434–464 Eastberg JH, Pelletier J et al (2004) Recognition of DNA substrates by T4 bacteriophage polynucleotide kinase. Nucleic Acids Res 32(2):653–660 Fields S (2009) Interactive learning: lessons from two hybrids over two decades. Proteomics 9(23): 5209–5213

Grindley ND, Whiteson KL et al (2006) Mechanisms of site-specific recombination. Annu Rev Biochem 75:567–605 Guo B, Bi Y (2002) Cloning PCR products. An overview. Methods Mol Biol 192:111–119 Harbers M (2008) The current status of cDNA cloning. Genomics 91(3):232–242 Helfman DM, Feramisco JR et al (1983) Identification of clones that encode chicken tropomyosin by direct immunological screening of a cDNA expression library. Proc Natl Acad Sci U S A 80(1):31–35 Hsiao CL, Carbon J (1979) High-frequency transformation of yeast by plasmids containing the cloned yeast Arg4 gene. Proc Natl Acad Sci U S A 76(8):3829–3833 Ioannou PA, Amemiya CT et al (1994) A new bacteriophage P1-derived vector for the propagation of large human DNA fragments. Nat Genet 6(1):84–89 Li W, Caberoy NB (2010) New perspective for phage display as an efficient and versatile technology of functional proteomics. Appl Microbiol Biotechnol 85(4):909–919 Lipps H (2003) Chromosome-based vectors for gene therapy. Gene 304:23–33 Malhotra A (2009) Tagging for protein expression. Methods Enzymol 463:239–258 Messing J (1991) Cloning in M13 phage or how to use biology at its best. Gene 100:3–12 Miller LK (1989) Insect baculoviruses: powerful gene expression vectors. Bioessays 11(4):91–95 Monaco AP, Larin Z (1994) YACs, BACs, PACs and MACs: artificial chromosomes as research tools. Trends Biotechnol 12(7):280–286 Murray NE, Murray K (1974) Manipulation of restriction targets in phage lambda to form receptor chromosomes for DNA fragments. Nature 251(5475):476–481 Murray AW, Szostak JW (1983) Construction of artificial chromosomes in yeast. Nature 305(5931):189–193 Narayanan K, Chen Q (2011) Bacterial artificial chromosome mutagenesis using recombineering. J Biomed Biotechnol 2011:971296 Pingoud A, Fuxreiter M et al (2005) Type II restriction endonucleases: structure and mechanism. Cell Mol Life Sci 62(6):685–707 Prorocic MM, Wenlong D et al (2011) Zinc-finger recombinase activities in vitro. Nucleic Acids Res 39:9316–9328 Ramsay M (1994) Yeast artificial chromosome cloning. Mol Biotechnol 1(2):181–201 Rosenberg AH, Lade BN et al (1987) Vectors for selective expression of cloned DNAs by T7 RNA polymerase. Gene 56(1):125–135 Ruiz-Cruz S, Solano-Collado V et al (2010) Novel plasmid-based genetic tools for the study of promoters and terminators in Streptococcus pneumoniae and Enterococcus faecalis. J Microbiol Methods 83(2):156–163 Sawitzke JA, Thomason LC et al (2007) Recombineering: in vivo genetic engineering in E. coli, S. enterica, and beyond. Methods Enzymol 421:171–199

Epigenetic Research, Computational Methods in Schmidt FR (2004) Recombinant expression systems in the pharmaceutical industry. Appl Microbiol Biotechnol 65(4):363–372 Shizuya H, Birren B et al (1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factorbased vector. Proc Natl Acad Sci U S A 89(18): 8794–8797 Short JM, Fernandez JM et al (1988) Lambda ZAP: a bacteriophage lambda expression vector with in vivo excision properties. Nucleic Acids Res 16(15): 7583–7600 Smith GE, Summers MD et al (1983) Production of human beta interferon in insect cells infected with a baculovirus expression vector. Mol Cell Biol 3(12):2156–2165 Sternberg N (1990) Bacteriophage-P1 cloning system for the isolation, amplification, and recovery of DNA fragments as large as 100 kilobase pairs. Proc Natl Acad Sci U S A 87(1):103–107 Stinchcomb DT, Struhl K et al (1979) Isolation and characterization of a yeast chromosomal replicator. Nature 282(5734):39–43 Szostak JW, Blackburn EH (1982) Cloning yeast telomeres on linear plasmid vectors. Cell 29(1):245–255 Terpe K (2006) Overview of bacterial expression systems for heterologous protein production: from molecular and biochemical fundamentals to commercial systems. Appl Microbiol Biotechnol 72(2): 211–222 Thomas M, Cameron JR et al (1974) Viable molecular hybrids of bacteriophage lambda and eukaryotic DNA. Proc Natl Acad Sci U S A 71(11): 4579–4583 Tomkinson AE, Vijayakumar S et al (2006) DNA ligases: structure, reaction mechanism, and function. Chem Rev 106(2):687–699 Ullrich A, Shine J et al (1977) Rat insulin genes: construction of plasmids containing the coding sequences. Science 196(4296):1313–1319 Vieira J, Messing J (1982) The pUC plasmids, an M13mp7-derived system for insertion mutagenesis and sequencing with synthetic universal primers. Gene 19(3):259–268 Walhout AJ, Temple GF et al (2000) GATEWAY recombinational cloning: application to the cloning of large numbers of open reading frames or ORFeomes. Methods Enzymol 328:575–592 Weigel C, Seitz H (2006) Bacteriophage replication modules. Fems Microbiol Rev 30(3):321–381 Zhou MY, Gomez-Sanchez CE (2000) Universal TA cloning. Curr Issues Mol Biol 2(1):1–7

Epigenetic DNA Methylation ▶ Genomic Imprinting

347

Epigenetic Research, Computational Methods in Heather J. Ruskin1 and Dimitri Perrin2 1 Centre for Scientific Computing and Complex Systems Modelling, Dublin City University, Dublin, Ireland 2 Laboratory for Systems Biology, RIKEN Center for Developmental Biology, Kobe, Japan

Synonyms Computational epigenetics

Synopsis Biomedical systems are characterized by their emergent behavior, with overall dynamics of the system arising from the multiplicity of their interactions. Even though these interactions can be directly investigated, it is often difficult to link them directly to the phenomena observed at the system level. Epigenetic regulation provides a typical example of emergent system behavior. A large number of mechanisms are involved, e.g., DNA methylation and histone modification among others, and all contribute to controlling whether genes are expressed. Research has resulted to date in a better description of each mechanism, but a quantitative description of the interactions and their contribution to overall system evolution remains a challenge. In this context, in vitro and in vivo studies need to be complemented by a third branch of research, namely, in silico investigation, which can contribute in several ways. It enables a better analysis of the limited data available and facilitates integration of separate data sources. It can also complement wet lab experiments for both hypothesis testing and investigation of the system dynamics and offers potential to explore the nature of the emergent system behavior through appropriate models. Historically, this approach has proved valuable for other biomedical systems and is now being

E

348

applied to epigenetic research. This chapter reports on some specific studies.

Epigenetic Research, Computational Methods in

costly and specific, while computational techniques can be wide ranging, addressing both data analysis and modeling of the intricate interactions involved.

Introduction Recent biomedical research has shown that the phenotype of living organisms depends on complex interdependent mechanisms, deriving not only from genotype and environment, but also from a large set of interactions modifying gene expression without alteration of the DNA sequence  a phenomenon called epigenetics. Epigenetic mechanisms involve heritable alterations in chromatin structure (e.g., DNA methylation and histone acetylation among other “signatures”). In turn, these regulate transcriptional activation of protein- and RNA-encoding genes (gene expression). Epigenetic signals arise during development and cell proliferation and persist through cell division. Hence, while information within the genetic material is not changed, instructions for its assembly and interpretation may be. The role of such changes is modification specific (different changes on the same amino acid may have opposite effects) as well as molecule specific (e.g., effects of trimethylation of H3K9 and H3K4 have opposite consequences on gene expression). Complex interactions exist between DNA methylation and histone modifications, while respective epigenetic changes have different dynamics and stability. For instance, histone deacetylation is very rapid, while histone methylation is slow and DNA methylation comparatively stable. Alterations in DNA methylation, imprinting, and chromatin structure are common in cancer, and links to epigenetic changes have been established in several cases, e.g., in Wilms’ tumor and colon cancer. Epigenetic mechanisms are also studied in other medical fields because of their association with obesity, abnormal neural development, mood disorders such as stress vulnerability and bipolar disorder, and risk of heart failure. Quantitative information remains sparse. Laboratory research on epigenetic mechanisms is

Analogy with Other Complex Biomedical Systems Details from ongoing research are regularly published on specific epigenetic changes. Constraints include ethical issues as well as the limited nature of individual experiments; so, while specific phenomena are now better described, an explanation of the system-wide complex interactions, which are a crucial feature of epigenetic control, is still lacking. Given the inherent technical constraints, most research groups focus on a single epigenetic change in a specific context. Such a situation is not, of course, uncommon in biomedical research. Analogies exist, e.g., with early research on HIV and immune response to that infection. A number of key features were rapidly identified, such as HIV targeting of CD4 lymphocytes and the use of these as hosts to replicate. Also known at an early stage was the fact that this results in a gradual depletion of the immune system, increasing susceptibility to opportunistic diseases and ultimately leading to the death of the patient. The main challenges included explaining how interactions at the microscopic level were contributing to overall disease progression and determining the influence of observed individual variations on progression rate. Limited data only were available, and the computational modeling approach was used to formulate hypotheses for testing, which became more refined as data were gradually augmented. A similar approach is currently being applied to epigenetic research. Data Analysis The data themselves present new issues. Bioinformatics draws inspiration from diverse disciplines, including physics and engineering, statistics and computer science to establish a quantitative interface with biology, which ranges from cell to system level. The ultimate aim is to discover how life functions and, to this end, statistical and pattern recognition tools are core. Nevertheless, biology

Epigenetic Research, Computational Methods in

deals with specifics, as well as possibilities, and much detail is required to root computational analyses in the real-world context. This requirement is not new to biological investigation and has always presented formidable problems in terms of classification and interpretation. Nevertheless, while development of computer capability has paralleled that of new laboratory technologies over recent years, the extensive and variable data now available require sophisticated management tools, analysis, and, ultimately, interpretation. From a bioinformatics viewpoint, major tasks include integration of different data sources and types, to facilitate queries by prespecified criteria (e.g., by genes, genomic regions, phenotypic characteristics, methylation states, miRNA binding, etc.). Combining query criteria also enables highlighting of multilevel control or expression. Additionally, several tools, such as BALM (Lan et al. 2011) and MEDIPS (Clark et al. 2012), have been developed to facilitate analysis of epigenetic data obtained through nextgeneration sequencing. Large-scale epigenome mapping efforts have been established and are already proving successful, in terms of both improvements of experimental methods required to generate such data and of complementary data analyses. A summary of these projects is provided by Bock and Lengauer (2008). The authors also discuss drawing inferences on epigenetic states from DNA sequences, with results particularly encouraging for DNA methylation and nucleosome positioning. Such predictions, particularly where experimental data are still lacking, represent a first step toward quantitative modeling of epigenetic mechanisms. Information from independent studies also requires collation, so that combined analyses can be performed with niche databases and corresponding links providing templates and appropriate statistical tools. StatEpigen is an example of such a resource (Barat and Ruskin 2010). Designed as a repository of curated and synthesized research information on epigenetic factors responsible for colorectal cancer (in the first instance), data are made available for public use, through a web interface, with additional inputs accommodated to support development of

349

computational and gene-network models (discussed in what follows). The database provides advanced query options, as shown in Fig. 1. Information is currently available on more than 700 genes, their corresponding single and conditional epigenetic changes, and is augmented on a regular basis. Population of the StatEpigen database is carried out through expert manual curation and annotation of epigenetic literature, links to complementary sources are highlighted and semiautomation of additional data is in hand. The project also targets mining and analysis of colorectal cancer data for insights on associated events (genetic and epigenetic interactions) and progression through the various stages of the disease. Initial analysis has demonstrated synergy between epigenetic changes, implying increased risk of cancer initiation. Such efforts provide increasingly refined information, enhancing both ongoing wet lab research and parameterization of the computational models being developed in support. In parallel, the suitability of the murine reference model for human colorectal cancer has been explored (Morgan et al. 2012). The evolutionary analysis investigated whether human colorectal cancer genes and their mouse orthologs are still functionally similar after 130 million years of independent evolution. Findings suggest that some genes are not, and this is likely to have an impact on future in vivo studies. Modeling Integration of results, such as those mentioned, improves understanding of the overall biological system, and computer-based modeling can provide a framework for this. Again, given system complexity, early models have generally focused on a single scale, but spanning system layers is the ultimate goal. Here, two examples are given, dealing respectively with aberrant epigenetic evolution at the organ level and with epigenetic interactions at the chromatin level. Organ-Level Modeling An important computational modeling role, as noted previously, is hypothesis testing to complement in vivo or in vitro experiments. In illustration,

E

350

Epigenetic Research, Computational Methods in

Query Options in StatEpigen :

Filtering according to cancer type Suitable for comparison of the occurrence of a genetic event in different types of cancers.

Cell Lines Information on more than 100 colon cancer cell lines

Filtering according to different aspects of colon cancer Suitable for comparison of the occurrence of a genetic events for different stages / types of colon cancer. Combined option, includes various filtering possibilites:

Advanced Search

Genes and molecular events

Most frequent hystologies and subhistologies

Clinicopathological factors

Epigenetic Research, Computational Methods in, Fig. 1 Types of query options available in StatEpigen. Shown here by: cancer type and different aspects of CRC, namely, individual cell lines and advanced search

a study focused on gastric cancers has shown that gene inactivation is more frequently due to aberrant promoter methylation than to defects at the genetic level (Ushijima and Sasako 2004). In addition, methylation levels in healthy individuals were found to be significantly higher in H. pylori-positive as opposed to H. pylori-negative individuals (Maekita et al. 2006). This is significant, since the infection is known to be a major risk factor for gastric cancers. Importantly, this infection-induced hypermethylation is part temporary and decreases after eradication of infection, due rather to cell turnover than to active demethylation. Dynamics were hypothesized to be a result of the structure of the gastric crypt (one stem cell, multiple progenitor cells, and many differentiated cells), which causes a two-part methylation (Fig. 2). The components are: – Permanent, due to methylation of stem cells. Methylation is conserved during cell division, so progenitor and differentiated cells obtained from an aberrantly methylated stem cell exhibit identical abnormal patterns. – Temporary, due to methylation in progenitor and differentiated cells. These cells have a

finite life span and are gradually replaced. If a crypt stem cell is not methylated, new cells are also nonmethylated and contribute to erasure of this temporary component. Gaining improved insight into overall methylation dynamics is crucial, and to complement murine experiments, a computational model has been developed (Perrin et al. 2010). This model reproduces cell renewal dynamics, as well as the structure of the gastric crypt. Aberrant methylation then occurs as a probabilistic event, based on cell type and the infection status. This structure permits investigation of complex infection dynamics, as well as sensitivity analysis on methylation susceptibilities for all cell types. Confirmation of the formulated hypothesis and the additional information, not accessible through physical experiments (on quantitative values for methylation susceptibilities), thus provides a strong argument in favor of in silico epigenetic modeling. Complementing in vivo experiments, information is provided on the dynamics of infection-induced aberrant methylation (important for understanding gastric cancer initiation), while potential exists for generalization to other cancer types. Aberrant DNA methylation in

Epigenetic Research, Computational Methods in

351

Epigenetic Research, Computational Methods in, Fig. 2 Methylation dynamics in gastric crypts. Color represents cell type (black for stem cells, gray for progenitor cells, white for differentiated cells). Aberrant methylation is shown with a cross inside the cell. Left: During H. pylori infection, cells can be methylated, with different probabilities, depending on their type. Methylation status is

conserved through cell division (from stem to progenitor cell and from progenitor to differentiated cell). Right: After eradication, only crypts where the stem cell was methylated conserve aberrant methylation (which is propagated through the whole crypt during cell division) (Adapted from Perrin et al. 2010)

noncancerous tissues was also identified, e.g., in the colon, the liver, and the stomach. In silico models can thus facilitate quantitative methylation analysis and investigation of the influence of potential induction factors.

of the levels of histone modifications and DNA methylation.

Microscopic Modeling A prototype theoretical micromodel (EpiGMP), based on the Markov Chain Monte Carlo (MCMC) class of algorithms, is also described (Raghavan et al. 2010). This models the interdependence of histone modifications and DNA methylation, based on random sampling of states, assuming defined distributional forms. Development of this epigenetic tool to predict molecular interactions involves four main stages: (a) Information, on the nature of DNA methylation and number and types of known histone modifications (collated from literature). (b) DNA methylation levels initially fixed [0.1, 0.9], with different types of histone modifications represented as numerical strings. (c) Interdependencies incorporated between the epigenetic variables or key elements and based on distribution values selected. (d) Iteration, permitting random selection/sampling of specific histone modifications for a fixed value of DNA methylation. The final output Transcription (T) is given as a function

The number of “states,” or numerical strings, varies for each type of histone. The Markov Chain Monte Carlo (MCMC) algorithm is applied in the model to allow a restricted and slow shift from a current amino acid modification to a new histone modification. Each move to a new histone state has an associated transition probability. Extensive data structures are used to store the list of modifications (randomly chosen over several iterations). The model is thus computationally efficient and can also simultaneously decide on the mono/di/tri modification levels. Potentially, a generic tool of this kind can be applied to predict molecular events that affect the state of expression of either a single or group of genes. Current applications include prediction of molecular events and expression levels of related genes in colorectal cancer (extracted from the StatEpigen database, described above). This predictive framework will also be applied to an analysis of the influence of histone changes in controlling the physical chromosomal rearrangements during disease onset. Thus, additional information is anticipated on the way in which low-level molecular changes determine physical traits for normal and disease conditions in an organism.

E

352

Epigenetic Research, Computational Methods in

Epigenetic Research, Computational Methods in, Fig. 3 Overall model structure, with epigenetic interactions and external risk factors

Model Integration As indicated previously, the long-term objective is model integration across multiple scales. In the first instance, the focus is the cancer initiation context and is of considerable interest to clinical researchers, since progression of aberrant changes in cancer cell can be controlled and even treated or reversed, if detected at an early stage (already utilized in some cancer therapies). For colorectal cancer, the immediate effort seeks to span scales to develop a model for genetic and epigenetic signals in different pathology phenotype levels. The goal is to predict cancer initiation and progression, following a Bayesian network approach for describing the causality between genes in tumor pathways. Structured in three main layers (micromolecular, gene interaction, and cancer stage), an integrated model is partly developed, drawing on data from the StatEpigen database and

incorporating hypermethylation, hypomethylation, gene expression level, and mutations for different genes implicated in the disease. Histone modification and corresponding DNA methylation updates are also based on information generated by the Raghavan et al. model outlined earlier. Future extensions include inclusion of heredity as an initiation factor: the assumption of multiple successive “hits”: (Knudson 2001) suggests that fewer mutations are required to produce malignancies in families in which cancer has already appeared, since some changes are already inherited. Aging, another major risk factor, will also be included, given its known influence on genetic and epigenetic events in cancer development (Fraga et al. 2007). Based on knowledge of common mutated genes in cancer (such as P53), the colorectal model may be extended also to other types of cancer (breast, lung, ovarian,

Equilibria and Bifurcations in the Molecular Biosciences

353

stomach), with statistical evidence on environment and lifestyle features and their implicit impact on cancer development, taken into account (Fig. 3). This integrated approach offers the possibility of gradual refinement to the multiscaled model and its outputs, so that capturing and quantifying the overall system behavior becomes a realistic target.

levels of aberrant DNA methylation in helicobacter pylori-infected gastric mucosae and its possible association with gastric cancer risk. Clin Cancer Res 12:989–995 Morgan CC, Shakya K, Webb A, Walsh TA, Lynch M, Loscher CE, Ruskin HJ, O’Connell MJ (2012) Colon cancer associated genes exhibit signatures of positive selection at functionally significant positions. BMC Evol Biol 12:114 Perrin D, Ruskin HJ, Niwa T (2010) Cell type-dependent, infection-induced, aberrant DNA methylation in gastric cancer. J Theor Biol 264(2):570–577 Raghavan K, Ruskin HJ, Perrin D, Burns J, Goasmat F (2010) Computational micro model for epigenetic mechanisms. PLoS One 5(11):e14031 Ushijima T, Sasako M (2004) Focus on gastric cancer. Cancer Cell 5:121–125

Cross-References ▶ Differential Equations and Chemical Master Equation Models for Gene Regulatory Networks ▶ DNA Methylation and Cancer ▶ Epigenetics ▶ Genomic Imprinting ▶ Genomic Imprinting in Mammals: Memories of Generations Past ▶ Mathematical Models in the Sciences ▶ Plasmid Regulatory Systems, Modeling

References Barat A, Ruskin HJ (2010) A manually curated novel knowledge management system for genetic and epigenetic molecular determinants of colon cancer. Open Colorectal Cancer J 3:36–46 Bock C, Lengauer T (2008) Computational epigenetics. Bioinformatics 24(1):1–10 Clark C, Palta P, Joyce CJ, Scott C, Grundberg E, Deloukas P, Palotie A, Coffe AJ (2012) A comparison of the whole genome approach of MeDIP-Seq to the targeted approach of the infinium HumanMethylation450 BeadChip ® for methylome profiling. PLoS One 7(11):e50233 Fraga MF, Agrelo R, Esteller M (2007) Cross-talk between aging and cancer. Ann N Y Acad Sci 1100(1):60–74 Knudson A (2001) Two genetic hits (more or less) to cancer. Nat Rev Cancer 1(2):157–162 Lan X, Adams C, Landers M, Dudas M, Krissinger D, Marnellos G, Bonneville R, Xu M, Wang J, Huang TH-M, Meredith G, Jin VX (2011) High resolution detection and analysis of CpG dinucleotides methylation using MBD-Seq technology. PLoS One 6(7): e22226 Maekita T, Nakazawa K, Mihara M, Nakajima T, Yanaoka K, Iguchi M, Arii K, Kaneda A, Tsukamoto T, Tatematsu M, Tamura G, Saito D, Sugimura T, Ichinose M, Ushijima T (2006) High

Epigenetics ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Equilibria and Bifurcations in the Molecular Biosciences John Wesley Cain Department of Mathematics and Computer Science, University of Richmond, Richmond, VA, USA

Synopsis The steady-state behavior of a biochemical system depends upon a variety of parameters, such as the kinetic constants of the reaction processes. Varying such parameters may induce sudden and dramatic changes in the observed behavior of the system. For instance, loss of stability of an equilibrium state may cause (i) a rapid evolution to a very different equilibrium state, (ii) emergence of sustained oscillatory behavior, or even (iii) chaotic behavior. Each of those events may be regarded as a bifurcation: a dramatic change in the expected qualitative behavior of a system in response to changing a parameter. While this

E

354

Equilibria and Bifurcations in the Molecular Biosciences

nontechnical description of bifurcation is an oversimplification of the rigorous mathematical definition, it certainly suffices in the context of this essay. Here, four examples of bifurcation phenomena in simple biochemical systems are analyzed and discussed.

never actually observe, because even a minuscule deviation from the upward vertical causes the system to rapidly deviate from that equilibrium. The former is stable and attracting due to friction: if the pendulum starts away from the downward vertical, then it will swing back and forth for awhile before settling down. The use of the word “close” in the definition of stable and attracting is important – the above definition is what mathematicians would call local attraction as opposed to global attraction. If a system is kicked too far away from a local attractor, it might never return, e.g., if the system is “pulled in” by some other locally attracting equilibrium. A globally attracting equilibrium corresponds to a state that the system is destined to approach regardless of the initial state of the system. Changing experimental parameters can sometimes cause major changes in steady-state behavior; e.g., equilibria can be created or destroyed or may suddenly change stability. Those events are examples of bifurcations: dramatic changes in the qualitative behavior of a dynamical system in response to a change in a parameter. (Caution: The mathematical usage of the word bifurcation differs substantially from the standard English usage.) For example, suppose that a torque of constant magnitude (call it t) is applied at the top of a pendulum arm. If t is small, the pendulum will have a stable equilibrium state in which the pendulum forms a small angle with the downward vertical. If the parameter t is gradually increased, the stable equilibrium orientation of the pendulum forms a larger and larger angle relative to the downward vertical. Once t reaches some critical value tcrit, the pendulum begins to spin around and around forever, and no equilibria exist. The destruction of equilibria when the parameter t reaches tcrit is an example of a bifurcation.

Introduction Given a chemical, biological, or physical system, what are its equilibria or steady states? How do those equilibria “move” if experimental parameters (such as the ambient room temperature during a chemical reaction) are varied? Could varying parameters cause sudden dramatic (or even catastrophic) changes in the long-term behavior of a system? There are mathematical tools for addressing precisely these sorts of questions – tools that are surveyed in this entry. Roughly speaking, a system is in equilibrium if its state does not change over time. For example, in a simple, closed chemical process A + B ! C, the reactants A and B are gradually consumed and converted into a product C. Over time, this system approaches an equilibrium state in which C saturates to some steady concentration Ceq, while A and B decay to their own steady-state values. When the concentration of either A or B approaches zero (complete consumption of one of the reactants), the system approaches equilibrium. The concept of stability of an equilibrium is important in the context of understanding longterm behavior. The above is an example of a stable, attracting equilibrium – an equilibrium that the system “wants” to be in. More descriptively, a stable, attracting equilibrium is one with the following property: whenever the system is in a state that is sufficiently “close” to the equilibrium, the system will eventually approach the equilibrium. Equilibria can also be unstable, but those are virtually impossible to observe in nature. Consider, for example, a simple pendulum. There are two equilibrium states: the downward vertical orientation (stable and attracting) and the upward vertical orientation (unstable). The latter is certainly a precarious equilibrium that one would

Equilibria and Bifurcations in Biochemistry Below are four examples of equilibria, stability, and bifurcations in [bio]chemical contexts. Differential equation (DE) models are used in each case, and the reader is referred to ▶ “Chemical Reaction

Equilibria and Bifurcations in the Molecular Biosciences

355

Kinetics: Mathematical Underpinnings” for details regarding where such equations come from.

equilibrium of a DE of the form dx/dt = f (x), then the equilibrium is stable and attracting if f '(x*) is negative and is unstable if f '(x*) is positive. Applied to the autocatalysis example, the DE has the form db/dt = f(b), where f(b) = k+ab  kb2. The derivative of f (b) is f '(b) = k+a  2kb, and there are two equilibria to test. For the trivial equilibrium b = 0, the fact that f '(0) = k+ a > 0 implies that b = 0 is an unstable equilibrium. The instability of this equilibrium makes sense, because even if a tiny amount of b is present, the reaction “wants” to proceed. As for the other equilibrium, one calculates that f '(ak+⁄k) =  k+a < 0, confirming that ak+/k is a (locally) stable, attracting equilibrium.

Example 1 Consider an autocatalytic process such as zymogen activation kþ

A þ B Ð B þ B; k

where B aids in its own production. Assume that the system is closed and well mixed and that A is highly abundant (or continuously replenished) so that its concentration can be regarded as essentially constant. In this case, the law of mass action states that the concentration b = [B] is governed approximately by the DE db ¼ kþ ab  k b2 ; dt where a = [A] (a positive constant) and k+ and k are (positive) kinetic constants for the forward and reverse reactions. By definition, an equilibrium state is one in which the system cannot change, and to find them, one simply sets the derivative db/dt = 0. Doing so yields the algebraic equation 0 ¼ kþ ab  k b2 ¼ bðakþ  k bÞ: Evidently, there are two equilibria: b = 0 and b = ak+/k. The former is not terribly interesting from a chemistry standpoint (since the reaction would not take place in the absence of B), but is useful in illustrating the concept of stability. Regarding the stability of these two equilibria, observe that db/dt changes sign when b is equal to one of the equilibrium values. For 0 < b < ak+/ k, the sign of db/dt is positive, implying that b increases. By contrast, when b > ak+/k, the sign of db/dt is negative, implying that b decreases. One may conclude immediately that b = 0 is an unstable equilibrium, because if b is slightly larger than 0, it will increase away from 0. By contrast, if b is near (but not equal to) ak+/k, then b will be attracted to that equilibrium over time. There is an analytical test for whether an equilibrium is (locally) stable and attracting. If x* is an

Example 2 During glycolysis, phosphorylation of fructose-6-phosphate is catalyzed by phosphofructokinase in a process that is known to generate self-sustaining oscillations. Sel’kov (1968) was the first to propose a mathematical model of those oscillations and, after some minor simplifying assumptions (Strogatz 1994, pp. 205–206), his equations take the form dx=dt ¼ x þ ay þ x2 y dy=dt ¼ b  ay  x2 y: The variables x and y are scaled representations of concentrations of adenosine diphosphate and fructose-6-phosphate, and the parameters a and b are (positive) kinetic constants. Because there are two independent variables and two parameters, seeking equilibria and classifying their stability are far more challenging than in the previous example. To seek equilibria, set the rates of change (derivatives) of both variables equal to zero, i.e., dx/dt = dy/dt = 0. The result is a pair of algebraic equations, x + ay + x2y = 0 and b  ay  x2y = 0. In general, it is difficult or impossible to solve nonlinear systems like this one algebraically, but this one is not so bad: there is an equilibrium x = b, y = b/(a + b2). Now that the equilibrium has been identified, how might one characterize the effects of the parameters a and b on its stability? There is a standard mathematical technique for doing so (linear stability analysis), and readers interested

E

356

Equilibria and Bifurcations in the Molecular Biosciences

in the technical details are encouraged to read Chapters 5 and 6 of Strogatz (1994). Linear stability analysis reveals that the equilibrium is (locally) stable and attracting if the quantity b4 + (2a  1)b2 + a + a2 is positive and is unstable if that quantity is negative. This information can be rendered graphically (see Fig. 1) in the b-versus-a plane to portray the expected behavior of this system. More exactly, the equation b4 + (2a  1)b2 + a + a2 = 0 defines a curve in the b-versus-a plane, a curve that subdivides the plane into two regions and serves as a boundary between stability and instability of the equilibrium. If the kinetic parameters a and b are varied in such a way that the stability boundary is crossed, then a bifurcation occurs, begging the

question: how does the dynamical behavior of the system change? In this example, it happens that when the equilibrium loses stability, the system locks into a steady pattern of oscillatory behavior, in which both x and y exhibit a periodic response. The result of such a bifurcation is illustrated in Fig. 2: the left panel illustrates the behavior when a and b are chosen such that the equilibrium is stable and attracting, and the right panel shows oscillations that occur otherwise.

1.0

Example 3 Roughly speaking, the repressilator (Elowitz and Leibler 2000) is a gene regulatory network containing three species x, y, and z with the property that y inhibits production of x, x inhibits production of z, and z inhibits production of y. This sort of mutual repression in a biochemical reaction network can give rise to oscillations, as can be seen through simulation of the following model of the repressilator: m x 1 þ y4 m dy=dt ¼ y 1 þ z4 m dz=dt ¼  z; 1 þ x4 dx=dt ¼

sustained oscillations

b

stable attracting equilibrium 0.0 0.0

a

0.16

Equilibria and Bifurcations in the Molecular Biosciences, Fig. 1 The curve b4 + (2a  1)b2 + a + a2 = 0 forms a boundary between two very different types of dynamical behavior. The bifurcation associated with crossing that curve can either create or destroy sustained oscillations

1.0

where the positive parameter m represents a production rate. The “symmetry” adopted in this crude model (i.e., same production rate m in each equation and same degree of repression in each denominator) is not necessary, but is helpful in illustrating a bifurcation that the repressilator exhibits as m is varied. Setting dx/dt = dy/dt = 2.5

a = 0.15, b = 0.6

a = 0.06, b = 0.6

x

x

0.0

0.0 0

40

80

time

Equilibria and Bifurcations in the Molecular Biosciences, Fig. 2 Left panel: Transient oscillations leading to a stable, attracting equilibrium state in the Sel’kov model

0

40

80

time

equations when a = 0.15 and b = 0.06. Right panel: Reducing the parameter a to 0.06 causes the equilibrium to lose stability and leads to sustained oscillations

Equilibria and Bifurcations in the Molecular Biosciences

357

1.4

1.4

x 1.0

x 1.0

μ = 1.8 0.6

0

t

30

μ = 2.2 0.6

0

t

30

Equilibria and Bifurcations in the Molecular Biosciences, Fig. 3 Left panel: For m = 1.8 in the repressilator equations, there is a gradual approach to equilibrium xeq

after transient oscillations. Right panel: For m = 2.2, the equilibrium xeq is unstable and there are sustained oscillations

dz/dt = 0 in order to seek equilibria, one finds that there is a steady state in which all three species are equal: x = y = z = xeq, where xeq satisfies the algebraic equation

The variables x and y represent concentrations of two chemical species, an activator and an inhibitor (respectively), and the parameters a, к, and g are positive. Here, x promotes its own production and that of y, because the factor x2/(1 + (x/к)2) increases as x increases. By contrast, y inhibits the production of x through the factor (1 + y)1. The number of equilibria of this system depends upon the choices of the parameters a and к. To be specific, seek equilibria by setting the rates of change of both concentrations to zero: dx/dt = dy/dt = 0. One way to satisfy dx/dt = 0 is if x = 0, which forces the choice y = 0 in the equation dy/dt = 0 in order to make both derivatives zero simultaneously. In other words, x = y = 0 is an equilibrium, albeit an uninteresting one. There is, however, another possibility: assuming x 6¼ 0, set dx/dt = 0, divide through by x, and algebraically solve for y to find that

  xeq 1 þ x4eq ¼ m: This equation has precisely one solution regardless of the choice of m. As in the previous example, linear stability analysis (Strogatz 1994, Section 6.3) can be used to show that the equilibrium state is stable if m < 2 and unstable if m > 2. The loss of stability when m = 2 is an example of a bifurcation, and it is illuminating to explore the dynamical behavior of the system for values of m on either side of the bifurcation point (see Fig. 3). For m = 1.8, each variable experiences transient oscillations before gradually settling towards the stable, attracting equilibrium xeq. For m = 2.2, the system settles into a pattern of persistent periodic oscillations. This is an example of a Hopf bifurcation and is a common mechanism for systems to transition from steady, equilibrium behavior towards steady oscillatory behavior. Example 4 As an illustration of a very different type of bifurcation, consider the following simple model of an activator-inhibitor system: 



2

a x x 1 þ y 1 þ ðx=kÞ2 " # dy x2 ¼g y : dt 1 þ ðx=kÞ2 dx ¼ dt



ax 1 þ ðx=kÞ2

 1:

Now set dy/dt = 0 to obtain another algebraic relationship between x and y, namely, y¼

x2 1 þ ðx=kÞ2

:

Combining these two algebraic equations and eliminating y, one finds that there may be other equilibria if x satisfies the equation 

 1 2 1 þ 2 x  ax þ 1 ¼ 0: k

E

358

Equilibria and Bifurcations in the Molecular Biosciences

be very dramatic (or even catastrophic) in an experimental preparation. bifurcation at

stable

α = 2 1 + κ−2

x

unstable stable

0 0

α

Equilibria and Bifurcations in the Molecular Biosciences, Fig. 4 Equilibria and their stabilities for the activator-inhibitor example. For a sufficiently large, there are two stable, attracting equilibrium states that the system might tend towards, depending upon the initial state of the system

It happens that this equation does have two solutions if the parameterqaffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi is sufficiently large, ffi namely, if a > abif ¼ 2

1 þ ð1=kÞ2 . Let x+ and

x denote the larger and smaller of these two solutions. Then by linear stability analysis, it is possible to prove the following: (i) The state x = 0 corresponds to a locally stable, attracting equilibrium regardless of the value of a. (ii) For a > abif, the equilibrium x is unstable and the equilibrium x+ is locally stable and attracting. These facts are summarized in Fig. 4, which is an example of a bifurcation diagram. The figure is a graphical depiction of the various equilibria and their stabilities as a function of the parameter a and illustrates the important concept of bistability: coexistence of two distinct equilibria that are locally stable and attracting. Namely, if a > abif, the system may either approach x = 0 or x = x+ depending upon the initial state of the system. If the system equilibrates to the state x = x+ and the parameter a is slowly reduced, observe what happens when a is reduced below abff: the equilibria x+ and x suddenly disappear, and the system has no choice but to suddenly “jump” to the only remaining equilibrium state, x = 0. A sudden jump from one equilibrium state to another can

Discussion and Further Reading Compared to the systems one typically encounters in biochemistry, the examples in this survey article are both lower dimensional (i.e., fewer dependent variables) and highly phenomenological. Still, it should be emphasized that they are useful caricatures of phenomena observed in more complex systems and provide valuable intuition regarding dynamical behavior. The primary message is this: Given a mathematical model of a biochemical process, there are standard mathematical techniques for predicting (i) whether equilibrium states are stable and attracting, (ii) how changing experimental parameters affects the equilibrium states, (iii) parameter values at which bifurcations occur, and (iv) how a bifurcation might affect the observed dynamical behavior of the process. Bifurcations cause dramatic changes in the observed dynamical behavior, such as creation or destruction of equilibrium states, changes in stability of equilibrium states, creation or destruction of oscillations, or even the emergence of chaotic behavior. Readers interested in learning more about equilibria, stability, oscillations, bifurcation, and chaos are encouraged to consult the texts of Strogatz (1994), Murray (2002/2003), and Keener and Sneyd (2009).

Cross-References ▶ Chemical Reaction Kinetics: Mathematical Underpinnings

References Elowitz M, Leibler S (2000) A synthetic oscillatory network of transcriptional regulators. Nature 403:335–338 Keener JP, Sneyd J (2009) Mathematical physiology, 2nd edn, vols 1 and 2. Springer, New York Murray JD (2002/2003) Mathematical biology, 3rd edn, vols 1 and 2. Springer, Berlin

Eukaryotic DNA Replicases Sel’kov EE (1968) Self-oscillations in glycolysis. Eur J Chem 4:79–86 Strogatz SH (1994) Nonlinear dynamics and chaos. Perseus, Cambridge

ET Recombination ▶ Recombineering

359

family B, which are distinguished by their larger sizes due to inserts in their N- and C-terminal regions. The sizes of these inserts vary among the three polymerases, and their functions remain largely unknown. Strikingly, the quaternary structures of Pol a, Pol d, and Pol e are arranged similarly. The catalytic subunits adopt a globular structure that is linked via its conserved C-terminal region to the B subunit. The remaining subunits are linked to the catalytic and B subunits in a highly flexible manner.

Eukaryotic DNA Replicases

Introduction

Manal S. Zaher, Muse Oke and Samir M. Hamdan Division of Biological and Environmental Sciences and Engineering, Laboratory of DNA Replication and Recombination, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

The tight coupling of eukaryotic DNA replication with the cell cycle has made it difficult to identify and reconstitute eukaryotic replication reactions in vitro. The first insight into the mechanism of eukaryotic DNA replication was possible after the in vitro reconstitution of the simian virus 40 (SV40) replication system (Fig. 1a). SV40 encodes its own replicating helicase, the large T-antigen, which recognizes the viral origin of replication and melts it locally to recruit the remainder of the host replication machinery. DNA synthesis is initiated at the origin of replication by the recruitment of the single-stranded DNA (ssDNA) binding protein, replication protein A (RPA), and Pol a. The primase subunit of Pol a synthesizes an 8–12 nucleotide RNA primer, which is then extended by the DNA polymerase subunit of Pol a into a 30–35 nucleotide RNA-DNA primer. The RNA-DNA primer attached to the ssDNA template provides a structure that can be recognized by the replication factor C (RFC)/proliferating cell nuclear antigen (PCNA) complex. The interaction of RFC/PCNA with the primer-template loads PCNA onto the double-stranded DNA (dsDNA) and hands off the primer-template to the replicative Pol d. Two copies of Pol d in complex with PCNA replicate the exposed ssDNA on the leading and lagging strands. In contrast to the continuous leading strand synthesis, the lagging strand requires the re-priming of the template strand about once

Synopsis The current model of the eukaryotic DNA replication fork includes three replicative DNA polymerases, polymerase a/primase complex (Pol a), polymerase d (Pol d), and polymerase e (Pol e). The primase synthesizes 8–12 nucleotide RNA primers that are extended by the DNA polymerization activity of Pol a into 30–35 nucleotide RNA-DNA primers. Replication factor C (RFC) opens the polymerase clamp-like processivity factor, proliferating cell nuclear antigen (PCNA), and loads it onto the primer-template. Pol d utilizes PCNA to mediate highly processive DNA synthesis, while Pol e has intrinsic high processivity that is modestly stimulated by PCNA. Pol e replicates the leading strand and Pol d replicates the lagging strand in a division of labor that is not strict. The three polymerases are comprised of multiple subunits and share unifying features in their large catalytic and B subunits. The remaining subunits are evolutionarily not related and perform diverse functions. The catalytic subunits are members of

E

360

Eukaryotic DNA Replicases

a

SV40

Leading strand PCNA

RFC

Pol δ

FEN-1

3’ 5’

3’

T-antigen 3’ primer

Pol α

ligase nascent Okazaki fragment

Lagging strand RPA

b

Cell chromosome

Leading strand PCNA RFC

Pol ε GINS

3’ 5’

FEN-1

Cdc45

3’

MCM Ctf4 Pol α

3’

ligase

Pol δ

primer

nascent Okazaki fragment

Lagging strand RPA

Eukaryotic DNA Replicases, Fig. 1 (a) A model of the SV40 replication fork. The large T-antigen unwinds the parental DNA strands. The leading strand is copied continuously by DNA polymerase d (Pol d) in complex with PCNA. Loop formation on the lagging strand reverses its orientation and aligns it with the leading strand. Another copy of Pol d/PCNA initiates synthesis of a new Okazaki fragment by utilizing an RNA-DNA primer synthesized by DNA polymerase a (Pol a). The ssDNA binding protein, replication protein A (RPA), coats transiently exposed ssDNA on the lagging strand. Ligation of Okazaki fragments is mediated by flap endonuclease-1 (FEN1), which

cleaves the 50 RNA-DNA flap, a structure generated by the Pol d strand displacement synthesis, to create a nick that is sealed by DNA ligase I (ligase). PCNA recruits and coordinates the activities of FEN1 and ligase. (b) A model of the cellular replication fork. In contrast to the SV40 replication fork, the helicase consists of Cdc45/MCM/GINS (CMG complex) and the leading and lagging strands are replicated by Pol e and Pol d, respectively. Furthermore, the Ctf4 trimeric protein interacts with GINS and simultaneously bridges the CMG complex with up to two partner proteins including the ability to bind to two molecules of Pol a

Eukaryotic DNA Replicases

every 200 base pair Okazaki fragments by Pol a and the switching of the primer-template to Pol d/ PCNA for extension. The maturation of the Okazaki fragments to form the contiguous lagging strand is initiated by the strand invasion of the previously synthesized Okazaki fragment by the lagging strand Pol d, thereby displacing the RNA-DNA primer. The displaced primer is then removed in a reaction that is primarily mediated by flap endonuclease 1 (FEN-1) to generate a nick that is sealed by DNA ligase I. Extrapolating the architecture of the replication fork from the SV40 model system to the actual eukaryotic cell has been fundamentally challenged by the discovery of a third DNA polymerase in Saccharomyces cerevisiae (Pol e) (division of labor). Pol e is a highly processive polymerase, even in the absence of PCNA, and has a 30 –50 exonuclease activity. It is therefore as equally qualified a replicase as Pol d to mediate genomic DNA synthesis. The inability to reconstitute the eukaryotic DNA replication system in vitro has made it difficult to assign the direct role of Pol e and Pol d at the replication fork. Genetic experiments in yeast, which utilized proofreadingdeficient forms of Pol e or Pol d, deletion of Pol e or Pol d, and active site mutants of Pol e or Pol d with specific mutation signatures on a reporter gene, have demonstrated the involvement of Pol e at the replication fork and proposed a division of labor in which Pol e replicates the leading strand and Pol d replicates the lagging strand (Fig. 1b). However, this division may not be strict because Pol d can replace Pol e to result in a replication fork that is fully copied by Pol d in a similar fashion to that in the SV40 model system. The importance and the timing of this unequal division of labor during the cell cycle remain debatable. Another critical feature that differs from the SV40 replication fork is the helicase complex, which consists of Cdc45/MCM/GINS (CMG complex) (Fig. 1b). The loading and the activation of this helicase complex remain under intensive investigation, and it is yet to be characterized. The CMG complex interacts with partner proteins via the Ctf4 trimeric protein. Ctf4 contains three binding sites through which it interacts with GINS and simultaneously bridges the CMG complex with

361

up to two more partner proteins including two molecules of Pol a (Fig. 1b).

Replicative DNA Polymerases: Accurate and Processive Enzymes Primary sequence homology and structural analysis of the catalytic subunits of DNA polymerases have established seven DNA polymerase families termed A, B, C, D, X, Y, and RT. Replicative DNA polymerases belong primarily to families B and C with a few exceptions belonging to family A. Nonetheless, they all operate according to a set of general mechanistic principles (Bebenek and Kunkel 2004; Garg and Burgers 2005; Johnson and O’Donnell 2005; Hamdan and Richardson 2009; Hamdan and van Oijen 2010; Johansson and Macneill 2010; McHenry 2011; Hogg and Johansson 2012; Pellegrini 2012; Tahirov 2012) that are discussed broadly in this section and explained in details in ▶ “Bacterial DNA Replicases,” ▶ “DNA Repair Polymerases,” and ▶ “DNA Polymerase III Structure”. Replicative DNA polymerases, similar to other DNA polymerases, catalyze a nucleophilic attack of the 30 -hydroxyl group of the primer on the a-phosphate of a dNTP aligned to the template strand. The polymerase verifies the correct base pairing of the incoming dNTP and the template strand before nucleotide incorporation to achieve an accuracy of one mistake for every 103–105 incorporated nucleotide. In the event that the polymerase incorporates an incorrect nucleotide, it transfers the primer-template temporarily to the associated proofreading exonuclease active site to remove the incorrectly base-paired nucleotide. This proofreading activity enhances the overall fidelity of the polymerase by more than 100-fold. In polymerases from families A and B, the same polypeptide chain encodes the proofreading and polymerization activities. However, the proofreading domain is oriented next to the polymerization domain in family A and across from it in family B. In polymerases from family C, the proofreading activity is encoded by a separate polypeptide, and its orientation relative to the polymerization domain is yet to be

E

362

determined. During polymerization, the stability of the polymerase/primer-template complex is enhanced by the recognition of the correct nucleotide and its polymerization. However, the movement of the polymerase to the next position in the template provides an opportunity for the polymerase to dissociate from the primer-template. To maintain high affinity to the primer-template and achieve processive DNA synthesis, the polymerase utilizes a processivity factor that is topologically linked to the DNA by interacting simultaneously with the polymerase and the DNA. In polymerases from families B and C, the processivity factor is a ring-shaped clamp that encircles the primer-template and tethers the polymerase to the DNA. The assembly of the processivity clamp requires a clamp loader multiprotein complex that actively opens the clamp and loads it onto the primer-template. The crystal structures of DNA polymerases from different families and domains of life, in the presence or absence of DNA and with the DNA trapped in the polymerase or exonuclease active sites, have provided in-depth knowledge of their molecular mechanisms. The polymerase adopts a partially closed right-hand framework with the thumb, palm, and finger subdomains forming a DNA binding groove that directs the primer-template to the polymerase active site (Fig. 2). The palm subdomain contains both the DNA polymerization and exonuclease proofreading activities. The finger subdomain contacts the incoming dNTP and the template with which it forms base pairs. The thumb subdomain binds to the primer-template and directs it into the polymerase or exonuclease active sites. The DNA polymerization active site utilizes two metal ions that align the 30 -hydroxyl group of the primer and the a-phosphate of the incoming dNTP to mediate an inline nucleophilic substitution reaction. The polymerase verifies the correct Watson-Crick base pairing prior to catalysis by maintaining extensive contacts with the primer-template and the incoming dNTP. This is controlled by a rate-limiting conformational change in the finger subdomain between open and closed conformations. If the finger subdomain binds the correctly base-paired nucleotide, a closed conformation that leads to

Eukaryotic DNA Replicases

proper assembly of the polymerization active site and nucleotide incorporation is formed. The open conformation of the finger subdomain enables the active site to sample nucleotides until the correct one is selected. In the event of incorporation of an incorrect nucleotide, the polymerase transfers the primer-template a distance of 30–45 Å to the exonuclease active site for removal of the mismatched nucleotide. The exonuclease active site utilizes two metal ions that align a water molecule and the scissile phosphate of the incorrect nucleotide for an inline nucleophilic substitution reaction. The crystal structures of several polymerases with DNA bound at either the polymerase or exonuclease active sites reveal that the tip of the thumb subdomain rotates to direct the primertemplate to the exonuclease active site. Furthermore, the primer-template is melted by several nucleotides to expose the 30 -terminal nucleotide for hydrolysis by the exonuclease domain.

Overview of Eukaryotic Replicative DNA Polymerases Five of the seven families of DNA polymerases are represented in eukaryotic cells, which possess at least 15 DNA polymerases with diverse DNA replication and repair functions (Bebenek and Kunkel 2004). Despite this complexity and large number of DNA polymerases, the current model of the eukaryotic DNA replication fork involves only three DNA polymerases (Pol a, Pol d, and Pol e) (Garg and Burgers 2005; Johnson and O’Donnell 2005; Johansson and Macneill 2010; Hogg and Johansson 2012; Pellegrini 2012; Tahirov 2012). These DNA polymerases each consist of a catalytic subunit that is connected to the remaining subunits primarily via the B subunit. The catalytic and B subunits are the most highly conserved subunits, and they form a tightly associated heterodimer that is indispensible to the function of the polymerase. The remaining subunits in Pol a, Pol d, and Pol e are evolutionarily unrelated and perform diverse functions. The large catalytic subunit in Pol a, Pol d, and Pol e contains the DNA polymerase and the

Eukaryotic DNA Replicases

363

Pol d

a

b

RB69 Pol

Fingers N-terminal

N-terminal

Palm

Fingers Palm

dCTP

G

A dGTP

Polymerase active site

Polymerase active site

Exonuclease Exonuclease active site

Thumb

Exonuclease

Thumb

Exonuclease active site

β-hairpin

β-hairpin

Pol e

c Fingers

Exonuclease

N-terminal

Thumb

Palm Polymerase active site dATP T T

180∞ P-domain Exonuclease

Thumb

N-terminal Palm

Exonuclease active site

Short β-hairpin

Fingers

Pol a

d

Fingers N-terminal

Palm

dGTP

Polymerase active site

C ssDNA

RNA

Thumb Exonuclease Exonuclease active site

Eukaryotic DNA Replicases, Fig. 2 (a) The crystal structure of yeast Pol368–985 bound to the primer-template and incoming nucleotide. (b) The crystal structure of RB69 Pol bound to the primer-template and incoming nucleotide. (c) The crystal structure of the yeast Pol21–1228 bound to the primer-template and incoming nucleotide (to the left). To the right, a 180 rotation showing the same structure from different orientation revealing the P-domain. (d) The

β-hairpin

crystal structure of yeast Pol1349–1258 bound to the primertemplate and incoming nucleotide. For comparison, the four DNA polymerases are shown in similar orientations and similar subdomain coloring. The DNA polymerase adopts a right-hand framework with the fingers (blue) contacting the incoming nucleotide (magenta) and the template strand, the palm containing the DNA polymerase (wheat) and the 30 –50 exonuclease (cyan) domains, and the

E

364

proofreading exonuclease domains. In Pol a, the exonuclease domain is inactive, which explains its overall lower fidelity in comparison with Pol d and Pol e. The catalytic subunits belong to polymerases in the B-family. However, they are larger and more complex than the prokaryotic members of this family, due to inserts at their N- and C-terminal regions whose functions remain largely unknown. The size of these inserts varies in Pol a, Pol d, and Pol e, with Pol e having a significantly larger C-terminal region that contains an extra inactive polymerase/exonuclease module (Fig. 3a). Indeed, phylogenetic analyses propose that eukaryotic replicases evolved from two distantly related archaeal B-family polymerases; one led to Pol e and the other led to Pol a and Pol d. An important conserved feature of the C-terminal region is the presence of two cysteine-rich metal-binding motifs (CysA and CysB) that interact primarily with the B subunit (Fig. 3a, b). The N-terminal region contains a conserved uracil-recognizing domain, yet this domain does not sense uracil and its function remains unknown (Fig. 3a). In the quaternary structures of Pol a, Pol d and Pol e, the catalytic subunits adopt globular structures that are linked via a flexible structure to the B subunits and the remaining accessory subunits, as observed by the Electron Microscopy (EM) structural analysis of Pol a (Klinge et al. 2009; Nunez-Ramirez et al. 2011) (Fig. 4a–c) and Pol e (Asturias et al. 2006) (Fig. 5) and by small-angle X-ray scattering (SAXS) analysis of Pol d (Jain et al. 2009) (Fig. 6a–c). The B subunits, at least in the case of Pol a and Pol e, seem to be part of the primertemplate-binding crevice and therefore could contribute to the interaction of the catalytic subunits with DNA (Asturias et al. 2006; Klinge et al. 2009) (Figs. 4e and 6d). The flexible linkage

Eukaryotic DNA Replicases

of the catalytic subunits to the remaining subunits might coordinate the interaction of the polymerase with the DNA and grant the remaining subunits independence to interact with other replication proteins in a highly dynamic manner. Pol a contains two extra subunits that provide the primase functionality (Fig. 3b). Consequently, Pol a initiates DNA synthesis on the leading and lagging strands at the origin of replication and at Okazaki fragments on the lagging strand during elongation. The smallest subunit encodes for the primase activity and the larger one plays a versatile role during primer synthesis. Consistent with its role as an initiator, Pol a has been found to interact with a wide range of initiation and replication fork proteins (Garg and Burgers 2005). The two subunits have elongated structures that bind the B subunit and might extend to the C-terminal or even the N-terminal regions of the catalytic subunit (Fig. 4a, b) (Nunez-Ramirez et al. 2011). Pol d contains one extra subunit in yeast and an additional fourth smaller subunit in humans (Fig. 3b). The third subunit of Pol d is exceptionally elongated in shape and interacts with the B subunit. It also contains a consensus PCNAinteracting protein box (PIP box) (QxxLxxFF) that contributes, as a secondary site, to the functional stimulation of the processivity of DNA synthesis by PCNA (Netz et al. 2012). The primary site is the CysA motif in the C-terminal region of the catalytic subunit (Netz et al. 2012). Pol e contains two extra subunits (Fig. 3b), which form, along with the B subunit, a central channel that could potentially interact with the primertemplate (Fig. 6d). This would explain Pol e’s intrinsic high processivity in the absence of PCNA. Nonetheless, PCNA can still enhance the processivity of Pol e but only modestly, in a mechanism that is yet uncharacterized. The catalytic subunit of Pol e contains a PIP box motif that

ä Eukaryotic DNA Replicases, Fig. 2 (continued) thumb (pale green) binding and directing the primer-template (orange) to the polymerase active sites. The N-terminal region is colored yellow and the P-domain is shown in chocolate. In Pol1 structure, the RNA is shown in gray. The arrows indicate the polymerase and exonuclease

active sites and the b-hairpin in the exonuclease domain. The PDB code for RB69 Pol is 1IG9 (Franklin et al. 2001), for Pol368–985 is 3IAY (Swan et al. 2009), for Pol21–1228 is 4M8O (Hogg et al. 2014), and for Pol1349–1258 is 4B08 (Perera et al. 2013)

Eukaryotic DNA Replicases

a

365

b

yeast

S. cerevisiae p12

Pol a

xx

Pol1 Pol2

xxx

xxxxxx

Pol3

p180

cysteine-rich metal binding motif CysA (binds zinc) cysteine-rich metal binding motif CysB (binds iron) Uracil recognition x Inactive domain

CTD

Pri1

Pri2

Pol3

Pol e

CTD

p125 Pol32 Dpb3

CTD

Dpb2

p12

p50

Pol d

Dpb4 Pol2

p48

p55

Pol31

Pol d

RB69 3’-5’ proofreading exonuclease 5’-3’ DNA polymerase

p68

Pol a

Pol1 CTD

xx xx

Human

CTD

p12

Pol e

p66 p17

p261 CTD

p59

Eukaryotic DNA Replicases, Fig. 3 (a) A schematic diagram of the conserved regions in Pol1, Pol3, and Pol2 in yeast, which are the catalytic subunits of replicative polymerases Pol a, Pol d, and Pol e, respectively, in comparison with bacteriophage RB69 Pol (This drawing is

roughly to scale and is modified from Pavlov and Shcherbakova (2010)). (b) A schematic of the subunit composition of Pol a, Pol d, and Pol e from Saccharomyces cerevisiae (left side) and humans (right side)

is buried in the middle of the catalytic subunit due to gene duplication. It is therefore unable to support the interaction with PCNA. However, phenotypic characterization of point mutations or deletion of this motif in yeast has indicated that it might be important for damage-induced DNA repair rather than for DNA replication. The availability of low-resolution quaternary structures of Pol a, Pol d and Pol e and highresolution structures of some of their domains, subunits and subassemblies has significantly enhanced our knowledge of their mechanisms at the molecular level. This review focuses on the molecular mechanisms of Pol a, Pol d and Pol e based on their available structures primarily from yeast and humans.

subunit is the large inserts at both their N- and C-terminal regions (Fig. 3a). The crystal structure of the truncated catalytic region of yeast Pol d68–985 which contains the DNA polymerization and exonuclease proofreading domains as well as most of the N-terminal domain, provides the first molecular-level information on the catalytic subunit of eukaryotic replicases (Swan et al. 2009) (Fig. 2a). Comparison of this structure with RB69 Pol (Franklin et al. 2001) (Fig. 2b), a replicase whose structure has been used as an archetype for B-family DNA polymerases, highlights the unifying and unique features of eukaryotic replicases in relation to the rest of B-family DNA polymerases. The palm subdomain that contains the DNA polymerization active site is arranged in a highly similar manner to that observed in RB69 Pol. However, one striking difference in the active site of Pol d is the presence of three metal ions. As observed in other structurally characterized DNA polymerases, two metal ions are positioned such that they could activate the 30 -OH group of the primer to perform a nucleophilic attack on the a-phosphate of the incoming dNTP and stabilize the pentacovalent transition state during the reaction. The additional metal ion in Pol d could stabilize the b,g-pyrophosphate-leaving group and affect the incorporation efficiency of correct and incorrect nucleotides. The finger subdomain

The Large Catalytic Submit of Pol a, Pol d and Pol « The catalytic subunits of three human polymerases, Pol a (p180; 165.9 kilo Dalton [kDa]), Pol d (p125; 123.6 kDa) and Pol e (p261; 261.5 kDa), and their respective Saccharomyces cerevisiae counterparts, Pol1 (166.8 kDa), Pol3 (124.6 kDa) and Pol2 (255.7 kDa) (Fig. 3b), are members of the B-family of DNA polymerases. A unique feature of the eukaryotic catalytic

E

366

Eukaryotic DNA Replicases

a

d

b

e

p12 Pol1 CTD

2 helix turns

Pri1 Pri2

c

Eukaryotic DNA Replicases, Fig. 4 (a) Threedimensional EM reconstruction of the four-subunit yeast Pol a. This structure represents one conformation obtained by clustering a relatively homogenous set of single particles. The Pol a complex consists of Pol1349–1468/Pol12246–705/Pri1/ Pri249–513. The polymerase is arranged in a bilobal structure connected by a linker region (This figure is adopted with permission from Fig. 1E in Nunez-Ramirez et al. (2011)). (b) The subunit composition drawn in accordance with the subunit assignment in (a), showing the catalytic subunit occupying the large lobe and the remaining subunits occupying the smaller lobe. (c) Two-dimensional reference-free averages of the four subunits of Pol a demonstrating the flexible arrangement between Pol1 and pol12/Pri1/Pri2. Each raw image shows a set of averages where the Pol1 lobe is oriented

according to the same view (This figure is adopted with permission from Fig. 9.4 in Pellegrini (2012)). (d) The EM structure of Pol1349–1468/Pol12246–705 to which the crystal structures of the Pol11263–1468/Pol12246–705 complex (Fig. 7a) and the archaeal polymerase Thermococcus gorgonarius, a B-family replicative DNA polymerase that resembles the Pol1 catalytic regions, were fitted. The Pol11263–1468 fits well in the linker region and the Pol12246–705 occupies the small lobe (This figure is adopted with permission from Fig. 5B in Klinge et al. (2009)). (e) Modeled DNA in the EM structure of Pol1349–1468/Pol12246–705. An extended B-dsDNA from the ternary complex of T. gorgonarius fits well in the cavity between Pol1 and Pol12 (This figure is adopted with permission from Fig. 7C in Klinge et al. (2009))

Eukaryotic DNA Replicases

367

c

a

b

Pol31 Pol3

CTD

Pol32

Eukaryotic DNA Replicases, Fig. 5 (a) A model of the average solution structure of the yeast Pol3/Pol31/ Pol321–103 complex analyzed by small-angle X-ray scattering (SAXS) (This structure is adopted with permission from Fig. 2A in Swan et al. (2009)). (b) The subunit composition drawn in accordance with the subunit assignment in (a) and as proposed in Swan et al. (2009), showing Pol3 occupying the large lobe and Pol31/Pol321–103 the extended tail. (c) Representative simulated models that best fit the SAXS data showing variability in the orientation between Pol3 and Pol31/Pol321–103 (This structure is adopted with permission from Fig. 2A in Swan et al. (2009))

a

that binds the incoming dNTP is much simpler than the finger subdomain of RB69 Pol. Nonetheless, it still binds to the incoming dNTP in a similar manner to other DNA polymerases, where the closed conformation assembles the polymerase active site and enables it to verify the correct Watson-Crick base pairing prior to catalysis. This sensing mechanism is mediated by extensive contacts with the minor groove of the nascent primer-template. However, Pol d is unique in that it contacts and verifies the DNA structure up to four or five base pairs away from the site of nucleotide incorporation, while RB69 Pol and other A-family DNA polymerases contact only one or two base pairs. This enables Pol d to detect a mismatch directly by reading the WatsonCrick base pairing further away from the site of incorporation. The tip of the thumb subdomain is shifted upward towards the 50 end of the template relative to that in RB69 Pol. The exonuclease domain has a similar arrangement to that in RB69 Pol except in the conformation of a b-hairpin. This secondary structure extends from

25°

c

70°

b

d

Dpb4 Dpb3 Dpb2

Pol2 CTD

Eukaryotic DNA Replicases, Fig. 6 (a) A threedimensional cryo-EM reconstruction of the four-subunit yeast Pol e (This figure is adopted with permission from Fig. 3 in Asturias et al. (2006)). (b) The subunit composition drawn in accordance with the subunit assignment in (a), showing Pol2 occupying the large lobe and Dpb2/ Dpb3/Dpb4 the extended tail. (c) Two-dimensional reference-free averages from different class averages showing the motional range and flexibility between Pol2

and Dpb2/Dpb3/Dpb4 (This figure is adopted with permission from Fig. 8B in Asturias et al. (2006)). (d) A model for the interaction of Pol e with DNA. The dsDNA was modeled along the central channel of the tail structure to form a close fit with a length that corresponded to the dsDNA footprint required for stimulating Pol e processivity of DNA synthesis (This figure is adopted with permission from Fig. 9 in Asturias et al. (2006))

E

368

the exonuclease domain and is known to play a central role in DNA partitioning between the polymerase and exonuclease active sites. In contrast to RB69 Pol, the b-hairpin in Pol d is firmly rooted into the primer-template major groove and interacts extensively with the unpaired portion of the template strand. Consequently the two regions, which are critical for partitioning the primer-template between the polymerase and the exonuclease active sites, the tip of the thumb subdomain and the b-hairpin, show more flexibility in Pol d. The N-terminal domain of Pol d that appears to interact with the unpaired portion of the template, adopts a more complicated structure than that of RB69 Pol and resembles the N-terminal domain of the DNA polymerase I of herpes simplex virus, another member of the B-family DNA polymerases. Although the function of this domain remains largely unknown, structural homology of two of its three motifs indicates that it might be involved in RNA and/or DNA binding. Motif I resembles the oligonucleotide-/oligosaccharide-binding (OB) fold seen in the ssDNA binding proteins and is proposed in DNA polymerase III of Escherichia coli replicase to guide the template strand to the polymerase active site. Motif II resembles the RNA-binding fold in ribonucleoproteins and therefore is suspected to bind to RNA during the strand displacement synthesis of Pol d when it encounters the RNA-DNA primer in the previous Okazaki fragment. This activity is integral for ligating Okazaki fragments since it creates bifurcated structures at their junctions that can be recognized and cleaved by the structure-specific FEN1 to generate a ligateable nick. Recently, the structural knowledge of the catalytic subunits of eukaryotic B-family DNA polymerases has been expanded to include the crystal structures of the catalytic subunits of both yeast Pol e and Pol a, Pol2 and Pol1, respectively. The structure of the catalytic core of Pol e, Pole1–1228 (Hogg et al. 2014) (Fig. 2c), reveals a similar “right-hand” fold architecture to that of Pol d (Swan et al. 2009) (Fig. 2a). It also shows an NTD that is partially unstructured but overall has similar secondary structural organization as observed in Pol d. However, Pol e has several

Eukaryotic DNA Replicases

structural differences in comparison with Pol d including insertions of a total of 260–300 residues at different locations in the catalytic core. The extended b-hairpin loop structure in the exonuclease domain is shorter in Pol e (Fig. 2c cf. a) and lost its contact with DNA suggesting that it would probably fail to facilitate the same type of communication between the exonuclease and polymerase domains as seen in Pol d. The fingers subdomain share the same two a-helices structure as in Pol d but with an extended length of about two turns (Jain et al. 2014). The thumb subdomain is in close contact with the dsDNA and exhibits comparable structure to that in Pol d with the exception of additional a-helices (Hogg et al. 2014). The significance of this insertion to processivity and fidelity of DNA synthesis is yet undefined. The active site coordinates two metal ions, which is similar to the two-divalent ion coordination mechanism seen in most DNA polymerases but is different from the three metal ions seen in Pol d (Jain et al. 2014). The palm subdomain contains several insertions including those that span residues 533–555 and 682–760, which together form a novel large domain termed the P-domain that plays a key role in enhancing the intrinsic processivity of Pol e (Fig. 2c) (Hogg et al. 2014). The P-domain extends outward from the palm subdomain to the dsDNA and adopts an elongated structure that allows Pol e to encircle the dsDNA, while its tip is potentially flexible enough to swing towards the thumb and possibly become in closer contact with the DNA (Hogg et al. 2014). The P-domain is suggested not only to have an effect on the processivity of Pol e but also to explain its high fidelity. It extends the interactions of Pol e with dsDNA to include the first 9 nucleotides of the nascent primer strand and the first 10 nucleotides of the template strand, thus sensing replication errors as far as 30–45 Å away from the active site (Hogg et al. 2014). The P-domain has a distorted region (residues 663–677) that seems to have a metal-binding site with four cysteines (Hogg et al. 2014). Spectroscopical measurements show that this metal-binding motif is most likely an Fe-S cluster, which gives the catalytic core of Pol e its yellowish color.

Eukaryotic DNA Replicases

As for the catalytic core of Pol a, Pol a349–1258, (Fig. 2d) (Perera et al. 2013), it shares the main structural organization of the catalytic core of Pol d (Fig. 2a); particularly, it consists of an NTD, an exonuclease, and the universal “right-hand” DNA polymerase fold. While the NTD shares a similar structure with Pol d’s NTD, the exonuclease domain of Pol a is catalytically inactive. The fingers subdomain consists of two antiparallel a-helices that are similar in structure and length to the fingers subdomain of Pol d (Perera et al. 2013). Pol a contacts the primer/template duplex mainly within a region of seven base pairs from the 30 -terminus, which is significantly smaller than what is reported for other B-family DNA polymerases (Kuchta et al. 1990; Perera et al. 2013). Yet, this smaller footprint fit perfectly with the minimal primer size that is used by Pol a. The structure of Pol a shows a unique contact with the substrate RNA primer/DNA template that has not been observed in other B-family DNA polymerases (Perera et al. 2013). In most DNA polymerases, including Pol d, the palm subdomain generally interacts with the primer-template to bring the 30 terminus of the primer into an optimal arrangement for catalysis, while the thumb subdomain holds onto the dsDNA (Franklin et al. 2001; Swan et al. 2009). In the case of Pol a, the palm subdomain makes basic interactions with the first three base pairs of the primer/template, but the thumb subdomain interacts almost exclusively with the RNA primer strand (Perera et al. 2013). The interaction with the ribose and the phosphate moieties of the RNA backbone is mediated by two motifs that include residues 1074–1077 in the long segment of the thumb and residues 1130–1150 in the tip of the thumb (Perera et al. 2013). These interactions explain the preference of Pol a towards RNA primer/DNA template over DNA primer/DNA template. Furthermore, the primer/template duplex when bound to Pol a takes a general A-DNA conformation as opposed to the B-DNA conformation when bound to other polymerases such as Pol d (Perera et al. 2013). This suggests that the conformational change of the DNA from A- to B-DNA might play a role in the mechanism that terminates primer extension by Pol a and promotes its hand-

369

off to other replicative polymerases such Pol d and Pol e. The sequence conservation in the catalytic subunits of Pol a, Pol d, and Pol e extends beyond the catalytic core to a C-terminal domain containing two cysteine-metal-binding motifs (CysA and CysB) (Fig. 3a) that are critical for the quaternary structure assembly of the polymerase (Netz et al. 2012). CysA and CysB bind zinc ions and contribute to the interaction with the B subunit as observed in the structure of the C-terminal domain of yeast Pol11263–1468 in complex with the C-terminal region of its B subunit, Pol12246–705 (Klinge et al. 2009) (Fig. 7a). However, further in vitro and in vivo analyses demonstrate that CysA binds to zinc and CysB binds to iron (Netz et al. 2012). The zinc-bound CysA is critical for the functional interaction of Pol d with PCNA (Netz et al. 2012) and is likely to play a similar role in Pol e. However, its potential role in Pol a cannot be speculated since functional interaction between Pol a and PCNA has not been demonstrated. The iron-bound CysB motif is critical for the quaternary structure assembly of Pol a, Pol d, and Pol e (Netz et al. 2012), presumably via its interaction with the B subunit (Klinge et al. 2009) (Fig. 7a). It should be noted that in Pol e, the CysA and CysB motifs are separated from the polymerase and exonuclease domains by an insert of an inactive polymerase/exonuclease module that is distantly related to the first one (Fig. 3a). The function of this module is unknown but is proposed to play a structural role.

The B Subunit of Pol a, Pol d, and Pol « The B subunits in the human polymerases, Pol a (p68; 66 kDa), Pol d (p50; 51.3 kDa), and Pol e (p59; 59.5 kDa), and their respective S. cerevisiae counterparts, Pol12 (78.8 kDa), pol31 (55.3 kDa), and Dpb2 (78.3 kDa) (Fig. 3b), are indispensible to polymerase function. They play a central role in the assembly of the polymerase by bridging the catalytic subunit to the remaining subunits. Furthermore, the B subunit might modulate the interaction of the catalytic subunit with DNA and participate in the polymerase interactions with

E

370

Eukaryotic DNA Replicases

a

b

p661-144 (wHTH)

Pol12246-705 (PDE) Pol12246-705(OB) Zn

p50 (PDE)

p50 (OB) Zn Pol11263-1468 (CysA)

Pol11263-1468 (CysB)

Eukaryotic DNA Replicases, Fig. 7 (a) Crystal structure of the Pol11263–1468/Pol12246–705 complex from yeast Pol a. The phosphodiesterase domain (PDE) and oligonucleotide oligosaccharide domains (OB) (yellow) of Pol12 adopt a compacted structure and make extensive contacts with the C-terminal region of Pol1 (cyan), whereby PDE and OB bind the cysteine-metal-binding motifs CysA and

CysB, respectively. (b) The crystal structure of p661–144/ p50 of human Pol d. The PDE domain (green) is bound to the OB domain (purple) from one side and to the wHTH domain (cyan) from the other side. The PDB code for the Pol11263–1468/Pol12246–705 structure is 3FLO (Klinge et al. 2009) and for p661–144/p50, it is 3E0J (Baranovskiy et al. 2008)

other proteins (Garg and Burgers 2005; Hogg and Johansson 2012; Pellegrini 2012; Tahirov 2012). The C-terminal region in the B subunits is highly conserved and consists of an inactive phosphodiesterase (PDE)-like domain that is connected via a short proline-rich segment to an amino proximal OB fold domain (Fig. 7a) (Klinge et al. 2009). The extreme N-terminal region is poorly conserved and is absent in Pol d. Interestingly, the N-terminal region of Pol a and Pol e appears to harbor a winged helix-turn-helix (wHTH) motif that is present in the third subunit of Pol d and mediates its interaction with the B subunit (Fig. 7b) (Baranovskiy et al. 2008). Furthermore, in Pol e-like polymerases, the extreme N-terminal region contains a conserved domain of unknown function that resembles the ATPase associated with various cellular activities (AAA+ ATPase) (Nuutinen et al. 2008). The crystal structure of Pol12246–705 in complex with the CysA/CysB-containing C-terminal region from Pol11263–1468 provides information on the interaction between the B and catalytic subunits (Klinge et al. 2009) (Fig. 7a). The CysA and CysB motifs are bound to zinc and

folded into two different lobes with CysA being larger than CysB. An extended three-helix bundle connects both lobes and forms an asymmetric saddle-like structure. The PDE and OB domains adopt a compacted structure and make extensive contacts with the C-terminal region of Pol1 through which PDE and OB bind CysA and CysB, respectively. Docking this crystal structure in the EM structure of a larger C-terminal fragment of Pol1349–1468, which includes the polymerase and exonuclease domains in complex with Pol12246–705, places the C-terminal domain of Pol1 in a linker region that connects the bilobal structure of the large lobe of Pol1 with the smaller one of Pol12 (Fig. 4d) (Klinge et al. 2007, 2009; Nunez-Ramirez et al. 2011). The linker region in the four-subunit EM structure of Pol a is highly flexible as evident by the multiple orientations of the catalytic subunit relative to the remaining subunits (Fig. 4c) (Nunez-Ramirez et al. 2011). The catalytic subunits in the quaternary structures of Pol d and Pol e also adopt a globular structure that is connected to the remaining subunits via a highly flexible linker (Asturias et al. 2006; Jain et al. 2009) (Figs. 5c and 6c). Given the high

Eukaryotic DNA Replicases

degree of conservation in the CysA, CysB, PDE, and OB domains among the three polymerases and the similarity in their quaternary structures, the assignment of the C-terminal domain of Pol1 to the linker region and its interaction with the PDE and OB domains in Pol12 are strongly suggested to be extended to Pol d and Pol e. The B subunit can also contribute to the interaction with DNA. In Pol a, an extended channel between Pol1 and Pol12 is observed to which a modeled primer-template could form a close fit (Fig. 4d). In Pol e, the B subunit (Dpb2) and the two additional subunits, Dpb3 and Dpb4, also form an extended channel that can fit a modeled primer-template (Asturias et al. 2006) (Fig. 6d). The crystal structure of the B subunit (p50) in complex with the N-terminal domain of the third subunit (p661–144) from human Pol d provides information on the interaction between the B and the third subunits (Fig. 7b). The PDE domain of the p50 subunit interacts with the wHTH domain from the p661–144 fragment (Baranovskiy et al. 2008). The wHTH domain is significantly similar to sequences in the N-terminal regions of the B subunits of Pol a and Pol e. However, no similar regions exist in the B subunit of Pol d. This suggests that the interaction of the wHTH domain with PDE in Pol d might also be applicable to Pol a and Pol e and that both wHTH of p66 and the p50 subunit in Pol d should be considered counterparts of the B subunits in Pol a and Pol e (Baranovskiy et al. 2008). The wHTH domain represents a fold that binds dsDNA and, together with the OB fold from the p50 subunit, might play a role in regulating the interaction of Pol d with DNA. In addition to its role in the assembly of the polymerase and the potential interaction with DNA, the B subunit has been shown to mediate other activities. In Pol d, the p50 subunit contains a PIP box motif, whose functional interaction with PCNA is uncharacterized. The p50 subunit also interacts with a number of proteins, including the P21 cyclin-dependent kinase inhibitor, polymerase d interacting proteins, (PDIP) PDIP1, PDIP38, BPDIP46, and the WRN helicase. In Pol a, the post-translational modification of the B subunit might play a role in regulating its function as an

371

initiator of DNA synthesis. Phosphorylation of the C-terminal region of the B subunit during the S and G2 phases is speculated to initiate Okazaki fragment synthesis during the elongation phase while the hypo-phosphorylated species initiate DNA replication at the origin.

The Primase Accessory Subunits of Pol a Primases are DNA-dependent RNA polymerases that carry out de novo synthesis of short RNA primers needed to initiate the DNA polymerase activity (Frick and Richardson 2001). In prokaryotes, the primase is encoded by a single polypeptide chain, whereas in eukaryotes and archaea, the primase is encoded by a heterodimer complex that consists of a small catalytic subunit termed p48 (49.9 kDa) in humans and Pri1 (47.7 kDa) in S. cerevisiae and a large regulatory subunit termed p58 (58.8 kDa) in humans and Pri2 (62.3 kDa) in S. cerevisiae (Fig. 3b). The de novo primer synthesis by eukaryotic primases generates a unit length of 8–12 nucleotides, which is then extended by multiple rounds of unit length extension. The presence of the DNA polymerization activity of Pol a suppresses the multimer RNA primer synthesis and attenuates the RNA-DNA primer length. Although the exact role of the primase regulatory subunit is not well elucidated, it is believed to be involved in stabilizing the primase catalytic subunit, initiating primer synthesis, determining primer length, releasing the primer-template from the primase, and handing off the primer-template to the polymerase subunit (Klinge et al. 2007; Vaithiyalingam et al. 2010). Crystal structures of the catalytic subunit PriS from Pyrococcus furiosus (Augustin et al. 2001), the PriS/UTP complex from Pyrococcus horikoshii (Ito et al. 2003), and the PriS/Nterminal region regulatory subunit PriL1–212 complex (Lao-Sirieix et al. 2005) (Fig. 8a) serve as model systems for understanding the molecular mechanism of the evolutionarily related eukaryotic primases. The structure of the missing C-terminal region of the large regulatory subunit, which is highly conserved in archaea and eukaryotes and is critical for primer synthesis, has also

E

372

a

Eukaryotic DNA Replicases

b

Archaeal Primase

PriL1-212

PriS

5’

Zn-binding motif

3’

DNA template

Zn

c

binds PriS

Arg84 Arg85 RNA primer

PriL-CTD

presumably binds Pol a

Human Primase

d

Homo sapiens

PriS PriL1-253 Helical Domain

Pyrococcus furiosus Zn Sulfolobus solfataricus

“prim” fold PriLSBD

PriLα

Eukaryotic DNA Replicases, Fig. 8 (a) Crystal structure of the archaea PriS/PriL1–212 complex. PriS is depicted in green and PriL1–212 in cyan. The pdb code for the PriS/ PriL1–212 structure is 1ZT2 (Lao-Sirieix et al. 2005). (b) A model for the interaction of PriS/ PriL1–212 with an RNA primer-DNA template. The protein is shown as a molecular surface, the RNA primer in cyan and the DNA template in orange. PriL is proposed to contact the RNA primer when PriS extends it to reach the unit length (This figure is adopted with permission from Fig. 6D in Lao-Sirieix et al. (2005)). (c) Crystal structure of human

primase PriS/PriL1–253 complex (PDB ID 4BPW) (Kilkenny et al. 2013). PriS is shown in green and PriL in cyan. (d) Superposition of archaeal PriS structures from Sulfolobus solfataricus (PDB ID code 1ZT2) and Pyrococcus furiosus (PDB ID 1G71) with human PriS. The polypeptide chains are drawn as thin tubes and are colored according to local average RMSD value, from red (lower values) to white (higher values). The position of the active site is indicated by a black arrow (The figure is taken with permission from Fig. S2 in Kilkenny et al. (2013))

been determined in yeast (Sauguet et al. 2010) and humans (Vaithiyalingam et al. 2010; Agarkar et al. 2011). The C-terminal region of the large regulatory subunit contains an iron-sulfur cluster and binds to ssDNA, dsDNA, and primed DNA, with strong preference to primed DNA (Vaithiyalingam et al. 2010). PriL1–212 consists of a large domain that presumably binds to Pol a and a smaller domain that binds to PriS (Fig. 8a). The catalytic active site in PriS that contains the conserved metal ion and nucleotide binding sites is spatially

distant from the N-terminal region of PriL, suggesting that the N-terminal region might not play a direct role in catalysis, i.e., nucleotide binding and polymerization (Lao-Sirieix et al. 2005). However, PriL stimulates primer synthesis and its hand-off to the polymerase subunit upon reaching the unit length. Basic residues in PriL that reside at the heterodimer interface with PriS bind to the primer-template and provide an extra DNA contact that could lead to stimulation of primer synthesis (Lao-Sirieix et al. 2005). A model RNA primer-DNA template in the PriS/PriL1–212

Eukaryotic DNA Replicases

suggests that PriL would contact the RNA primer when it reaches its unit length (Fig. 8b). Consequently, PriL would be able to count the length of the primer and facilitate its hand-off to the polymerase subunit (Lao-Sirieix et al. 2005). The C-terminal region of PriL is important for the initiation step of primer synthesis and not during primer extension (Klinge et al. 2007). Therefore, PriL could still contribute to catalysis by engaging the active site via the C-terminal region. In this model, the C-terminal region is proposed to bind to the template strand and the diribonucleotides at the active site (Pellegrini 2012) in an analogous manner to that observed in prokaryotic primases (Frick and Richardson 2001). Interestingly, two bona fide conformations were observed for the C-terminal region of the large regulatory subunit in humans (Agarkar et al. 2011). This conformational change might also facilitate the dissociation of the primer-template from the primase when the primer reaches its unit length. Finally, PriS contains a zinc-binding motif as a highly conserved feature (Fig. 8a). However, this motif is not conserved in its sequence or structure as well as in its insertion position in PriS. Despite the unknown function of this motif, it has been proposed to play a role in enhancing the processivity of primer synthesis by providing a binding site for the template ssDNA (Fig. 8b) (Lao-Sirieix et al. 2005; Pellegrini 2012). Recently, the crystal structure of the first eukaryotic primase from human was solved. The structure provides information on the full length PriS and the C-terminally truncated PriL (Kilkenny et al. 2013). It shows that human primase adopts similar overall structural domain architecture to the archaeal counterpart (Fig. 8c cf. a). Human PriS takes the shape of a flat slablike particle and superimposes well on the archaeal PriS for 180 residues (Fig. 8d). PriS shares with the archaeal primase the unique archaeo-eukaryotic primases “prim fold” and the metal-coordinating triad of aspartates (Pellegrini 2012; Kilkenny et al. 2013) (Fig. 8c cf. a). Moreover it confirms the presence of a zinc-binding motif in PriS. However, the sequences, structures, and positions of the archaeal and human zincbinding motifs do not match (Fig. 8c cf. a).

373

Another key difference is the presence of a unique all-helical domain that does not share sequence or structure homology with the archaeal primase and seems to be species specific (Kilkenny et al. 2013) (Fig. 8c cf. a). The function of this domain as well as the zinc-binding motif remains unknown. The structure of human PriS was solved separately (Vaithiyalingam et al. 2010) and confers the general structure described when in complex with PriL (Kilkenny et al. 2013). The C-terminally truncated human PriL, similar to its archaeal counterpart, consists of a large all-helical domain and a smaller domain with mixed alpha helices and beta-sheets (Fig. 8a–c) (Kilkenny et al. 2013). The smaller domain in both human and archaea interacts with the narrow edge of PriS (Fig. 8a–c). This positions the rest of PriL with a flexible tilted angle making it able to reach over to the active site of PriS. Hence, it seems that PriL presumably acts as a structural arm to position the C-terminal domain containing the Fe-S cluster in the vicinity of the active site (Kilkenny et al. 2013). The mechanism of the internal transfer of the unit length primer from the primase to the polymerase subunit is unknown. The primase occupies a distal site relative to the polymerase catalytic subunit, suggesting that a significant conformational change would be necessary to transfer the primer-template from the primase to the polymerase active site (Fig. 4a). The highly flexible linker region between the two activities allows the primase to adopt multiple orientations relative to the polymerase catalytic subunit (Fig. 4c) (NunezRamirez et al. 2011). Consequently, the mechanism of primer hand-off could be mediated by conformational change(s) triggered when the primer reaches the unit length. The large regulatory subunit is very elongated in shape and might contact the C-terminal region of Pol1 and extend even beyond it to the N-terminal region of Pol1 (Nunez-Ramirez et al. 2011). As discussed above, the C-terminal domain of PriL could play a major role in the conformational changes underlying counting and primer hand-off mechanisms (Klinge et al. 2007; Agarkar et al. 2011). Furthermore, the structure of both archaeal and human primases confirms the flexibility between PriS and

E

374

PriL (Lao-Sirieix et al. 2005; Kilkenny et al. 2013). Therefore, PriL might be well suited to coordinate the conformational change(s) during primer synthesis and hand-off to the polymerase subunit for extension. It is worth noting that the structure of the human primase along with a primase-binding motif of human Pol a fused to a long linker at the N-terminus of human PriL was used to map the interactions between Pol a and the helical domain of PriL (Kilkenny et al. 2013). The structure shows specificity and high affinity of PriL tethering to Pol a without compromising the presumable large-scale conformational changes that could occur within the complex during primer hand-off (Kilkenny et al. 2013).

The Accessory Subunits in Pol d In addition to the catalytic and B subunits, a majority of Pol d polymerases contain a third subunit, termed p66 (51.4 kDa) in humans and Pol32 (40.3 kDa) in S. cerevisiae (Fig. 3b). However, in some eukaryotes, Pol d contains a fourth subunit that has limited sequence conservation. This subunit is referred to as Cdm1 (22 kDa) in Schizosaccharomyces pombe and p12 (12.4 kDa) in humans. Pol32 is a nonessential gene for growth, suggesting that the catalytic and B subunits are sufficient to support DNA replication. However, Pol32 plays a critical role in mediating the switch from DNA replication to the DNA polymerase ζ (Pol ζ) translation-dependent DNA synthesis, break-induced recombination, and telomerase-independent telomere maintenance (Garg and Burgers 2005; Acharya et al. 2009; Tahirov 2012). Pol32 contains a globular N-terminal domain, wHTH, which interacts with the B subunit, and an extended C-terminal region, which interacts with both PCNA and the DNA Pol ζ-associated protein, Rev1. The wHTH domain in the p661–144 interacts with the PDE domain of the p50 subunit (Fig. 7b) (Baranovskiy et al. 2008). The wHTH domain provides a fold that binds to dsDNA, which is proposed to work with the ssDNA binding of the OB domain in the p50 subunit to enhance the catalytic subunit binding to DNA. Pol32 contains

Eukaryotic DNA Replicases

a PIP box motif in its extreme C-terminal region, which provides a secondary site for the functional interaction of Pol d with PCNA. The primary site is the CysA motif in the C-terminal region of the catalytic subunit (Netz et al. 2012). A PIP box motif has also been identified in the human p66 subunit and has been shown to contribute to the interaction with PCNA. The role of Pol32 in the translesion bypass synthesis by Pol ζ has begun to be clarified. Pol ζ consists of a catalytic subunit, Rev3, and an accessory subunit, Rev7. Rev1, another translesion bypass polymerase, interacts with Rev3 of Pol ζ. This interaction plays a structural role for the function of Pol ζ in translesion synthesis. Pol32 interacts with Rev1 alone, but its interaction with Pol ζ occurs only in the presence of Rev1. Consequently, it is proposed that Rev1 promotes the association of Pol ζ with Pol32, providing a mechanism that could target Pol ζ to a stalled replication fork at a DNA lesion site. The flexible linking of Pol32 to the catalytic and B subunits might enable it to interact freely and recruit Rev1/ Pol ζ to the lesion site. The fourth subunit in S. pombe (Cdm1) is nonessential; in humans, on the other hand, p12 plays multiple roles. The p12 subunit stabilizes the interaction between the p125 and p50 subunits because it has the ability to interact directly with both subunits. It also contains a PIP box motif that contributes to the stimulation of DNA synthesis by PCNA. During the DNA damage response, p12 is degraded and Pol d is converted into a three-subunit polymerase. This polymerase form has higher fidelity in nucleotide incorporation and enhanced exonuclease proofreading activity relative to the polymerization activity.

The Accessory Subunits in Pol « Pol e contains two small accessory subunits termed p17 (17 kDa) and p12 (12.3 kDa) in humans and Dpb3 (22.7 kDa) and Dbp4 (22 kDa) in S. cerevisiae. Although Dpb3 and Dbp4 are unessential for cell viability in S. cerevisiae, they are still required for normal

Eukaryotic DNA Replicases

chromosomal replication. Dpb3/Dpb4 interacts directly with dsDNA and increases the affinity of Pol2/Dpb2 for dsDNA. A histone fold motif is present in Dpb3 and Dpb4 and is proposed to mediate their interaction with DNA. The EM structural analyses of Pol e, Pol2/Dpb2 and Pol2, explain how Dpb3 and Dpb4, in addition to Dpb2, could enhance the processivity of Pol e. Pol2 occupies a globular structure, while Dpb2, Dpb3, and Dpb4 form an extended tail, which constitute a cavity that could fit a modeled primer-template and position it into the polymerase active site (Fig. 6d) (Asturias et al. 2006). The length of the modeled DNA (40 base pairs) agrees well with the optimal footprint required to enhance the processivity of Pol e (Asturias et al. 2006). Consequently, one could envision Dpb3 and Dpb4 along with Dpb2 acting like a processivity factor, which would explain the unusual intrinsic high processivity of Pol e in the absence of PCNA. The presence of Dpb3/ Dpb4 also plays a structural role in Pol e by stabilizing the conformation of Dpb2 (Asturias et al. 2006).

a

375

Proliferating Cell Nuclear Antigen Consisting of three identical monomers (28.7 kDa in humans and 28.9 kDa in S. cerevisiae) PCNA is assembled into a ring-shaped structure (Krishna et al. 1994) (Fig. 9a). Each monomer consists of two domains firmly connected by an extended b-sheet across the interdomain boundary. In addition, an interdomain-connecting loop (IDCL), which is a long and flexible linker, connects both domains (Fig. 9a). Each PCNA monomer interacts with its neighbor by b-sheet extension, mediated by a b-dimer interface that comprises extensive hydrogen-bonding network, hydrophobic contacts, and a salt bridge. Three such interfaces form a closed planar ring with a negatively charged b-structure on the outer surface and a positively charged inner ring surface comprised entirely of a-helices (Fig. 9a, b). Valuable insights into the function of PCNA at the replication fork have been revealed by the structures of PCNA bound to DNA and protein/ peptide partners (PCNA Structure and Interactions with Partner Proteins). The ring encloses a

b Subunit A

IDCL

Subunit B

Eukaryotic DNA Replicases, Fig. 9 (a) The crystal structure of the PCNA trimer. Monomers are colored in cyan, yellow, and green. Each monomer consists of two domains (domains A and B) that are connected by an extended b-sheet across the interdomain boundary. In addition, a long and flexible linker referred to as the interdomain-connecting loop (IDCL) (red) connects both

domains. (b) A model for the interaction of PCNA with DNA. The electrostatic potential surface of PCNA showing the distribution of positive (blue) and negative (red) patches. The 10-bp DNA molecule included the structure in the sticks representation. The PDB code for PCNA structure is 1PLQ (Krishna et al. 1994)

E

376

34 Å central hole, suggesting that PCNA can encircle either an 18 Å B-form DNA or a 21 Å A-form DNA. Only a few bases could be modeled confidently into the electron density in the center of the PCNA ring of a crystal structure of S. cerevisiae PCNA bound to a primer-template (10-bp DNA as the template and 4-base 50 overhang as the primer) that has been reported, presumably because of extensive disorder in the DNA molecule or the transient nature of the PCNA interaction with DNA. Consequently, a 10-bp B-form DNA molecule was modeled inside the PCNA ring using the existing weak electron density as a guide. The DNA molecule in this model was skewed towards one side of the ring at an angle of 40 relative to the PCNA axis (Fig. 9b). However, extension of the DNA model beyond the 10-bp size resulted in molecular clashing with PCNA, suggesting that at least, for the purposes of replication, this model was unlikely to be physiologically relevant. A more representative clamp/primer-template interaction is that observed in the E. coli b-clamp/DNA structure, which is tilted at an angle of 22 . Nonetheless, the tendency of the DNA to occupy the center of the PCNA and b-clamps in a tilted conformation, presumably to optimize the contact between the positively charged center and the negatively charged DNA backbone, is a genuine functional feature. Indeed the DNA in the EM structure of the P. furiosus ligase in complex with PCNA occupies the center of the PCNA in a tilted conformation, with an angle that is smaller than the 40 predicted from the PCNA/DNA model structure. In the E. coli structure, nine Lys and Arg residues were observed to make contacts with the DNA sugar-phosphate backbone. Structure-based sequence alignment shows that these cationic residues are conserved in similar positions in clamps from all domains of life. Indeed, mutational analyses reveal that all nine residues are required for PCNA loading onto primer-templates and for movement along dsDNA. PCNA functions as a platform for the assembly of a wide variety of DNA processing proteins on DNA (PCNA Structure and Interactions with Partner Proteins). It prevents dissociation of polymerases from the primer-template junction and

Eukaryotic DNA Replicases

coordinates the exchange of polymerases and other processing proteins on DNA. A hydrophobic groove created by the IDCL, a center loop, and the C-terminal tail serves as the primary docking site for many PCNA protein partners. The IDCL is located on the front side of PCNA, with the backside containing several loops that protrude into the solvent. These proteins utilize a consensus PIP box motif for PCNA interaction. The PIP box motif forms a 310 helix secondary structure that acts as a hydrophobic plug, inserting into the PCNA hydrophobic groove (PCNA Structure and Interactions with Partner Proteins). However, an increasing body of evidence suggests that PCNA interactions go beyond the PIP box motif (PCNA Structure and Interactions with Partner Proteins). With three IDCLs, PCNA may be able to bind up to three different protein partners at the same time. Support for this proposition is provided by the FEN1-PCNA structure, which shows that three FEN1 molecules could bind one PCNA clamp (PCNA Structure and Interactions with Partner Proteins).

Replication Factor C PCNA exists mostly as a closed ring in solution and is therefore unable to load directly onto the primer-template. The RFC clamp loader protein complex facilitates PCNA ring opening and loading onto the primer-template (mechanism of PCNA loading by RFC). RFC is a five-subunit ATP-dependent protein complex that is classified as a member of the AAA+ ATPases superfamily. These subunits in humans are p140 (128.2 kDa), p37 (39.2 kDa), p36 (40.6 kDa), p40 (39.7 kDa), and p38 (38.5 kDa). In S. cerevisiae, they are RFC-A (94.9 kDa), RFC-B (39.7 kDa), RFC-D (38.2 kDa), RFC-E (36.1 kDa), and RFC-F (39.9 kDa). All subunits consist of three homologous domains. Domains I and II comprise the AAA+ ATPase module, with all subunits capable of binding ATP. Domain III is the oligomerization domain and has been observed only in clamp loaders (Bowman et al. 2004). RFC-A has insertions at both the N- and C-termini that account for its larger size. Of particular interest is the insert at

Eukaryotic DNA Replicases

377

collar Domain III

Domain II

E AAA+

Domain I PCNA

Eukaryotic DNA Replicases, Fig. 10 The crystal structure of the S. cerevisiae RFC in a complex with PCNA and nonhydrolyzable adenosine 50 -O-(3-thio)triphosphate (ATPgS). RFC subunits are colored green (RFC-A), cyan (RFC-B), purple (RFC-C), yellow (RFC-D), and brown (RFC-E). PCNA is colored gray, and the ATPgS bound to each subunit are shown as spheres. All subunits consist

of three homologous domains; domains I and II comprise the AAA+ ATPase module, and domain III is the oligomerization domain. RFC-A has a fourth domain (A0 ) that is truncated in this structure. RFC forms a right-handed spiral structure with RFC-A, B, and C interacting with PCNA. The PDB code used for this structure is 1SXJ (Bowman et al. 2004)

the C-terminal region that contains the A0 domain, which interacts with the AAA+ ATPase modules of RFC-E (Bowman et al. 2004). This domain is proposed to participate in the binding of RFC to ssDNA and/or dsDNA. The overall pentameric assembly differs from the hexameric complexes observed in other AAA+ ATPases in that it adopts an open structure. The crystal structure of S. cerevisiae RFC in complex with PCNA and a nonhydrolyzable adenosine 50 -O-(3-thio)triphosphate (ATPgS) analogue (Fig. 10) (Bowman et al. 2004) shows that RFC forms a right-handed spiral heteropentameric complex, with the C-terminal domains organized into a circular collar at the top and AAA+ ATPase modules at the N-terminus forming the shape of a claw at the bottom (Fig. 10). The AAA+ module is connected to domain III via a flexible linker sequence. PCNA binds to the claw,

with three of the five RFC subunits (A, B, and C) interacting with PCNA, primarily mediated by the insertion of the RFC hydrophobic residues into the hydrophobic binding grooves on PCNA. The interface between each RFC subunit and its adjacent subunit has a large buried surface area formed by an extensive network of hydrogen bonds and salt bridges. ATPgS occupies the ATP binding site situated at each interface (Fig. 10), a location that appears to favor coupling of ATP binding and hydrolysis to conformational changes in RFC and, by association, PCNA. A proximate arginine residue “arginine finger” required for ATP hydrolysis is provided by the adjacent subunit. RFC-mediated loading of PCNA onto the primer-template involves an assembly and a disassembly stages (mechanism of PCNA loading by RFC). In the assembly stage, binding of ATP results in conformational changes such that RFC

378

can form a complex with PCNA. This event leads to conformational changes in which PCNA is cracked open at one subunit interface, and the positively charged inner surfaces of both RFC and PCNA are twisted into an extended spiral that complements the geometry of the primertemplate. The primer-template would gain entry through PCNA at the bottom and, via interactions with positively charged residues in both PCNA and RFC, travel upwards into the RFC chamber (mechanism of PCNA loading by RFC). The disassembly stage involves ATP hydrolysis, triggered by DNA binding, which results in the closure of PCNA around primer-template DNA, dissociation of RFC from the PCNA/primertemplate complex, and release of ADP and Pi (mechanism of PCNA loading by RFC). By utilizing the structurally related E. coli clamp loader/ DNA complex, a DNA molecule has been modeled into the chamber of RFC/PCNA. The DNA makes reasonable contacts with positive charges along the chamber and enters the center of PCNA closer to one side in a position that would interact with all positive residues in the center. Since PCNA is in the closed conformation, this model most likely represents an intermediary step after completion of ATP loading and binding to PCNA but prior to opening PCNA. To understand how open conformation of PCNA would interact with RFC prior to PCNA loading, a simulated model of an open PCNA structure was created and aligned in the RFC/PCNA complex via the subunit that interacts with RFC-A. The model shows that the PCNA subunit that makes the most contact with RFC maintains its interaction with DNA regardless of whether PCNA is in the open or closed conformation, while the remaining subunits pointed away from the DNA.

Cross-References ▶ Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis ▶ Division of Labor ▶ DNA Polymerase III Structure ▶ DNA Repair Polymerases ▶ DNA Replication

Eukaryotic DNA Replicases

▶ DNA Replication, Chemical Biology of ▶ DnaX Complex Composition and Assembly within Cells ▶ PCNA Loading by RFC, Mechanism of ▶ PCNA Structure and Interactions with Partner Proteins

References Acharya N, Johnson RE, Pages V, Prakash L, Prakash S (2009) Yeast Rev1 protein promotes complex formation of DNA polymerase zeta with Pol32 subunit of DNA polymerase delta. Proc Natl Acad Sci U S A 106(24):9631–9636 Agarkar VB, Babayeva ND, Pavlov YI, Tahirov TH (2011) Crystal structure of the C-terminal domain of human DNA primase large subunit: implications for the mechanism of the primase-polymerase alpha switch. Cell Cycle 10(6):926–931 Asturias FJ, Cheung IK, Sabouri N, Chilkova O, Wepplo D, Johansson E (2006) Structure of Saccharomyces cerevisiae DNA polymerase epsilon by cryo-electron microscopy. Nat Struct Mol Biol 13(1):35–43 Augustin MA, Huber R, Kaiser JT (2001) Crystal structure of a DNA-dependent RNA polymerase (DNA primase). Nat Struct Biol 8(1):57–61 Baranovskiy AG, Babayeva ND, Liston VG, Rogozin IB, Koonin EV, Pavlov YI, Vassylyev DG, Tahirov TH (2008) X-ray structure of the complex of regulatory subunits of human DNA polymerase delta. Cell Cycle 7(19):3026–3036 Bebenek K, Kunkel TA (2004) Functions of DNA polymerases. Adv Protein Chem 69:137–165 Bowman GD, O’Donnell M, Kuriyan J (2004) Structural analysis of a eukaryotic sliding DNA clamp-clamp loader complex. Nature 429(6993):724–730 Franklin MC, Wang J, Steitz TA (2001) Structure of the replicating complex of a pol alpha family DNA polymerase. Cell 105(5):657–667 Frick DN, Richardson CC (2001) DNA primases. Annu Rev Biochem 70:39–80 Garg P, Burgers PM (2005) DNA polymerases that propagate the eukaryotic DNA replication fork. Crit Rev Biochem Mol Biol 40(2):115–128 Hamdan SM, Richardson CC (2009) Motors, switches, and contacts in the replisome. Annu Rev Biochem 78:205–243 Hamdan SM, van Oijen AM (2010) Timing, coordination, and rhythm: acrobatics at the DNA replication fork. J Biol Chem 285(25):18979–18983 Hogg M, Johansson E (2012) DNA polymerase epsilon. Subcell Biochem 62:237–257 Hogg M, Osterman P, Bylund GO, Ganai RA, Lundstrom EB, Sauer-Eriksson AE, Johansson E (2014) Structural basis for processive DNA synthesis by yeast DNA

Exocyclic Adducts polymerase varepsilon. Nat Struct Mol Biol 21(1): 49–55 Ito N, Nureki O, Shirouzu M, Yokoyama S, Hanaoka F (2003) Crystal structure of the Pyrococcus horikoshii DNA primase-UTP complex: implications for the mechanism of primer synthesis. Genes Cells 8(12):913–923 Jain R, Hammel M, Johnson RE, Prakash L, Prakash S, Aggarwal AK (2009) Structural insights into yeast DNA polymerase delta by small angle X-ray scattering. J Mol Biol 394(3):377–382 Jain R, Rajashankar KR, Buku A, Johnson RE, Prakash L, Prakash S, Aggarwal AK (2014) Crystal structure of yeast DNA polymerase epsilon catalytic domain. PLoS One 9(4):e94835 Johansson E, Macneill SA (2010) The eukaryotic replicative DNA polymerases take shape. Trends Biochem Sci 35(6):339–347 Johnson A, O’Donnell M (2005) Cellular DNA replicases: components and dynamics at the replication fork. Annu Rev Biochem 74:283–315 Kilkenny ML, Longo MA, Perera RL, Pellegrini L (2013) Structures of human primase reveal design of nucleotide elongation site and mode of Pol alpha tethering. Proc Natl Acad Sci U S A 110(40):15961–15966 Klinge S, Hirst J, Maman JD, Krude T, Pellegrini L (2007) An iron-sulfur domain of the eukaryotic primase is essential for RNA primer synthesis. Nat Struct Mol Biol 14(9):875–877 Klinge S, Nunez-Ramirez R, Llorca O, Pellegrini L (2009) 3D architecture of DNA Pol alpha reveals the functional core of multi-subunit replicative polymerases. EMBO J 28(13):1978–1987 Krishna TS, Kong XP, Gary S, Burgers PM, Kuriyan J (1994) Crystal structure of the eukaryotic DNA polymerase processivity factor PCNA. Cell 79(7):1233–1243 Kuchta RD, Reid B, Chang LM (1990) DNA primase. Processivity and the primase to polymerase alpha activity switch. J Biol Chem 265(27):16158–16165 Lao-Sirieix SH, Nookala RK, Roversi P, Bell SD, Pellegrini L (2005) Structure of the heterodimeric core primase. Nat Struct Mol Biol 12(12):1137–1144 McHenry CS (2011) DNA replicases from a bacterial perspective. Annu Rev Biochem 80:403–436 Netz DJ, Stith CM, Stumpfig M, Kopf G, Vogel D, Genau HM, Stodola JL, Lill R, Burgers PM, Pierik AJ (2012) Eukaryotic DNA polymerases require an ironsulfur cluster for the formation of active complexes. Nat Chem Biol 8(1):125–132 Nunez-Ramirez R, Klinge S, Sauguet L, Melero R, Recuero-Checa MA, Kilkenny M, Perera RL, GarciaAlvarez B, Hall RJ, Nogales E, Pellegrini L, Llorca O (2011) Flexible tethering of primase and DNA Pol alpha in the eukaryotic primosome. Nucleic Acids Res 39(18):8187–8199 Nuutinen T, Tossavainen H, Fredriksson K, Pirila P, Permi P, Pospiech H, Syvaoja JE (2008) The solution structure of the amino-terminal domain of human DNA

379 polymerase epsilon subunit B is homologous to C-domains of AAA+ proteins. Nucleic Acids Res 36(15):5102–5110 Pavlov YI, Shcherbakova PV (2010) DNA polymerases at the eukaryotic fork-20 years later. Mutat Res 685(1–2):45–53 Pellegrini L (2012) The Pol alpha-primase complex. Subcell Biochem 62:157–169 Perera RL, Torella R, Klinge S, Kilkenny ML, Maman JD, Pellegrini L (2013) Mechanism for priming DNA synthesis by yeast DNA polymerase alpha. Elife 2:e00482 Sauguet L, Klinge S, Perera RL, Maman JD, Pellegrini L (2010) Shared active site architecture between the large subunit of eukaryotic primase and DNA photolyase. PLoS One 5(4):e10083 Swan MK, Johnson RE, Prakash L, Prakash S, Aggarwal AK (2009) Structural basis of high-fidelity DNA synthesis by yeast DNA polymerase delta. Nat Struct Mol Biol 16(9):979–986 Tahirov TH (2012) Structure and function of eukaryotic DNA polymerase delta. Subcell Biochem 62:217–236 Vaithiyalingam S, Warren EM, Eichman BF, Chazin WJ (2010) Insights into eukaryotic DNA priming from the structure and functional interactions of the 4Fe-4S cluster domain of human DNA primase. Proc Natl Acad Sci U S A 107(31):13684–13689

Exocyclic Adducts Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synopsis Exocyclic DNA adducts are those in which an extra ring or rings have been added to the DNA base, either a purine or pyrimidine. These are often problematic in that they disrupt the normal DNA coding with the opposite strand. Some of the most common exocyclic adducts are the so-called etheno (e) adducts, which (as the name would imply) contain an extra two carbons. These etheno adducts were first studied because of their relevance to the carcinogenicity of vinyl chloride, but subsequently they were found to also be derived form peroxidation of unsaturated fatty acids and are present in humans who have never been

E

380

Exocyclic Adducts

Exocyclic Adducts, Fig. 1 Some etheno (e) DNA adducts

exposed to vinyl chloride or other vinyl monomers. Most of these exocyclic DNA adducts have been found to be miscoding, and there is considerable interest regarding them in the etiology of human cancer.

Introduction The exocyclic adducts are an eclectic group of DNA adducts having only the commonality of containing extra rings. These adducts arise from the reaction of DNA with bifunctional electrophiles (as do DNA cross-links). In many of these adducts, the atoms normally involved in hydrogen bonding with incoming nucleoside triphosphates (dNTPs) are not available, and information about coding is limited or possibly absent. Some of the exocyclic adducts are found in DNA of normal humans and have been implicated as miscoding in biological experiments; thus they are potentially involved in the etiology of human cancer.

Nature and Consequences of Exocyclic DNA Adducts The etheno (e) adducts (Fig. 1) are called so because they have an added ethylene group (two carbons) to form a ring. Some were known from tRNA chemistry, but four of these – 1, N6-e-adenine, 3, N4-e-cytosine, 1, N2-e-guanine, and N2,

3-e-guanine (Fig. 1) – were developed by Leonard and his associates by reacting the bases or nucleoside with chloroacetaldehyde (Leonard 1984). Three of these bases are fluorescent, and these derivatives found various uses as ATP analogs in biophysical studies. Subsequently these etheno adducts were identified in DNA treated with 2chlorooxirane, the epoxidation product of the human carcinogen vinyl chloride (Barbin et al. 1975). 2-Chlorooxirane decomposes to 2chloroacetaldehyde, but the epoxide has been shown to be the agent that forms the etheno adducts. The etheno adducts can be formed not only from 2-chloroxirane and 2-haloacetaldehyes but also epoxides derived from other olefins, if appropriate leaving groups are present, e.g., acrylonitrile, vinyl acetate, and vinyl carbamate. A surprising result was that rats that had never been exposed to any of these chemicals also contained etheno adducts (Fedtke et al. 1990). Subsequently it has been shown that these adducts are derived from products of the oxidation of unsaturated fatty acids (Blair 2008). Some of the etheno adducts (especially 1, N2-e-guanine) retain the fatty chain (Fig. 1), which is also lost in a retrograde reaction to yield 1, N2-e-guanine. Chemical mechanisms for the formation of 1, 6 N -e-adenine, 3, N4-e-cytosine, 1, N2-e-guanine, and N2, 3-e-guanine from 2-chlorooxirane and chloroacetaldehyde have been developed based on isotopic labeling (Guengerich et al. 1993). An interesting feature of the e-guanine adducts is that

Exocyclic Adducts

labeling studies show that the etheno rings slowly open and close, even at neutral pH for 1, N2-e-guanine (Guengerich et al. 1993). Treatment of DNA with 2-chlorooxirane produces DNA adducts in the order N7-(2-oxoethyl)guanine > 1, N6-e-adenine > 7-hydroxy-1, N2-ethanoguanine > N2, 3-e-guanine > 3, N4-e-cytosine > 1, N2-e-guanine (Müller et al. 1997). However, because of differences in rates of DNA repair, the pattern of DNA adducts that accumulates in vivo is different, i.e., N2, 3-e-guanine accumulates because it is repaired slowly (Fedtke et al. 1990). The major etheno adducts have been examined with DNA polymerases and in site-specific mutagenesis experiments; all appear to have the potential to miscode (Langouët et al. 1997). Limited studies have been done with N2, 3-e-guanine because of the lability of the glycosidic bond, although earlier reports that this adduct pairs with thymine (Cheng et al. 1991) appear to be solid as confirmed with studies with a stable isosteric analog in recent studies in this laboratory. Some miscoding studies have been done with 1, N2-e-guanine adducts containing residual fatty acid chains. Major differences have been found in humans with different diets (and even major gender differences) (Nair et al. 1995). Although a role for etheno adducts in cancer etiology has not been rigorously established (in liver hemangiosarcoma cases related to vinyl chloride, the presence of high levels of the etheno adducts in humans (total of ~ 1/107 bases for some) and their documented biological effects in experimental systems suggest potential roles in human cancer). A related group of adducts derives from treatment of DNA with malondialdehyde, a product of lipid peroxidation. Although N6-adenine and other minor adducts are known, the major product is 3-(20 -deoxy-b-D-erythro-pentofuranosyl)pyrimido [1,2-a]purin-10(3H)-one, often abbreviated M1G (Fig. 2). This adduct is formed from the addition of malondialdehyde to DNA. Alternatively, the adduct can be formed by a base propenal derived from the sugar ring of DNA next to a guanine base. Several biological and other properties of the M1G adduct have been investigated in some

381 Exocyclic Adducts, Fig. 2 M1G

detail. M1G miscodes in site-specific mutagenesis experiments, both in prokaryotic and mammalian cell systems. The adduct causes both base-pair and frameshift mutations. The atoms of M1G normally involved in hydrogen bonding are obscured. Therefore, the bulky exocyclic group can stack with other bases to facilitate frameshift mutations. Another interesting feature of the M1G exocyclic ring is that it readily opens (and closes), so that a DNA polymerase (or RNA polymerases) might encounter either the open or closed form. The open form is an N2-substituted guanine. In double-stranded oligonucleotides in which M1G is placed opposite cytosine, the open form of M1G is favored, presumably due to the favorable thermodynamic equilibrium, in which 3-hydrogen bond pairing to cytosine can occur. If a thymine is positioned opposite M1G, the ring-closed form is favored. The timescale for opening and closing has been analyzed and may be relevant for replication, especially if the progress of a processive DNA polymerase is blocked. The products of bypass of etheno and M1G adducts are rather variable, depending upon the particular DNA polymerase involved (Langouët et al. 1997). X-ray crystal structures have been established for Sulfolobus solfataricus DNA polymerase Dpo4, a Y-family enzyme, with 1, N3-e-guanine and with M1G (Zang et al. 2005). A number of crystals have been obtained for each, but with both adducts there is a “type II” structure that can explain the 1 frameshifts. In this case the DNA polymerase copies “around” the adduct, as if were not present. This process apparently does not extensively strain either the DNA backbone or the protein structure, due in part to the large size of the active site. Some drugs have been reported to link the O6 and N1 atoms of guanine (Brent et al. 1987). This

E

382

Exocyclic Adducts

Exocyclic Adducts, Fig. 3 Exocyclic DNA adducts generated by reaction with crotonaldehyde (from guanine)

Exocyclic Adducts, Fig. 4 An exocyclic DNA adduct generated by reaction with benzoquinone (from guanine)

Exocyclic Adducts, Fig. 5 An exocyclic DNA adducts generated from the estrogenic drug component equilenin

linkage is not completely stable, in that O6alkylguanine-DNA alkyltransferase (a DNA repair protein) can react with the methylene group and become covalently linked. Bifunctional electrophiles can be generated from certain nitrosamines that form a ring linked to the N7 and C8 atoms of guanine (Fig. 3). The biological consequences of this adduct are not known. Quinone compounds have been shown to yield some rather complex exocyclic adducts. For instance, benzoquinone, an oxidation product of benzene, has been shown to form

a complex ring system with guanine (Fig. 4). Other complex exocyclic DNA adducts can be formed from estrogen-derived quinones and quinone methides (Fig. 5). Some of the effects of these latter adducts (estrogens) on DNA replication have been done. The significance of the benzoquinone adducts to benzene toxicity is unknown, in that benzene is not generally considered a genotoxic carcinogen and DNA adducts have not been identified in animals treated with benzene. The relevance of the estrogen-derived exocyclic DNA adducts to cancers remains to be established.

Expression Assessment

383

Cross-References

Müller M, Belas FJ, Blair IA, Guengerich FP (1997) Analysis of 1, N2-Ethenoguanine and 5,6,7,9-Tetrahydro-7hydroxy-9-oxoimidazo-[1,2-a]purine in DNA Treated with 2-Chlorooxirane by High Performance Liquid Chromatography/Mass Spectrometry and Comparison of Amounts with Other Adducts”. Chem Res Toxicol 10:242–247 Nair J, Barbin A, Guichard Y et al (1995) 1, N6Ethenodeoxyadenosine and 3, N4-ethenodeoxycytidine in liver DNA from humans and untreated rodents detected by immunoaffinity/32P-postlabelling. Carcinogenesis 16:613–617 Zang H, Goodenough AK, Choi JY et al (2005) DNA adduct bypass polymerization by Sulfolobus solfataricus DNA polymerase Dpo4: analysis and crystal structures of multiple base pair substitution and frameshift products with the adduct 1, N2-ethenoguanine. J Biol Chem 280:29750–29764

▶ Bioactivation of Carcinogens ▶ Damage DNA, Natural Products that ▶ Damaged DNA, Analysis of ▶ Direct Enzymatic Reversal of DNA Damage ▶ DNA Base Pairing, Modes of ▶ DNA Damage, Frequency of ▶ DNA Damage, Types of ▶ Electrophiles, Types of ▶ Kinetics of DNA Damage ▶ Selectivity of Chemicals for DNA Damage ▶ Site-Specific Mutagenesis ▶ Synthesis of Modified Oligonucleotides

References Barbin A, Brésil H, Croisy A et al (1975) Livermicrosome-mediated formation of alkylating agents from vinyl bromide and vinyl chloride. Biochem Biophys Res Commun 67:596–603 Blair IA (2008) DNA adducts with lipid peroxidation products. J Biol Chem 283:15545–15549 Brent TP, Lestrud SO, Smith DG et al (1987) Formation of DNA interstrand cross-links by the novel chloroethylating agent 2-chloroethyl(methylsulfonyl) methanesulfonate: suppression by O6-alkylguanineDNA alkyltransferase purified from human leukemic lymphoblasts. Cancer Res 47:3384–3387 Cheng KC, Preston BD, Cahill DS et al (1991) The vinyl chloride DNA derivative, N2, 3-ethenoguanine, produces G * > A transitions in Escherichia coli. Proc Natl Acad Sci U S A 88:9974–9978 Fedtke N, Boucheron JA, Walker VE et al (1990) Vinyl chloride-induced DNA adducts. II. Formation and persistence of 7-(20 -oxoethyl)guanine and N2, 3ethenoguanine in rat tissue DNA. Carcinogenesis 11:1287–1292 Guengerich FP, Persmark M, Humphreys WG (1993) Formation of 1, N2- and N2, 3-ethenoguanine from 2halooxiranes: isotopic labeling studies and isolation of a hemiaminal derivative of N2-(2-oxoethyl)guanine. Chem Res Toxicol 6:635–648 Langouët S, Muller M, Guengerich FP (1997) Misincorporation of dNTPs opposite 1, N2-ethenoguanine and 5,6,7,9-tetrahydro-7-hydroxy-9-oxoimidazo [1,2-a]purine in oligonucleotides by Escherichia coli polymerases I exo- and II exo-, T7 polymerase exo, human immunodeficiency virus-1 reverse transcriptase, and rat polymerase b. Biochemistry 36: 6069–6079 Leonard NJ (1984) Etheno-substituted nucleotides and coenzymes: fluorescence and biological activity. CRC Crit Rev Biochem 15:125–199

Expression Assessment C. Robin Buell Department of Plant Biology, Michigan State University, East Lansing, MI, USA

Definition Genes are transcribed into mRNA which are then translated into proteins. Measurements of transcript or gene expression levels provide an indirect assessment of protein levels. Examination of transcript structure is also a powerful method to determine gene structure and the presence of alternative isoforms. Single transcripts can be quantitated or entire transcriptomes, i.e., all the transcripts in the cell, can be measured. Methods for measuring transcript levels have evolved greatly in the last two decades, and the advent of next-generation sequencing platforms has been applied to transcriptome measurements enabling quantification of not only the level but also the structure of all transcripts in a cell.

Discussion Measurements of gene expression levels have evolved in the last 30 years from simple

E

384

hybridization-based northern blots to sequencing of RNA which provides single-base resolution and highly accurate measurements of transcript structure and levels. Initial assessments of transcript levels involved modification of the Southern blot technique in which single-stranded DNA immobilized on a solid surface is hybridized to radioactive DNA probes. Modification of this technique for RNA, termed “northern” blots, utilized electrophoretic separation of RNA in an agarose gel, transfer of the RNA to a solid surface such as a nylon or nitrocellulose membrane, hybridization of the immobilized transcripts to radioactive or fluorescently labeled DNA probes, and quantification of transcript levels through autoradiography or luminescence, respectively. While effective, northern blots were low throughput, required high amounts of RNA per sample, and were limited in resolution. In the 1990s, two technologies emerged to assess expression that were paradigm changing in terms of throughput and resolution: expressed sequence tags (ESTs) and microarrays. ESTs were first shown to be an effective method for gene discovery in humans (Adams et al. 1993). These single-pass sequences from anonymous cDNA clones can rapidly generate sequence data on expressed genes and bypass the laborious methods of directed cloning or positional cloning of genes. The majority of EST collections were generated using automated Sanger sequencing methods, which, while accurate due to their single-pass nature, still contained errors. Another issue with ESTs is that the transcript, and as a consequence the cDNA clone, is typically longer than the single EST read. A third issue with EST-derived sequences is that the underlying cDNA clone is not always full length; thus, the EST may not represent the true 50 and 30 ends of the transcript. To address these three limitations, ESTs can be clustered and assembled to generate a consensus sequence representative of the original transcript. As a consequence, these transcript assemblies are of greater length and higher accuracy than any one of the component EST sequences. Quantitation of expression using

Expression Assessment

ESTs is possible. These “electronic northern blots” involve random sequencing of cDNA libraries and bioinformatic quantitation of the frequency of transcripts in the underlying tissue by counting the EST reads for each transcript. Microarrays emerged in the 1990s as a highthroughput method to assess expression levels. In this technology, DNA either purified cDNA clones, open reading frames, or oligonucleotides are spotted onto a solid surface, hybridized with fluorescently labeled cDNA populations, and detected through excitation with a laser (Fig. 1). A switch in the use of the term “probe” was necessary with the emergence of microarrays. In northern blots, the “known” sequence, or probe, was radioactively labeled and hybridized to total mRNA immobilized on a solid surface. However, in microarrays, the DNA spotted on the microarray is what is “known” while the labeled cDNA (i.e., mRNA) is unknown. Thus, in microarrays, the probes are the DNA elements on the array and the labeled nucleic acid (or query) is the total mRNA population. Early microarrays were simple. cDNA clones from EST projects were amplified using PCR, spotted onto glass microscope slides using robotics, and hybridized with mRNA labeled with fluorescent dyes. Nonspecific hybridization was removed and the slide scanned with a laser, exciting the fluorophore attached to the labeled cDNA and capturing the emission. Two mRNA samples could be imaged simultaneously due to the availability of dyes with nonoverlapping excitation and emission spectra. Thus, two mRNA expression profiles could be obtained from a single slide, enabling treatment versus control types of comparisons. Microarrays evolved from simple spotting of the DNA onto glass slides to fabrication in which the oligonucleotides are synthesized directly on the slide surface using an array of technologies. The current commercial microarray platforms utilize different technologies to synthesize oligonucleotides directly on a solid surface including photolithography, micromirrors, and inkjets. The density at which the oligonucleotides are deposited on the array has increased and currently millions of features, i.e., the probes, can be placed on these

Expression Assessment

385

E

Expression Assessment, Fig. 1 Microarrays. Early microarrays were fabricated by robotically spotting oligonucleotides or cloned cDNAs on modified coated glass microscope slides. mRNA from two different samples, typically a control and a treatment, are converted to cDNA and labeled with two fluorescent nucleotides that have differential excitation and emission spectra. These are mixed and hybridized to the microarray. Nonspecific hybridization is removed through a series of temperatureand salt-modified solutions, and the hybridized microarray is scanned with a laser to excite and capture the

fluorescent molecules. In this example, the microarray was hybridized with Cy3- (green) and Cy5- (red) labeled cDNA from the control and treatment samples, respectively. The yellow elements reflect oligonucleotides in which mRNA levels are similar for the gene represented by the probe in the control and treatment mRNA. The red elements reflect high expression of the gene represented by the probe in the treatment but not the control mRNA. The green elements reflect high expression of the gene represented by the probe in the control not the treatment mRNA

commercial arrays enabling interrogation of a large number of genes. An extension of the microarrays for quantification of expression of known genes was the development of tiling arrays. Here, rather than using known, predicted, or annotated genes to interrogate on the array, the entire genome was placed on the array thereby permitting the RNA hybridization data to inform the scientist as to what portions of the gene encoded transcripts. These tiling arrays have shown that additional regions of the genome, i.e., regions outside of curated genes, encode transcripts and that the structural annotation of genes should be amended as additional exons and/or alternative splice forms could be detected on these tiling arrays.

With the advent of next-generation sequencing methods, sequencing of RNA, termed RNA-seq, has supplanted all previous methods. In RNA-seq, RNA is converted into cDNA and sequenced to a high depth using next-generation sequencing platforms (Wang et al. 2009). RNA-seq is essentially EST technology but at a higher level in terms of numbers of reads, depth of coverage of the transcriptome, and breadth of coverage of individual transcripts, as the entire transcript and not just the 50 and 30 ends of the cDNA are sequenced. The RNA-seq reads are then mapped to a reference genome using a short read alignment program to improve annotation and/or estimate transcript abundances and isoforms (Fig. 2) (Wang et al. 2008). Access to

386

Expression Assessment AAAAAAA

mRNA

Cleave and convert to cDNA Add adaptors Amplify Sequence on Next Generation Sequencing Platform Align to reads to reference genome

Intergenic/ Promoter

Exon 1

Int 1

Exon 2

Int 2

Exon 3

Intergenic

Expression Assessment, Fig. 2 RNA-seq. RNA sequencing, RNA-seq, involves sequencing of cDNA to high depth using next-generation sequencing methods. The reads are then aligned to a reference genome to annotate gene model structure and determine expression

abundances. Solid short black lines reflect full alignment of the sequenced reads. In the reference genome, genes are modeled (intergenic region: orange; exon: green, intron: gray). The sequence reads with a dashed line are spliced across the introns

whole transcriptome sequence data has permitted improvements in gene structure predictions and in identification of alternative isoforms in a genome. In the absence of a reference genome, the sequences can be de novo assembled to generate a reference transcriptome, which then can be used to estimate transcript abundances in individual cDNA libraries. The evolution of expression measurement methods to RNA-seq provides unprecedented resolution of the structure and composition of the transcriptome.

References Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 4:373–380 Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476 Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

F

Fatty Acid Metabolism Margaret A. Park1 and Charles Chalfant1,2 1 Department of Biochemistry and Molecular Biology, Virginia Commonwealth University, Richmond, VA, USA 2 Research Career Scientist, Research and Development, Hunter Holmes McGuire VAMC, Richmond, VA, USA

Synopsis Fatty acids are extremely efficient at storing energy and are used by many organisms for metabolic reactions. In this review of fatty acid metabolism, fat molecules will be followed from dietary intake and digestion, through storage and mobilization, oxidation, and finally ketone body formation. The regulation of each of these processes will also be discussed. Finally, genetic abnormalities and disorders of fatty acid oxidation will be examined. Most of the concepts in this entry will focus on the human metabolism of fatty acids. However, other species will be discussed when of relevance.

Introduction Living cells require a constant input of energy for metabolic functions and to maintain homeostasis. For most non-photosynthetic organisms, this # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

energy is acquired from the oxidation of various food sources such as proteins, carbohydrates, and fats. Energy is derived from the breakdown of individual subunits of these substances into acetyl-CoA and the subsequent cycling of this molecule through the citric acid cycle and the mitochondrial oxidative phosphorylation pathway. ATP is thereby produced providing chemical energy for cells, thus allowing for the maintenance of homeostasis and continued normal function. Fatty acids are especially important in the production of ATP from foods and this entry will serve as an overview of these vital pathways. Fatty acids (FAs) are a diverse subgroup of lipids including straight-chain FAs, substituted FAs (fatty acids with substituents other than methyl groups), branched-chain FAs, and ringcontaining fatty acids. This group of molecules has a similarly diverse range of functions including energy storage, protection from water saturation (such as the FAs found in the preening glands of birds), membrane structure, protein modification (lipoproteins), and scavenging of reactive oxygen species (tocopherols). Fatty acids can be used as subunits to build mono-, di-, and triacylglycerol molecules (see Fig. 1), the body’s main source of stored energy. Fatty acids as a group are negatively charged linear hydrocarbon chains of various lengths. The negative charge is located at a carboxyl end group that is completely deprotonated at physiological pH values. Fatty acids are especially efficient stores of energy, as these molecules can provide more than twice as

388

Fatty Acid Metabolism

Fatty Acid Metabolism, Fig. 1 Structure of a typical triglyceride, the most common form of dietary fatty acid. Carbon atoms are shown in green, oxygen in red, and hydrogen in white. The triglyceride depicted is the 16-C chain length, palmitic acid

much energy per gram as carbohydrates. Moreover, fatty acid molecules are easily stored in the body, and many mammals store the majority of their energy in this form in adipose tissue (reviewed in sections “Development of Adipose Tissue” and “Fatty Acid Synthesis (Lipogenesis)”). Hibernating mammals, for example, store fats as energy for use over long periods of time, thus avoiding the need for re-fueling (Kalish et al. 2012).

Fatty Acid Digestion: Triglycerides Fatty acid intake usually occurs in the form of triglycerides, examples of which include vegetable oils and animal fats. Triglyceride (or triacylglyceride, TAG) molecules are comprised of a 3-carbon glycerol molecule attached via ester bonds to three fatty acid hydrocarbon chains (see Fig. 1). Chain lengths vary in the composition of naturally occurring triglycerides, but most contain 16, 18, or 20 carbon atoms.

Natural fatty acids found in plants and animals are typically composed of only even numbers of carbon atoms, a reflection of the pathway for their biosynthesis from the two-carbon building-block acetyl coenzyme A (acetyl-CoA). However, bacteria possess the ability to synthesize odd-chain lengths in addition to branched-chain fatty acids. As a result, ruminant animal fat (such as that found in cattle and sheep) contains some odd-numbered fatty acid chains due to the action of bacteria in the gut of these species (Kalish et al. 2012; Ramírez et al. 2001). Although they are unparalleled in efficiency for energy storage, the hydrophobicity of triglycerides presents a problem for their eventual use as a source of energy, as these molecules cannot be directly absorbed in the aqueous environment of the intestines. Therefore, triacylglycerols must be digested at lipid-water interfaces. Digestion occurs in two main stages: 1. Large aggregates of triglycerides, which have been consumed in the diet and which are

Fatty Acid Metabolism

virtually insoluble in an aqueous environment, must be broken down physically and held in suspension – a process called emulsification. 2. TAG molecules must be enzymatically digested to yield a molecule of glycerol and three fatty acids. These compounds can either efficiently diffuse or be transported into the enterocytes found in the intestinal wall.

Fatty Acid Digestion: Emulsification Emulsification of triacylglycerol occurs mainly via the action of bile acids. These are ionized sodium salts derived from steroids, which are produced by the oxidation of cholesterol in the liver. Cholesterol, which is ingested as part of the diet or synthesized by the liver, is converted into both the cholic and chenodeoxycholic forms of bile acids. These compounds are then conjugated to an amino acid (either glycine or taurine) to yield the acid-conjugated form, which is secreted into the canaliculi of the liver. Bile acids are sometimes referred to as facial amphipathic compounds, meaning that one outer surface of the molecule is hydrophobic and the other is hydrophilic. Amphipathic compounds are thus well suited to fat emulsification, as they can form lipid-containing micelles. Adult humans produce anywhere from 400 to 800 ml of bile acids daily in the gall bladder. Secretion of bile occurs in two main stages. 1. Hepatocytes secrete bile into the liver canaliculi, which then flows into bile ducts. At this point, hepatic bile contains large quantities of bile acids, cholesterol, and other organic molecules. 2. As bile flows through the bile ducts, ductal epithelial cells secrete a bicarbonate-rich product into the bile ducts. Bile is stored and concentrated in the gall bladder during the fasting state and then secreted into the intestines after a meal in order to break down any fatty acids that have been ingested. Micelles formed by the bile acids in the intestines have a detergent effect on fats by

389

surrounding the fatty acids with more polar groups. The hydrophobic sterol side of the salt faces inward toward the fat, and the carboxyl (polar) side faces outward. In the presence of bile, 97% of ingested triglycerides are absorbed by the intestines, whereas in the absence of bile, only 50–60% of fat molecules present are absorbed. Hence, diseases that impair bile secretion such as biliary obstruction lead to severe fat malabsorption, although the lack of absorption is less acute for medium-chain triglycerides. Thus, while the action of bile does not by itself affect digestion, it greatly increases the surface area of the fats, allowing for a much greater rate of breakdown by lipases, a process termed lipolysis (Mu et al. 2005).

Fatty Acid Digestion: Lipolysis As discussed in section “Fatty Acid Digestion: Emulsification,” the rate of fat digestion is affected by the surface area of the interface between the fatty acid and the aqueous environment, which is greatly increased by both peristalsis and the action of bile acids. Fats are then cleaved via the action of pancreatic lipases such as phospholipase A2, forming free fatty acids and monoglycerides. The mechanism of cleavage requires the coenzyme colipase, which, in binding to the lipase enzyme, opens a hydrophobic channel. The hydrophobicity of this channel allows the lipase to interact directly with the target lipid molecules. Free fatty acids and glycerol molecules produced during this reaction are then absorbed by the enterocytes of the intestinal wall. Fatty acids and 2-monoglycerides may enter the enterocyte via simple diffusion, but a significant number also enter via the FAT (fatty acid transporter). In general, fatty acids which have a chain length of fewer than 14 carbons enter directly into the portal vein system and are transported to the liver. Fatty acids with 14 or more carbons are re-esterified within the enterocyte and enter the circulation via the lymphatic route as chylomicrons (see section “Chylomicrons”). Fat-soluble vitamins (vitamins A, D, E and K) and cholesterol

F

390

Fatty Acid Metabolism

Fatty Acid Metabolism, Fig. 2 Chylomicron structure. The outer phospholipid layer shields the inner triacylglycerols from the aqueous environment. Apolipoproteins include apolipoproteins A, B-48 in nascent chylomicrons, and apolipoproteins C-II and E in

mature chylomicrons. Chylomicrons deliver their cargo to both muscle tissue (for energy) and adipose tissue (for storage) before delivering a payload of mostly cholesterol to the liver as a chylomicron remnant

are delivered directly to the liver as a part of chylomicron remnants. There are several classes of diseases that influence the secretion of lipase enzymes from the pancreas, such as cystic fibrosis, which also result in the malabsorption of fatty acids (Glatz et al. 2010; Horowitz 2003; Horowitz 2000; Mansbach and Siddiqi 2010).

becoming mature chylomicrons. ApoC-II then activates lipoprotein lipase, which catalyzes the hydrolysis of triacylglycerols contained within the mature chylomicrons. Chylomicrons release triacylglycerol, cholesterol, and lipoproteins during their travels through the circulatory system. Fatty acids and glycerol released from the chylomicrons are taken up by various cell types according to need including adipose and muscle tissues. Hydrolysis and release of fatty acids allow chylomicrons to shrink in size and become chylomicron remnants. These remnants contain a cholesterol-rich core and are taken up in the liver via receptor-mediated endocytosis (Mansbach and Siddiqi 2010).

Chylomicrons After absorption of free fatty acids in the intestines, free fatty acids must be re-formed as triacylglycerides and transported to adipose and/or muscle tissues. Triacylglycerols are thus packaged into large lipoprotein complexes (chylomicrons, see Fig. 2), which transport these molecules from the intestinal lymph vessels to the bloodstream and deliver dietary TAGs to adipose and muscle tissues. Chylomicrons are less chemically active than free fatty acids and also serve to deliver dietary cholesterol to the liver. Chylomicrons are made up mainly of triglycerides, but also contain small amounts of phospholipids, cholesterol, and proteins. Synthesis of chylomicrons occurs in two main stages. Chylomicrons in the first stage of synthesis (nascent chylomicrons) contain only apolipoproteins A and B-48, which are exclusive to these structures. Nascent chylomicrons are secreted by intestinal cells and transported to the blood, where they acquire apolipoprotein C-II and Apo E,

Fatty Acid Storage In humans and other mammals, adipose tissue is distributed all over the body, but accumulates in particular under the skin, around deep blood vessels, and in the abdominal cavity. The primary function of adipocytes is to act as a store for excess energy consumed in the diet, although secondary functions include protection of vital organs and insulation from environmental conditions. Dietary intake of fat or carbohydrate in excess of the energy requirements of the body leads to the accumulation of fatty acids stored as triglycerides in fat cells. There are two conflicting hypotheses

Fatty Acid Metabolism

as to the mechanism of transport across the adipocyte membrane: 1. Since FAs are hydrophobic and acidic molecules, it would be logical to argue that they are able to diffuse across the membrane freely. 2. Others have hypothesized a transport proteinmediated process. Several lipid transporters have been identified, including adipocyte LDL receptor-related protein 1, which modulates lipid uptake in murine adipocytes, which lends credence to the transport-mediated hypothesis. Once inside the adipocyte, triacylglyceride molecules are hydrolyzed to free fatty acids and glycerol through the action of the enzyme lipoprotein lipase (LPL), which is bound to the external surface of adipose cells. Apoprotein C-II activates this enzyme, as does insulin, which is released into the bloodstream after a meal. The liberated free fatty acids are then taken up by the adipose cells and resynthesized into triglycerides, which accumulate in a fat droplet in each cell (Lillis et al. 2008; Ehehalt et al. 2006).

Adipose Tissue Adipose tissue is considered to be a specialized form of connective tissue which acts as a store for excess dietary fat. Adipose tissue is found in mammals mainly in two different forms (white and brown adipose tissue or WAT and BAT, respectively), but white adipose tissue is by far the most common form. Adipose tissue serves multiple purposes: (1) Adipose tissue found directly below the skin is used to insulate the organism from cold environmental conditions, (2) adipose tissue found surrounding the internal organs acts as a physical cushion for internal organs against jarring or falling of the organism, and (3) adipose tissue stores also act as an extremely efficient source of energy. As the major form of energy storage, fat provides a buffer for energy imbalances during times of starvation or fasting. The efficiency of fats as

391

energy storage is derived from its hydrophobicity. Since fats are stored with very little water, more energy can be derived per gram of fat (9 kcal) than per gram of carbohydrate (4 kcal) or protein (4 kcal). Indeed, approximately 60–85% of the dry weight of white adipose tissue is composed of triglyceride molecules. Average body fat content of an adult human hovers around 20%, allowing an individual to survive for about 1 month if necessary (provided he/she has an input of essential vitamins and water). While most animals are able to convert carbohydrates to lipids, the opposite is rarely true; hence, organisms cannot use fatty acids exclusively as fuel. Tissues that function mostly anaerobically such as erythrocytes must rely on carbohydrate intake for energy. Additionally, under normal conditions, the cells of the brain are dependent upon glucose as their sole source of energy. Under normal conditions, the brain utilizes about 25% of carbohydrates absorbed by the diet. Under abnormal (very low carbohydrate) metabolic conditions, the brain does have the ability to use ketone bodies when they are present in high quantities, as demonstrated by the popular “Atkins Diet.” Brown adipose tissue derives its color from a greater amount of vascularization and a high concentration of mitochondria. BAT is found in various locations throughout the human body including the upper torso and neck areas. BAT is metabolically activated mainly in response to cold conditions, whereas WAT is active mainly during fasting. In order to protect the body from the cold, the mitochondria of brown adipose tissue contain a specific ion carrier called uncoupling protein that transfers protons from the outside of the mitochondria to the matrix while bypassing the production of ATP. Thus, the lipid molecules in brown adipose tissue are usually converted directly into heat instead of being used as a source of energy for the organism. Indeed, murine studies have shown a link between increases in BAT and weight loss. The adipocytes in adipose tissue can be classified as unilocular or multilocular depending on the placement and number of lipid droplets contained within them. Unilocular cells contain a single large

F

392

lipid droplet which displaces the nucleus and other organelles, pushing them up against the plasma membrane. Unilocular cells are normally found in white adipose tissue. Multilocular cells, which are typical of BAT, contain smaller and more numerous lipid droplets and do not contain displaced organelles (Horowitz 2003; Horowitz 2000; Beranger et al. 2013; Richard and Picard 2011).

Fatty Acid Metabolism

domain act as the substrate for the next domain in a chain-like set of reactions. Mammalian FASI consists of a homodimer, each polypeptide of which contains seven catalytic domains. The N-terminal section of the polypeptide contains three catalytic activities: 1. Ketoacyl synthase 2. Malonyl/acetyltransferase 3. Dehydratase

Development of Adipose Tissue Fat deposits in organisms can increase or decrease either due to an increase/decrease in the size of the adipocytes already present (hyperplastic/hypoplastic growth) or due to an increase/decrease in total number of adipocytes in a certain area (hypertrophic/hypotrophic growth). In adult humans, adipocytes demonstrate mostly hyperplastic/hypoplastic growth and not hypertrophic/hypotrophic growth. Just six types of fatty acids make up the majority of the total fatty acid content of adipocytes: myristic (N-tetradecanoic acid, C-14), palmitic (N-hexadecanoic acid, C-16), palmitoleic ((Z) 9-hexadecanoic acid, C-16), stearic (octadecanoic acid, C-18), oleic ((Z9)-octadecenoic acid, C-18), and linoleic ((Z9,12) octadecanoic acid, C-18) acids. Dietary variation has been shown to induce some variation in the fatty acid makeup of an individual’s adipocytes. Individuals who eat a linoleic acid-rich diet (such as a Mediterranean diet), for example, have increased linoleic acid content in the blood and other tissues (Beranger et al. 2013).

Fatty Acid Synthesis (Lipogenesis) The multifunctional fatty acid synthase I (FASI) protein is a very large (273 kDa) multi-domain protein localized in the cytosol. All de novo fatty acid synthesis reactions take place in the cytosol as well. FASI is folded into several globular domains, each of which catalyzes a different reaction in the chain of fatty acid synthesis. Hence, FASI is a single enzyme which can engage in multiple functions in which the products of one

The C-terminal region contains four catalytic activities: 4. 5. 6. 7.

Enoyl reductase Ketoacyl reductase Acyl carrier activity Thioesterase

Malonyl-CoA is required as a substrate for FASI. The generation of malonyl-CoA is catalyzed separately from the FASI-mediated reactions by an enzyme called acetyl-CoA carboxylase. The major FASI product, palmitic acid, is synthesized by a series of 37 sequential reactions. The enzymatic steps of FAS involve decarboxylative condensation, reduction, dehydration, and another reduction and result in a saturated acyl moiety, with two additional methylene groups at the end of the cycle. NADPH is required as an electron donor in reductive reactions (see Fig. 3). FAS I synthesizes mainly palmitic acid (C16) and is unable to synthesize any smaller fatty acids than (C12). Both lauric acid (C12) and myristic acid (C14) are synthesized by FASI and are used as substrates for the acylation of proteins and also in membrane assembly more than the production of energy. Chain length seems to be tissue specific and controlled in an uncharacterized activity separate from the FASI complex. Humans are unable to synthesize only two essential fatty acids: linoleic (C18:2) and alpha-linolenic (C18:3) polyunsaturated acids. FASI is able to use acetyl-CoA derived from any metabolite for lipogenesis, but carbohydrates are the origin of the bulk of acetyl-CoA used in this process (Leibundgut et al. 2008).

Fatty Acid Metabolism

393

F

Fatty Acid Metabolism, Fig. 3 The lipogenesis pathway. Lipogenesis converts acetyl-CoA, an intermediate in glucose metabolism, into fats. Fatty acids are synthesized two carbons at a time and then are esterified with glycerol

to form triglycerides. Insulin, high levels of which is associated with the fed state, is a positive regulator of lipogenesis

Enzymatic Steps of Fatty Acid Synthesis

to take part in further turns of the citric acid cycle or is used for gluconeogenesis. Once in the cytoplasm, acetyl-CoA is carboxylated by the enzyme acetyl-CoA-carboxylase to malonyl-CoA.

Lipogenesis uses an entirely separate pathway from beta-oxidation to generate fatty acids from acetyl-CoA. There are three major processes involved in the reductive synthesis of fatty acids: 1. Acetyl-CoA must be transported from the matrix of the mitochondria into the cytoplasm. 2. The true substrate for FAS is malonyl-CoA, a 3-carbon compound which is formed by the carboxylation of acetyl-CoA. 3. The chain is elongated for further lipid biosynthesis. Step 1: Acetyl-CoA Transport to the Cytoplasm All fatty acid synthesis occurs in the cytoplasm. Acetyl-CoA is exported from the mitochondrial matrix as a citrate molecule. Then, in the cytoplasm, it is split back into acetyl-CoA and oxaloacetate. The oxaloacetate moiety either is transported back into the mitochondrial matrix

Step 2: Formation of Malonyl-CoA AcetylCoA carboxylase carries out this first committed step of the fatty acid synthesis pathway: HCO3  þ ATP þ acetyl-CoA ! ADP þ Pi þ malonyl-CoA In mammals, acetyl-CoA carboxylase is negatively regulated by palmitoyl-CoA, a product of lipogenesis, and positively regulated by citrate, a substrate for this reaction. The enzyme is also regulated via phosphorylation. When active, acetyl-CoA carboxylase self-assembles into long filamentous multimers. When these multimers dissociate and form monomers, the carboxylase transitions back to its inactive state.

394

Step 3: Chain Elongation Chain elongation itself occurs in multiple steps as the malonyl and acyl substrates are cycled through the different domains of FASI. In the first two steps, malonylCoA and acyl-CoA are both transferred to the acyl carrier protein (ACP) via the malonyl/acetyl-CoA ACP transacylase activity of FASI. Once covalently linked to the ACP subunit, the activated substrate is cycled by a pantetheine arm from catalytic site to catalytic site through the FASI multi-enzyme as described below. 1. One acetyl-ACP molecule and one malonylACP molecule are used to form the C4 acyl unit acetoacetyl-CoA by the enzyme activity b-ketoacyl synthase. This condensation reaction yields a carbon dioxide molecule. 2. The ketone group in the beta position is then reduced to an alcohol by b-ketoacyl-ACP reductase. This reaction uses a molecule of NADPH. 3. A dehydration step (catalyzed by b-hydroxyacyl-ACP dehydratase) introduces a trans carbon double bond. This reaction produces a water molecule. 4. The trans bond in step 3 is reduced by a second NADPH to the fully saturated hydrocarbon unit butyryl-CoA (4 carbons) by enoyl-ACP reductase. This cycle is repeated with the condensation of a second malonyl-ACP molecule (from a second acetyl-CoA). When the fatty acid is 16 carbon atoms long, the thioesterase domain of fatty acid synthase catalyzes hydrolysis of the thioester linking the fatty acid to phosphopantetheine. These reactions are shown in Fig. 3. The net reaction and energy consumption of forming one molecule of palmitoyl-CoA from acetyl-CoA units is 8 acetyl-CoA þ 7ATP þ 14NADPH þ 14Hþ ! palmitoyl-CoA þ 7ADP þ 7Pi þ 7CoA þ 14NADPþ þ 7H2 O For further reading, please see (Asturias et al. 2005; Jakobsson et al. 2006).

Fatty Acid Metabolism

Mobilization of Stored Fatty Acids During fasting, energy from fatty acids is most important to organisms. Low glucose levels during fasting conditions lead to the production of glucagon by the pancreas, which in turn activates hormone-sensitive lipases in adipose tissue. Epinephrine and b-corticotropin may also play a role in fatty acid mobilization. Lipids are stored in adipose tissue in the form of lipid droplets surrounded by a phospholipid monolayer, reminiscent of chylomicrons in the circulation (see section “Chylomicrons”). These phospholipid monolayers are coated with proteins called perilipins (of which perilipin A is the most abundant). Perilipin A is phosphorylated by protein kinase A in response to kinase receptor activation leading to changes in its conformation and activity. Hyperphosphorylated perilipin is no longer able to protect the triacylglycerides within the core of the lipid droplets, allowing two proteins involved in fatty acid mobilization, desnutrin and hormone-sensitive lipase (HSL), access to these molecules. Desnutrin (also known as adipose triglyceride lipase (ATGL)) is a 54 kDa protein, which contains an N-terminal patatin-like domain. Prior to hydrolysis, desnutrin breaks down the triacylglycerols in the core of the lipid droplet to provide diacylglycerols for lipolysis by hormone-sensitive lipase. Hence, this protein greatly enhances the oxidation of fatty acids by HSL. Adipose tissue hormone-sensitive lipase (HSL) is a cytoplasmic protein responsible for the hydrolysis of triacylglycerols and diacylglycerols, as well as cholesteryl and retinyl esters. Hormone-sensitive lipase (HSL) may hydrolyze fatty acids from either the 1-carbon or the 3-carbon of the triacylglyceride. Hydrolysis mediated by HSL has long been considered to be the rate-limiting step in lipolysis. However, it was recently discovered that residual hydrolysis of TAG occurs in murine fat cells lacking HSL. These findings, together with the fact that desnutrin is selective for the TAG substrate, suggest that desnutrin is responsible mainly for TAG hydrolysis and HSL is mainly responsible for

Fatty Acid Metabolism

DAG hydrolysis. Hormone-sensitive lipase is regulated by protein kinase A via alternative phosphorylation on four phospho-sites, which are all serines. Serine-563, known as the regulatory phosphorylation site, is phosphorylated upon elevation of cyclic AMP (cAMP). Serine-565, the basal phosphorylation site, is phosphorylated by several enzymes in adipocytes. These kinases include AMP-activated protein kinase, calmodulin-dependent protein kinase, and glycogen synthase kinase. However, protein kinase A does not phosphorylate the regulatory site of HSL (serine-563). Serine-565 remains phosphorylated in unstimulated adipocytes. There is some evidence that HSL, in addition to phosphorylation, can be stimulated or repressed according to its subcellular location. Hormonesensitive lipase translocates to the lipid droplet when fat stores need to be mobilized after lipolytic stimulation. Insulin induces HSL to translocate back to the cytosol. Mobilization of fat from adipose tissue is inhibited by numerous stimuli, the most significant of which is insulin. The insulin receptor, after binding to its ligand, autophosphorylates and then phosphorylates multiple downstream substrates including shc, PI-3-kinase, and Akt. Activation of PI-3 K leads eventually to the activation of phosphodiesterase-3B, which is inhibited by cyclic GMP. Phosphodiesterase-3B activation lowers cyclic AMP levels and decreases hormone-sensitive lipase activity. When an individual is in a well-fed state, insulin released from the pancreas prevents the inappropriate mobilization of stored fat. Instead, any excess fat and carbohydrate are incorporated into the triacylglycerol pool within the adipose tissue. Insulin inhibits lipolysis by activating cyclic nucleotide phosphodiesterase, hydrolyzing cAMP. There is some evidence that insulin may also lead to increases in protein phosphatase activities, dephosphorylating HSL and thereby inhibiting it. Lipolysis may also be inhibited by the presence of ketone bodies. During high rates of fatty acid oxidation, primarily in the liver, large amounts of acetyl-CoA are generated. If this acetyl-CoA exceeds the capacity of the citric acid cycle to

395

oxidize it, one result is the synthesis of ketone bodies, or ketogenesis. Thus, ketone body formation leads to a negative feedback loop for the lipolytic cycle. Lipolysis is induced by the following catecholamine hormones: epinephrine, norepinephrine, testosterone, glucagon, and cortisol. Catecholamines are powerful inducers of lipolysis both in vivo and in vitro. These hormones trigger activation of G protein-coupled receptors, which then activate adenylate cyclase. Adenylate cyclase is responsible for converting ATP into cyclic AMP (cAMP), which activates protein kinase A. PKA subsequently activates lipases found in adipose tissue. Lipolysis and fatty acid oxidation arrive at a balance with glycolysis such that an increase in the oxidation of FA causes a decrease in the utilization of glucose. On the other hand, reduction in the supply of free fatty acids will lead to an increase in the use of glucose as the oxidative fuel. Lipid oxidation is induced by an increase in free fatty acid availability. A reciprocal decrease in glucose oxidation also occurs in this instance (Horowitz 2003; Lopaschuk et al. 2010; Lafontan et al. 2009; Duncan et al. 2007; Alberts et al. 2009).

Beta-Oxidation of Fatty Acids in the Mitochondria Although b-oxidation of fatty acids may occur in either the mitochondria or the peroxisome, mitochondrial b-oxidation only will be discussed in this entry. The two processes are similar, but make use of different sets of enzymes. Peroxisomal b-oxidation enzymes are specific for longer chain lengths (14 and above) than are mitochondrial enzymes, which prefer C4-C12 chain lengths. Beta-oxidation of fatty acids in the mitochondria involves three stages: 1. Activation of fatty acids in the cytosol 2. Transport of fatty acids into mitochondria (carnitine shuttle) 3. Beta-oxidation proper, which occurs in the mitochondrial matrix

F

396

Activation of Fatty Acids Fatty acids must be activated in order to be degraded by coenzyme A, forming a fatty acylCoA thioester. This molecule then reacts with carnitine to form acylcarnitine, which is transported across the inner mitochondrial membrane in order to undergo oxidation. Short and medium length fatty acids are activated in the mitochondria. However, the larger long-chain fatty acids are unable to translocate across the mitochondrial membrane without a carrier. Thus, they are transformed into acylcarnitine derivatives by carnitine transferase I on the outer leaflet of the inner mitochondrial membrane. These derivatives are then transported across the membrane by a translocase and are passed to the enzyme carnitine transferase II on the matrix side. Carnitine transferase II reattaches the fatty acyl group to the coenzyme A, resulting in a molecule identical to the original fatty acylCoA. After the activation step, b-oxidation of saturated fatty acids consists of a recurring cycle of a series of four steps, often called a “spiral” due to its sequential nature. The b-oxidation of fatty acids occurs two carbons at a time. These reactions occur in the mitochondria and are closely associated with the electron transport chain to produce energy in the form of ATP. The products of beta-oxidation include FADH2, NADH, H+, and acetyl-CoA, which then passes through the citric acid cycle. The reactions of beta-oxidation may be formulated as follows: Fatty acid-CoA þ NADþ þ FAD ! Acetyl-CoA þ NADH þ Hþ þ FADH2 Beta-Oxidation Each cycle through the b-oxidation spiral produces one molar equivalent of NADH, one molar equivalent of FADH2, and one molar equivalent of acetyl-CoA (see above). Acetyl-CoA, the end product of each round, then enters the citric acid cycle, where it is further oxidized to CO2 with the concomitant generation of three molar equivalents of NADH, one molar equivalent of FADH2, and one molar equivalent of ATP. Beta-oxidation, so named as it occurs via sequential removal of 2-carbon units

Fatty Acid Metabolism

by oxidation at the b-carbon position of the fatty acyl-CoA molecule, may be described as a cycle of four reactions, each round of which shortens the fatty acid hydrocarbon chain by two carbons. Step 1: Oxidation (Dehydrogenation) Dehydrogenation occurs at the number two and three carbons to form a trans double bond. Concomitantly, FAD is converted to FADH2 which donates its electrons to the electron transport chain. Step 2: Hydration After dehydrogenation, a hydration reaction occurs. The alkene formed in step one forms an alcohol at the beta-carbon. This reaction is the reason that carbons leave the molecule two by two. Step 3: Oxidation This reaction leads to an oxidation of the alcohol formed in step two to form a ketone on the beta-carbon. At this point, NAD+ is also converted to NADH + H+ which link directly to the electron transport chain. Step 4: Cleavage The final step in the betaoxidation spiral is the cleavage of the acetylCoA from the original fatty acid chain. At the same time, a new coenzyme A molecule makes a new thioester bond to the b-carbon carbonyl group. The four steps are repeated until the fatty acid molecule is completely digested. Oxidation of unsaturated fatty acids is essentially the same process as for saturated fats, except when a double bond is encountered. In such a case, the bond is isomerized by a specific enoyl-CoA isomerase and oxidation continues apace. These steps are summarized in Fig. 4 (Lopaschuk et al. 2010; Osmundsen et al. 1991; Schulz 2008; Stumpf 1969; Sugden and Holness 1994).

v-Oxidation in the Endoplasmic Reticulum There are two locations for fatty acid oxidation in vertebrates: the endoplasmic reticulum (ER) and the mitochondria. In vertebrates, the enzymes for o-oxidation are located in the endoplasmic

Fatty Acid Metabolism

397

F

Fatty Acid Metabolism, Fig. 4 The beta-oxidation spiral of fatty acids. Fatty acids are oxidized two carbons at a time, similar to lipogenesis. However, the reactions of beta-oxidation are very different. The two carbons being cleaved in this turn of the spiral are shown in red for clarity. Each turn of the spiral produces one molar equivalent of

acetyl-CoA (which feeds into the citric acid cycle), one mol of FADH2, and one mole of NADH (both of which feed directly into the electron transport chain). Each cycle shortens the chain length by two carbons until the molecule is digested completely

reticulum instead of in the mitochondria as with b-oxidation. This process involves the oxidation of the o-carbon (the carbon most distant from the carboxyl group of the fatty acid) instead of the b-carbon. Omega-oxidation is normally a minor catabolic pathway for medium-chain fatty acids (10–12 carbon atoms), but becomes more important when b-oxidation is defective. There are three main steps in this pathway.

and the resulting molecule can enter the mitochondrial b-oxidation spiral (Wanders et al. 2011).

Step 1: Hydroxylation Hydroxylation occurs as a function of the enzyme mixed function oxidase and introduces a hydroxyl group onto the o-carbon. Step 2: Oxidation This step occurs via the action of the enzyme alcohol dehydrogenase, which oxidizes the added hydroxyl group to an aldehyde and produces NADH. Step 3: Oxidation This step occurs via the action of the enzyme aldehyde dehydrogenase and oxidizes the aldehyde group to a carboxylic acid. This reaction also produces one molecule of NADH, and the final product is a fatty acid with a carboxyl group at each end. After these oxidation steps, the fatty acid may be attached to coenzyme A,

Regulation of Fatty Acid Oxidation Synthesis and degradation of fatty acid molecules must be tightly regulated, as the energy requirements for organisms are constantly changing as the organism vacillates between the fasting and the fed states. The primary organ for sensing an organism’s energy supply and needs is the pancreas, which monitors glucose concentrations in the blood. Low blood glucose stimulates the secretion of glucagon, whereas elevated blood glucose induces the secretion of insulin. The metabolism of fat is regulated by two distinct mechanisms. The first which will be discussed is short-term regulation. This mechanism is active in response to conditions such as substrate availability, allosteric effectors, and/or enzyme modification (usually b-oxidation is positively regulated in this way by glucagon and negatively regulated by insulin). The other mechanism, long-term regulation, is achieved by alteration of the rate of enzyme synthesis and turnover.

398

Hormone-sensitive lipase (HSL) found in adipose tissue is an enzyme which is, like acetyl-CoA carboxylase (ACC), phosphorylated by protein kinase A (PKA), which in turn activates HSL. Activation of HSL leads to an increase in the release of fatty acids into the blood, which in turn leads to an increase in the oxidation of fatty acids in other tissues. In the liver, the net result of this activity (due to increased acetyl-CoA levels) is the production of ketone bodies. Ketone synthesis occurs when carbohydrate stores and gluconeogenic precursors available in the liver are insufficient to allow glucose production for eventual degradation. The activity of HSL is also affected by its phosphorylation by AMPK. Surprisingly, in this case, phosphorylation inhibits the enzyme. To explain this seemingly paradoxical situation, it has been proposed that inhibition of HSL by AMPK mediated-phosphorylation is a mechanism to ensure that the rate of lipogenesis does not exceed the rate at which fatty acids are either exported or oxidized. Insulin, on the other hand, affects fatty acid metabolism in the opposite way as glucagon and epinephrine. Insulin increases the synthesis of triacylglycerols (and glycogen). One of the many effects of insulin is to lower cAMP levels, which leads to increased dephosphorylation through the enhanced activity of protein phosphatases such as protein phosphatase-1. With respect to b-oxidation, this leads to dephosphorylation and inactivation of hormone-sensitive lipase. The metabolism of lipids can also be regulated by inhibition of CPT I by malonyl-CoA. Inhibition of CPT I prevents de novo synthesized fatty acids from immediately entering the mitochondria and being re-oxidized, thereby shunting them to storage (Schulz 2008; Vance 1996; Sugden and Holness 1994).

Ketogenesis Acetyl-CoA is manufactured as a by-product of FA oxidation. This occurs primarily in the liver. The production of acetyl-CoA can sometimes exceed the capacity of the citric acid cycle to turn it into citrate. Acetyl-CoA is in this instance shunted into a pathway which results in the synthesis

Fatty Acid Metabolism

of ketone bodies (ketogenesis). Acetoacetate, b-hydroxybutyrate, and acetone are three ketone compounds synthesized in this way. When carbohydrate utilization is low or deficient, oxaloacetate levels will also decrease, resulting in reduced turning of the citric acid cycle. b-hydroxybutyrate synthesis increases when glycogen levels in the liver are produced at an increased rate. These conditions in turn lead to increased release of ketone bodies from the liver for use as fuel by tissues (except for brain). During the early stages of starvation, after most fats have already been oxidized, the heart and skeletal muscle will consume primarily ketone bodies to preserve glucose for use by the brain. Acetoacetate and b-hydroxybutyrate, in particular, also serve as major substrates for the biosynthesis of neonatal cerebral lipids.

Regulation of Ketogenesis Ketogenesis takes place primarily in the liver and may be affected by several factors, depending on the individual. Firstly, the release of free fatty acids from adipose tissue directly affects the level of ketogenesis in the liver. Secondly, ATP demand may increase the further oxidation of acetyl-CoA to produce ketone bodies. Thirdly, the level of fat oxidation is regulated hormonally through phosphorylation of ACC, which may activate it (in response to glucagon) or inhibit it (in the case of insulin) (Hegardt 1999).

Ketogenesis in Disease States The production of ketone bodies occurs at a relatively low rate during normal feeding and under normal physiological conditions. The physiological response to carbohydrate shortages induces ketone body production in the liver and uses acetyl-CoA generated from fatty acid oxidation as a substrate. This allows the heart and skeletal muscles to primarily use ketone bodies for energy, thereby preserving any available glucose for use by the brain. Ketone body oxidation becomes a significant contributor to overall

Fatty Acid Metabolism

energy metabolism in numerous physiological states including the neonatal period, starvation, post-exercise, and during adherence to low-carbohydrate diets. For ketosis, the most significant clinical outcome occurs in untreated type 1 diabetes mellitus. Diabetic ketoacidosis (where ketone body concentrations can reach up to 20 mM) results from a reduced supply of glucose (due to a decrease in circulating insulin) and a concomitant increase in fatty acid oxidation (due to an increase in circulating glucagon). Increased production of acetylCoA leads to ketone body production that exceeds the ability of peripheral tissues to oxidize them. Ketone bodies are relatively strong acids (pKa around 3.5). Thus, increasing their concentration effectively lowers the pH of the blood, a dangerous situation, since hemoglobin-oxygen binding is disrupted at low pH (Fukao et al. 2004; Wakil and Abu-Elheiga 2009). In addition to the clinical significance of ketoacidosis in diabetic patients, some recent evidence suggests that ketone bodies may play a significant role in myopathy of the heart. Hypertrophy of the heart may lead to increased ketone body utilization in the myocardium. While the data on ketone body metabolism in heart disease are currently limited, there is some indication that myocardial ketone body oxidation could promote worse outcomes (Cotter et al. 2013).

Inborn Errors of Fatty Acid Metabolism Fatty acid oxidation represents an important pathway for energy supply during the life cycle of many organisms. Therefore, disorders that affect this process are quite rare. Long-chain mitochondrial fatty acid oxidation defects are probably the most difficult to identify due to the episodic nature of their characteristic clinical manifestations. Difficulties in diagnosis may arise from the fact that symptoms are usually hidden during anabolism. The pathway of fatty acid beta-oxidation includes at least 25 enzymes and specific transport proteins, and deficiencies in more than 50% of these enzymes have been identified as disease causing. This section will focus on some of the more common

399

disorders: the oxidation disorders, carnitine palmitoyltransferase 2 (CPT2) deficiency, mitochondrial trifunctional protein (MTP) deficiency, and very-long-chain-acyl-CoA dehydrogenase (VLCAD) deficiency, and storage disorders such as Gaucher’s disease and Niemann-Pick disease. General Aspects of Fatty Acid Oxidation Disorders (FAODs). In fatty acid oxidation disorders, clinical symptoms are usually triggered by some metabolic stressor. These include infection, exercise, exposure to cold, or prolonged fasting. However, under non-stressed conditions, no episodes are triggered and patients can therefore lead a fairly normal life if they adhere to dietary and exercise restrictions. Because of the episodic nature of these disorders, they may remain concealed for a long time in some patients. In long-chain FAODs, as with many genetic disorders, the phenotypes range from relatively benign forms to severe phenotypes leading to early death. The most common symptom in the more benign forms is muscular impairment. Episodes are often triggered by fasting soon after birth in the more severe forms of this class of diseases. Later-onset patients usually present with exercise intolerance and muscle weakness. Myoglobinuria is another symptom of longchain FAODs and may lead to acute renal failure if fluid intake is decreased. Myopathy, manifesting mainly during periods of aerobic exercise, is another hallmark of these diseases. Carnitine Palmitoyltransferase 2 (CPT2) Deficiency. The CPT2 protein is located on the inner leaflet of the inner mitochondrial membrane and is related to the enzyme carnitine acyltransferase, discussed in section “o-Oxidation in the Endoplasmic Reticulum.” This enzyme catalyzes the addition of a coenzyme A moiety to long-chain acyl-carnitines. Deficiency in this enzyme is an autosomal recessive disorder. There are three phenotypes of CPT2 deficiency: The “classic muscular form” is most frequent and has a wide range of onset from childhood to adulthood. It presents with exercise-induced muscle weakness and rhabdomyolysis. A “severe neonatal form” presents in the newborn period with nonketotic hypoglycemia, cardiomyopathy, muscle weakness, and renal dysgenesis in some patients. Most of these patients die within days after birth. The

F

400

“infantile multisystemic phenotype” is also often fatal, but if the patients survive past the age of 13, they tend to have a much better prognosis. The latter presents with seizures, hepatomegaly, nonketotic hypoglycemia, cardiomyopathy, and muscle weakness. Mitochondrial Trifunctional Protein (MTP) Deficiency. MTP is an enzyme complex responsible for the beta-oxidation of fatty acids with chain lengths of C12 to C18 and is a component of the b-oxidation spiral (see Fig. 4). The MTP catalyzes three non-sequential reactions in the oxidation spiral and consists of the following enzymes: 3-hydroxyacyl-CoA dehydrogenase, 2-enoyl-CoA hydratase, and 3-ketoacyl-CoA thiolase. Similar to CPT2 deficiency, the severity of the phenotypes in this disorder varies widely and can be correlated with the severity of the patient’s mutation. Levels of 3-ketoacyl-CoA thiolase expression were especially found to correlate with the disease severity. Patients with MTP deficiency accumulate long-chain 3-hydroxyacylcarnitines and free fatty acids in blood as well as dicarboxylic acids in urine, although the presentation is variable. Again, just as in CPT 2 deficiency, three phenotypes have been reported in patients: a lethal form with predominating cardiac involvement, hypoketotic hypoglycemia, capillary leak syndrome and sudden death, an infancy-onset hepatic presentation, and a milder late-onset neuromyopathic form. The main features of the neuromyopathic form with late-onset are progressive peripheral neuropathy and exercise-induced myoglobinuria. Very-Long-Chain Acyl-CoA Dehydrogenase (VLCAD) Deficiency. The VLCAD protein is localized to the inner mitochondrial membrane and catalyzes the first step of the long-chain fatty acid beta-oxidation spiral (see Fig. 4). VLCAD deficiency is inherited in an autosomal recessive manner. As is common in genetic disorders, three phenotypes have been described: a severe infantile form presenting with hypertrophic cardiomyopathy and liver failure, a childhood-onset type with hypoketotic hypoglycemia, and a juvenile-/ adult-onset muscular form characterized by recurrent episodes of rhabdomyolysis triggered by prolonged exercise or fasting (similar to CPT2

Fatty Acid Metabolism

deficiency). Residual enzyme activity is strongly correlated to less severe phenotypes for this disorder. The milder form of VLCAD deficiency is increasingly recognized due to the more widespread use of tandem mass spectrometry, allowing the detection of abnormal long-chain acylcarnitines during neonatal screening, prior to the onset of symptoms. Medium-Chain Acyl-CoA Dehydrogenase (MCAD) Deficiency: Genetic defects in the enzyme medium-chain acyl-CoA dehydrogenase can produce a serious disease characterized by the inability to oxidize medium-chain fatty acids via b-oxidation. It is indicated by high urinary concentrations of medium-chain dicarboxylic acids produced via omega-oxidation. General Principles in the Treatment of Fatty Acid Oxidation Disorders. Current treatment options for patients with long-chain fatty acid oxidation defects include long-term dietary intervention including avoidance of fasting, a low fat diet with the restriction of long-chain fatty acid intake, and substitution of long-chain fatty acids with medium-chain fatty acids. Patients must also avoid stressing the body with exercise. Although long-term outcomes in patients with FAOD have not yet been fully evaluated, many smaller studies indicate that good adherence to dietary restrictions leads to better outcomes. For most patients, clinical outcomes are also linked to the severity of the underlying enzyme defect. The incidence of overweight and obesity is increasing among children with long-chain FAOD, as these patients need a continuous supply of carbohydrate energy and are unable to maintain a vigorous exercise regimen due to the nature of their disease. A diet higher in protein and lower in carbohydrate may help to lower total energy intake while maintaining sufficient metabolic control (Sugden and Holness 1994; Bennett 2010; Das et al. 2010; Jones 2006).

References Alberts B et al. (2009) Essentials of cell biology, 3rd edn. Garland Science Asturias FJ, Chadick JZ, Cheung IK, Stark H, Witkowski A, Joshi AK, Smith S (2005) Structure and molecular organization of mammalian fatty acid synthase. Nat Struct Mol Biol 12:225–232

Functional Site Bennett MJ (2010) Pathophysiology of fatty acid oxidation disorders. J Inherit Metab Dis 33(5):533–537 Beranger GE, Karbiener M, Barquissau V, Pisani DF, Scheideler M, Langin D, Amri EZ (2013) In vitro brown and “brite”/“beige” adipogenesis: human cellular models and molecular aspects. Biochim Biophys Acta 1831(5):905–914 Cotter DG, Schugar RC, Crawford PA (2013) Ketone body metabolism and cardiovascular disease. AJP – Heart 304(8):H1060–H1076 Das AM, Steuerwald U, Illsinger S (2010) Inborn errors of energy metabolism associated with myopathies. J Biomed Biotechnol 2010:340849 Duncan RE, Ahmadian M, Jaworski K, Sarkadi-Nagy E, Sul HS (2007) Regulation of lipolysis in adipocytes. Annu Rev Nutr 27:79–101 Ehehalt R, Füllekrug J, Pohl J, Ring A, Herrmann T, Stremmel W (2006) Translocation of long chain fatty acids across the plasma membrane–lipid rafts and fatty acid transport proteins. Mol Cell Biochem 284(1–2):135–140 Fukao T, Lopaschuk GD, Mitchell GA (2004) Pathways and control of ketone body metabolism: on the fringe of lipid biochemistry. Prostaglandins Leukot Essent Fatty Acids 70(3):243–251 Glatz JF, Luiken JJ, Bonen A (2010) Membrane fatty acid transporters as regulators of lipid metabolism: implications for metabolic disease. Physiol Rev 90(1):367–417 Hegardt FG (1999) Mitochondrial 3-hydroxy-3methylglutaryl-CoA synthase: a control enzyme in ketogenesis. Biochem J 338(Pt 3):569–582 Horowitz JF (2000) Lipid metabolism during endurance exercise. Am J Clin Nutr 72:558S Horowitz JF (2003) Fatty acid mobilization from adipose tissue during exercise. Trends Endocr Metab 14:386 http://themedicalbiochemistrypage.org/fatty-acid-oxidation. php. Accessed 15 Apr 2014 Jakobsson A, Westerberg R, Jacobsson R (2006) Fatty acid elongases in mammals: their regulation and roles in metabolism. Progr Lipid Res 45:237–249 Jones KL (2006) Smith’s recognizable patterns of human malformation, 6th edn. Elsevier, Philadelphia. ISBN 13: 978-0-7216-0615-6 Kalish BT, Fallon EM, Puder M (2012) A tutorial on fatty acid biology. J Parenter Enteral Nutr 36(4):380–388 Lafontan M, Langin D (2009) Lipolysis and lipid mobilization in human adipose tissue. Prog Lipid Res 48(5):275–297 Leibundgut M, Maier T, Jenni S, Ban N (2008) The multienzyme architecture of eukaryotic fatty acid synthases. Curr Opin Struct Biol 18(6):714–725 Lillis AP, Van Duyn LB, Murphy-Ullrich JE, Strickland DK (2008) LDL receptor-related protein 1: unique tissue-specific functions revealed by selective gene knockout studies. Physiol Rev 88(3):887–918 Lopaschuk GD, Ussher JR, Folmes CDL, Jaswal JS, Stanley WC (2010) Myocardial fatty acid metabolism in health and disease. Physiol Rev 90(207–258):208–239 Mansbach CM, Siddiqi SA (2010) The biogenesis of chylomicrons. Annu Rev Physiol 72:315–333

401 Mu H, Porsgaard T (2005) The metabolism of structured triacylglycerols. Prog Lipid Res 44(6):430–448 Osmundsen H, Bremer J, Pedersen JI (1991) Metabolic aspects of peroxisomal beta-oxidation. Biochim Biophys Acta 1085(2):141–158 Ramírez M, Amate L, Gil A (2001) Absorption and distribution of dietary fatty acids from different sources. Early Hum Dev 65(Suppl):S95–S101 Richard D, Picard F (2011) Brown fat biology and thermogenesis. Front Biosci 16:1233–1260 Schulz H (2008) Oxidation of fatty acids in eukaryotes. In: Vance DE, Vance J (eds) Biochemistry of lipids, lipoproteins and membranes, 5th edn. Elsevier, Amsterdam, pp 131–154 Stumpf PK (1969) Metabolism of fatty acids. Annu Rev Biochem 38:159–212 Sugden MC, Holness MJ (1994) Interactive regulation of the pyruvate dehydrogenase complex and the carnitine palmitoyltransferase system. FASEB J 8(1):54–61 Vance, Dennis E, Jean E. Vance (1996) Biochemistry of Lipids, Lipoproteins, and Membranes. Amsterdam: Elsevier. Print Wakil SJ, Abu-Elheiga LA (2009) Fatty acid metabolism: target for metabolic syndrome. J Lipid Res 50:S138–S143 Wanders RJ, Komen J, Kemp S (2011) Fatty acid omegaoxidation as a rescue pathway for fatty acid oxidation disorders in humans. FEBS J 278(2):182–194

50 Capping ▶ Co-transcriptional Eukaryotes

mRNA

Processing

in

Functional Annotation – Functional Description ▶ Plant Genome Annotation, Methods for

Functional Repeat ▶ Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

Functional Site ▶ Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

F

G

Gene Cloning ▶ Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

Gene Finding – Gene Prediction ▶ Plant Genome Annotation, Methods for

Gene Networks ▶ Differential Equations and Chemical Master Equation Models for Gene Regulatory Networks

different environmental conditions? Indeed, appropriate genetic regulation is the hallmark of cellular viability. In multicellular organisms, cells that share identical genomes produce different phenotypes depending on specific spatial and temporal signals, bacterial cells produce different proteins when metabolic changes occur in the environment, and stem cells become heart or brain cells only when appropriate triggers are detected. This introductory entry will provide an overview of gene regulation in prokaryotic and eukaryotic systems. The subsequent chapters will deal more specifically with a suite of mechanisms used by cellular organisms to regulate expression and mechanisms employed to control the processing and production of gene products.

Development of the Field

Gene Regulation April Hill and Sarah Friday Department of Biology, University of Richmond, Richmond, VA, USA

Synopsis One of the fundamental questions in biology is: How is gene expression regulated in a contextdependent manner such that appropriate phenotypes (e.g., RNA, proteins) are produced given # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

It is widely recognized that the earliest and most seminal work on a gene regulatory system was that of Jacques Monod and Francois Jacob (Jacob et al. 1960) who formulated the model describing the lac operon. Their study of E. coli mutants involved in the metabolism of lactose (along with similar studies in bacteriophage l) led to an understanding of the fact that genomes encode genes whose main role is to modify expression of other genes, in this case depending on the presence or absence of lactose. Their work also resulted in the proposal that both positive and negative regulatory factors could act on regions

404

near the promoter of a gene to control expression of genes regulated by that promoter. Indeed, their articulation of a model that described cis-acting regulatory DNA sequences interacting with trans-acting regulatory proteins to modulate transcription remains the major paradigm that subsequent studies have built upon to elaborate the mechanisms of transcriptional control of gene expression in all organisms. A few key observations in eukaryotic organisms set the stage for eventual understanding that transcriptional control is fundamental to gene regulation. The development of a variety of molecular methodologies to study the levels (e.g., southern, northern, and western blotting; PCR and RT-PCR) and locations (e.g., in situ hybridization, immunohistochemistry) of DNA, mRNA, and protein in the cells of a large variety of organisms enabled the scientific community to propose and test hypotheses about the mechanisms by which genes are regulated. Importantly, the finding that, in general, the content of DNA in different cell types within an organism remains fairly constant throughout time and space while the mRNA and subsequent protein content can vary greatly among those cells within the same organism led to the proposal that a major component of gene regulation must operate at the level of transcription to produce different RNA populations in the various cell types. A large body of evidence from numerous sources (e.g., polytene chromosomes in Drosophila (Ashburner 1971), nuclear run-on assays measuring tissue-specific mRNAs (Derman et al. 1981)) indicated that gene activity is primarily regulated at the level of transcription. More recently, a growing body of evidence has shown that another key step of regulation of gene expression in both prokaryotes and eukaryotes is through short and long RNA molecules that can act at both the transcriptional and posttranscriptional levels. There are many levels at which differential gene expression is controlled in living organisms. The regulatory processes that lead cells to alter their patterns of gene expression in a spatial and temporal manner in response to alterations of environmental signals are both complex and promise to be an unfolding area of research for many years to come. Initial understanding of

Gene Regulation

gene regulation began with a study of the mechanisms by which bacteria alter gene expression in order to metabolize nutrients from their environment. The lessons learned from early work on the lac operon remain a standard by which all gene regulation at the transcriptional level can be understood. However, as more complex genomes and multicellular organisms were studied, the levels at which genes could be regulated unfolded. For example, it is now understood that gene expression can be controlled at the level of chromatin where both small and large alterations in chromatin structure and/or modifications to cellular DNA can lead to differential regulation of gene expression. Common mechanisms for regulation at the level of transcription have been discovered in prokaryotic and eukaryotic cells and include both positive and negative controlling processes. In eukaryotic organisms, a variety of posttranscriptional processes play important roles in regulating the fate of mRNAs including determination of whether or not those mRNAs will ultimately be translated. Finally, any considerations of gene regulation must be placed in appropriate developmental and physiological contexts as well as understood through the lens of evolution and gene regulatory networks.

Prokaryotic Gene Regulation: A Case Study of the Lac Operon The regulation of the lac operon in E. coli has been described in detail in numerous textbooks over the past several decades. Here, a brief description and analysis of salient points that demonstrate basic features of gene regulatory mechanisms will be provided; however, readers are encouraged to seek further reading for finer details on the methodology and details of this case. The lac operon region contains cis-regulatory sequences (i.e., operator, CAP-binding site) involved in regulation of the downstream structural genes (i.e., lac Z encoding b-galactosidase, lac Y encoding lactose permease, lac A encoding galactoside O-acetyltransferase) that code for proteins with roles in lactose metabolism. It also codes for a gene encoding a trans-acting factor

Gene Regulation

405

Gene Regulation, Fig. 1 (a) Schematic representation of the lac operon with active lac repressor bound to the operator (top) and with inactive repressor bound to lactose inducer allowing low levels of transcription of target genes, lacZ, lacY, and lacA (bottom). (b) Addition of positive regulation by cAMP and CAP protein binding to the CAP site of the lac operon leads to higher levels of transcription of target genes

(i.e., lac I encoding the lac repressor) that binds to one of the cis-regulatory elements (i.e., the operator) to mediate, in this case to repress, expression of the structural genes (see Fig. 1a). The constitutively transcribed transcriptional repressor sequence encoded by the lac I gene is located upstream of the lac operon structural genes and

G

their cis-regulatory control elements. When translated to protein, the product of the lac I gene is a transcription factor (referred to as the lac repressor) that contains two important regions. One region of the lac repressor binds directly to cis-regulatory DNA sequences in the operator region of the lac operon. When bound to this

406

region of DNA, the lac repressor blocks RNA polymerase from binding to the promoter for the structural genes. The structure of the lac repressor is a homo-tetramer. The crystal structure reveals that each tetramer contains two DNA-binding sequences that can interact with multiple sequences on the operator to induce DNA looping which is likely involved in the inhibition of RNA polymerase binding. The fundamental principle here is that the lac repressor binds to DNA in the operator region near the lac promoter to keep the genes needed for lactose metabolism in a transcriptionally repressed state. This is a key feature of the regulation of many (if not most) genes in both prokaryotic and eukaryotic systems – transcriptional repressors keep genes turned off until cell appropriate signals are received that lead to modulation of the gene’s expression. In the case of the lac operon, one signal from the environment indicating that the structural genes involved in the metabolism of lactose will be produced is the presence of lactose itself. As in most organisms, energy for producing many proteins is conserved until the right time and place when those proteins are needed. In this case, E. coli can use lactose as a nutrient source for production of energy in the form of glucose (especially when the cell does not have access to glucose itself). When E. coli are in an environment with lactose present, an isomer of lactose (allolactose, often referred to as the inducer) will bind to the other important structural domain of the lac repressor (referred to as the regulatory or core domain). The binding of allolactose to the regulatory domain of the lac repressor protein will induce a conformational change that renders the DNA-binding domain of that lac repressor protein unable to bind to the cis-regulatory DNA sequence in the operator region. Thus, the presence of lactose in E. coli results in availability of the lac promoter for binding by RNA polymerase because the lac repressor protein can no longer bind to the operator sequence (Fig. 1a). Again, this is a common feature in gene regulation. The presence of some environmental signal leads to removal of repressive factors inhibiting gene expression.

Gene Regulation

In order for the structural genes required for lactose metabolism to be expressed at a high level, a positive regulator of transcription is also required. Here, the environmental cue is also important. If the E. coli has a source of glucose for energy, it will utilize glucose until that resource is nearly exhausted. Only when the cell has low levels of glucose and needs an alternative nutrient source will other sugars be metabolized (in this case lactose). The mechanism for this component of regulation is that if glucose is present, it inhibits a key cellular enzyme, adenylyl cyclase. Adenylyl cyclase is responsible for catalyzing the formation of cyclic AMP from ATP. Without cAMP, the trans-acting factor CAP (an activator of transcription) will not bind to its cis-regulatory DNA-binding site (CAP-binding site) located upstream of the lac promoter (Fig. 1b). When the cell has low glucose, however, adenylyl cyclase is no longer inhibited. In this case, cAMP is produced, and cAMP molecules bind to the CAP protein increasing the protein’s affinity for DNA. In this case, CAP/cAMP will bind to the cis-regulatory DNA sequence at the CAP-binding site, which opens up the DNA molecule to enhance binding of RNA polymerase. When RNA polymerase is able to bind to the promoter, it will transcribe the structural genes for lactose metabolism at a high rate. The final point here is that in addition to removal of negative regulators of transcription, a positive input is also required. This is usually in the form of a transcriptional activator whose binding to cis-regulatory DNA will enhance or increase levels of transcription. Most genes have multiple cis-regulatory sites that are involved in modulating the precise expression levels of the target gene (s) through the binding of both transcriptional activators and transcriptional repressors.

Environmental Cues and Cell Signaling While this chapter will not formally address the various ways in which a cell can receive a signal either from the external environment (e.g., Fig. 1, Box 1) or from a neighboring cell (e.g., Fig. 1, Box 2), it is important to note that changes in gene

Gene Regulation

expression, and thus all gene regulation, are a product of environmental or cellular signals that modulate gene regulatory pathways. An examination of cell signaling pathways in both prokaryotic and eukaryotic systems is required for a more robust understanding of the mechanisms of signal transduction (for further information see volume 6). Cell signaling pathways are the main regulators of the activity of the transcription factors that regulate gene expression. Thus, it is through signal transduction pathways that transcription factors either get synthesized or become activated (e.g., phosphorylated). In prokaryotic systems, studies on quorum sensing in bacteria have shown that bacteria can respond to environmental cues sensed when bacteria are in high density to coordinate gene expression for processes like the formation of biofilms. For instance, Vibrio harveyi, a bioluminescent marine bacterium, produces a small chemical (called an autoinducer) that when present in high enough concentration will bind to membranebound receptors on the surface of the bacteria. When Vibrio harveyi are present at sufficient density, the autoinducer will bind to these receptors and cause a conformational change that leads the receptors to have phosphatase activity. When unbound, these receptors act as kinases. In this cascade, the switch between kinase and phosphatase activity for the receptor is a mechanism that will ultimately lead to either repression or activation of a downstream transcription factor. When these bacterial cells do not have their receptor bound by the autoinducer, the overall response is to repress transcription. Here, the receptor will behave as a kinase and phosphorylate a protein called LuxO. LuxO, in turn, will act as a repressor of the lux operon by binding to DNA and promoting the transcription of a short RNA molecule. This short RNA molecule will bind the mRNA of the main transcriptional activator LuxR. This is another common mechanism of transcriptional repression in many organisms, the inhibition of gene expression by small RNA molecules. Without signal, the cell will repress translation of the LuxR transcription factor, which results in overall repression of the downstream target genes in the lux operon. When cells receive the signal of the

407

autoinducer, the receptor will then behave as a phosphatase and remove the phosphate group from LuxO. In this scenario, LuxO will no longer bind to DNA and promote the synthesis of the small inhibitor RNA. Without the small inhibitor RNA, the mRNA for the LuxR transcription factor can be translated to protein and the LuxR transcription factor will be able to activate the expression of several downstream target genes involved in light production (Bassler 1999). In eukaryotes there are a variety of signaling pathways that control cell fate decisions during the development and life-spans of these organisms. Indeed, many of the pathways have been conserved over the course of evolution and play important roles in all aspects of organismal physiology. However, regardless of the diversity and complexity of the cell-cell signaling pathway in biochemical mechanism or method of signal transduction, the primary outcome of the signaling is always the activation of specific target genes by transcription factors that control gene expression. The study of the way in which signaling pathways in eukaryotes ultimately link to transcriptional control has revealed that, like in the prokaryotic examples above, there are fundamental commonalities that link the mechanisms of cell signaling with transcriptional control of gene regulation (Barolo and Posakony 2002). One such pathway that has been highly characterized in the metazoans is the Wnt signaling pathway. The canonical Wnt pathway is utilized in animals to regulate cell fate decisions and stem cell pluripotency during development and disease states. The molecular signaling cascade of the Wnt pathway integrates signals from a variety of other well-characterized pathways (e.g., BMP, TGF-B, retinoic acid) across many cell and tissue types. The Wnt signaling pathways transduce signals from outside the cell through cell surface receptors that will ultimately send biological signals to regulate gene transcription. This is a common feature of many such pathways. In this case, without Wnt signaling, the b-catenin protein (a cell-cell adhesion adaptor protein that also serves as a master transcriptional co-regulator) is ultimately kept off because it is targeted for phosphorylation and subsequent degradation by the

G

408

APC/Axin/GSK-3b complex. When the Wnt-protein ligand (a secreted glycoprotein) is present, however, it binds to Frizzled receptors (a seven time membrane spanning G proteincoupled receptor) on the cell surface. This leads to activation of Dishevelled (Dvl) protein, which, through a complicated cascade, leads to the disruption of APC/Axin/GSK-3b complex. In this case, b-catenin can accumulate in the cytoplasm and will localize to the nucleus where it will interact with TCF/LEF DNA-binding proteins to activate transcription of downstream target genes (e.g., Clevers and Nusse 2012). The prokaryotic and eukaryotic world is full of examples where some environmental cue, often in the form of a small chemical, will bind to receptors on the surface of cells; the changes in status of the receptor ultimately lead to activation or repression of transcription factors that ultimately control gene expression. The ways in which signal transduction pathways control transcription factors can be complex. In most cases, however, the receipt of a signal by a cell leads to activation of a transcription factor (or multiple transcription factors). In turn, these transcription factors will affect the expression of target genes whose protein products will lead to changes in cellular phenotypes. There are many mechanisms by which transcription factors can be activated including, but not limited to, (1) being transcriptionally regulated themselves by another transcription factor; (2) regulation by small RNA molecules that keep the mRNAs of that transcription factor from being translated until which time it is needed; (3) posttranslational modification (e.g., transcription factor precursor protein is cleaved to make active form, transcription factor protein is modified by chemical groups such a phosphorylation, acetylation, methylation, ubiquitination, sumoylation); (4) the transcription factor is bound by some ligand inducing a conformational change and thus activation; and (5) the transcription factor dissociates from an inhibitor protein that is keeping the transcription factor inactive. The end result in each case is modulation of gene expression through binding of combinations of gene-specific cis-regulatory DNA-binding proteins that either repress or activate transcription of target genes.

Gene Regulation

Chromatin Alteration and Gene Regulation While gene-specific transcriptional activators and repressors play important roles in both prokaryotic and eukaryotic gene regulation, eukaryotic organisms have evolved additional mechanisms to keep genes inactive until their activation is required at particular times and places in the organisms’ life-span. In eukaryotic organisms, the default state of chromatin, in general, is repression, and most eukaryotic genes remain inactive across cell types and throughout developmental time until activated by a specific signal. In the repressed state, chromatin remains “unavailable” to transcription factors and RNA polymerases so that most genes cannot be expressed at high levels. This is, of course, an oversimplification of the complex and highly orchestrated and diverse mechanisms by which cells and tissues have their genes regulated by local and global (i.e., more than one gene is repressed in a region) chromatin modifications. Importantly, however, these transitory alterations to chromatin are responses to intrinsic and/or external stimuli (i.e., local and environmental signals) that regulate access or the process by which the transcriptional machinery is able to interact with cis-regulatory regions controlling gene expression. A more thorough treatment of chromatin remodeling and DNA modification in transcriptional regulation can be found in this volume (Section 10, Subsection 2). However, many of the pathways leading to chromatin alteration and ultimately changes in gene expression have to do with changing the interaction of DNA with the histone proteins that make up the nucleosomes that package DNA molecules. For example, acetylation of histones leads to removal of positive charges and thus decreases the interaction of the N-terminal regions of those histones with the negatively charged phosphate groups of DNA. The outcome of this process (modulated by histone acetyltransferases) is the alteration of a condensed chromatin (i.e., heterochromatin) region to a more relaxed “open” structure (euchromatin) that is associated with higher levels of gene expression for the genes in that region of

Gene Regulation

409

G

Gene Regulation, Fig. 2 (Box 1) Schematic representation of three cells. Red boxes indicate areas of gene regulation, with arrows showing a closer look at the regulation steps, each marked by a numbered box. Jagged arrows indicate environmental signals (e.g., heat sources, UV, etc.) that can influence gene expression. (Box 2) Cell-cell communication is depicted with colored areas signifying chemical signals that modulate gene expression being transmitted between cells. (Box 3) Chromatin remodeling is shown with blue compounds representing enzymes with either histone acetyltransferase or histone deacetylase activity, prompting chromatin to unwind. Red dot linked with brownish compounds represents histone methyltransferases, which add methyl groups to condense/rewind chromatin. (Box 4) Transcriptional regulation is shown. All compounds are various transcription factors that bind to upstream or downstream promoters and enhancers to direct and/or influence transcription by RNA polymerase (big red compound). (Box 5) Posttranscriptional processing is shown including 5’-capping, 3’-polyadenylation, and mRNA splicing. Red compound is representing the spliceosome for intron removal. (Box 6)

Posttranscriptional regulation is shown. Part 1 of the figure depicts how one gene can be used to produce multiple mRNAs (that will ultimately be distinct proteins), as regulated by a system of proteins that bind to the pre-mRNA, including splicing activators that promote the splicing at particular sites. Part 2 of the figure depicts addition of shRNA and dsRNA with the Dicer enzyme (green compound) to form an RNA-induced silencing complex (RISC) that, in this scenario, will degrade all but one of the alternatively spliced mRNA, which then leaves the nucleus in preparation for translation. (Box 7) Translational regulation is depicted. Red, orange, purple, and green molecules represent proteins influencing the location of ribosome recruitment, and orange molecule (with rays of yellow) signifies the initiation codon. (Box 8) Posttranslational modification is shown with two scenarios. First scenario depicts phosphorylation (yellow compound) of a protein initiating a conformational change making the protein suitable for transport out of the cell to the rough ER for processing/distribution. The second scenario depicts addition of enzyme (green compound) that alters conformation leading to protein degradation

chromatin (Fig. 2, Box 3). This process can be reversed by the action of histone deacetylase activities, and this entire process itself is highly regulated. While many of the modifications to chromatin are reversible in cells and tissues and thus not likely to be propagated from one

generation to the next, some histone modifications and/or altered nucleosomal structures can be stable through multiple cell divisions pathways resulting in a form of cellular memory. In this case, “epigenetic states” may be established and moderated by a complex and structured system

410

whereby regions of chromatin are “primed” to respond to environmental signals via chromatin modification to mediate gene expression patterns over developmental time and space. A deeper examination of the ways in which these epigenetic mechanisms play roles in mediating gene regulation can be found in volume 3, Section 9, and in Allis et al. (2006). In summary, the state of chromatin will determine whether or not particular cis-regulatory regions of DNA (e.g., promoters, enhancers, silencers, insulators, locus control regions) are available to the transcriptional factors (i.e., repressors and activators) that modulate gene expression. In many instances, remodeled chromatin is the result of alterations to histones (and thus nucleosomes) that lead to patterns of chromatin accessibility throughout the genome. It is also clear that small interfering RNA (siRNA) associated with the RNA-induced transcriptional silencing (RITS) complex can determine the state of chromatin by playing roles in formation of heterochromatin by recruiting histone methyltransferases or DNA methyltransferases leading to tightly packed chromatin. Recent work by the ENCODE project through mapping DNase I hypersensitive sites in the human genome has shown that we are only beginning to understand the relationship between chromatin accessibility and regulation of gene expression. Given that DNase I hypersensitivity marks all known classes of cis-regulatory regions, it is now possible to map genome-wide global regulatory regions in any cell or tissue type at any developmental time. This work reveals that patterns of chromatin accessibility are organized at a higher level than previously known and that a multitude of cis-regulatory elements (dozens to hundreds) may be co-activated by systematic long-distance regulatory patterns in ways that predict cell-typespecific functions (Thurman et al. 2012).

DNA Methylation and Gene Regulation Genes can also be regulated on global or local levels through alternating methylation patterns in regions either surrounding or within gene-coding

Gene Regulation

sequences. In almost all cases, DNA methylation is associated with silencing of transcription (either permanently or conditionally). Indeed, DNA methylation likely evolved as a mechanism of host defense to silence foreign DNA. In eukaryotes, DNA methylation is considered to be an epigenetic mechanism of gene regulation when heritable and can be important in a variety of cellular processes (e.g., embryonic development, genomic imprinting, and X chromosome in mammals). DNA methylation typically occurs at cytosine bases that are adjacent to guanine bases where DNA methyltransferases catalyze the methylation reactions that lead to formation of 5-methylcytosines. For example, in humans it is known that most CpG dinucleotides are methylated and unmethylated CpG islands occur at promoter regions of genes that are highly expressed. The exact mechanisms by which DNA methylation leads to changes in chromatin structure and accessibility are not completely understood; however, a host of factors are known (e.g., methyl-binding domain proteins; MBDs) to play roles in “reading” the methylation state of DNA (for further reading see volume 3, Section 8, Subsection 2).

Transcriptional Control and Gene Regulation The main method of controlling gene expression across biological organisms is likely at the level of transcription. While this section will not give detail to the mechanism of transcription itself, a complete analysis of this process can be found in volume 2, Section 6. The central tenant of transcriptional control has to do with whether or not RNA polymerase (a single enzyme in prokaryotes or RNA polymerase I, II, or III in eukaryotes) is recruited to the upstream promoter region of the gene it transcribes. A suite of factors are involved in determining the availability of promoter regions for RNA polymerase binding and subsequent transcription as well as the rate and level of that transcription. In prokaryotic systems, the promoter region usually has conserved DNA sequence elements at 10 and 35 regions

Gene Regulation

upstream of the transcription start site. When the promoter region is bound by the sigma factor, it enables RNA polymerase to bind the promoter and transcribe the genes downstream. The example given above regarding the lac operon in bacteria is an excellent case of transcriptional control of gene expression in a prokaryotic organism through the binding of various transcription factors (e.g., repressor protein, CAP-binding protein) to regulatory DNA sequences (e.g., operator, CAP-binding site) in order to arbitrate the expression of downstream target genes (e.g., lac Z, lac Y) via the binding of Sigma factor and RNA polymerase to the lac promoter. This example illustrates an important component of transcriptional control: that the binding of additional cis-acting sequences (often called operators in prokaryotes) by trans-acting protein factors is usually a necessary feature for recruiting RNA polymerase to the promoter sequence to facilitate transcription of target genes. In prokaryotes, the genes under transcriptional control are often arranged in clusters (operons) by related functions so that their transcriptional control occurs at only one point (i.e., by a common set of DNA control sequences and their transcription factors). Ultimately, the result of transcriptional control is to vary the numbers and types of proteins made in the cell in response to environmental signals. Transcriptional control can be both positive (e.g., activator protein binds to operator to stabilize RNA polymerase binding to the promoter) and negative (e.g., a repressor protein binds to the operator to inhibit RNA polymerase binding the promoter), though much more complicated combinations of regulatory mechanisms are known. In eukaryotic systems, transcriptional control mechanisms are more complex. Like prokaryotic genes, all eukaryotic genes have promoter regions. Genes transcribed by RNA polymerase II (those that will become the mRNAs) often have a typical structure that includes a basal promoter region located about 30 bases upstream of the transcriptional start site. Included in this core region is often a consensus sequence called the TATA box (because of the thymine and adenine content of the region). Most genes also have additional upstream promoter elements that help drive

411

transcription of the gene and control the level of transcription. The assembly of a basal transcriptional complex (involving DNA-protein and protein-protein interactions) at the gene promoter is an essential step required for recruitment of RNA polymerase. Like the Sigma factor in prokaryotes, eukaryotic systems require a transcription factor (i.e., TBP) to initiate binding to the DNA of the promoter region. Finally, in addition to the basal (core) promoter and other upstream promoter elements, most eukaryotic genes also have sequences that are cis-regulatory, but that act at a further distance from the transcriptional start site (either upstream or downstream; Fig. 2, Box 4). These sequences are called enhancers or silencers and as their names indicate are involved in either increasing or decreasing the rate of transcription of genes they control through the binding of a host of gene-specific transcription factors (see ▶ “Cis-Regulation of Eukaryotic Transcription” in this volume for further details). Recent efforts by the ENCODE project have been directed at mapping the distribution and content of cis-regulatory sequences in the human genome. This project has also examined the combinatorial binding profiles of transcription factors that specify the on- and off-states of the genes regulated by these network factors (e.g., Gerstein et al. 2012; Thurman et al. 2012). ENCODE has exposed sets of common conserved cis-regulatory sequence elements and combinations of co-associated and context-specific transcription factor-binding patterns that are responsive to a variety of stimuli. These types of studies promise to shed further light on the evolution and function of transcriptional level gene control.

Posttranscriptional Processing In eukaryotic systems there are several key regulatory events that occur at the level of posttranscriptional processing (i.e., modifications to the mRNA transcript) and modulate the eventual creation of that gene’s protein product (Fig. 2, Box 5). First, eukaryotic mRNAs are modified posttranscriptionally at the 50 end with the addition of a 7-methylguanine cap. Not only does the

G

412

cap enhance the eventual translation of the mRNA, but its removal can also facilitate the degradation of the mRNA. Concomitantly, many eukaryotic mRNAs are cleaved at a polyadenylation site at the 3’ end of the mRNA molecule followed by addition of adenosine residues (poly-A tail) of varying lengths. Polyadenylation increases the stability of mRNA molecules and, along with the 50 ’ cap, protects the mRNA from decay. In some cases, the poly-A tail may also regulate translational efficiency of the mRNA. Indeed, a point of regulation for eukaryotic mRNAs is related to mRNA half-life and rates of decay. Changes in RNA stability in response to specific regulatory signals in the cell have been described for many genes and are moderated by cis-acting sequences within the RNA itself (often these sequences are in the 3’ untranslated region of the mRNA). The rate of decay for any mRNA is directly related to the level of protein that will be made in the cell. This additional mechanism for transcriptional control is most often observed where rapid changes in synthesis of a particular protein are necessary or metabolically expensive. Perhaps the most important posttranscriptional processing event regarding gene regulation is that of mRNA splicing. During splicing intervening RNA sequences (introns) are removed, while the coding sequences (exons) are joined together. This process is highly regulated by a suite of RNAs and proteins collectively referred to as the spliceosome. Regulatory and synergistic control of posttranscription processing (including splicing) is covered in detail in this volume (see ▶ “Co-transcriptional mRNA Processing in Eukaryotes”).

Posttranscriptional Regulation One reason that mRNA splicing can be such a key control point of gene regulation is due to the phenomena of alternative splicing (Fig. 2, Box 6, left). Through alternative splicing, one mRNA molecule can code for multiple protein products (by either including or excluding particular exons from the final product) depending on the cellular context and biological functions. Alternative

Gene Regulation

splicing allows for the production of many more proteins from the genome than the genes encoded within that genome. For example, it is estimated that nearly 75% of genes in the human genome may be alternatively spliced resulting in a diverse array of isoforms of protein products from a relatively small number of genes. Specific trans-acting proteins that recognize and bind to cis-acting regulatory sites (called splicing enhancers or silencers) near splice sites and within the primary mRNA transcript have been identified. These trans-acting proteins have been shown to regulate the production of alternatively spliced mRNAs. Like transcriptional control of gene expression, the regulation of gene expression by alternative splicing also results in tissue- or celltype-specific gene products. It is also clear that transcription and alternative splicing are linked and depend on a variety of factors including cis-acting sequences within the actual RNA molecule. Another form of posttranscriptional gene regulation that leads to alteration of the mRNA sequence is RNA editing. RNA editing can lead to insertions, deletions, and DNA base changes and is mediated by either site-specific deamination (of adenines or cytosines) or guide RNA-directed (uridine) insertion or deletion. Though RNA editing seems to be a rare point of posttranscriptional control, it has likely evolved multiple times as a supplement to transcriptional control. Once mRNAs are transcribed and processed, it is necessary that the mRNA be transported to the cytoplasm. Control of RNA transport is another important process for regulating gene expression. The transport of mRNAs to the cytoplasm is regulated by a host of RNA-associated proteins. There are also proteins that will inhibit transport of other RNAs. Furthermore, there are specific processes that are known to regulate the location of particular mRNAs within regions of the cytoplasm (e.g., specific localization of bicoid and nanos mRNAs at opposite poles of the Drosophila egg). Controlled transport and localization of mRNAs within cell types is one mechanism that leads to differential gene expression patterns.

Gene Regulation

RNA Silencing: Posttranscriptional Inhibition of Gene Expression via Small RNAs As discussed several times above, small RNA molecules play a variety of important roles in gene regulation at several levels, including the transcriptional level via chromatin remodeling (e.g., see RITS complex above). RNA silencing, however, is a mechanism by which gene expression is downregulated or repressed at the posttranscriptional level through small RNA molecules that use sequence-specific binding in order to target specific mRNAs for degradation through an RNA-induced silencing complex (Fig. 2, Box 6, right). The RNA-induced silencing complex (RISC) uses small RNA molecules (siRNA or miRNA) that bind by complementary base paring to an mRNA target. This event activates proteins in the RISC complex (i.e., Argonaute, Dicer), which will eventually cleave the mRNA leading to degradation. This mechanism of gene silencing is widespread across many biological organisms, and a variety of gene silencing techniques (e.g., shRNA, dsRNA) based on RNAi have been developed for suppressing gene expression in model organisms for the study of gene regulation and function.

413

affect translation include changes to the secondary structure of mRNA, use of alternative initiation codons, and changes in ribosome binding through control of ribosome recruitment to the initiation codon (Fig. 2, Box 7). In most instances, the control of translation is a supplement to transcriptional regulation and is most often seen in cases where tightly correlated regulation (e.g., rapid response) is required for a particular cellular context. It is also known that small RNAs can directly repress mRNA translation. This repression can occur during initiation by the ribosome or during elongation.

G Posttranslational Regulation The level of active protein translated from a particular mRNA is also subject to regulatory control through both reversible (i.e., posttranslational modifications) and irreversible (e.g., proteolysis) means. Posttranslational modifications can greatly alter the functions and behavior of the protein by the addition of functional groups (e.g., phosphorylation; see Fig. 2, Box 8), by changing the chemical nature of an amino acid, or by altering the actual structure of the protein (e.g., formation of disulfide bridge). Protein activity can be regulated by enzymes that modify protein structure in various ways, including regulation of the degradation of the protein (Fig. 2, Box 8).

Translation and Regulation of Gene Expression Future Outlook Gene regulation also occurs at the level of translation and often is involved in mediating the level of protein synthesized from a particular mRNA in a given cellular context. As is seen at other levels of gene regulatory control, the regulation can occur due to protein interactions and/or modifications or through cis-regulatory changes involving proteins that recognize and interact with sequences in the actual mRNA molecule to affect its translation (Fig. 2, Box 7). Oftentimes, a sequence in the 5’ untranslated region (UTR) of the mRNA is the site of recognition for particular proteins involved in regulating translation; however, other sites are possible (e.g., 3’UTR, polyadenylation site). Other levels of control that can

There is still much to discover about gene regulatory mechanisms in both prokaryotic and eukaryotic systems. For example, short and long RNA molecules have emerged in recent years as major mechanisms for modulating expression of genes at nearly every level (from chromatin to translation). The emergent theme in our understanding of gene regulation is that highly complex regulatory networks involving multiple levels, hierarchies, and sources of control are ultimately involved in controlling gene expression. These networks are coordinated in ways that fine tune expression of an array of target genes (for further information see articles on gene regulatory networks in this

414

volume). More recently, studies on the organization of chromatin have also elucidated ways in which the overall three-dimensional structure of chromatin can play roles in determining gene regulatory processes. For readers who would like a more detailed and robust review of eukaryotic gene regulation, an excellent treatment of this subject can be found in Latchman (2010). Future work, aided by computational biology and experimentation, will likely lead to a much greater understanding of the ways in which the expression of genes is modulated in response to environmental and cellular influences. These finely tuned cellular processes have lead to the vast array organismal complexity and diversity on the planet.

Cross-References ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of ▶ Cis-Regulation of Eukaryotic Transcription ▶ Co-Transcriptional mRNA Processing in Eukaryotes ▶ Differential Equations and Chemical Master Equation Models for Gene Regulatory Networks ▶ Gene Regulatory Networks, Evolution of ▶ Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase ▶ Prokaryotic Gene Regulation by Small RNAs ▶ Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems

References Allis CD, Jenuwein T, Reinberg D, Caparros M-L (eds) (2006) Epigenetics. Cold Spring Harbor Press, Cold Spring Harbor Ashburner M (1971) Induction of puffs in polytene chromosomes of in vitro cultured salivary glands of Drosophila melanogaster by ecdysone and ecdysone analogues. Nature 230:222–224 Barolo S, Posakony JW (2002) Three habits of highly effective signaling pathways: principles of transcriptional control by developmental cell signaling. Genes Dev 16:1167–1181 Bassler BL (1999) How bacteria talk to each other: regulation of gene expression by quorum sensing. Curr Opin Microbiol 2:582–587

Gene Regulatory Networks, Evolution of Clevers H, Nusse R (2012) Wnt/Beta-Catenin signaling and disease. Cell 149:1192–1205 Derman E, Krauter K, Walling L (1981) Transcriptional control in the production of liver-specific mRNAs. Cell 23:731–739 Gerstein M, Kundaje A, Snyder M, et al (2012) Architecture of the human regulatory network derived from ENCODE data. Nature 489: 91–100 Jacob F, Perrin D, Sanchez C, Monod J (1960) The operon: a group of genes whose expression is coordinated by an operator. C R Séances Acad Sci 250:1727–1729 Latchman DS (2010) Gene control. Garland Science, New York Thurman RE et al (2012) The accessible chromatin landscape of the human genome. Nature 489:75–82

Gene Regulatory Networks, Evolution of Ajna Rivera and Andrea Sajuthi Biological Sciences, University of the Pacific, College of the Pacific, Stockton, CA, USA

Synopsis In the past few decades, the fields of genetics, evolution, and development have converged to begin to understand the relationship between genotype and phenotype or, more simply, how an organism’s genes are used to build its body. A central finding in this research is that genes do not typically act alone, but rather act as parts of gene regulatory networks (GRNs). Early in development, regulatory genes turn on, interact with each other, and act to regulate the expression of other genes. These lower-level genes integrate the signals from the first set and turn on or off structural genes, genes that can change cellular structures and functions. The entire set of these interactions is a gene regulatory network. Changes to GRNs drive evolution. The nature of GRN regulation is such that small changes can have subtle or massive effects, depending on the type of change and level of regulatory hierarchy. By studying the ways genes within GRNs interact, we can understand how organisms develop their defining characteristics and how these characters have changed over evolutionary time.

Gene Regulatory Networks, Evolution of

Introduction Evolution and Molecular Biology Historically, evolution has been studied at the phenotypic level – examining changes in a group of organisms’ appearance, behavior, and metabolism over time. In the last few decades, with the advent of molecular tools, the focus of evolutionary biology has broadened to understand the changes in DNA that accompany morphological changes. In examining this topic, researchers have also broadened the central dogma of molecular biology (DNA makes RNA makes protein) to a more subtle version where DNA and RNA both play a more active role in cellular functions. The primary focus is on interactions between DNA and proteins because, though the role of RNA molecules in determining cellular function is an emerging field in biology, DNA and protein interactions are most well studied. These interactions take the form of binding – certain sequences of base pairs bind to certain protein domains under specific conditions. This DNA/protein binding (along with RNA interactions) is the basis of gene regulation. Since DNA mutates and changes over evolutionary time, so do protein sequences and, therefore, functions. These evolutionary changes can also change the interactions between DNA and proteins by changing binding sites on either type of molecule. While we cannot go back in time and examine these interactions in ancestral species, nor can we study molecular biology in fossils, we do have two primary tactics for elucidating the evolution of these interactions. First, we can look at small-scale changes in a particular organism. For example, we can mutate DNA that ordinarily binds to a protein and see how that affects both the interaction and the phenotype of the organism. Second, we can compare living organisms. For example, we can look at a particular protein and see if it binds to the same sequences of DNA near the same genes in different organisms. These two approaches have been fruitful in elucidating small pieces of networks. Higherthroughput approaches are now being taken to more quickly advance our understanding of the evolution of these interactions (see Tools section below).

415

What Is a Gene Regulatory Network? As described above, proteins and DNA can interact with one another by binding under specific conditions. When this happens, it can affect the ability of RNA polymerase to access a nearby gene and transcribe it into RNA. In this way, proteins can affect whether a gene gets turned on, or expressed. Since gene expression can be controlled, this means that the set of proteins in a cell is also under control. The specific set of proteins that affect gene expression, called transcription factors, therefore differs in different types of cells causing these cells to regulate their gene expression differently. A gene regulatory network is the set of interactions between a particular set of genes (Fig. 1). We generally think of gene regulatory networks (GRNs) in terms of a specific function or end point, for example, a GRN for eye development or cell growth. There are several ways to study gene regulatory networks, and these affect our understanding of them. For example, in the past many researchers have focused efforts on studying the various phenotypic effects that different mutations had on a particular phenotype. By comparing these effects, we can get an understanding of how different genes and gene products influence each other. Added to this, we can now study the molecular interactions between genes. That is, whether the proteins interact, whether one protein interacts with the DNA of the other gene, or whether they both have downstream targets that affect the same phenotype. For this reason, many researchers in this field have moved to molecular and computational approaches (see Tools section) that more directly test physical and functional interactions between genes and gene products. In this way, the focus of many studies has moved from correlative studies of phenotypic effects to more direct studies of gene regulatory effects. DNA and Proteins In most well-studied gene regulatory networks, the focus has been on DNA and protein interactions, though this has been changing to include RNA interactions as well. The most straightforward, but not the most accurate, way to study gene

G

416

Gene Regulatory Networks, Evolution of

Gene Regulatory Networks, Evolution of, Fig. 1 Gene regulatory networks exist in a hierarchical structure, in which kernels (red) are at the top of the hierarchy, plug-ins (yellow) are in the middle, and batteries (green) are at the bottom (a). Kernel genes are typically

transcription factors that can activate (arrows) or repress (lines with bars) plug-in and battery genes. Each network (shaded in gray) is able to affect and influence other networks through in/out switches (b)

regulatory networks is to lower the levels of one protein and look at the effect on gene expression of other genes. All genes have regulatory regions in their DNA that determine whether or not they will be expressed (turned on). These pieces of DNA are called “cis-regulatory regions” or “cisregulatory elements.” The proteins that bind these pieces of DNA are called “trans-regulatory elements” or “trans-regulatory factors.” They are also known as “transcription factors” because they regulate gene transcription (the process of making RNA from a DNA template). To summarize, cis-elements are part of the DNA and are always present. Trans-factors are proteins that bind to cis-elements. Specific trans-factors are not always present in a cell, and they themselves are subject to gene regulation. Once a gene has been transcribed, it is now present in the cell as RNA. In eukaryotes, many RNAs are translocated to the cytoplasm where they are translated into protein. Regulation of translation is accomplished by changing the stability of the mRNA molecule via proteins and specialized RNAs as well as by regulating the rate of translation of a particular mRNA. Genes can also be regulated once they are translated into proteins by protein/protein interactions. Thus, a particular gene can regulate another gene at many different stages of expression.

The Importance of Understanding Gene Regulatory Networks For many years, the study of genetics focused on the action of single genes. While this approach led to many insights, it is limited in that a single gene does not act in a vacuum, but rather in a complex cellular environment that itself is responsive to the external environment. Because of this, deep study into gene action has shown time and time again that genes do not generally act as simple on/off switches for a process but rather in complex fashions that are often seemingly unpredictable. More recent research has shown that some genes are “structural,” while others are “regulatory.” The difference between these two types of genes is analogous to the difference between building materials and construction workers. The structural building materials are used to actually create the final product, but without the construction workers acting together to follow a blueprint, they would be useless. The construction workers are regulatory; they need to interact with one another as well as with the building materials to construct the building. Likewise, regulatory genes not only interact with structural genes to build and maintain cells and organisms but also interact with each other to follow the organismal blueprint. These interactions, as mentioned above, involve DNA, RNA, and proteins.

Gene Regulatory Networks, Evolution of

Understanding the interactions between the DNA, RNA, and protein in a cell involves a multidimensional network structure with high connectivity between many different components. Imagine a spider web where any segment of silk has the potential to be affected by pulling on any other segment of silk. The farther apart the segments are the less the effect and the closer and more intimately connected the segments are the greater the effect. Likewise, in thinking about a molecular cell network, different components influence each other in different ways. Changes in a highly connected node can have implications for nearly all functions in a cell, while changes in an outer, less connected, node might only have tiny reverberations (Fig. 1). While we do not currently have the tools to solve an entire network for an organism, much progress is being made through computational and functional studies, and key pieces of networks can now be understood. Examples of these and the types of research that are currently being used to understand cell function and organismal morphology at the molecular genetic level are given below. Gene Regulatory Networks and Evolution The nature of a network is such that mutational changes in one element can potentially have drastic implications for the functions of the other elements. Because of this, GRNs are particularly useful for understanding how phenotype links to genotype over evolutionary time and why some mutations have large effects on phenotype, while others have only a small effect. While most of the interactions in gene regulatory networks are not yet understood for any organism, studies comparing organisms, knocking down gene levels, and examining gene levels under different conditions have been very useful in determining basic principles of how GRNs act and how they have changed in different evolutionary lineages. The underlying principle of GRNs is that they are almost infinitely flexible. Rather than simply being on/off switches, GRNs can change their output in a graded fashion. Changing a GRN over evolutionary time can range in consequence from small changes in gene expression levels or small changes in gene expression domain to much

417

larger changes involving multiple aspects of a character (see, e.g., Wray 2007). These changes are thought to be able to evolve rapidly, compared to changes in structural proteins, because of the modularity and flexibility of both transcription factors and cis-regulatory regions (Wray 2007). Demonstrating this rapid evolution, Blekhman et al. (2008) compared three primate species and found a large number of genes where gene regulation was undergoing selection. Interestingly, many of these genes were related to metabolism and presumably evolved in response to changes in diet. Changes in gene regulation can lead to great changes in organismal morphology. In some cases, it can even lead to phenotypic novelty, or new anatomical features in an organism. One example of this is the changes in the outputs and connections of the neural plate gene regulatory network to produce the neural crest cells in vertebrates. These cells go on to form much of the craniofacial structures (including teeth) of the animal – giving vertebrates all of the head structures they are well known for. Another example is in the eyes. The gene regulatory network controlling eye development is very similar in many different kinds of animals, like flies and mice. But the downstream connections are different enough to produce very different types of eyes. Both eye types evolved largely due to changes in the eye gene regulatory network. In a typical regulatory region for a gene, there are multiple binding sites for transcription factors. These regulate gene expression in various ways and can be tissue and time specific. Even though a gene might be expressed in multiple tissues at many times, changing a single cis-regulatory binding site might only affect one tissue at a specific time (Box 1). On one hand, this type of change could have a small effect on many tissues. Because multiple protein-binding sites regulate gene expression, they can each evolve separately without affecting most of the gene’s functions (Box 1). Likewise, transcription factors are also often modular with separate sites for DNA binding and protein/protein interactions. Alternative splicing and short-linear motifs take advantage of this modularity and can confer new

G

418

Gene Regulatory Networks, Evolution of

functions on existing transcription factors (Wagner and Lynch 2008). Finally, modularity exists at the whole-GRN level as well. Cassettes of regulation can be swapped out for one another to change parts of the function of the GRN without changing the entire network. In the following section, we will examine the different components of GRNs and begin to see how these components can each be selected on to change GRNs over time in different ways. Gene Regulatory Network Components The components of GRNs are typically characterized by referring to a hierarchical system of organization. For example, in Babu’s structure and evolution of transcriptional regulatory networks, GRN components are placed into organizational levels from the smallest interactions between transcription factors (motifs) to the more complex modules (semi-independent modules) to the entire network. Genes connecting motifs and modules are termed “nodes” (Babu et al. 2004). Babu’s system is similar to that described by Davidson (Davidson and Erwin) wherein GRN components are organized by their level of effect, with components higher in the hierarchy having a more general effect than those lower in the hierarchy (Figs. 1 and 2). The organization of these components is important to analyze in order to understand how changes in this structure can affect development and evolution. First, each GRN component is defined, and then the ways links between components within a system allow for evolutionary change in the development are examisned. The Davidson system has particular relevance to animal development and evolution, and the Babu system was developed using data from yeast and bacteria. At the top of the Davidson hierarchy are the kernels. These are the most upstream subunits in the GRN and consist of high-level regulatory genes, controlling expression of one or more other genes that typically regulate developmental patterns, such as body axes and regions. As such, regulatory genes in a kernel define the locations in which a body part will form by controlling those components of the GRN that are lower in the hierarchy. For example, the retinal determination

Gene Regulatory Networks, Evolution of, Fig. 2 Kernels, plug-ins, batteries, and GRN evolution. The gene regulatory network specifying endomesodermal tissues in sea urchins (Echinodermata) has been largely solved (Peter and Davidson 2010). Here, an endomesodermal kernel regulates multiple plug-ins, which in turn regulate battery genes responsible for cell structure changes. Kernels, plug-ins, and batteries have been defined by gene function and by comparison to the homologous GRN in sea stars, a closely related echinoderm (Hinman et al. 2007). Hinman and Davidson found that in both echinoderms, endomesodermal tissue differentiates into endoderm and blastocoelar cells, a type of mesodermal cell (teal boxes). The battery genes regulated by the blastocoelar plug-in includes genes necessary for the movement of the blastocoelar cells into the archenteron of the developing embryo, for example, cell matrix proteins. Hinman and Davidson also found that urchins contain a plug-in not found in sea stars (purple box). This plugin regulates the development of a cell lineage (micromeres) absent in the sea star but present in sea urchins. The micromeres make the sea urchin embryonic spicule skeleton, and the battery genes for this plug-in include spicule matrix genes

network (also termed PSED for its core members Pax/Six/Eyes Absent) specifies the location of the eyes for several animals (Silver and Rebay 2005). Likewise, the mesoderm specification kernel in sea stars and sea urchins differentiates mesodermal tissue from endoderm and primary mesenchymal tissues (Hinman et al. 2007 Fig. 2). The kernels set the foundation for body plan development. The details of this blueprint will be specified by the plug-ins.

Gene Regulatory Networks, Evolution of

Plug-ins are small sub-circuits of interacting genes and are regulated by the genes within a kernel. They are not dedicated to the formation of a specific body axis or region, as are kernels, but instead they can be found in a number of networks where their downstream affects vary between those networks. These plug-ins are conserved and exist in many different GRNs – homologous processes in related animals may have different uses of a plug-in (Davidson and Erwin 2006). A signal transduction system is a good example of a plug-in. Signal transduction is the cascade that ensues when a cell surface receptor is activated by an extracellular signaling molecule. This interaction allows for various protein interactions within the cell that eventually leads to the expression or repression of various genes. Epidermal growth factor receptor (EGFR) is a gene expressed in many different tissue types within a single organism and thus exists in different networks within that organism. In Drosophila eye development, EGFR interacts with Spitz and Keren ligands to function in eye cell determination (Brown et al. 2007), while EGFR interactions with Vein ligand on muscle cells are responsible for interstitial stem cell maintenance and proliferation (Xu et al. 2011). Plug-ins, like the plug-in containing EGFR, thus act as gates, interpret signals, and stabilize and establish regulatory states for gene expression. At the bottom of the GRN hierarchy are groups of protein-coding genes, called batteries. The genes in a battery are expressed in unison because they are under common regulatory control by kernels and plug-ins. Their products perform cell-type-specific functions, which essentially define the cell state. As batteries exist at the distal (or output) end of the GRN hierarchy, they have no regulatory or control function; instead they generate the cellular changes that specific body parts/axes defined by kernels and plug-ins. The components of a GRN, kernels, plug-ins, and batteries, are often highly interconnected and are modulated through incoming signals from both extracellular and intracellular signals outside the GRN. The genes that can be affected by these non-GRN signals are input/output (I/O) switches (similar to the genes that control the nodes in Babu

419

et al 2004). Because of the hierarchical nature of GRNs, these signals have the greatest effect when they change the expression or activation state of kernel genes and the smallest effect when they change the expression or activation state of the battery genes. The specific genes in a GRN module are difficult to clearly define, as a GRN can be highly connected to other GRNs and other environmental signals through I/O switches. Principles of GRN Evolution Changes to the I/O switches as well as changes to connections within a GRN can lead to drastic changes in body plan development. This has been seen at the single gene level by examining changes in cis-regulatory regions over evolutionary time that correspond with morphological changes like insect color patterns and human disease susceptibility (reviewed in Wray 2007). This has also been examined at the whole-genome level by comparing genomic sequence to phenotype in related yeast strains (reviewed in Li and Johnson 2010) and at the GRN level by comparing functional data from multiple species (reviewed in Davidson and Erwin 2006; Rebeiz et al. 2011). GRNs Rely on Protein/DNA Interactions

GRNs include multiple types of regulation, including protein/protein interactions and micro RNAs. For the most part, however, GRNs have been studied by examining protein/DNA interactions, and these interactions seem to explain many of the salient features of individual GRNs. The regulation of gene expression by cellular proteins has several effects on the evolution of GRNs. First, the modularity of the regulation of gene expression patterns allows for graded output of expression, rather than all-or-nothing changes (see Box 1). At the extreme end of this graded output is the finding that cis-regulatory regions can undergo some amount of evolutionary change and still have the same expression result (Swanson et al. 2011). Second, cis-regulatory mutations can act dominantly, for example, a gain of a binding site in a single copy of a chromosome can lead to ectopic expression of the gene without both copies having to have the same mutation (Ruvkun et al. 1991). Ectopic

G

420

expression of a GRN member will have consequences on all of the downstream components of the GRN. In this way, part of a GRN (e.g., a plugin) can be co-opted to act in a new place or at a new time during the development (Box 1). This ectopic expression can be selected on without detriment to the original GRN expression and function. In this way, the modularity of GRNs makes them particularly amenable to evolutionary tinkering. Third, mutations in either proteincoding or cis-regulatory sequence can lead to evolutionary changes in a GRN. A protein can gain the ability to regulate a different set of downstream genes, or a cis-regulatory sequence can gain, lose, or modify binding sites (making them stronger or weaker, Box 1). As described above, these changes can be modular – changing only a subset of a GRN’s output. All three of these features of GRNs are theorized to have the effect of allowing for rapid evolution of GRNs. This can lead to the diversification of a taxon to exploit different habitats. One example of this is found in yeast, where different species live in vastly different environments. Changes in GRN kernels have allowed different fungal strains to exploit various habitats. Saccharomyces cerevisiae, Kluyveromyces lactis, and Candida albicans are related yeast species that live in different environments; S. cerevisiae is the common lab yeast, K. lactis is a dairy yeast, and C. albicans is a human pathogen. Using the ChIP-Chip technique (see Tools section), Tuch et al. elucidated all the gene targets of Mcm-1 transcription factors in these three yeasts. They found that while some of the regulatory interactions remained stable, defining Mcm-1 kernels, others had changed. In this way, the fungi living under different conditions have changed the responsiveness of their I/O switches over time, making each fungus able to activate different downstream plug-ins and batteries under different environmental conditions (Tuch et al. 2008 Li and Johnson 2008). GRNs Are Hierarchical As mentioned above, the alteration or removal of those components higher up in the chain will likely lead to changes in downstream components

Gene Regulatory Networks, Evolution of

and functions. This hierarchical nature of GRNs leads to many of their evolutionary properties. Specifically, (1) kernels may be the most stable components of GRNs. (2) GRNs act to sequentially specialize tissues. (3) Early regulatory states restrict the possibilities for future states (canalization). While specific examples of each of these properties are available, the universality of them is, as yet, unknown. Below, the consequences of each of these proposed properties are examined, and several examples of the hierarchical nature of GRNs are offered with the caveat that these properties are currently theoretical, rather than proven. Because kernels are at the top of the hierarchy, they tend to be evolutionarily inflexible; if any one part of a kernel is destroyed, the genes it affects lower in the hierarchy will no longer be expressed in their normal patterns. This, in turn, leads to changes in the characters normally specified by the GRN. The inflexibility of kernels is made obvious when comparing organisms that are evolutionarily distant from each other. Several kernels have been found to be conserved in even extremely disparate species. This may be because changes to a kernel will change such a large portion of the body of an organism, that the organism will be unviable. If the organism is viable despite these changes, it could potentially lead to phylum and subphylum level differences between organisms. One compelling example of a conserved kernel is within the mesoderm specification GRN in echinoderms (Fig. 2). In both sea stars and sea urchins, regulatory interactions between several genes including Blimp1, Otx, Gatae, Foxa, and Brachyury specify mesodermal versus endodermal and primary mesenchymal cells. However, regulatory interactions outside this kernel have changed greatly between these two species, suggesting an evolutionarily inflexible kernel feeding into more flexible plug-ins, such as the Notch/Delta signal transduction pathway (Hinman et al. 2007). Changes in lower levels of GRN hierarchy have also occurred in the evolutionary time, separating mice and humans. In mice and human livers, kernels containing Apoh, Apoc4, and AldhIII are expressed. However, some genes

Gene Regulatory Networks, Evolution of

within these modules are species specific, such as DCXR in humans and Cat in mice. Despite the differences, these GRNs are both involved in similar processes including acute inflammatory response, glycerolipid metabolic processes, and regulation of coagulation (Yang and Su 2010). We can infer, then, that the changes in genes between the species’ modules have occurred lower on the organizational hierarchy of GRN components and, conversely, those genes that are conserved are likely to be part of components higher up in the GRN organizational hierarchy. Sea urchins and sea stars share the same basic body plan as do humans and mice. It is thought that alterations in plug-ins, differentiation gene batteries, and the I/O switches that connect them may lead to these morphological differences between species of organisms within a phylum. Gene batteries and the switches that control them are presumably altered most frequently because there are fewer downstream effects (Davidson and Erwin 2006). In insects, for example, the Ultrabithorax (Ubx) gene patterns certain thoracic segments of the body. Drosophila (fruit flies) and Tribolium (beetles) express Ubx in very similar thoracic patterns, suggesting similar upstream patterning by a conserved kernel. The GRN circuitry downstream of Ubx, however, differs markedly with Drosophila Ubx repressing wing development and Tribolium Ubx promoting wing development (Tomoyasu et al. 2005). These opposite effects of a specific gene can be explained if one or more plug-ins or batteries downstream of Ubx was swapped out at some point in the ancestry of insects. The GRN hierarchy can also been seen in the sequential specialization of tissue over developmental time. Kernels, restricting cell fates in terms of body region and axis, often act first in development followed by lower-level components, which subsequently specialize these regions. Because of this, changes in the deployment of specific plugins may have a modest effect on specific regions, while changes in kernels may have a much larger effect – abolishing or increasing a region able to respond to particular developmental cues. Over evolutionary time scales, this has the consequence of canalizing particular developmental programs.

421

Canalization occurs when early regulatory states of a region restrict the number of possibilities for future development. For example, once a region has been specified as neuronal it is no longer responsive to cues specifying non-neuronal cell types. In evolutionary terms, this can create a sort of “bottleneck” for evolution – particular morphological stages are seen across a wide variety of species in a phylum due to the fixation of highlevel GRN components that restrict fates (Peter and Davidson 2011). GRNs Transduce Signals

It is no coincidence that many described gene regulatory networks involve at least one signal transduction pathway. GRNs are, at their core, a means to translate multiple types of environmental signals into a cellular response. As such, they act by many of the same principles as signal transduction pathways. Several functional features found in multiple GRNs are also properties of many signal transduction pathways. Barolo and Posakony (Barolo and Posakony 2002) identify three features of signal transduction pathways used to amplify signals. Davidson and Levine (Davidson and Levine 2008) find that these, as well as other signal transduction pathway properties, are key in regulating GRNs. These include (1) activator insufficiency wherein multiple inputs are needed to activate a network. This keeps a GRN or portion of a GRN from deploying in response to a single activator. (2) Transcriptional exclusion of alternative states. This is similar to activator insufficiency, and here a negative feedback mechanism prevents activators from responding to a signal, thus >keeping a particular regulatory state off despite an activating signal. (3) Cooperative activation. Like activator insufficiency, cooperative activation requires multiple activators. Here, the particular combination of activators is important. This increases the number of output possibilities without greatly increasing the number of genes in the genome. As this is regulated via the specific combination of proteins that can bind to a cis-regulatory region, it evolves via changes in cis-regulatory regions of downstream genes and DNA binding regions of upstream signaling proteins. (4) Feedback loops.

G

422

Different types of feedback loops, common in signal transduction pathways, also regulate GRNs by stabilizing initially transient signals, coordinating a large group of cells to be similarly responsive to signals, and creating oscillatory expression patterns to spatially or temporally pattern a tissue (Davidson and Levine 2008). In a particular organism, these mechanisms serve to amplify small initial differences between environmental factors and gene expression. Over evolutionary time, they serve to act as both a buffer system for small changes in GRN genes and as a way to overcome detrimental pleiotropic effects. Pleiotropy occurs when a mutation in a single gene has multiple effects on an organism. This happens because many genes are involved in specifying and organizing multiple cell types. The signal transduction properties of GRNs ensure that only specific environments activate a particular GRN. Because GRNs are modular, specific components can be activated under multiple conditions. If one gene in a network is mutated, it might only show its effect in specific tissues, sparing others. For an example of this, see Box 1. In this hypothetical organism, some types of mutation of a cis-regulatory region lead to potentially detrimental outcomes (total loss of limbs), while others may be adaptive (gain of a second set of limbs). GRNs Can Evolve by Duplication and Divergence

The modularity of GRNs allows for redeployment of genes, plug-ins, cassettes, and kernels in different biological contexts. This modularity also allows GRNs to evolve by duplication and divergence mechanisms. At the smallest level, cis-regulatory sites can duplicate, providing more sites for a particular protein to bind. These duplicated sites can undergo further mutation to increase or decrease binding strength or to favor binding to other proteins that will become new members of the GRN. Genes themselves can also duplicate, with or without their cis-regulatory regions. After gene duplication, one of the copies may be able to mutate without detrimental effect on the organism, since an intact working copy still exists and functions (reviewed in Taylor and Raes 2004). This new copy can add complexity to the

Gene Regulatory Networks, Evolution of

original GRN, can accumulate mutations in its cis-regulatory region to become part of a new GRN, or can acquire an entirely new cis-regulatory region by being inserted into a new part of the genome. At a higher level, parts of GRNs or entire GRNs can also duplicate, for example, in a wholegenome duplication. This could lead to the evolution of new cell or tissue types. One place where this seems to have happened is in the neural crest cells of vertebrates. In vertebrates, a wholegenome duplication has created a set of neural plate gene homologues that are now expressed in the neural plate. Here, these duplicated genes drive the expression of battery genes that regulate cell adhesion and migration (for further references and examples, see Oakley and Rivera 2008). Tools for Studying Gene Regulatory Network Evolution Global Gene Expression

Global gene expression for a particular tissue type can be assayed using microarrays or highthroughput sequencing of the transcriptome. Both of these techniques begin with total mRNA from a cell type, tissue type, or developmental stage. This pool of mRNA represents every gene expressed in that tissue at a particular developmental time; genes expressed at higher levels will be represented by more copies of their particular mRNA. For both microarrays and highthroughput sequencing (next-generation sequencing), this mRNA is then reverse transcribed into DNA, called cDNA. Using high-throughput sequencing every copy of this cDNA can then be sequenced. This is followed by bioinformatics analysis to annotate each sequence with the name of the gene it represents and the number of times it was sequenced. cDNAs of genes that are expressed at higher levels will be more highly represented in the resultant data and will be sequenced more times. In this way, a “tissue profile” containing all of the mRNA expression data for a tissue can be generated. Microarrays also generate a tissue profile for a tissue or developmental stage, but they do not generate sequence data for each of the expressed

Gene Regulatory Networks, Evolution of

genes. Microarrays are typically used when sequence data is already available. Short singlestranded DNA (ssDNA) fragments are bound to a substrate, like a glass slide, in spots. Each spot on the slide is made up of trillions of identical short ssDNA fragments and represents sequence from a known gene. Since DNA fragments are so small, tens of thousands of spots can fit on a small microarray slide. A microarray slide is treated with cDNA prepared from the tissue of interest that has been labeled with a fluorescent marker. The single-stranded cDNA hydrogens bond with complementary ssDNA probes bound to the slide. After the slide is washed, only cDNA bound to the ssDNA probes is retained. Fluorescence detection then reveals which probes have been bound and, therefore, which cDNAs were present in the sample. Both microarrays and high-throughput transcriptome sequencing can be used to compare the gene expression profiles of different tissues (e.g., developing neural tissue and fully developed neuronal tissue) and to compare gene expression between tissues under different conditions (e.g., different mutants). These profiling techniques are valuable in elucidating gene regulatory network interactions. By comparing different tissue types, we can see which genes are specific to a certain tissue and begin to make predictions regarding the GRNs specifying differentiation of that tissue. By comparing mutants to wild-type profiles, we can see what the effects of kernel or plug-in elements are on downstream effector genes. Specific Gene Expression

Fine measurements of single gene expression patterns are often used to assess the subtleties of particular members of GRN or to test for expression changes downstream of a GRN member (see Functional Methods below). Common techniques to assay single genes, or a few at a time, include in situ hybridization (or in situ) and quantitative realtime PCR (qPCR). These are complementary methods with qPCR assaying the amount of gene expression and in situ precisely locating the tissues and cells expressing the gene(s) of interest. In qPCR, mRNA is extracted from tissue, cells, or whole bodies at a specific stage of

423

development. Reverse transcriptase converts the mRNA into single-stranded cDNA, which is used in a quantitative PCR reaction. Here, the amount of PCR product is quantified using a dye capable of binding double-stranded DNA, the PCR product, after each round of PCR. By comparing this data to standard controls, the amount of starting cDNA, and thus mRNA, for a particular gene can be quantified. In situ hybridization directly examines mRNA transcripts via a nucleotide probe complementary to the mRNA of the gene of interest. The probe itself is detected with an antibody producing a dark color or fluorescent signal. In situ hybridization, while not quantitative, can be very spatially sensitive, with resolution down to the subcellular level. Specific DNA/Protein Interactions

There are several methods commonly used to assay interactions between DNA and proteins. The data generated from these methods demonstrates whether a particular transcription factor (protein) binds to a particular enhancer (DNA). Methods range from the low throughput (testing a single enhancer against a single protein) to high throughput (testing all DNA against a single protein or all proteins against a particular cis-regulatory sequence). All of these methods rely on binding assays: detecting the binding between a piece of DNA and a protein. Because of this, these methods are limited, and functionality cannot be assessed. Thus, the in vivo relevance, for example, whether the transcription factor acts as a repressor or as an activator, cannot be ascertained by this type of analysis. Nevertheless, binding assays definitively show that a particular protein is capable of binding a particular enhancer. Because of this, they are a key element in elucidating a gene regulatory network. Physical binding between a protein and a piece of DNA can be assayed in several ways. Popular methods include gel-shift assays (or electromobility shift assays, a.k.a. EMSAs) and chromatin immunoprecipitation (ChIP). Gel-shift assays examine a specific protein/DNA combination. In these assays, a specific cloned and purified protein

G

424

is incubated with a labeled piece of DNA (oligonucleotide). This reaction is compared to a control oligonucleotide that has not been incubated with the protein. Both are run on a sizeseparating electrophoretic gel and probed for the labeled DNA. If binding between the protein and DNA occurred, the incubated sample will show a higher molecular weight (because the labeled DNA is weighted down by the bound protein) compared to the control. If binding did not occur, both samples will be the same size. While gel-shift assays rely on testing specific DNA sequences for binding in vitro, ChIP assays test for binding in vivo and thus do not limit the test to a specific DNA sequence or sequences. ChIP assays use an antibody against a specific protein to extract that protein and any associated DNA from cells. Subsequent steps include highthroughput sequencing of all of the different DNA species extracted (ChIP-Seq), PCR cloning of a subset of the DNA species extracted (ChIPDisplay), or microarray analysis to determine the genes or genomic regions extracted (ChIP-onchip). This final step allows researchers to see which genes are located near binding sites for the protein assayed. Actual binding site sequence, however, is not determined by this method. Because of this, ChIP and gel-shift assays are complementary in the type of data they produce, and both can be used in parallel to gain an understanding of the binding patterns of specific proteins. Subsequent computational analysis can compare binding patterns for a particular protein or protein family across species to explain changes in gene regulatory interactions over time. Functional Methods

The study of GRNs is a branch of genetics and, as such, relies on the standard genetic tool of turning genes on and off to study their function. By examining gene expression levels (via microarray, qPCR, or in situ) after perturbation, we can elucidate downstream effects on a GRN. Classic experiments are performed with mutational analysis. However, mutant stocks for particular genes are typically only available for a small set of model organisms and thus do not allow us to study most evolutionary questions. To study the evolution of

Gene Regulatory Networks, Evolution of

gene regulatory networks, other methods of turning genes on and off have been developed for non-model organisms. Techniques to increase gene expression typically rely on the addition of RNA or DNA coding for a gene into a developing organism via injection or electroporation. This adds in gene coding sequence locally to a particular cell or group of cells. If stable, the gene may be expressed for several cell generations, making the protein in all the progeny of the original cell(s) affected. An advantage of this is the ability to study a gene in a particular cell lineage or tissue type. A disadvantage is that it is difficult to study the whole-organism effects of a gene. A novel method using “supercharged” proteins complexed to RNA or DNA is being developed that could possibly overcome this hurdle (McNaughton et al. 2009). New breakthroughs in gene knockdown technology have made this tool amenable to a huge variety of organisms, allowing for large-scale studies across phyla to assay the evolution of particular genes as well as the evolution of GRNs. RNA interference (RNAi) has rapidly become the foremost method to knock down expression levels of target genes in non-model organisms (e.g., Rivera et al. 2011). In this method, double-stranded RNA (dsRNA) that matches the sequence of the gene of interest is applied to an organism. This dsRNA uses normal cellular machinery to target and knock down mRNA levels of its target gene. Because dsRNA is relatively easy and inexpensive to make, researchers can make dsRNA in house and can target any gene they can clone. This is an improvement over morpholinos, another popular method of targeted gene knockdown, which are much more expensive and can only be synthesized in specialized laboratories. Computational Methods

Computational or in silico methods to examine GRNs have exploded in recent years. Some of these have focused on understanding and organizing functional data (e.g., Longabaugh et al. 2005), and others have focused on searching sequence data to generate new hypotheses regarding

Gene Regulatory Networks, Evolution of

network connectivity (e.g., Bailey et al. 2009). Methods that organize functional data rely on expression data that can be obtained using microarray, transcriptomics, ChIP, qPCR, or in situ hybridization. The organizational methods then typically compare this data between different developmental stages, environmental conditions, species, or experimentally manipulated genetic backgrounds (i.e., mutant vs. wild-type expression). The end result is an understanding of which expressed genes act together and, with enough data, a wiring diagram of the GRN. Methods that generate new hypotheses use genomic data to search for cis-regulatory motifs that may be regulating particular expression patterns in response to specific transcription factors. To find these motifs, genomes can be searched for known transcription factor binding sites, or genomes can be compared to one another for potential conserved binding sites (reviewed in Aerts 2012). While many motif discovery programs currently exist, those that are able to validate known data or generate hypotheses that are subsequently validated will be the most useful in the future.

Conclusion While the method of studying protein-coding DNA sequences and their direct products has lead to advances in the biological and evolutionary world, these results do not represent the entirety of genetic information in a cell or organism. To understand this, we must also study gene interactions, in particular the regulation of gene expression. With current technologies, we are able to examine cis-regulatory regions and transregulatory elements and determine their effects within gene regulatory networks (GRNs) and to identify each gene’s position within the GRN hierarchy. Once the members and interactions of a GRN are identified, we are able to compare that GRN in different developmental contexts within a single organism as well as compare it to GRNs in other species. By doing this, we are able to identify the evolutionarily inflexible “kernels” that are found in multiple contexts and species.

425

Conversely, we can also identify the elements that are more readily changed or mutated. Current thought is that those elements that are most inflexible are at the top of the GRN hierarchy, regulating other regulatory genes, and those that are more flexible exist towards the bottom, performing structural tasks within the organism. One major impact the study of GRNs has had on our modern understanding of biology is to allow us to quickly identify core genetic components when we first begin to study a developmental process. For example, the elucidation of a conserved kernel specifying eye development in vertebrates and fruit flies (reviewed in Silver and Rebay 2005) led to several research groups examining the genes in the kernel in many other animals (reviewed in Vopalensky and Kozmik 2009). Because of this, we now understand the basics of eye development in a wide array of non-model organisms without the need to laboriously study hundreds of candidate genes. The study of GRNs thus enhances our ability to study complex processes in a wide variety of organisms. In this way, it also helps to reveal evolutionary trends and allows us to deduce the mechanisms for biological change and evolution. Explanation of Terms Molecular and Genetic Gene: A functional segment of DNA that codes for RNA. Typically, this RNA codes for a protein. A gene can also refer to the downstream products of this DNA (its RNA and protein). Gene expression: A gene is expressed when it is “turned on” – an RNA molecule is made from its DNA template. Gene regulatory network: The interactions between a particular set of genes. Typically this involves proteins coded by genes interacting with one another and with the DNA regulating the expression of other genes in the network. Cis-acting regulatory factor: A region of DNA near the coding portion of a gene that helps to regulate the gene’s expression. Trans-acting regulatory factor: Diffusible elements (typically proteins) that bind to cis-acting DNA to help regulate the gene’s expression. Transcription factor: A protein that affects gene expression, a trans-acting regulatory factor. Central dogma of molecular biology: DNA makes RNA makes protein. While this is often correct for any given gene, large exceptions exist. For example, not all DNA (continued)

G

426 Explanation of Terms codes for RNA (noncoding DNA) and many species of RNA are themselves functional and do not code for protein. Transcription: The normal cellular process of making an RNA from a complementary sequence of DNA using RNA polymerase. This is the first step in gene expression. Translation: The normal cellular process of making protein from an mRNA molecule using a ribosome. Genomics: The study of the entire DNA sequence (genome) of organisms. Transcriptome: All of the expressed genes in a particular organism or tissue at a particular developmental stage. Transcriptomics: The study of the entire DNA sequence of the transcriptome. Tools Reverse transcription: RNA is used as a template for DNA synthesis using a viral enzyme called reverse transcriptase. Because double-stranded DNA is more stable than single-stranded RNA, this technique is often used to study the specific types of RNA expressed in a particular cell type, tissue, or developmental stage. High-throughput sequencing: DNA sequencing techniques that elucidate the sequence of hundreds to millions of DNA fragments in parallel. Often used to sequence “transcriptomes” or the full complement of genes expressed in a particular cell type, tissue, or developmental stage. Gene knockdown: Experimental techniques used to lower the amount of gene expression of a particular gene in a cell. RNA interference (RNAi): A gene knockdown technique wherein double-stranded RNA (dsRNA) matching a particular gene’s sequence is introduced into an organism. The RNAi pathway in the cell uses these dsRNAs to knockdown gene expression of the specific gene targeted, but not other genes. Evolution and Development Character: A trait or feature of an organism that is heritable. Germ layers: Tissues in animal development. Most animals have three germ layers the endoderm (innermost layer), the ectoderm (outermost layer), and the mesoderm (middle layer). These three layers migrate and differentiate substantially during development. Specification: The point in development when a cell’s fate has been roughly decided. Differentiation: The developmental process wherein a cell changes its gene expression patterns and morphological characteristics to mature. Differentiated cells typically cannot be undifferentiated.

See Figs. 3 and 4

Gene Regulatory Networks, Evolution of

Cis-regulatory evolution. Changing the enhancer region of a gene can have drastic effects on an animal. To understand this, it is important to realize two things: first, that small mutations of only a few base pairs can disrupt or strengthen protein/DNA binding, and second, that all genes are present in the nucleus of every cell, but only a subset of these genes are turned on in any particular cell. Transcription factors expressed in a particular cell type or tissue can turn on multiple downstream genes in that cell type or tissue. The specific combination of negative and positive regulators of a gene in any given cell determines whether that gene is expressed as well as the strength of that expression – i.e., how many copies of the gene are transcribed. In this way, existing gene expression patterns drive the expression of downstream genes in a cell. In this hypothetical example, Gene A is required for limb development (1a). Gene A expression is regulated by three proteins: the positive regulators X and Y and the negative regulator Z. Its cis-regulatory region has binding sites for these three proteins (1a, upper panel). During relevant developmental stages, X is expressed in a gradient highest at the anterior of the thorax with lower levels towards the posterior of the animal (1a, lower right panel grey). Y is expressed globally (not shown), and negative regulator Z is expressed in the head and anterior thorax (1a, lower right panel dots). Together, these regulators drive the expression of Gene A in a band around the upper-mid thorax where it initiates limb development (1a, lower right panel). Small mutations in the cis-regulatory regions can eliminate, add, or otherwise change these protein-binding sites. Mutating the binding site for positive regulator Y diminishes the overall expression of Gene A, leading to smaller limbs (b). Conversely, mutating the binding site for negative regulator Z increases the extent of Gene A expression into the upper thorax (c). This results in an animal with two sets of limbs. An addition or decrease in the number of binding sites, via a small duplication or deletion mutation, can likewise change Gene A’s expression pattern. An addition of binding sites for X allows Gene A to be expressed at lower levels

Gene Regulatory Networks, Evolution of

427

Gene Regulatory Networks, Evolution of, Fig. 3

G

Gene Regulatory Networks, Evolution of, Fig. 4

of X, effectively increasing the expression of Gene A into posterior regions that have low but detectable X expression levels (d). A decrease in number of binding sites for X, then, has the opposite effect of allowing Gene A to only be expressed at very high levels of X, where Z is repressing Gene A. In this case, Gene A is not expressed (e). Mutating a low affinity X binding site (one that does not bind strongly to protein X)

to a high affinity X binding site (one that binds much more strongly to protein X) has the same effect as the addition of X binding sites (d). Likewise, decreasing the affinity of X binding sites has the same effect as decreasing the number of X binding sites (e). A final way that small mutations can drastically affect gene expression and, thus, morphology is by the addition of binding sites for new proteins.

428

The two primary ways to do this are to mutate a segment of non-protein-binding DNA so that it can now bind a new protein or to mutate an existing binding site so that its affinity changes from one protein to another. In the first case, protein W is expressed at the posterior-most end of the animal and acts as a positive regulator of transcription. When non-binding DNA mutates to a protein W binding site, W now positively regulates Gene A, turning it on at the animals posterior and inducing ectopic legs there (f). A binding site for W can also emerge if an existing protein-binding site mutates. In this case, the ectopic expression of Gene A is the same as in (f), but the ancestral binding site no longer regulates the gene.

References Aerts S (2012) Computational strategies for the genomewide identification of cis-regulatory elements and transcriptional targets. Curr Top Dev Biol 98:121–145. http://www.ncbi.nlm.nih.gov/pubmed/22305161. Accessed 21 May 2014 Babu MM, Luscombe NM et al (2004) Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14(3):283–291. http://www. sciencedirect.com/science/article/pii/S0959440X04000 788. Accessed 21 May 2014 Bailey TL, Boden M et al (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(Web Server):W202–W208. http://nar.oxfordjournals. org/content/37/suppl_2/W202.full?sid=b457777b-ab8441fd-bf80-1fc43d9156b2. Accessed 21 May 2014 Barolo S, Posakony JW (2002) Three habits of highly effective signaling pathways: principles of transcriptional control by developmental cell signaling. Genes Dev 16(10):1167–1181. http://genesdev.cshlp.org/con tent/16/10/1167.full?sid=73ccebe3-d202-4b96-b1db425391212387. Accessed 21 May 2014 Blekhman R, Oshlack A et al. (2008) Gene regulation in primates evolves under tissue-specific selection pressures. PLoS Genet 4(11):e1000271. https://doi. org/1000210.1001371/journal.pgen.1000271. http:// www.plosgenetics.org/article/info%3Adoi%2F10.1371 %2Fjournal.pgen.1000271. Accessed 21 May 2014 Brown KE, Kerr M et al (2007) The EGFR ligands Spitz and Keren act cooperatively in the Drosophila eye. Dev Biol 307(1):105–113 Davidson E, Erwin DH (2006) Gene regulatory networks and the evolution of animal body plans. Science 311(5762):796–800. http://www.sciencemag.org/con tent/311/5762/796.full?sid=c7dd4af9-64f1-4db7-92141101f403b991. Accessed 21 May 2014

Gene Regulatory Networks, Evolution of Davidson E, Levine M (2008) Properties of developmental gene regulatory networks. Proc Natl Acad Sci 105(51):20063–20066. http://www.pnas.org/content/ 105/51/20063.full?sid=b2e64971-b25c-41b8-bf3d02ccf1c24c13. Accessed 21 May 2014 Hinman VF, Nguyen A et al (2007) Caught in the evolutionary act: precise cis-regulatory basis of difference in the organization of gene networks of sea stars and sea urchins. Dev Biol 312(2):584–595. http://www. sciencedirect.com/science/article/pii/S00121606070 13425. Accessed 21 May 2014 Li H, Johnson AD (2010) Evolution of transcription networks–lessons from yeasts. Curr Biol 20(17): R746–R753 Longabaugh WJ, Davidson EH et al (2005) Computational representation of developmental genetic regulatory networks. Dev Biol 283(1):1–16 McNaughton BR, Cronican JJ et al. (2009) Mammalian cell penetration, siRNA transfection, and DNA transfection by supercharged proteins. Proc Natl Acad Sci 106(15):6111–6116. http://www.pnas.org/content/ 106/15/6111.full?sid=6e098831-5ca4-4cac-9da8-a2af3 984ab07. Accessed 21 May 2014 Oakley TH, Rivera AS (2008) Genomics and the evolutionary origins of nervous system complexity. Curr Opin Genet Dev 18(6):479–492. http://www. sciencedirect.com/science/article/pii/S0959437X0800 1846. Accessed 21 May 2014 Peter IS, Davidson EH (2010) The endoderm gene regulatory network in sea urchin embryos up to mid-blastula stage. Dev Biol 340(2):188–199. http://www.science direct.com/science/article/pii/S001216060901313X. Accessed 21 May 2014 Peter IS, Davidson EH (2011) Evolution of gene regulatory networks controlling body plan development. Cell 144(6):970–985. http://www.sciencedirect.com/science/ article/pii/S0092867411001310. Accessed 21 May 2014 Rebeiz M, Jikomes N et al (2011) Evolutionary origin of a novel gene expression pattern through co-option of the latent activities of existing regulatory sequences. Proc Natl Acad Sci 108(25):10036–10043. http://www. pnas.org/content/108/25/10036.full?sid=65fadf08-d28d4b94-b91a-27cd4db8c106. Accessed 21 May 2014 Rivera AS, Hammel JU et al (2011) RNA interference in marine and freshwater sponges: actin knockdown in Tethya wilhelma and Ephydatia muelleri by ingested dsRNA expressing bacteria. BMC Biotechnol 11(1):67. http://creativecommons.org/licenses/by/62. 60/, http://www.biomedcentral.com/1472-6750/11/67. Accessed 21 May 2014 Ruvkun G, Wightman B et al (1991) Dominant gain-offunction mutations that lead to misregulation of the C. elegans heterochronic gene lin-14, and the evolutionary implications of dominant mutations in patternformation genes. Development 113:47–54. http://www. ncbi.nlm.nih.gov/pubmed/1742500. Accessed 21 May 2014 Silver SJ, Rebay I (2005) Signaling circuitries in development: insights from the retinal determination gene

Genes and Genomes: Structure network. Development 132(1):3–13. http://dev.biolo gists.org/content/132/1/3.full. Accessed 21 May 2014 Swanson CI, Schwimmer DB (2011) Rapid evolutionary rewiring of a structurally constrained eye enhancer. Curr Biol 21(14):1186–1196. http://www. sciencedirect.com/science/article/pii/ S0960982211006439. Accessed 21 May 2014 Taylor JS, Raes J (2004) DUPLICATION AND DIVERGENCE: the evolution of new genes and old ideas. Annu Rev Genet 38(1):615–643. http://www. annualreviews.org/doi/abs/10.1146/annurev.genet.38. 072902.092831?url_ver=Z39.88-2003&rfr_dat=cr_pub %3Dpubmed&rfr_id=ori%3Arid%3Acrossref.org&jour nalCode=genet. Accessed 21 May 2014 Tomoyasu Y, Wheeler SR et al (2005) Ultrabithorax is required for membranous wing identity in the beetle Tribolium castaneum. Nature 433(7026):643–647. Nature Publishing Group http://www.nature. com/nature/journal/v433/n7026/full/nature03272.html. Accessed 21 May 2014 Tuch BB, Galgoczy DJ et al (2008) The evolution of combinatorial gene regulation in fungi. PLoS Biol 6(2):e38. http://dx.doi.org/10.1371/journal.pbio.0060038. Accessed 21 May 2014 Vopalensky P, Kozmik Z (2009) Eye evolution: common use and independent recruitment of genetic components. Philos Trans R Soc Lond B Biol Sci 364(1531): 2819–2832. http://rstb.royalsocietypublishing.org/con tent/364/1531/2819.short. Accessed 21 May 2014 Wagner GP, Lynch VJ (2008) The gene regulatory logic of transcription factor evolution. Trends Ecol Evol 23(7):377–385 Wray GA (2007) The evolutionary significance of cis-regulatory mutations. Nat Rev Genet 8(3):206–216. Nature Publishing Group. http://www.nature.com/nrg/ journal/v8/n3/abs/nrg2063.html. Accessed 21 May 2014 Xu N, Wang SQ et al (2011) EGFR, Wingless and JAK/STAT signaling cooperatively maintain Drosophila intestinal stem cells. Dev Biol 354(1):31–43. http://www.sciencedirect.com/science/ article/pii/S0012160611001825. Accessed 21 May 2014 Yang R, Su B (2010) Characterization and comparison of the tissue-related modules in human and mouse. PLoS One 5(7):e11730. http://www.plosone.org/arti cle/info%3Adoi%2F10.1371%2Fjournal.pone.0011730. Accessed 21 May 2014

Gene Silencing ▶ DNA Methylation and Cancer ▶ Genomic Imprinting ▶ Long-Term Genetic Silencing at Centromere and Telomeres

429

Genes and Genomes: Structure Lawrence I. Grossman Center for Molecular Medicine and Genetics, Wayne State University School of Medicine, Detroit, MI, USA

Synopsis The ability of living cells to continue depends on their ability to divide and produce either exact copies of themselves or programmed variations; they thus require a repository of knowledge for doing so. This repository is their genes, composed in cells of DNA and in aggregate referred to as their genome. The mechanism by which this DNA is duplicated is treated in Section I, DNA Replication. Here the concern is with its organization and function. These facets, organization and function, differ among the different repository levels (nuclei, organelles – mitochondria and chloroplasts – plasmids, and viruses) and are dealt with individually in each appropriate section. Major differences include the number of chromosomes (multiple chromosomes in eukaryotic nuclei, single chromosomes typically in organelles, plasmids, and DNA viruses), the number of genome copies (two in nuclei, hundreds to thousands in mitochondria and chloroplasts, various numbers >2 for plasmids and viruses), and their organization (DNA-protein complexes called chromatin in nuclei, defined DNA-protein complexes for viruses, looser and less welldefined associations of DNA and protein for organellar DNAs and plasmids). Whatever the details in each case, the genome organization used, in addition to being able to support its own replication, must be able to support expression and regulation of expression of its information content. These considerations must be borne in mind in focusing here primarily on the organization and information content of genomes and on the relative differences among related species.

G

430

Development of the Field The discovery of DNA (structure and function) is surely one of the great milestones in the history of knowledge. Although genetics (inheritance) could be described and studied in the absence of specific knowledge about how it is mediated – that is, what is the genetic material? – it could not really be understood in that context. The structure itself was worked out and announced dramatically (Watson and Crick 1953) – and has been the subject of much historical writing and personal recollection (e.g., Watson 1969; Sayre 1975) – but DNA did not spring full blown from the famous note in Nature by Watson and Crick. Rather, it was preceded by nearly a century of work, some of it on DNA as a chemical of unknown function, dating to a time when protein was suspected of being the genetic material because it could be seen to have enough complexity that it might, in fact, embody a code passed through generations; DNA, by contrast, had four seemingly repeating bases, which seemed too simple for the purpose. Even after the association of DNA with the genetic material (Griffith 1928; Avery et al. 1944; Hershey and Chase 1952), a good deal of work was carried out in chemical characterization (Chargaff et al. 1950) that set the stage for the model eventually proposed by Watson and Crick. Now that the structure of both genes and genomes is understood in some depth, it is astonishing to sit back and consider the diversity of chemical, structural, and topological forms that they can take in different organisms. Details about the molecular mechanisms that operate at each of these stages of DNA replication are described in individual essays of this section.

The Genetic Code, Genes, and Chromosomes This section, which is often the introductory chapter in books about genetics or other advanced works in modern biology and biomedicine, and which will be included at a later time in the online edition, embodies the history of molecular biology and encompasses the central subject matter of

Genes and Genomes: Structure

decades of research that started during World War II and was called molecular biology. Historically, a gene is the specifier for an inherited trait, a formulation that was very much in keeping with earlier genetic analysis that followed the inheritance of visible features, such as the examples familiar to students of genetics of smooth versus wrinkled seeds. As will be developed in this section, that definition is too limited; a modern definition would consist of the sequence on the genome that specifies the gene product along with any regulatory or otherwise functional associated regions. A phenotypic difference, it is now clear, can result not only from protein (or RNA) coding region differences but also from a single-nucleotide variation in a regulatory region that might regulate whether or not a gene product is expressed, or at what level it is expressed, or in response to which stimulus it is expressed. The genetic code specifies the relation between DNA sequence and protein sequence. Early workers suspected that there would be a triplet code (three bases specify an amino acid) because a singlet code would allow only four amino acids to be specified, and even a doublet code did not allow enough possibilities (16) to uniquely specify the 20 known amino acids. A triplet code (64) was certainly possible; in fact, it would overspecify the number of amino acids, causing speculation that some codons could serve as punctuation of various kinds. Furthermore, there was a period of uncertainty as to whether the genetic code was overlapping (123, 234, 345. . .) or nonoverlapping (123, 456, 789. . .). A historic experiment was able to resolve this: the ability to sequence peptides had become available by the late 1950s, and Tsugita and Fraenkel-Conrat, who were working with tobacco mosaic virus (TMV), were able to resolve it (discussed in Jukes (1962)). They reasoned that in an overlapping code, a single-base mutation would change two or three proteins but a nonoverlapping code would change only one. They thus treated TMV with a mutagen that was known to lead to a single-base change in RNA and determined that only one amino acid was changed in the cognate protein. This experiment that showed the genetic code was nonoverlapping did not prove that it was

Genes and Genomes: Structure

a triplet. That was suggested by another experiment, by Crick, Brenner, and their colleagues (Crick et al. 1961). Working with the Escherichia coli bacteriophage T4, they used a mutagen that introduced a different kind of mutation than the one Tsugita and Fraenkel-Conrat used; their mutagen caused a frame shift (by adding or deleting a base in DNA instead of changing it for a different one). Thus, reading out the DNA sequence after the frameshift would result in a different, irrelevant protein. They were able to group the ones that added a base and the ones that subtracted a base because infections with a phage from one group with a phage from the other usually restored function (ability to cause an infection), whereas two from the same group did not. The important observation was that three (but not two) from either group often restored function. The reasoning was that if the mutations were fairly close, then a short region of incorrect or missing amino acids might not render the whole protein unusable. As is now well known, the genetic code is nonoverlapping and redundant and minimizes the effects of mutation by rendering a moderate proportion of them silent. It is also universal – at least in the nucleus. As the first complete sequence of mitochondrial DNA (Anderson et al. 1981) showed, mitochondria use a modified genetic code. It became clear later that the mitochondrial genetic code is not universal among mitochondria but differs somewhat in different genera. The endosymbiont hypothesis of mitochondrial origination posits that they arose from the symbiotic fusion of an oxygen utilizing prokaryote with an emerging eukaryotic cell, followed by gene movement and elimination. One conclusion from the present-day difference between genetic codes, in which there are differences in which codons are used to signal protein termination, is that gene flow between subcellular compartments is no longer possible. Chromosome is a term applied to the genetic material organized as a nucleoprotein structure. Eukaryotic cell chromosomes in particular are examples of highly organized but fluid structures which, on the one hand, need to be enormously condensed to fit inside a nucleus and, on the other

431

hand, able to undergo DNA replication or regulation. Regulation requires that condensed regions be opened and available to transcription machinery in response to specific signals. It is perhaps not surprising that the structural proteins that help DNA carry out these tasks (histones) are among the most evolutionarily conserved known.

Prokaryotic Genomes This section, which will be included at a later time in the online edition, deals with genomes of prokaryotes. Prokaryotes consist of two major domains, Bacteria and Archaea. Bacteria are generally single celled, divide by fission, and lack internal membrane-bound structures like nuclei and mitochondria. Indeed, many bacteria are similar in size to mitochondria. Archaea are similar to bacteria in size and shape but differ in their evolutionary history, including the type of metabolic pathways they use. Prokaryotic genomes have been well studied. However, a caveat may be in order: on the one hand, this is a mature subject about which we know a great deal given that the foundations of molecular biology were worked out over the last 75 or so years with bacteria. On the other hand, we know a great deal about relatively few organisms; the vast majority of prokaryotes have not been explored at all and, indeed, have likely not yet even been found. Nevertheless, many of the ones that have been studied have, in fact, been sequenced in their entirety (http://www.ncbi.nlm. nih.gov/genomes/MICROBES/microbial_taxtree. html, last accessed 26 Mar 2014). Prokaryotic genome size is between 106 and 7 10 base pairs (bp), about three orders of magnitude smaller than many eukaryotes. The intestinal bacterium Escherichia coli, the most widely studied in the laboratory, is ~5  106 bp; mammals like mice or humans have ~3  109. A bacterial genome is comprised of a single molecule of double-stranded DNA in the form of a circle. In addition to its circular genome, which in cells is complexed with protein and called a nucleoid, many prokaryotes contain additional genes located on separate, small circular DNA

G

432

molecules within the cell called plasmids (▶ Plasmid Genomes, Introduction to). The organization of bacterial genomes is based on the operon concept (Jacob and Monod 1961; Beckwith 2011). That means that genes coding for proteins that function at sequential steps in a particular pathway are typically clustered together and co-regulated. Regulation of operons is carried out by the binding of proteins to a regulatory sequence at the start of the operon, termed the operator. Depending on their function, operons can be normally active and turned off when no longer needed by the binding of a regulatory protein (termed the repressor) to the regulatory DNA region, or they can be normally inactive and turned on when needed by the inactivation of the repressor. One of the best studied examples of a bacterial operon, the E. coli lac operon, operates by being normally inactive, so as to not make lactose metabolizing proteins in the absence of lactose, but is turned on in the presence of lactose (and the absence of glucose) by the binding of a lactose metabolite to the lac repressor, inactivating it.

Eukaryotic Genomes Eukaryotic genomes vary widely in their properties. In terms of amount of DNA, among the least is found in the fission yeast Schizosaccharomyces pombe (13.8 million bp). The largest amounts are found in some plants; a Lilium species has 300,000 million bp. Most mammals contain about 3,000 million bp. Compared to prokaryotic genomes, several features distinguish them. The most notable is that the genome is divided into chromosomes, DNA-protein complexes whose number varies from a low of 2 to more than 1,000 in some plants; humans have 23. Second, the genome (chromosomes) is isolated in a subcellular compartment, the nucleus. One way among many to mark the rate of progress in genomic research is by observing the number of species for which a complete genome sequence is now available. The first eukaryote (yeast) was sequenced in 1996; currently 183 eukaryotic genomes have been fully sequenced and

Genes and Genomes: Structure

annotated (http://www.genomesonline.org, last accessed 26 Mar 2014). Many more (thousands) prokaryotic genomes have been completed since they are a thousand or more times smaller. A striking consideration about eukaryotic genomes consists of what is sometimes called the packaging problem. The basic problem, for example, for the human genome, is that the total length of DNA is on the order of 2 m, but it must be packaged to fit into a microscopic nucleus and packaged in such a way that specific regions can be activated or inhibited under specific conditions in particular cell types. Although this problem has been addressed in broad outline, it remains an area of active investigation. Another area of great current interest falls under what is broadly called genomics. In general, one representative of an organism is sequenced. In comparing a particular locus across species, the question arises as to whether any difference(s) is a property of the species or a variation in the particular individual chosen for sequencing. To clarify this more than one individual needs to be assessed. In the case of human, where variation can be related to disease or other phenotype, extensive population characterization of variation has been carried out (http://hapmap.ncbi.nlm.nih. gov, last accessed 26 Mar 2014). A fuller discussion of eukaryotic genomes will be included at a later time in the online edition.

Plant Genomes Information about plant genomes has continued to increase, although not at the rate information about animal genomes has accumulated. There are two major reasons for this. One is that plant genomes are complex. Plants consist of three separate genomes, deriving from the nucleus, the chloroplasts, and the mitochondria. The major genome, in the nucleus, can have considerable variation in size, a phenomenon long recognized and referred to as the C-value paradox. C-value refers to the haploid DNA content of an organism, and the so-called paradox was the lack of relationship between the C-value and the apparent complexity of the organism. As Childs and Buell

Genes and Genomes: Structure

discuss, just among angiosperms there is a 2,380fold difference between smallest and largest known genomes, with a mean, interestingly, near that of mammals. The other reason for the information disparity is that more support has been available, largely through the National Institutes of Health but also from numerous charities often targeting specific diseases, for the more directly human health-related work thought to derive from animal genomics. What further complicates analysis of plant genomes in addition to size variation is the nature of the sequences. Plants contain a large number of transposable elements – fragments of DNA that can move from one location to another – resulting, as Jiang describes, in insertions, deletions, duplications, and chromosomal inversions. Their ability to amplify allows them in some cases to be responsible for the majority of DNA sequence in some species. For example, maize contains well over a million copies of transposable elements, consisting of 84% of its genome. And, as Shiu discusses, in analyzing the causes of genome expansions, the ability of transposable elements to expand is per se a major reason for genome size increases. Finally, plants are frequently polyploid, with heterozygosity between the copies sufficiently extensive as to inhibit genome assembly. Therefore, as Buell describes, special versions must be bred to allow ready genome sequencing. The existence of so much sequence diversity in plants, as Hansey discusses, leads to phenotypic diversity. In a model organism like Arabidopsis thaliana, which has a relatively small and fully sequenced genome, genome diversity (singlenucleotide polymorphisms, insertions and deletions, copy number variation) can be evaluated and alleles responsible for phenotypic variation can start to be determined.

Plasmid Genomes Plasmids are autonomously replicating extrachromosomal DNA elements. They are of considerable interest, for at least two reasons. One is that they have been harnessed by molecular biologists as invaluable tools in modern molecular genetics;

433

indeed, the original work on recombinant DNA technology more than 30 years ago was facilitated by engineered plasmids that allowed insertion of foreign DNA into bacterial cells along with the use of selectable markers in the plasmids to identify and isolate those bacterial cells that contained them. The second reason is that plasmids have allowed a wide range of genes that encode resistance to rarely encountered environmental perils to exist in a population of organisms, ensuring survival of the species if the particular peril is encountered, but not overly burdening the genome with coding capacity for rarely if ever used genes in particular organisms. Antimony is an example among many of a toxin but a relatively uncommon one. Considering the number of environmental toxins a bacterium might encounter, and postulating one or several genes devoted to detoxification for each, the bacterial genome would swell substantially if it contained the information for coping with each potential danger. Instead of devoting resources on an ongoing basis to anticipating infrequent perils, bacteria and other microorganisms have “outsourced” these genes to extrachromosomal DNA. Plasmids containing genes for a particular danger are not present in every organism but are often present in the population. Here we are not emphasizing this basic plasmid functional biology but, rather, focusing on the plasmid genomes themselves – their types, their replication, and the way they interact with their hosts. The introductory article by Thomas and Frost lays out and summarizes the major issues such as types of replication, copy number control, and regulation of replication. Many of these topics are treated in more detail in the specialized articles that follow. For example, Van Houdt and Mergeay describe chromids, which are large plasmids that carry genes indispensable for cell viability. However, a chromid differs from a chromosome in using a plasmid type of replication system. In addition, chromid genes evolve more rapidly than chromosomal ones. Two plasmids cannot always coexist stably in the same cell line. Their inability to do so, called plasmid incompatibility, is described by Thomas, who points out that typically incompatibility

G

434

means that two plasmids are closely related and are competing for the same proteins. However, other mechanisms have also been observed. Suzuki, Brown, and Top discuss plasmid host range, which is their ability to maintain themselves in bacteria of divergent species. Clearly, the greater the host range, the more rapidly a plasmid can spread genes in its environment. Since it is difficult to test each organism in an environment but sequences of different species are accumulating, Suzuki et al. discuss predicting plasmid host range from genomic signatures, which are evaluated from a plasmid’s and a potential host’s nucleotide sequence. Stekel discusses the modeling of plasmid regulatory systems. In nature plasmids have evolved to contribute a negligible burden to the host’s metabolism. However, understanding this burden is necessary both for designing plasmids not found in nature and for trying to manipulate the burden for therapeutic or other interventions. Starting with simple models, considerable success in mathematical modeling has been achieved thus far in building dynamic and realistic models. Kreft further discusses mathematical modeling of plasmid dynamics, reviewing models that have given useful results in appropriate environments, and then moves on to models that incorporate spatial structure and are oriented to individual organisms rather than only population-level properties.

Mitochondrial Genomes The discovery of DNA in mitochondria (Nass and Nass 1963; Schatz et al. 1964), which helped to rationalize years of prior genetics results on cytoplasmic or non-Mendelian inheritance that had been studied largely in plants and in fungi, started what has been more than a half century of characterization of these genomes and their role in cellular function. This initial discovery in yeast was soon shown to be true also in mammalian cells (Corneo et al. 1966). Work on mitochondrial DNA (mtDNA) can be largely divided into work on animal mtDNA and work on all others, although all others contain vastly more types of organisms. This dichotomy

Genes and Genomes: Structure

has two primary causes. One is that animal mtDNA, already known to be small, was also the beneficiary of one of the early milestones in the field, the discovery that mammalian mtDNA is circular (Van Bruggen et al. 1966); in fact, all vertebrate mtDNA is circular and has a similar genetic map. This discovery came soon after the initial discovery of closed circular DNA in small tumor viruses (Dulbecco and Vogt 1963; Weil and Vinograd 1963). Because the discovery was both novel and associated in the minds of the discoverers with tumorigenicity, substantial work took place on the properties of circularity and the associated property of supercoiling, including, soon enough, a purification method based on the discovery that the intercalating drug ethidium bromide showed restricted binding to closed circular DNA. Once mtDNA, which constitutes about 0.1% of total cellular DNA, could be highly purified, considerable strides could be made in characterizing it, leading within about 20 years to its complete nucleotide sequencing – one of the first and at that time the largest DNA to be sequenced (Anderson et al. 1981). The second cause was the discovery that mtDNA mutations could be linked to human disease (Holt et al. 1988; Wallace et al. 1988) and, additionally, to human migrations (Wallace 2005). These discoveries caused human and other animal mtDNAs to become the objects of considerable interest and, importantly, research resources. However, as the overview essay by Gray, as well as the individual essays, makes clear, mtDNA throughout eukaryotic life comes in a wide variety of sizes and forms. Although its gene content has a focus of energy metabolism, the form genomes have taken in descent from a bacterial ancestor is diverse. Furthermore, as compared to animal mtDNAs, with 37 genes consisting of 13 respiratory proteins, 22 tRNAs, and 2 rRNAs, some genomes contain some or all ribosomal protein genes, some contain only some or no tRNA genes, and many contain additional and still unidentified open reading frames. Lee and Hua, in their essay about Archaeplastidian algae, describe mitochondrial genome variation in size by nearly 20-fold, variation in structure

Genes and Genomes: Structure

(circular, linear, branched, fragmented), and variation in information content (up to a fivefold difference in number of genes). Slamovits, describing alveolate protists, finds linear genomes in all sequenced cases containing 20–44 genes. One alveolate lineage, Apicomplexa, contains mitochondrial genomes generally between 6 and 8 kb and in one case appears to lack mtDNA altogether. In addition, some alveolates contain fragmented rRNA genes that must be edited. RNA editing is also present in a number of other groups, including Amoebozoa, as described by Miller. Chromista, as O’Brien and Lane discuss, contains a threefold variation in genome size as well as novel features not seen outside this group of organisms, such as large intergenic regions and some of the open reading frames. Novelty is also a feature of supergroup Excavata: as pointed out by Lukes, some members contain the largest gene set seen in mtDNA, with an operon type of organization, whereas other members have lost mtDNA altogether. More surprises come in fungal groups: although most fungal mitochondrial genomes are unsurprising, containing circular chromosomes with standard mitochondrial gene features, Lang points out that less wellcharacterized and faster-evolving lineages have many novel features. They contain different genome architectures, genetic code changes, unorthodox initiator tRNA structures and translation initiation mechanisms, and, most recently, group I intron-mediated mRNA trans-splicing. For vertebrate animals, Bogenhagen summarizes the features, perhaps best known for human mtDNA, that have been conserved for about 500 Ma. Invertebrate animals, by contrast, as Lavrov describes, reveal substantial diversity. Particularly in non-bilateral animals, gene content, genome architecture, and the mode and tempo of genome evolution show considerable variation. Both of the above groups are part of Metazoa (multicellular animals). Unicellular animals, reviewed by Lavrov and Lang, also show wide variation. A member of Choanoflagellata, protists most closely related to the Metazoa, has a circular genome more than 76 kb in size, 86% A + T in composition, and 53 identified genes.

435

More distant relatives include ones with linear genomes, repeat containing noncoding regions, and greater size containing more genes. Lastly, Bonen discusses land plants, which contain some of the most unusual mitochondrial genomes. Plants such as liverworts and mosses have mitochondrial genomes of approximately 100–200 kb with a conservative set of about 70 genes, some of which retain bacterial-operon organization. By contrast, vascular plant mitochondrial genomes range from 200 to 3,000 kb. These mitochondrial genomes recombine readily and so exist in various physical forms and stoichiometries, as well as containing incorporated nuclear and chloroplast sequences. In addition, RNA editing of message takes place.

Chloroplast Genomes Like mitochondria, chloroplasts too have their own genomes. And, like mitochondria, they are postulated to have arisen by endosymbiosis, with most genes being transferred to the nucleus over time. As Childs and Buell write, this is an ongoing process that is still occurring both in plant mitochondria and in chloroplasts. Although novel findings have been seen in particular taxa (Knight et al. 2001), both chloroplast and mitochondrial genomes in plants use the universal genetic code, whereas in the animal kingdom, and in all metazoans examined, variations in the code occur that would prevent current transfer of organelle genes to the nucleus in a straightforward way. A full discussion of chloroplast genomes will be included at a later time in the online edition. In general, chloroplast genomes consist of circles that are 100–200 kb and that contain two inverted repeats, thereby dividing the molecule into single-copy and repeat regions. The inverted repeats contain the rRNAs and some other genes, but most genes are found in the single-copy regions. The single-copy genes are of three basic types. First, they contain chloroplast versions of genes for basic machinery that are also present in mitochondrial genomes. These include rRNAs, tRNAs, and core subunits of respiratory complexes such as NADH dehydrogenase,

G

436

cytochrome c oxidase, and ATP synthase. Second, they contain a number of proteins similar to ones found in mitochondria but whose genes are often – always in the case of animal mitochondria – found in the nucleus. Examples are some subunits of chloroplast (cl)-RNA polymerase, cl-DNA polymerase, some ribosomal proteins, and others. Third, and importantly, they contain a number of genes not found elsewhere in the cell that suit their role in the major function of plants, photosynthesis. Examples of these are genes for photosystem I and II. It’s interesting, because it’s so puzzling, to compare organelle genomes in terms of their basic properties. mtDNAs from metazoan animals, whose size and gene order are essentially constant, use all of their genetic information for coding RNA or protein, except for the approximately 1-kb regulatory region. To reduce genome size, there is an example of overlapping genes on opposite strands, and there is even an example of laboring to save a single base pair: at the 30 end of the gene for COX1, TCT encodes the final amino acid, serine, leaving AG before the 30 end of tRNAser; polyadenylation of the mRNA subsequently creates AGA, the termination codon. Compared to this aggressive saving of a single base, one of the striking features of chloroplast genomes (and of higher plant mitochondrial genomes) is their large size and great size variability.

Viral Genomes Much of the early knowledge of molecular biology originated by using viruses both as objects of curiosity and as research tools. The early studies in E. coli benefitted from use of both the virulent T bacteriophages and the lysogenic phages, prototypically bacteriophage lambda, that could either be virulent or instead integrate into the host chromosome as a benign passenger. Their relatively compact genomes could be analyzed with the less sophisticated tools available ~50 years ago, and they also were among the first molecules whose DNA sequence was determined when the technology became available. In

Genes and Genomes: Structure

the case of animal DNA viruses, they were able to effect oncogenic transformations and thus served as an early probe into cell transformation. In addition, as noted earlier, the fact that they were closed circular drove a period of intense investigation on the inherent properties of circularity and on technology that would allow their separation from nuclear DNA. A discussion of viral genomes will be included at a later time in the online version.

Future Outlook The future of work on genes and genomes continues to be full of interest, along with the occasional surprise, and there is no reason to think this will end any time soon. Much is known, to be sure. Still, lest we get too convinced that the main principles have been discovered and the most exciting work is behind us, it may be useful to remember a period about 40 years ago when the broad details of replication, transcription, and translation had been worked out in microorganisms but before the discovery of introns, all the regulatory RNAs, and so much else, that it was credible to claim that the period of great discovery in molecular genetics had come to an end, suggesting that what we have left is just filling in details. Some of the areas where we can expect to see new advances are: • Prokaryotic genomes: Synthetic biology promises to better define what are the genes that constitute a minimum requirement for life and informing the questions of what additional genes both add and cost. • Eukaryotic genomes: Some major issues in the field are regulation; codes for modifications about which little is known, such as protein glycosylation; and the role of noncoding RNA. On the latter point, it has been known for some time that the protein-coding part of the eukaryotic genome occupies on the order of 3% of the genome, and for some years much of the remainder was considered “junk” DNA, possibly the vestiges of failed evolution experiments in the past. More recently, it became

Genes and Genomes: Structure

clear that most of the genome is transcribed and that this noncoding RNA is in some cases as conserved or more so than coding regions. The functions of this noncoding RNA are just beginning to be unearthed. Furthermore, illuminating the functional effects of noncoding variation is a major area that is just now in its infancy. • Plant (including chloroplast and mitochondrial) genomes: The expansion of sequencing projects to plant materials promises a considerable growth in the amount of data available. • Mitochondrial genomes: Among mammals, much is known by way of physical description because mtDNA is so readily sequenced and because, in humans, disease-causing mutations receive considerable study and collection (www.mitomap.org). However, there is a disconnect between major phenotypic effects attributed solely to mtDNA (Sharpley et al. 2012) and genome-wide polymorphism studies in populations (www.ncbi.nlm.nih.gov/ gap) that typically do not examine mtDNA. In the future, studies that examine interaction between nuclear and mitochondrial variations are likely to be of considerable interest. Also likely to be of interest is regulation of the mitochondrial genome based on sensing of the cell’s metabolic state, a field that is in its early stages. The discovery that proteins thought to function in the cytoplasm and nucleus are also found in the mitochondria, at least conditionally, is driving discovery of signaling pathways that such proteins participate in (e.g., Leigh-Brown et al. 2010; De et al. 2012).

Cross-References ▶ DNA Replication ▶ DNA Replication, Chemical Biology of ▶ DNA Topology and Topoisomerases ▶ Gene Regulation ▶ Genomic Sequence and Structural Diversity in Plants ▶ Mitochondrial Genomes

437

▶ Plant Genomes: From Sequence to Function Across Evolutionary Time ▶ Plasmid Genomes, Introduction to

References Anderson S, Bankier AT, Barrell BG, De Bruijn MHL, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe BA, Sanger F, Schreier PH, Smith AJH, Staden R, Yougng IG (1981) Sequence and organization of the human mitochondrial genome. Nature 290:457–465 Avery OT, Macleod CM, McCarty M (1944) Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III. J Exp Med 79:137–158 Beckwith J (2011) The operon as paradigm: normal science and the beginning of biological complexity. J Mol Biol 409:7–13 Chargaff E, Zamenhof S, Green C (1950) Composition of human deoxypentose nucleic acid. Nature 165:756–757 Corneo G, Moore C, Sanadi DR, Grossman LI, Marmur J (1966) Mitochondrial DNA in yeast and some mammalian species. Science 151:687–689 Crick FH, Barnett L, Brenner S, Watts-Tobin RJ (1961) General nature of the genetic code for proteins. Nature 192:1227–1232 De S, Kumari J, Mudgal R, Modi P, Gupta S, Futami K, Goto H, Lindor NM, Furuichi Y, Mohanty D, Sengupta S (2012) Recql4 is essential for the transport of p53 to mitochondria in normal human cells in the absence of exogenous stress. J Cell Sci 125:2509–2522 Dulbecco R, Vogt M (1963) Evidence for a ring structure of polyoma virus DNA. Proc Natl Acad Sci U S A 50:236–243 Griffith F (1928) The significance of pneumococcal types. J Hygiene 27:113–159 Hershey AD, Chase M (1952) Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol 36:39–56 Holt IJ, Harding AE, Morgan-Hughes JA (1988) Deletions of muscle mitochondrial DNA in patients with mitochondrial myopathies. Nature 331:717–719 Jacob F, Monod J (1961) Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3:318–356 Jukes TH (1962) Relations between mutations and base sequences in the amino acid code. Proc Natl Acad Sci U S A 48:1809–1815 Knight RD, Freeland SJ, Landweber LF (2001) Rewiring the keyboard: evolvability of the genetic code. Nat Rev Genet 2:49–58 Leigh-Brown S, Enriquez JA, Odom DT (2010) Nuclear transcription factors in mammalian mitochondria. Genome Biol 11:215

G

438 Nass MM, Nass S (1963) Intramitochondrial fibers with DNA characteristics. I Fixation and electron staining reactions. J Cell Biol 19:593–611 Sayre A (1975) Rosalind Franklin and DNA. Norton, New York Schatz G, Haslbrunner E, Tuppy H (1964) Deoxyribonucleic acid associated with yeast mitochondria. Biochem Biophys Res Commun 15:127–132 Sharpley MS, Marciniak C, Eckel-Mahan K, Mcmanus M, Crimi M, Waymire K, Lin CS, Masubuchi S, Friend N, Koike M, Chalkia D, Macgregor G, Sassone-Corsi P, Wallace DC (2012) Heteroplasmy of mouse mtDNA is genetically unstable and results in altered behavior and cognition. Cell 151:333–343 Van Bruggen EF, Borst P, Ruttenberg GJ, Gruber M, Kroon AM (1966) Circular mitochondrial DNA. Biochim Biophys Acta 119:437–439 Wallace DC (2005) The mitochondrial genome in human adaptive radiation and disease: on the road to therapeutics and performance enhancement. Gene 354:169–180 Wallace DC, Singh G, Lott MT, Hodge JA, Schurr TG, Lezza AM, Elsas LJ 2nd, Nikoskelainen EK (1988) Mitochondrial DNA mutation associated with Leber’s hereditary optic neuropathy. Science 242:1427–1430 Watson JD (1969) Double helix. Athenium, New York Watson JD, Crick FH (1953) Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171:737–738 Weil R, Vinograd J (1963) The cyclic helix and cyclic coil forms of polyoma viral DNA. Proc Natl Acad Sci U S A 50:730–738

Genetic Element Mobility, Regulation of Adam R. Parks1 and Joseph E. Peters2 1 Molecular Control and Genetics Section, Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD, USA 2 Department of Microbiology, Cornell University, Ithaca, NY, USA

Synopsis Mobilization of genetic elements is typically a highly regulated process. Mobile elements must be active enough to maintain presence within a host, yet not move so frequently that they harm the host. Mobilization can be

Genetic Element Mobility, Regulation of

coordinated with cellular metabolism and timed to coincide with specific cellular events. Expression of recombinase genes is a key step for the activation of mobility; however, there are additional factors that modulate recombinase activity. The formation of nucleoprotein complexes that are capable of DNA strand exchange is also a key event that is heavily regulated. Once a functional nucleoprotein complex is formed, each intermediate step in the recombination process often increases the stability of the complex, driving reactions toward completion. The process of target DNA selection is also regulated in certain genetic elements. Transposons move between sites that lack DNA homology, and some can select specific sites within the target DNA to insert into. Certain mobile elements can also discourage the insertion of other similar elements into the same target DNA through a process called target-site immunity.

Introduction As parasitic entities, the success of mobile genetic elements is dictated both by the ability of the element to propagate and by the survival of the host organism. This means that elements must be active enough to maintain presence within a host, yet not too active, potentially harming the host with each new insertion (Craig 1997). Transposons and other mobile elements can harm a host in a variety of ways, for example, by interrupting essential host genes, causing damage to host chromosomes, destabilizing host genomes, or adversely affecting gene expression. In addition to the direct effects that transposons may have on host genomes, host repair mechanisms that are involved in repairing damaged DNA following a transposition event are often inherently error prone. Mobile elements are regulated by both host-encoded and element-encoded factors that coordinate mobilization and prevent or mitigate genotoxic damage (Fig. 1). Critical checkpoints for mobilization include: • Expression of recombinase proteins • Activity of recombinases

Genetic Element Mobility, Regulation of

a

-35 -10

d

*

b -35 -10

439

e

*

-10

*

G

c

Genetic Element Mobility, Regulation of, Fig. 1 Controls on transposon activity can be found throughout the transposition cycle. (a) Expression of the transposase can be controlled by promoter activity. Some transposons encode incomplete 35 and –10 regions and must have a usable 35 region supplied by the host DNA molecule (blue triangles indicate inverted repeats and the red line indicates transposon DNA). In some elements, the transposase gene is interrupted by a frameshift (blue arrow with asterisk indicating a stop codon), truncating the gene product. The DNA-binding portion of the transposase (pink square) may be produced without the catalytic domain (pink half circle). Translational frameshifting or transcriptional slippage may produce a full-length gene

product. (b) The DNA-binding domain by itself can act as a repressor of transcription. Truncated versions of the transposase may bind to the ends of the element, but are not active for transposition. (c) Assembly of the transpososome may require additional host factors (yellow hexagons) that modify the structure of the DNA. (d) Some transposons encode additional factors (orange circle), or interact with host factors, to select a site within the target DNA for insertion. (e) For some elements, a new insertion may not occur close enough to a 35 region that can be used by the RNA polymerase to initiate transcription, which disrupts the promoter for the transposase gene. This will prevent expression of the transposase gene and prevent the transposon from mobilizing again

• Assembly of nucleoprotein complexes (i.e., transpososomes) • Pairing of mobile element ends • DNA strand cleavage • Target DNA engagement • DNA strand exchange • Disassembly of recombinase complexes • Repair of recombination intermediates

Chandler 2004). For the purpose of this discussion, focus will be primarily on how transposons regulate mobility, as they embody many of the challenges that elements face in the mobilization cycle.

Each of these steps in the mobilization cycle can be regulated, affecting the frequency, timing, and location of mobile element insertion, thus affecting the risk of genotoxic damage wrought on the host (Gueguen et al. 2005; Nagy and

Regulation of the Timing of Transposition Coordination of transposition with cellular events has been observed in diverse transposons. These cellular events include DNA replication, transfer of DNA between cells, transcription, and gene

440

silencing. There is also evidence that mobile elements may be able to assess the metabolic status of the cell, enabling mobilization events to coincide with specific growth conditions, such as rapid growth or stress conditions (Peters and Craig 2001). Transposition is typically controlled by either altering the production or the activity of the recombinase enzyme in one or a combination of the following ways: • Transcription of the recombinase gene • Translation of the recombinase gene • Posttranslational modification of the recombinase enzyme • Interaction of the recombinase enzyme with other host- or element-encoded factors • Synapsis of donor and target DNAs • Reactivity of the DNA ends of the element Transcription of recombinase genes is typically very low under most conditions; however, there are some notable exceptions, such as IS186 (Nagy and Chandler 2004). When transcription of the recombinase gene is not held at a low level, the recombinase is often tightly regulated in one of the other ways listed (i.e., initiation of translation may be inefficient, preventing abundant production of the recombinase). For insertion sequences and transposons, the promoter that drives the expression of the transposase gene is often encoded within one of the terminal repeats at the end of the element (Fig. 1). In some cases, these promoters are incomplete on their own and require insertion of the element into a site where a partial promoter sequence already exists. For some transposons, additional regulatory elements may influence the amount of transcription from these promoters. The presence of promoters within the terminal repeats of the element also enables regulation of the promoters by the binding of the recombinase protein or truncated portions of the recombinase (Fig. 1). Recombinase binding often blocks transcription of the recombinase gene itself. In some systems, only the DNA-binding portion of the recombinase is abundantly translated, often encoded within the N-terminal portion of the gene. Only rarely is the catalytic domain of the recombinase produced

Genetic Element Mobility, Regulation of

along with the DNA-binding portion, limiting the amount of functional recombinase that is in the cell at a given time. Mobile elements employ a variety of mechanisms that separate the DNA-binding domain of the recombinase from the catalytic domain. In IS120, the coding sequence of the N-terminus of the transposase gene is out-of-frame with the coding sequence of the catalytic domain, and slippage of the transcription apparatus is required to generate transcripts with the two coding sequences fused (Baranov et al. 2006). For other elements, such as IS1, IS2, and IS911, fusion of the catalytic and DNA-binding domains depends on programmed translational frameshifting, whereby slippage of the ribosome generates a fused protein product. Host factors may also contribute to the activity of the recombinase promoter, either by inhibiting transcription or by activating transcription. IHF, HU, H-NS, cAMP-CRP, and Fis have been shown to affect the activity of a variety of mobile element promoters, including IS2 and the integronassociated integrase IntI1 (Cambray et al. 2010; Grindley et al. 2006; Gueguen et al. 2005). In IS50 and IS10, the activity of the promoter is also dictated by the methylation status of the DNA ends, enabling the coordination of transposase production with active replication (Nagy and Chandler 2004). In Escherichia coli, the Dam methylase adds a methyl group to the adenine of sequences containing GATC sites, but this process occurs at a slow rate compared to DNA replication. Hemimethylation is a condition in which the GATC sites of a single strand of a DNA molecule lack methyl groups, and this condition coincides with recent DNA replication, since the Dam methylase has not had time to fully methylate the DNA. The increase in transposase production following replication enables transposition to occur when (i) the double-strand DNA break caused by transposition may be used to initiate homologous recombination with the sister chromosomes, and (ii) the transposon may move in front of the active DNA replication fork, resulting in a duplication of the element. More than one promoter may drive expression of transposase genes too, as in IS50, where full-length transposase is expressed from a different promoter

Genetic Element Mobility, Regulation of

than DNA-binding domain (Reznikoff 2003). The transposase of Tn10 also has a promoter that is encoded within the open reading frame of the transposase gene, producing a truncated, nonfunctional, form of the transposase at much higher levels, even when DNA is methylated. Once the recombinase gene is transcribed, the transcript may also be subject to regulation, preventing its translation. Some messenger RNAs are particularly sensitive to degradation by host nucleases, limiting the window of opportunity for translation of the RNA to a very short time frame or ensuring that translation of the RNA is only achieved if translation is coupled with transcription. Rho utilization sites (rut) within the mRNA may also recruit the Rho termination protein, promoting transcription termination if translation is not actively occurring on the mRNA. Secondary structures in the mRNA, or the annealing of small RNAs, generate doublestranded RNAs that are a substrate for RNase III. RNase III processing of transcripts renders them more sensitive to other nucleases or separates the coding sequence for the recombinase from other regulatory elements, such as a ribosome-binding site that would be required for translation of the transcript. Secondary structures in the mRNA may also block the binding of the ribosomes to the ribosome-binding sequence, as observed in IS10 and IS50. Some mRNAs produced by mobile elements, such as IS186, lack leader sequences altogether and depend on translation occurring without the presence of a canonical ribosome-binding site (Nagy and Chandler 2004). After translation, the recombinase may be processed by proteases, deactivating the enzyme or altering its activity. In bacteria, the ClpXP and Lon proteases are involved in the processing of transposase proteins in bacteriophage Mu and IS903, respectively. In eukaryotic retroviruses and retrotransposons, proteolysis plays an even greater role in integration and transposition. In HIVand Ty elements, mRNA is initially translated into large polyprotein precursors that must be cleaved before the proteins may carry out their functions (Merkulov et al. 2001). The importance of proteolysis in these elements is highlighted by the success of protease inhibitors that make up

441

a component of the highly effective highly active antiretroviral treatment (HAART) currently in use against HIV infections. Overexpression of recombinase proteins does not necessarily lead to high levels of recombination. In eukaryotes, some transposases, such as those found in Mos1 and Himar1 mariner, are regulated by a process termed transposition overproduction inhibition (Bire et al. 2013). In this mechanism, high concentrations of transposase protein form inactive soluble oligomers that are sequestered away from target DNA. For recombination to occur, the recombinases bound to DNA ends of the genetic element must be brought together into a nucleoprotein complex; for transposons, this complex is referred to as the transpososome (Gueguen et al. 2005). The transpososome establishes the appropriate arrangement of DNA and protein that lead to transposition, and its formation has been recognized as a key step in the regulation of this form of recombination. As transposition reactions proceed, the intermediate steps of the transpososome increase in stability, driving reactions toward completion. Assembly of the transpososome begins with the binding of transposon DNA ends by transposase protein. The ends are brought together and paired, forming a paired-end complex (PEC). The process of end pairing is sometimes also referred to as synapsis. PEC formation has very specific DNA topological requirements and often includes the participation of various host proteins (Gueguen et al. 2005). Supercoiling of DNA is often an important factor in synapsis, and elements, such as Mu, contain sequences that recruit gyrase or topoisomerase to promote proper supercoiling. For Tn10, IHF protein is required, in the absence of supercoiling, to assemble the paired-end complex. In Tn5, DnaA boxes and Fis-binding sites link transpososome assembly with cellular physiology (Reznikoff 1993; Weinreich and Reznikoff 1992). The Fis protein is a nucleoid-associated protein and is abundant during early exponential phase, and DnaA is the master-regulator protein involved in DNA replication initiation. Fis binds preferentially to hemimethylated DNA, stimulating transposition after recent replication of the transposon

G

442

(Weinreich and Reznikoff 1992). DnaA binding to Tn5 ends also appears to stimulate transposition in vivo by a mechanism that is not yet understood. For most mobile elements, it is important that both ends are cleaved at the same time to prevent DNA damage. Some elements ensure that both ends are activated at the same time by transcleavage process, wherein recombinases bound to the right end of the element actually perform strand cleavage and strand transfer at the opposite end of the element. Transposases that act at a single end, such as Mos1, potentially cause the so-called self-inflicted wounds, since the DNA lesions are often mutagenic and can lead to inactivity of the element (Lipkow et al. 2004; Lohe et al. 2000). Some elements, such as Tn7 and Mu, require interaction with the target DNA molecule before the donor DNA can be cleaved; however, others do not have this requirement and can excise from the donor DNA without first identifying a target. The latter transposable elements can excise and never reinsert into a target DNA, a process referred to as abortive transposition. Elements that involve conservative sitespecific mechanisms can drive the outcome of reactions depending on DNA-bending or DNA-bridging proteins that change the synapsis architecture of the element. An example of this phenomenon can be found in bacteriophage lambda, where IHF and Int alone promote insertion of the bacteriophage genome into the E. coli chromosome, and Int, IHF, and Xis together form a separate synapsis complex, driving excision of the lambda genome from the chromosome. A similar principle has been observed in transposons, for example, IHF influences DNA-bending events that change the pattern of target-site selection in Tn10, a process called channeling (Kleckner et al. 1996). Channeling leads to insertion of the element nearby and can promote the formation of compound transposons, potentially capturing nearby host-encoded genes. In the case of Tn7, transposition does not occur until an appropriate target DNA molecule has been bound by specialized target-site selection proteins, either TnsE or TnsD, along with additional protein host factors. Some mobile elements, IS10, for example, require that target sites have

Genetic Element Mobility, Regulation of

a certain degree of physical flexibility, so that the necessary bent DNA structures can be adopted.

Target-Site Selection Virtually all mobile elements have a preference for the sequences into which they may insert. In the case of conservative site-specific recombination, the DNA-binding sequence of the recombinase determines the sequence specificity of recombination. Transposases often have preferred DNA-binding sequences or bind to DNA with certain structural characteristics. For IS10 and some non-LTR transposons, inherent DNA flexibility is an important feature of target DNA molecules. In some cases, target DNA sequences share some sequence similarity to the ends of the transposon. For some elements, such as the Ll. LtrB group II mobile intron in Lactococcus lactis (Bonen and Vogel 2001) and the IS200/IS605 family of DNA transposons (Barabas et al. 2008), base pairing between donor and target molecules serves an important role in selection of an insertion site. Some mobile elements, such as Tn7 and related elements, have dedicated target-selection proteins that bind to target DNAs and direct transposition into those sites (Fig. 1). Tn7 has two such proteins, TnsE and TnsD, which independently identify two different classes of targets. Along with two other host factors, TnsD binds to a specific, highly conserved, DNA sequence within the coding region of the glmS gene. This class of target selection ensures that Tn7 can find a target site, called attTn7, in almost any microorganism, without the risk of disrupting any genes. TnsE directs transposition into actively replicating DNA, with a strong bias toward conjugal plasmids that are actively being transported into cells. TnsE identifies these targets through an interaction with specific DNA structures and an essential component of the DNA replication machinery, the b processivity factor. DNA damage is sometimes used as a cue to activate the mobilization process. For example, activation of proteins involved in the DNA damage response in E. coli also activates transcription

Genetic Element Mobility, Regulation of

of bacteriophage lambda genes, ultimately leading to the excision of the element from the bacterial chromosome. The Saccharomyces pombe element Tf1 targets the promoter region of stress response genes under conditions of stress in which these genes are activated.

Target Immunity Tn7 and other mobile elements, such as Mu and Tn3, display a property known as target immunity. These elements discourage the insertion of additional, similar, elements nearby. Preventing the insertion of elements nearby helps to ensure that mobile elements will not insert into themselves, creating nonfunctional elements, and also prevents destabilization of target DNA molecules. Tn7 discourages transposition into DNA up to 190 kilobase from an existing element. Insertion of the same element close together can provide extended regions of DNA homology that, when involved in host-mediated homologous recombination, can result in deletions or inversions of large regions of DNA. In both Mu and Tn7, target immunity is mediated by a regulatory protein with an ATPase activity that modulates the mobility of the mobile element, based on the ATP- or ADP-bound status of the protein. Transposon ends that already exist in a potential target DNA molecule that are bound by transposase proteins stimulate the ATPase activity of the regulator protein, thereby deactivating the transposon regulator protein, preventing transposition. In the case of Tn7 and related elements, target immunity appears to be at least partly restricted to very closely related elements. Multiple similar but nonidentical Tn7-like elements have been detected within a single attTn7 site, demonstrating the limitations of target immunity for these elements.

Host Defenses Against Mobile Genetic Elements Host organisms employ a variety of strategies to protect their genomes from mobile elements. Prokaryotic organisms use CRISPR-Cas (clustered

443

regularly interspersed short palindromic repeats) systems to selectively cleave mobile elements, including phage, plasmids, integrative conjugal elements, and transposons that can potentially invade the genome. These systems act as primitive immune systems by encoding short nucleotide sequences that originate from mobile elements, and allow them to identify and specifically cleave and inactivate mobile elements. In eukaryotes, defenses against transposons are often mediated by small noncoding RNAs that silence gene expression from transposons. For example, Piwiinteracting RNAs (piRNAs) mediate the modification of histones that prevents gene expression from transposon and surrounding genes.

Cross-References ▶ DNA Recombination, Mechanisms of ▶ DNA Repair ▶ DNA Replication ▶ Double-Strand Break Repair ▶ Mobile DNA: Mechanisms, Utility, and Consequences ▶ Transposons

References Barabas O, Ronning DR, Guynet C, Hickman AB, Ton-Hoang B, Chandler M, Dyda F (2008) Mechanism of IS200/IS605 family DNA transposases: activation and transposon-directed target site selection. Cell 132:208–220 Baranov PV, Fayet O, Hendrix RW, Atkins JF (2006) Recoding in bacteriophages and bacterial IS elements. Trends Genet 22:174–181 Bire S, Casteret S, Arnaoty A, Piegu B, Lecomte T, Bigot Y (2013) Transposase concentration controls transposition activity: myth or reality? Gene 530:165–171 Bonen L, Vogel J (2001) The ins and outs of group II introns. Trends Genet 17:322–331 Cambray G, Guerout AM, Mazel D (2010) Integrons. Annu Rev Genet 44:141–166 Craig NL (1997) Target site selection in transposition. Annu Rev Biochem 66:437–474 Grindley ND, Whiteson KL, Rice PA (2006) Mechanisms of site-specific recombination. Annu Rev Biochem 75:567–605 Gueguen E, Rousseau P, Duval-Valentin G, Chandler M (2005) The transpososome: control of transposition at the level of catalysis. Trends Microbiol 13:543–549

G

444 Kleckner N, Chalmers RM, Kwon D, Sakai J, Bolland S (1996) Tn10 and IS10 transposition and chromosome rearrangements: mechanism and regulation in vivo and in vitro. Curr Top Microbiol Immunol 204:49–82 Lipkow K, Buisine N, Lampe DJ, Chalmers R (2004) Early intermediates of mariner transposition: catalysis without synapsis of the transposon ends suggests a novel architecture of the synaptic complex. Mol Cell Biol 24:8301–8311 Lohe AR, Timmons C, Beerman I, Lozovskaya ER, Hartl DL (2000) Self-inflicted wounds, templatedirected gap repair and a recombination hotspot. Effects of the mariner transposase. Genetics 154:647–656 Merkulov GV, Lawler JF Jr, Eby Y, Boeke JD (2001) Ty1 proteolytic cleavage sites are required for transposition: all sites are not created equal. J Virol 75:638–644 Nagy Z, Chandler M (2004) Regulation of transposition in bacteria. Res Microbiol 155:387–398 Peters JE, Craig NL (2001) Tn7: smarter than we thought. Nat Rev Mol Cell Biol 2:806–814 Reznikoff WS (1993) The Tn5 transposon. Annu Rev Microbiol 47:945–963 Reznikoff WS (2003) Tn5 as a model for understanding DNA transposition. Mol Microbiol 47:1199–1206 Weinreich MD, Reznikoff WS (1992) Fis plays a role in Tn5 and IS50 transposition. J Bacteriol 174:4530–4537

Genomic Imprinting Scheherazade Khan and Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms Epigenetic DNA methylation; Gene silencing

Definition Genomic imprinting is a non-Mendelian pattern of inheritance where some genes are silenced, or imprinted, in the sperm or egg, and this silencing is maintained in the offspring. As a result, only one of the two copies of the offspring’s gene is expressed. The expression pattern is determined by the allele that was inherited maternally or paternally. Some genes are imprinted in eggs, causing only the paternally inherited copy to be expressed; other genes are imprinted in sperm,

Genomic Imprinting

allowing expression only from the maternally inherited chromosome.

Discussion It has been recognized for centuries that the mule, the offspring of a female horse and a male donkey, differs significantly from the hinny, the offspring of a male horse and a female donkey. How does the mixture of the same genomes, different only by the sex of the parent, create two such different animals? Their differences result from an epigenetic phenomenon called genomic imprinting. In genomic imprinting, only one of the two copies of a gene is activated, while the other is silenced. The silencing pattern is determined by which parent the gene was inherited from and is established when gametes (sperm and egg) are formed. The silencing pattern subsists in the offspring and may affect the phenotype of the offspring. Genomic imprinting results in the expression of only the maternal allele for some genes and only the paternal allele for others. It explains, in the case of the mule and the hinny, why parental genomes appear to contribute unequally to the phenotype of their offspring. Genomic imprinting is caused by DNA methylation. Recently, it has been found that the majority of imprinted genes or imprinted loci (locations harboring multiple imprinted genes) have regions of DNA that are methylated differently on the chromosomes of each parent. The differences in the methylation of these regions constitute the “imprint” that causes one copy of the gene to be silenced and the other to be expressed (Obata 2001). Usually, DNA methylation is correlated with repression of transcription, as described in the ▶ “Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of”. However, there are cases in which DNA methylation can activate the expression of a nearby gene. For this reason, it is important to examine each example of genomic imprinting in isolation (Pfeifer 2000). For instance, the H19 and Igf2 genes are neighbors in a group of imprinted genes that is highly conserved in mice and

Genomic Imprinting

445

Genomic Imprinting, Fig. 1 Control of Igf2 and H19 genes by genomic imprinting via DNA methylation. Because of this selective imprinting of the paternal chromosome only, the Igf2 gene is only expressed from the paternal chromosome, while H19 is only expressed from the maternal chromosome. (a) Due to genomic imprinting, the paternal chromosome is selectively methylated at the DMR (differential methylation region). This methylation

extends into the H19 gene, causing it to be silenced. A nearby transcriptional activator (in red) can activate the transcription of the upstream Igf2 gene on the paternal chromosome. (b) The maternal chromosome has no methylation, so the H19 gene is transcribed. As the DMR is not methylated, the insulator protein CTCF can bind (gray triangles). CTCF blocks the enhancer (red) from activating Igf2 (Adapted from Watson et al. 2008, Fig. 17–26)

humans; there is a conserved DMR between these genes (Fig. 1). Although H19 and Igf2 are adjacent to one another, methylation of the paternal chromosome’s DMR results in the silencing of H19 and the expression of Igf2. Conversely, the maternal copy of H19 is active and Igf2 is inactive. The regulation of H19 is consistent with our normal perception that DNA methylation leads to silencing, while Igf2 counters convention. The mechanism for this differential regulation depends on the DNA within the DMR, which includes the H19 gene itself and a regulatory element between H19 and Igr2 that binds an insulator protein, CTCF. Methylation of the DMR and H19 gene will inhibit transcription of the H19 gene, but will also block binding of the CTCF insulator protein to the DMR. The activation of Igr2 occurs because, in the absence of CTCF, nearby transcriptional activators can activate Igf2. On the maternal chromosome, the DMR is not methylated and the regulatory element binds the CTCF insulator, blocking Igf2 from activation (Fig. 1). This DMR illustrates that methylation can block many types of DNA sequences, not merely promoters, which can lead to complex effects on transcription (Latchman 2010). Since imprinting leads to activation of only one of two copies of a gene, there are several genetic disorders related to imprinting. If the expressed

copy contains a mutation, then the imprinted copy, even if it is wild type, cannot compensate for the mutated copy. Imprinting creates non-Mendelian patterns of inheritance, by which two offspring can have the same genotype (a heterozygote), but will exhibit different phenotypes (unaffected or diseased), depending on which parent contributed the disease allele. For example, Angelman syndrome and Prader-Willi syndrome can each arise from a deletion of a particular region of chromosome 15 that contains several genes. Sections of the DNA that are important for preventing Prader-Willi syndrome are imprinted in the maternal chromosome, while other portions of the DNA that help prevent Angelman syndrome are imprinted in the paternal chromosome. If a person has one wild type copy of chromosome 15 and one copy of chromosome 15 with this deletion, then they will have Prader-Willi or Angelman syndrome, depending on whether the mutant chromosome was inherited from their mother or father. If the mutated chromosome is maternal, then the child will have Angelman syndrome as DNA important for preventing the disease is absent from mom and imprinted from dad. If the mutated chromosome is paternal, then the child will have Prader-Willi syndrome because the DNA important for preventing the disease is absent from dad and imprinted in mom. Thus,

G

446

Genomic Imprinting in Mammals: Memories of Generations Past

each disease manifests because essential DNA sequences are imprinted in one parent and absent in the other. Genomic imprinting has been found in insects, plants, and mammals. It is estimated that about 1% of the human genome is imprinted. Yet, what is the purpose of imprinting? It is clear that it is an important process, as mutations that interfere with imprinting are commonly associated with cancer (Pfeifer 2000). However, the normal function of the process is largely mysterious. As many of the genes that undergo this process are vital, some have theorized that it evolved in order to prevent the development of embryos with no paternal contribution. Others theorize that imprinting creates a genetic tug of war between the sexes. It has been observed that imprinting of maternal genes tends to slow growth, while imprinting of paternal genes tends to increase growth. In some mammals, including cats, females carry litters that can arise from more than one father. Imprinting in males may have arisen to make their offspring bigger, in hope of outcompeting the littermates from another father. Conversely, the female’s imprinting may have arisen to encourage less growth to decrease competition and optimize the fitness of all of her offspring. From an evolutionary point of view, the contrasting goal of the mother (survival of the whole litter) versus the goal of the father (survival of his offspring exclusively) may create a tug of war facilitated by genomic imprinting.

Cross-References ▶ Epigenetics ▶ Genomic Imprinting in Mammals: Memories of Generations Past

References Latchman D (2010) Gene regulation. Taylor & Francis Group, New York Obata Y (2001) Study on the mechanism of maternal imprinting during oocyte growth. J Reprod Dev 57:1–8 Pfeifer K (2000) Mechanisms of genomic imprinting. Am J Hum Genet 67:777–787 Watson JD et al (2008) Molecular biology of the gene, 6th edn. Pearson/Benjamin Cummings, San Francisco

Genomic Imprinting in Mammals: Memories of Generations Past Nora Engel Fels Institute for Cancer Research, Temple University School of Medicine, Philadelphia, PA, USA

Synopsis Genomic imprinting in mammals is the process in which the nonequivalence of parental genomes is established, leading to parent-of-origin effects on several processes, including transcription. DNA methylation has the dual role of marking genes as either maternally or paternally inherited and of silencing gene expression on one of the two alleles. Furthermore, the imprinting control extends over large domains of genes by mechanisms that have yet to be elucidated. Studies of imprinted regions have revealed many of the broader principles of the epigenetic control of transcription.

Introduction Mammals are diploid organisms, with two functionally nonequivalent parental genomes. The maternal and paternal genomes are made distinct in their potential for gene expression by specializations that occur during oogenesis and spermatogenesis. Additional specializations can occur after fertilization, before the two genomes are united. These differences are determined by epigenetic mechanisms, i.e., reversible chemical and structural modifications that are imposed on the DNA and that affect the transcriptional competence of genes, without altering the DNA sequence per se. Many of the epigenetic asymmetries between the maternal and paternal genomes are erased during early embryogenesis, restoring the ability of most genes to be expressed from both parental genomes. However, there is a group of genes that have an unusual behavior: they retain different expression potentials after fertilization, with the

Genomic Imprinting in Mammals: Memories of Generations Past

two parental copies displaying differential effects on phenotype. These are the so-called imprinted genes (McGrath and Solter 1984).

The Life Cycle of Imprinted Regions We can define genomic imprinting as the process by which DNA sequences, including genes or regulatory regions, are labeled epigenetically as paternal or maternal, allowing them to be distinguished during development (Fig. 1). By mapping the imprinting process onto the life cycle of the organism, several logical steps can be identified. During gametogenesis, the process by which male and female gametes are produced, the parental genomes undergo sex-specific modifications. After fertilization, the two genomes are eventually united within a single nuclear compartment. Some differences are erased in the ensuing cell divisions. Imprints, however, are retained in all or most cell types, depending on the individual gene (Latham 1995), and continue to mark the paternally from the maternally derived genomes by a maintenance mechanism that allows them to transcend mitoses. Importantly, maintenance of imprints is needed for viability and healthy life. Shortly after implantation, a small number of germ line precursors of either sperm or oocytes, Genomic Imprinting in Mammals: Memories of Generations Past, Fig. 1 The life cycle of imprints. Schematic representing the stages of fertilization, embryogenesis, and gametogenesis. P and M, paternal and maternal imprints. Processes of imprint maintenance, erasure, and establishment are indicated

447

the primordial germ cells, are physically set apart from the soma. The primordial germ cells undergo an erasure of the imprints in order to reestablish them according to the sex of the developing embryo. To summarize, mammalians inherit two genomes, one maternally and another paternally imprinted, but only transmit one imprinting state to their progeny, depending on whether they are male or female. This life cycle underlies the three main requirements for gene imprinting – it must be stable, heritable, and reversible.

Functions of Imprinted Genes The two sets of imprints in the parental genomes are complementary and thus, both genomes are required for normal development (McGrath and Solter 1984). To date, approximately 100 imprinted genes have been identified in both humans and mice, most of which have crucial roles during embryogenesis (Fig. 2a) (for current catalogs of imprinted genes, see http://igc.otago. ac.nz/home.html). The fact that in all or most expressing cell types imprinted genes are only active from one allele means that any perturbation in their expression is dominant. In fact, proper imprinted gene regulation is vital and alteration of their status in humans

G

448

Genomic Imprinting in Mammals: Memories of Generations Past

Genomic Imprinting in Mammals: Memories of Generations Past, Fig. 2 Imprinted genes and potential disease mechanisms. (a) Maternal and paternal genomes are represented, with nonimprinted genes in gray. Actively transcribed genes exhibit broken arrows to indicate RNA production. Gene 1 is maternally silent and is only expressed from the paternal copy, whereas Gene 2 is paternally silent and maternally expressed. (b) Large red X represents inactivating mutation. When mutation occurs

on a nonimprinted gene (gray boxes), there is still an active copy on the other parental chromosome. If imprinted Gene 1 is mutated on the paternal copy, which is normally the single expressed allele, there is complete loss of the gene product. (c) If the maternal copy of Gene 1 is activated (represented by red arrow) by a mutation or epimutation, such as loss of methylation, there is now double dosage of the gene product

can lead to both genetic diseases and cancer (Fig. 2b, c). A comprehensive list of some of the best described imprinting disorders can be found in several recent reviews (Bartolomei and Ferguson-Smith 2011). These diseases can be caused by a variety of mechanisms, all of which inflict an alteration of dosage of imprinted genes, with either doubling of the normal amount of protein or no protein expression at all. Among other etiologies, there are chromosomal duplications, deletions, and uniparental disomies, in which the two copies of a chromosome have the same parental origin instead of being from the two parents; additionally, there are cases due to failure to establish, erase, or maintain imprints. These diseases, though rare, are severe, since they affect genes regulating embryonic development. In addition, loss of imprinting resulting from aberrant DNA methylation states can occur in later stages, and much effort is being invested to

elucidate the effects of loss of imprinting in aging and cancer.

Mechanisms of Genomic Imprinting The “epigenetics” field is largely the legacy of the study of genomic imprinting, since investigation into the molecular mechanisms of monoallelic expression yielded a wealth of information on heritable regulatory mechanisms in general. Since both the active and the inactive copy of imprinted genes coexist in the same transcriptional environment, i.e., within reach of appropriate transcription factors, the differences between the alleles of imprinted genes, barring sequence variations, will account for the differences in transcriptional status and are, by definition, epigenetic. Thus, DNA methylation and histone modifications distinguish the parental alleles, just

Genomic Imprinting in Mammals: Memories of Generations Past

as they distinguish an active gene that becomes inactive or vice versa during development or in response to external cues (Barlow 2011). In the case of imprinted genes, several questions arise: What is the primary mark that imprints a gene? When and how are imprints established? How are imprints targeted to specific regions? How are they maintained throughout development? How are those marks recognized by the transcriptional machinery, and how do they affect transcription? Heritable silencing of one allele in the presence of the homologue that maintains its expression potential correlates with the presence of differential chemical or structural modifications of the DNA. A common feature of imprinted genes is the presence of regions of DNA that have been methylated in only one gamete and that are kept methylated on that parental chromosome throughout cell divisions. Some of these differentially methylated regions, or DMRs, have been shown to have regulatory functions by targeted mutations in the mouse. Thus, DNA methylation at DMRs associated with imprinted genes has the dual role of maintaining a memory of parental origin and, if located in a region involved in transcriptional regulation, of affecting gene expression. In fact, DNA methylation fulfills all the criteria expected of a mark that differentially identifies the parental alleles, since it is established in a gametespecific manner, when the parental chromosome sets are in their separate compartments. Moreover, sequences that carry parent-specific methylation maintain these marks heritably through mitosis and are only subject to erasure during passage through the germ line. Further, DNA methylation of regulatory sequences can exclude binding of proteins to their recognition sites, with an outcome on gene expression depending on the role of the sequence element in the activation or repression of transcription. Factors involved in establishing and removing DNA methylation imprints have been identified (Smallwood and Kelsey 2012). There are some emerging clues as to how they affect chromatin status, but how imprints are targeted to specific sequences is not known, since the enzymatic machinery responsible for de novo and

449

maintenance methylation of imprinted regions during gametogenesis also exerts genome-wide methylation in of other non-imprinted genes. One possibility is that there is no sequence-specific recognition of imprinted genes and that methylation occurs by default during gametogenesis at loci that are accessible to the methylation machinery. This is not the whole story, though, because other sequences that are methylated during gametogenesis do not share the resistance to demethylation of imprinted genes during preimplantation development. Specifically, DNMT3A2, a germ line–specific de novo DNA methyltransferase, together with a DNMT1factor, DNMT3L, are both required for establishment of male and female imprints. DNMT1 is responsible for heritable maintenance of methylation patterns over the whole genome, including imprints. These enzymes do not have sequence specificity the protection of some, but not all, methylation marks at DMRs during phases of reprogramming and genome-wide demethylation requires PGC7/STELLA and ZFP57. Nevertheless, much remains to be explored to understand the molecular and/or structural differences between methylation at DMRs and other nonimprinted regions.

Transcriptional Interpretation of the Imprint Most of what we have learned about imprinting relates to the mechanisms that result in monoallelic expression, although it should be noted that there are likely to be other consequences of differential methylation that are unrelated to gene expression (Pardo-Manuel de Villena, de la CasaEsperon et al. 2000). Elegant gene targeting experiments in the mouse have shown that monoallelic expression at imprinted genes always involves specific DNA sequences that are binding sites for regulatory proteins. Methylation of such sequences acts as a switch by impeding access of the protein to its recognition site. The resulting effect on transcription, i.e., whether it is turned on or off, depends on the nature of the regulatory sequence.

G

450

Genomic Imprinting in Mammals: Memories of Generations Past

Genomic Imprinting in Mammals: Memories of Generations Past, Fig. 3 Transcriptional mechanisms of monoallelic expression at imprinted genes. Active genes are represented in colors (pink, maternal alleles; blue, paternal alleles) and inactive genes are shown in gray. (a) A partial schematic of the Kcnq1 domain, with the differentially methylated region, DMR, showing maternal methylation (filled lollipops) in an intron of the Kcnq1 gene. On the paternal allele, the same region is unmethylated and gives rise to a noncoding RNA (Lit1/Kcnq1ot1),

represented as a wavy line, that acts in cis as a repressor of neighboring genes. Thus, Cdkn1c, Slc22a18, Phlda2 are only expressed from the maternal copies. (b) Representation of the H19/Igf2 imprinted region. In this case, the paternal chromosome carries a methylation mark upstream of the H19 gene, in a sequence that on the unmethylated maternal allele can bind CTCF to establish a genomic insulator (gray inverted triangle). The insulator blocks Igf2 from access to the enhancers downstream of the H19 gene (ovals). Thus, Igf2 is only active on the paternal copy

To date, two DNA elements are involved in controlling transcriptional outcomes at imprinted loci: promoters and insulators. In the first case, a promoter is active on the unmethylated copy, but inactive on its methylated counterpart. All the examples involving this mechanism are maternally methylated and paternally unmethylated, and these constitute the majority of imprinted genes. There are various possible outcomes to this scenario: (1) In the Igf2r and Kcnq1 imprinted domains, an antisense noncoding RNA is produced from the paternally unmethylated promoter (Fig. 3a). Both the 108 kb Airn in the Igf2r domain and the 92 kb Kcnq1ot1 in the Kcnq1 region have silencing capabilities in cis, so some of their neighboring genes have monoallelic expression on the opposite chromosome as a secondary effect of the production of the ncRNA. Control over several neighboring genes by a noncoding RNA has also been shown for the Gnas and Snrpn domains. (2) In the H13/Mcts2 domain, these

two genes overlap in their coding region, but there is differential methylation of the Mcts2 promoter, such that transcription of Mcts2 on the unmethylated allele interferes with the production of the full H13 RNA, which translates into an inactive peptide. The second imprinting mechanism, exemplified by the H19/Igf2 domain, involves a genomic insulator that binds the CTCF protein on the unmethylated maternal allele (Engel and Bartolomei 2003). This interferes in the communication between the Igf2 promoter and specific enhancers, abrogating maternal Igf2 expression from that allele. The paternally methylated version of the insulator cannot bind CTCF, thus allowing contact between the Igf2 promoter and the enhancers and resulting in active paternal expression (Fig. 3b). It is interesting to note that in the four currently known cases of imprinted domains in which paternal methylation is involved, the mark occurs in intergenic regions.

Genomic Sequence and Structural Diversity in Plants

The examples above do not preclude the existence of other mechanisms by which methylation can lead to monoallelic expression. For example, in the case of predominantly intergenic imprints, other regulatory elements in addition to insulators, such as enhancers and silencers, will likely be involved in imprinted gene expression. Interestingly, in most imprinted domains, the transcriptional consequences of methylation of a single DNA element are far-reaching, affecting clusters of genes, raising the question of how long-range regulation can be explained mechanistically. In conclusion, imprinted genes remember their history and sexual origin, and epigenetic marks such as DNA methylation function to preserve this memory after fertilization. Why imprinting exists at all is still under debate (for a discussion on the evolutionary theories of imprinting, see Hurst 1997), but its existence mandates that both maternal and paternal genomes are required for proper development, and disruption of imprinting by both mutations and epimutations poses significant risks to human health. The major challenges in the field are in the exploring cases of tissue-specific imprinting, in determining the effects of imprinting that are unrelated to gene expression, in elucidating the molecular mechanisms by which specific sequences are targeted for imprint establishment, and, finally, in understanding how and why imprinting emerged during evolution.

Cross-References ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of ▶ Epigenetics ▶ Gene Regulation ▶ Genomic Imprinting ▶ RNA-Induced Chromatin Remodeling

References Barlow DP (2011) Genomic imprinting: a mammalian epigenetic discovery model. Annu Rev Genet 45:379–403 Bartolomei MS, Ferguson-Smith AC (2011) Mammalian genomic imprinting. Cold Spring Harb Perspect Biol 3(7)

451 Engel N, Bartolomei MS (2003) Mechanisms of insulator function in gene regulation and genomic imprinting. Int Rev Cytol 232:89–127 http://igc.otago.ac.nz/home.html. Accessed 04 Apr 2014 Hurst LD (1997) Evolutionary theories of genomic imprinting, in genomic imprinting. IRL Press, Oxford Latham KE (1995) Stage-specific and cell type-specific aspects of genomic imprinting effects in mammals. Differentiation 59(5):269–282 McGrath J, Solter D (1984) Completion of mouse embryogenesis requires both the maternal and paternal genomes. Cell 37:179–183 Pardo-Manuel de Villena F, de la Casa-Esperon E et al (2000) Natural selection and the function of genome imprinting: beyond the silenced minority. Trends Genet 16(12):573–579 Smallwood SA, Kelsey G (2012) De novo DNA methylation: a germ cell perspective. Trends Genet 28(1):33–42

Genomic Sequence and Structural Diversity in Plants Candice N. Hirsch Department of Agronomy and Plant Genetics, University of Minnesota, Saint Paul, MN, USA

Synopsis Plant genomes can tolerate a wide range of variation derived from accumulation of mutations, hybridization, polyploidization, and other mechanisms. It is this diversity that underlies the array of phenotypes observed not only across the Plantae kingdom but also within each species. Characterization of sequence level variation (single nucleotide polymorphisms, insertions/ deletions, presence/absence variation, copy number variation, and inversions) allows for the association of specific sequence variants with resulting phenotypes, understanding of genetic pathways, rapid varietal improvement by plant breeders, and continuing insight into basic biological phenomenon. The following essay describes the sequence and structural diversity within plant genomes and the implications and applications of this diversity.

G

452

Genomic Sequence and Structural Diversity in Plants

Introduction The phenotypic diversity exhibited within plant species has been widely documented. It is this diversity that has allowed plants to be used for human and animal consumption, for isolation of compounds used in an array of chemicals and medications, and, more recently, for production of biofuels. While there is a deep understanding and appreciation of the phenotypic diversity in plants, only recently have we begun to understand the genomic level variation present within plant species, and how that diversity is responsible for the observed phenotypic diversity. The first plant genome to be sequenced was the model species Arabidopsis thaliana (The Arabidopsis Genome Initiative 2000), which was followed by genome sequences for many agronomic and specialty crops. Genome sequencing projects such as that of A. thaliana, rice, maize, and others have attempted to generate a reference sequence for each plant species using a single accession. This approach, while necessarily practical, is reliant on there being a single representative genome for each species. In reality, the genomic sequence present in any single accession represents only a portion of the total sequence across a species. As a consequence, this single genome sequence does not provide a clear picture of the sequence level variation between accessions. Resequencing of additional accessions within species can provide great insights into the true genomic diversity

a

.... ATATATCGGCGCTCATCTCTA ....

.... ATATATCGGC

Gene1

Gene2

Gene1

Gene2

b

TCTCATCTCTA ....

Gene3

Gene4

Gene4

Genomic Sequence and Structural Diversity in Plants, Fig. 1 Sequence level variation in plant genomes. (a) Single-nucleotide polymorphism (SNP). (b)

that exists within plant species (Fig. 1). This sequence level diversity between accessions is the product of recombination, mutation, selection, gene conversion, and polyploidization and subsequent diploidization, species hybridization, and many other mechanisms. In addition, the genome and genome diversity of each species have unique characteristics reflective of the evolutionary and domestication processes for that species, the reproductive system, artificial selection imposed by humans, and other external factors such as environmental adaptation.

Diversity in a Small Self-Pollinated Genome: Arabidopsis thaliana The model species A. thaliana has one of the smallest known plant genomes at approximately 157 Mb, and the most recent version of the annotation has predicted only 27,416 genes (TAIR 2014). This small genome size, coupled with the inbred nature of the species, has allowed A. thaliana to pave the way for genome level studies in plants. The first version of the A. thaliana genome was completed in 2000 by the Arabidopsis Genome Initiative, a group of public researchers, using the Columbia (Col-0) accession (The Arabidopsis Genome Initiative 2000). The Col-0 genome was sequenced using a bacterial artificial chromosome (BAC)-by-BAC approach resulting in a high-quality sequence for this accession and providing a scaffold for which

.... ATATATCG

GCGCT CATCTCTA ....

.... ATATATCGCATCTCTA ....

Gene1

Gene1

Gene2

Gene2

Gene3

Gene2

Gene3

Gene4

Gene4

Insertion/deletion (InDel). (c) Presence/absence variation (PAV). (d) Copy number variation (CNV)

Genomic Sequence and Structural Diversity in Plants

sequences from other accessions can be readily aligned to evaluate genomic diversity. At the same time, complementary research was being conducted at Cereon Genomics LLC, using a whole-genome shotgun approach to sequence Landsberg erecta (Ler), another commonly used accession in Arabidopsis research. The Ler genome was sequenced to approximately 2 coverage and assembly of the sequence reads totaled 92.1 Mb, approximately 70% of the genome (Jander et al. 2002). The availability of these two resources permitted the first whole-genome level evaluation of sequence diversity between accessions within a plant species. In the initial comparison between the Col-0 and Ler accessions, single nucleotide polymorphism (SNP) and insertion/ deletion (InDel) variants were identified using a stringent set of criteria due to the relatively low sequence coverage available for Ler. As a result of these stringent criteria as well as the 30% of the genome not included in the comparison, the polymorphic frequency was an underestimation of the true diversity between these two accessions. However, the validation rate of the predicted polymorphisms was near 100%. In total 56,670 highconfidence polymorphisms were identified. Of these, there was one InDel every 6.6 kbp with the insertions in Col-0 relative to Ler averaging 175 bp. The remaining polymorphisms (18,579; 1 every 3.3 kbp) were SNPs, with higher average density in introns compared to exons. Despite the underestimation of genomic diversity between these accessions, this landmark data set provided a high-quality set of markers for quantitative trait loci (QTL) and association mapping studies and for map-based cloning, in addition to it being the first genome level evaluation of diversity in plants. Evaluation of genome level diversity has come a long way since the initial comparison of Col-0 and Ler. In 2008, the 1001 Genomes Project for A. thaliana was initiated, a worldwide collaboration with the goal of describing whole-genome sequence level variation in 1,001 A. thaliana accessions (1001 Genomes 2014; Weigel and Mott 2009). There is a large amount of phenotypic and genotypic variation in A. thaliana, however, despite this variation and the resources available

453

as a model species, limited alleles responsible for phenotypic variation among accessions have been identified owing to the time consuming and difficult process of fine mapping and QTL dissection. The 1001 Genomes Project provides a new and powerful tool for the dissection of phenotypic variation through genome-wide association mapping, as well as a deeper understanding of the genomic content in the species by allowing evaluation of not only SNPs and small InDels but also structural variation that exists within the species. In the initial study from the 1001 Genomes Project, Col-0 and two divergent A. thaliana strains that represent much of the global population diversity were compared. A combination of mapping reads to the reference sequence and de novo assembly was employed to allow detection of SNPs (823,325), InDels (79,961), and structural variation (presence/absence variation (PAV) and copy number variation (CNV)) (Ossowski et al. 2008). The structural variation analysis identified >3.4 Mb from the Bur-0 and Tsu-1 genomes that were highly divergent, deleted, or duplicated compared to the Col-O sequence, with the CNV determined based on read coverage and presence of SNPs within the accession. A subsequent analysis of the first 80 genomes from the 1001 Genomes Project representing genetic diversity in eight geographically distinct populations provided further insights into the basis of genetic diversity in A. thaliana (Cao et al. 2011). Using 10–20 coverage from paired-end reads, 4,902,039 SNPs and 810,469 1–20 bp InDels were identified, with over half of the variants present in at least two of the geographically diverse populations. The lowest frequency of polymorphisms was observed in the coding sequence and the frequency was highest in the transposable elements (TEs) and intergenic sequences. While only 56 accessions were needed to detect 98% of the SNPs, additional rare alleles will continue to be discovered with the sequence from additional accessions. Interestingly, many of these SNPs and InDels altered start and stop codons and introduced frame shifts; however, multiple InDels in combination were shown to restore the proper reading frame. In addition to the SNPs and small InDels, novel contigs and

G

454

CNV were identified. Since its inception, the 1001 Genomes Project has released whole-genome sequence for over 450 A. thaliana accessions. Continual sequencing of A. thaliana accessions will likely identify additional novel genomic sequences not present in the reference Col-0 sequence that contribute to phenotypic diversity throughout the species. In addition to identifying polymorphic loci and characterizing the distribution of such loci throughout the genome, characterization of the organization of genetic variation in natural populations can also be informative. A common way to evaluate genetic variation on a population level is by calculating linkage disequilibrium (LD), which is the nonrandom association of alleles at different loci, and the rate of LD decay. LD and the rate of LD decay are indicative of recombination, population history, breeding method, and selection. Evaluation of 19 A. thaliana accessions revealed that LD decays within 10 kbp (approximately 2 genes) and that local populations have high LD, but that on a larger geographic scale, the rate of decay is similar to that observed in humans. This is reflective of a large effective population size for A. thaliana globally (Kim et al. 2007). Using an expanded set of lines from the 1001 Genomes Project, regional differences in LD were observed with a negative relationship between LD and geographic space (Cao et al. 2011). Studies such as these are highly valuable not only in describing the structure of genomic diversity in populations but also in providing great insight into the feasibility of genome-wide association studies and the necessary marker density within a species.

Diversity in a Large Open-Pollinated Genome: Maize The maize genome offers a stark contrast to the small non-repetitive A. thaliana genome, at 2.3 Gb, over 39,000 predicted gene models, and TEs comprising 85% of the genome (Schnable et al. 2009); see ▶ “Plant Transposable Elements: Beyond Insertions and Interruptions”). High prevalence of amplified gene fragments throughout the

Genomic Sequence and Structural Diversity in Plants

genome is a reflection of high TE content and activity. While the highly repetitive nature of maize can make it difficult to study, it is the action of TEs in recombination, exon shuffling, and evolution of new proteins that contribute to the vast genotypic and ultimately phenotypic diversity observed between maize accessions. In addition to TE-mediated shuffling of the genome to generate diversity, the natural out-crossing nature of maize allows for continual shuffling of the genome and introduction of novel allelic combinations within the species and a rapid decay of LD as compared to self-pollinated species such as A. thaliana. While maize is a natural heterozygous out-crossing species, it can also be readily selfpollinated to generate inbred lines suitable for genetic studies. As with A. thaliana, the release of the maize B73 reference sequence (Schnable et al. 2009) greatly facilitated genomic diversity studies at the whole-genome level. Skim sequencing of 27 diverse maize inbred lines and subsequent alignment of reads to the reference sequence allowed for compilation of the first genomic scale haplotype map in maize (Gore et al. 2009). Because of the large and highly repetitive nature of the maize genome, techniques to reduce the effective genome size and focus on the low-copy portion of the genome are often employed, such as the use of restriction enzyme adjacent sequencing (Fig. 2). Focusing on positions with coverage in greater than 13 of the inbred lines, 3.3 million polymorphic loci (SNPs and InDels) were identified, amounting to one polymorphism every 44 bp. Owing both to the skim sequencing approach as well as the diverse nature of the germplasm, and thus difficulty in aligning to a single reference sequence, this is likely an underestimation of the polymorphism rate within the entire population of maize. Scanning the whole genome, regions of low diversity were identified predominantly near centromeres, although regions with low diversity were present throughout the genome. These low-diversity regions were likely the targets of selection during domestication from teosinte and subsequent improvement of maize but may also reflect selection in teosinte prior to domestication of maize.

Genomic Sequence and Structural Diversity in Plants

455

a

b

Genomic Sequence and Structural Diversity in Plants, Fig. 2 Restriction enzyme-based method for reducing the effective genome size in resequencing projects. (a) Random shearing method. The genome is randomly sheared into small fragments that are then sequenced. With this method the sequence reads cover the entire genome. (b) Restriction enzyme adjacent

sequencing method. In this method the genome is digested with a restriction enzyme and the fragments are sequenced. The sequence reads only cover the portion of the genome adjacent to a restriction enzyme cut site. Restriction enzymes that cut in the low-copy portion of the genome are typically used in this method

Regions of selective sweeps such as these are consistently observed in domesticated crop species. As a result of linkage, the decreased diversity often extends beyond the target of selection, hampering the ability of modern day plant breeders to make selective improvement at the linked low-diversity loci. In addition to SNPs and small InDels, genome content variation exists in maize from both PAV and CNV (Lai et al. 2010; Swanson-Wagner et al. 2010). While a common set of genes are present in all accessions (core genome), other genes/ sequences are only present in a portion of maize accessions (dispensable genome). This phenomenon of a core and dispensable genome, while originally described in bacteria, has been observed across the Plantae and Animalia kingdoms. Evidence of structural variation has been demonstrated using comparative genome hybridization (CGH) methods (Fig. 3) as well as de novo assembly of whole-genome resequencing reads. Whole-genome sequencing offers a distinct advantage over hybridization approaches for the evaluation of structural variation, namely, that sequences present in the species, yet absent in the reference sequence can be identified. Using the CGH approach across 33 maize and teosinte accessions, nearly 4,000 genes exhibiting PAV or CNV relative to the B73 reference sequence were identified (Swanson-Wagner et al.

2010). Additionally, using 6 sequence coverage of six related inbred lines and a combination of mapping and de novo assembly, 157 high confidence novel genes were identified (Lai et al. 2010). Other sequence-based studies have estimated that the B73 genome may contain less than 70% of the low-copy sequences in the species (Gore et al. 2009). Additional data is needed to know the extent of PAV and CNV. In addition to contributing to basic knowledge about plant species and the development of markers for gene cloning and breeding efforts, whole-genome diversity studies particularly in maize have begun to elucidate economically important phenomena such as heterosis. Heterosis, defined as the superior performance of hybrids relative to their inbred parents, is at the core of modern day corn breeding. It is currently believed that complementation of both allelic variation (SNPs and InDels) and structural variation (PAV and CNV) contributes to heterosis (Lai et al. 2010; Swanson-Wagner et al. 2010). Whole-genome diversity studies have identified significant residual heterozygosity in pericentromeric regions of inbred lines and revealed a negative relationship between residual heterozygosity and recombination frequency. In addition, they have determined that these regions of increased residual heterozygosity contain about one-third of all maize genes. These results implicate low recombination rate in

G

456

Genomic Sequence and Structural Diversity in Plants

Genotype1 DNA

Genotype2 DNA

6 4 2 0 −2 −4 −6

log2(G1/G2)

log2(G1/G2)

Oligonucleotide Microarray

0

200

400

600

800

1000

1200

1400

1600

6 4 2 0 −2 −4 −6

0

200

400

600

800

1000

Genomic Position

Genomic Position

No Structural Variation

Structural Variation

1200

1400

1600

Genomic Sequence and Structural Diversity in Plants, Fig. 3 Analysis of presence/absence variation and copy number variation using a comparative genome hybridization approach. DNA from two genotypes are labeled with distinct florescent dyes and hybridized onto an oligonucleotide microarray. The hybridization intensity for each genotype is determined at each probe in

the same way that expression is determined when RNA is hybridized to an oligonucleotide microarray. Using the ratio of the intensities, regions in the genome with higher or lower copy number (presence/absence variation and copy number variation) are identified. Boxes indicated examples of structural variation with increased copy number in genotype 1 (red) and genotype 2 (blue)

the retention of residual heterozygosity and the inability to shuffle alleles into favorable combinations and ultimately provide evidence for a pseudo-overdominance mechanism of heterosis in maize (Gore et al. 2009).

working with related species can be advantageous by minimizing crossing barriers and issues of codon bias. The genus Oryza, for which cultivated rice (Oryza sativa) is a member, provides a prime system to explore genus level diversity. Within Oryza there are two cultivated species (O. sativa and Oryza glaberrima) as well as 22 wild species with 10 different genomes including polyploids (AA – cultivated rice and wild relatives, BB, CC, BBCC, CCDD, EE, FF, HHJJ, KKLL, GG). Additionally, cultivated rice has a “gold standard” finished genome, which provides a superior framework for comparative studies. Finally, the genome size of rice is relatively small at 371 Mb with less repetitive content than other plant species with larger genomes. In 2003, the Oryza Map Alignment Project (OMAP) was initiated to explore the genomic diversity that exists in rice and related Oryza species (Goicoechea et al. 2010), similar to the Arabidopsis 1001 Genomes Project at the species level. OMAP generated BAC-based physical maps for four species with the AA genome, the

Diversity Throughout a Genus: Oryza The discussion thus far has focused on genomic diversity at the species level. Comparing cultivated and wild-related species within a genus can provide additional insights into the full spectrum of potential diversity for a species. Across crop plants, the diversity in wild relatives is much higher than that observed in the related cultivated species owing to loss of variation during domestication and crop improvement. This extensive diversity in wild relatives provides a valuable yet underutilized source of genes and alleles for facing problems of increasing worldwide population size and changing environmental conditions. While, new genes for problems such as disease resistance can be identified outside of the genus,

Genomic Sequence and Structural Diversity in Plants

genome of cultivated rice, and one species each for the other nine Oryza genomes providing an unprecedented tool for genus wide diversity analysis in plants. Using pair-end sequencing and mapping to a reference sequence insertions, contractions, and inversions can be identified based on reads mapping closer together, further away, or in an unexpected orientation. The OMAP project uncovered rampant structural variation among Oryza species. In a comparison between three of the wild species and O. sativa, a minimum of 674 contractions, 611 expansions, and 140 inversions were identified between the wild species and O. sativa (Hurwitz et al. 2010). These insertions and contractions in the Oryza genome are dispersed throughout the genome with an increased frequency in unstable genomic regions (i.e., repetitive heterochromatic sequences and regions with large segmental duplications). Interestingly, genes with pathogenresistance and antimicrobial properties, among others, were enriched in these variable regions of the genome. While the species in the OMAP study are part of the same genus and have substantial colinearity, their genome sizes differ by nearly threefold among diploid species. Structural variation such as this can play a major role in species diversification and can also provide substantial novel genetic variation for crop improvement and adaptation to new environments. Landraces within a species can also offer a vast array of unexploited genomic variation. Landraces are local varieties that have not gone through the rigors of a formal breeding program selecting for particular traits. Rather, they are shaped predominantly by the environment and local adaptation needs. As a result, landraces maintain much of the diversity that is typically lost during modern breeding. This type of germplasm offers an intermediate between the cultivated accessions and wild species. A study in O. sativa comparing 100 Mb of the genome across 20 diverse accessions and landraces identified nearly 160,000 SNPs. This study also identified introgression regions within the accessions reflective of their breeding history. One such introgression included the Sd1 semidwarf gene, important in the green revolution (Mcnally et al. 2009).

457

There is vast genomic diversity in plants whether at the cultivated species, landrace, or genus level. Understanding the range of diversity that exists and developing creative ways to utilize this diversity in agriculture is a huge challenge. Researchers using cutting edge methods in sequencing, variant identification, and trait association are just beginning to decipher the extent of genomic diversity that exists and how that diversity relates to important traits in sustainability, agriculture, and human health.

Cross-References ▶ Genes and Genomes: Structure ▶ Plant Genome Sequencing Methods ▶ Plant Genomes, Evolution of ▶ Plant Genomes: From Sequence to Function Across Evolutionary Time ▶ Plant Transposable Elements: Beyond Insertions and Interruptions

References 1001 Genomes (2014) 1001 genomes, A catalog of Arabidopsis thaliana genetic variation. Available at http://1001genomes.org. Verified 12 March 2014 Cao J, Schneeberger K, Ossowski S, Gunther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, Wang X, Ott F, Muller J, Alonso-Blanco C, Borgwardt K, Schmid KJ, Weigel D (2011) Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet 43:956–963 Goicoechea J, Ammiraju J, Marri P, Chen M, Jackson S, Yu Y, Rounsley S, Wing R (2010) The future of rice genomics: sequencing the collective Oryza genome. Rice 3:89–97 Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, Mcmullen MD, Grills GS, Ross-IbarraJ, Ware DH, Buckler ES (2009) A first-generation haplotype map of maize. Science 326:1115–1117 Hurwitz BL, Kudrna D, Yu Y, Sebastian A, Zuccolo A, Jackson SA, Ware D, Wing RA, Stein L (2010) Rice structural variation: a comparative analysis of structural variation between rice and three of its closest relatives in the genus Oryza. Plant J 63:990–1003 Jander G, Norris SR, Rounsley SD, Bush DF, Levin IM, Last RL (2002) Arabidopsis map-based cloning in the post-genome era. Plant Physiol 129:440–450 Kim S, Plagnol V, Hu TT, Toomajian C, Clark RM, Ossowski S, Ecker JR, Weigel D, Nordborg M (2007)

G

458

Genomic Signature Analysis to Predict Plasmid Host Range

Recombination and linkage disequilibrium in Arabidopsis thaliana. Nat Genet 39:1151–1155 Lai J, Li R, Xu X, Jin W, Xu M, Zhao H, Xiang Z, Song W, Ying K, Zhang M, Jiao Y, Ni P, Zhang J, Li D, Guo X, Ye K, Jian M, Wang B, Zheng H, Liang H, Zhang X, Wang S, Chen S, Li J, Fu Y, Springer NM, Yang H, Wang J, Dai J, Schnable PS (2010) Genome-wide patterns of genetic variation among elite maize inbred lines. Nat Genet 42:1027–1030 Mcnally KL, Childs KL, Bohnert R, Davidson RM, Zhao K, Ulat VJ, Zeller G, Clark RM, Hoen DR, Bureau TE, Stokowski R, Ballinger DG, Frazer KA, Cox DR, Padhukasahasram B, Bustamante CD, Weigel D, Mackill DJ, Bruskiewich RM, Ratsch G, Buell CR, Leung H, Leach JE (2009) Genomewide SNP variation reveals relationships among landraces and modern varieties of rice. Proc Natl Acad Sci U S A 106:12273–12278 Ossowski S, Schneeberger K, Clark RM, Lanz C, Warthmann N, Weigel D (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18:2024–2033 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, Chen W, Yan L, Higginbotham J, Cardenas M, Waligorski J, Applebaum E, Phelps L, Falcone J, Kanchi K, Thane T, Scimone A, Thane N, Henke J, Wang T, Ruppert J, Shah N, Rotter K, Hodges J, Ingenthron E, Cordes M, Kohlberg S, Sgro J, Delgado B, Mead K, Chinwalla A, Leonard S, Crouse K, Collura K, Kudrna D, Currie J, He R, Angelova A, Rajasekar S, Mueller T, Lomeli R, Scara G, Ko A, Delaney K, Wissotski M, Lopez G, Campos D, Braidotti M, Ashley E, Golser W, Kim H, Lee S, Lin J, Dujmic Z, Kim W, Talag J, Zuccolo A, Fan C, Sebastian A, Kramer M, Spiegel L, Nascimento L, Zutavern T, Miller B, Ambroise C, Muller S, Spooner W, Narechania A, Ren L, Wei S, Kumari S, Faga B, Levy MJ, Mcmahan L, Van Buren P, Vaughn MW et al (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326:1112–1115 Swanson-Wagner RA, Eichten SR, Kumari S, Tiffin P, Stein JC, Ware D, Springer NM (2010) Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res 20:1689–1699 TAIR (2014) The Arabidopsis information resource. Available at http://www.arabidopsis.org/. Verified 12 March 2014 The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815 Weigel D, Mott R (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biol 10:107

Genomic Signature Analysis to Predict Plasmid Host Range Haruo Suzuki1, Celeste J. Brown2 and Eva M. Top2 1 Department of Population Medicine and Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA 2 Department of Biological Sciences, Institute for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, Moscow, ID, USA

Synopsis Plasmids differ in their ability to maintain themselves in different hosts, and this determines the extent to which they can spread the genes they carry to bacteria of different species in their local environment. The rate at which new sequence information is accumulating means that it is impracticable to test the host range of all new plasmids empirically so ways of predicting host range from their sequence will provide important ways of classifying plasmids and in some cases assessing risk and directing resources. Different bacteria have different genomic sequence signatures of nucleotide composition, and it appears that plasmids that are associated permanently with bacteria of a single signature type tend to acquire that signature while plasmids that move around more between bacterial types are less adapted to any host type. This short entry describes progress in developing this simple concept as an approach to analyzing plasmid genomes and using this information to predict plasmid host range.

Introduction Plasmid host range is a key concept in plasmid biology and typically refers to the taxonomic range of bacterial hosts in which a plasmid can replicate. One can also include the ability to

Genomic Signature Analysis to Predict Plasmid Host Range

transfer to such hosts in the definition, but since not all plasmids are self-transmissible or even mobilizable, it should not be an essential aspect. Host range differs widely among natural plasmids, with some being very restricted in their host range [narrow-host-range (NHR) plasmids] and others able to replicate in a much broader range of hosts [broad-host-range (BHR) plasmids]. Understanding the host range of plasmids is important, not only because it helps address basic ecological questions of gene flow among bacteria but because of the medical interest in the reservoirs and dissemination trajectories of multidrug resistance and virulence plasmids. For example, recent comparative genomic analysis of the 2011 German enterohemorrhagic Escherichia coli (EHEC) O104:H4 outbreak suggests that the highly virulent outbreak clone evolved through gains and losses of chromosomal and plasmid-encoded virulence factors and extended spectrum b-lacatamase (ESBL) genes (Mellmann et al. 2011). The rapid spread of plasmids encoding ESBL and other new b-lactamases together with multiple other resistance genes is of growing concern to health-care professionals. Well known is the recently spreading so-called New Delhi metallo-b-lactamase-1 (NDM-1), which is encoded on plasmids in various Enterobacteria and can thus be transferred horizontally. To slow down the alarmingly rapid spread of these unwanted traits in human pathogens, one first needs to understand where the corresponding genes are coming from and how they are so rapidly exchanged in bacterial communities of natural and clinical habitats. Thus, better insight is needed into the range of bacteria in which these plasmids have resided so far, as well as the potential hosts wherein they could transfer in the near future. While plasmid host range has until recently only been assessed empirically, a few studies have begun to predict candidate hosts and the putative host range of a plasmid based on its genome sequence alone (Suzuki et al. 2008, 2010; Norberg et al. 2011). This short entry focuses on plasmids from the Proteobacteria to describe the rationale for using this novel approach, first results and future directions.

459

Determining Plasmid Host Range In the literature, host range is usually expressed only qualitatively as narrow and broad and is generally based on two methods. One method relies on mating assays in the laboratory. A specific plasmid is transferred into a set of recipient strains or environmental samples. The recipient bacteria that acquired the plasmid represent the host range. Another method infers the host range from the diversity of hosts in which a plasmid is found in various habitats. For example, plasmids from the incompatibility (Inc) groups IncF, IncH, and IncI generally have been found in a narrow range of hosts, while IncN, IncP-1, IncQ, IncU, and IncW plasmids have been found in a moderately to extremely broad range of hosts (Suzuki et al. 2010). Among these, IncP-1 plasmids from natural environments are found in members of at least three classes within the phylum Proteobacteria (Alpha-, Beta-, and Gammaproteobacteria), and several have also been experimentally shown to transfer to and replicate in representatives of at least these three classes. However, a rigorous quantitative and uniformly comparable empirical determination of host range will require several adjustments to the usual practices. First, standardized methods for high-throughput testing of plasmid host range using many phylogenetically diverse recipients should be developed. Currently, empirical plasmid host range studies differ substantially in the experimental methods and donor and recipient strains used, making it difficult to compare results. Second, genetic distance of a conserved gene like the 16S rRNA gene is superior to taxonomic richness as a quantitative measure of the diversity of plasmid hosts (Suzuki et al. 2010). These adjustments will result in quantitative measures of host range.

Analysis of Genomic Signature of Plasmids and Host Chromosomes To better understand plasmid host adaptation and the life history of plasmids, i.e., the putative hosts in which the plasmids may have resided, genome

G

460

Genomic Signature Analysis to Predict Plasmid Host Range

characteristics based on G + C content, GC skew, and “genomic signatures” have been compared between plasmids and their known hosts (Wilkins et al. 1996; Thorsted et al. 1998; Campbell et al. 1999; Rocha and Danchin 2002; Wong et al. 2002; van Passel et al. 2006a; Suzuki et al. 2008; Arakawa et al. 2009). The concept of “genomic signature” (or “genome signatures”) refers to the similarity in oligonucleotide composition throughout a genome (Karlin and Burge 1995) and is usually represented as a vector of oligonucleotide (k-mer) frequencies or their normalized derivative indices (Mrazek 2009). For example, dinucleotide (2-mer) frequencies normalized to factor out the embedded mononucleotide frequen cies are given by xij ¼ f ij f f , where i or j stands i

j

for A, C, G, or T. The dinucleotide relative abundance value (xij) is the observed frequency of that dinucleotide (fij) divided by the expected frequency, which is the product of the individual mononucleotide frequencies (fi and fj). These values combine counts from sliding dinucleotide windows along both strands of the sequence. The number of possible dinucleotide combinations is 16, that of trinucleotides is 64, and that of tetranucleotides is 256. Applications of genomic signature analysis can include taxonomic and phylogenetic classification of individual genomes and metagenome fragments and prediction of hosts for mobile genetic elements such as plasmids, phages, and other viruses (Abe et al. 2003, 2005; Simmons 2008; Willner et al. 2009). A good method for retrieving known plasmid host pairs based on their genomic signature similarity is the Mahalanobis distance, which takes into account the variance and correlation in genomic signatures. This method is better than alternative distance metrics such as Euclidean distance, Manhattan distance, and its relatives like delta-distance (Suzuki et al. 2008). Dalevi et al. proposed the framework of fixed- and variable-order Markov chains to capture dependencies in genomic signatures (Dalevi et al. 2006a, b). The genomic signature carries a phylogenetic signal (van Passel et al. 2006b; Mrazek 2009). Generally, oligonucleotide compositions

Genomic Signature Analysis to Predict Plasmid Host Range, Fig. 1 Correlation of distances based on 3-mer genomic signature (with 50-kb sequence samples) and 16S rRNA genes among 141 bacterial strains of the Proteobacteria. Only one representative strain was randomly selected for each genus. The two distances are moderately and significantly correlated; i.e., the Spearman rank correlation coefficient was 0.49 (P-value 100). A peculiar feature of the TRPM family is that two of its members (TRPM6 and TRPM7) contain a C-terminal functional protein kinase domain. Therefore, these channels are sometimes named chanzymes. TRPM7 is also unusual in that it displays high permeability to Mg2+ (and other physiologically nonrelevant divalent cations). TRP channels display a wide range of activating mechanisms, often mutually overlapping and ranging from voltage to temperature changes, from stretch to ligand binding. This implies that their physiological roles in excitable and non-excitable cells are considerably diverse. The details cannot be reviewed here. In brief, TRP channels generally mediate environmental signals, both local (i.e., tissue environment) and external to the organism. Hence, these ion channels transduce sensory signals in the broadest sense, including thermo- and nociception. They are widely distributed in the Animal Kingdom, but they are also present in yeast. In mammalian tissues, TRP channels have been implicated in both genetic and nongenetic diseases, including cancer. At least four TRP genes may carry mutations linked to human diseases (Venkatachalam and Montell 2007).

Voltage-Gated H+ Channels Proton channels were discovered very recently. They are coded by the HVCN1 gene, also known as HV1 or VSOP, which is expressed in tissues as diverse as neutrophils, basophils, B cells, spermatozoa, and airway epithelial cells. HV1 channels are structurally related to the other VG channels. They contain four transmembrane domains (S1–S4), but lack S5 and S6. Therefore, HV1 retains the voltage dependence but not the pore domain. Therefore, its permeation mechanism must be radically different from those of the other VG channels. The native channel is thought to be a dimer of HV1 subunits, each containing a proton pore. The N- and C-terminal are both

Two-Pore Domain K+ Channels (K2P) The prototype, isolated in 1996, is TWIK-1 (tandem of pore domains in a weak inward rectifying K+ channel, Type 1). A K+ channel with two-pore

I

600

Ion Channels and Transporters

domains per subunit was previously identified in yeast (TOK1), but its structure and function are rather different from those of the mammalian K2P. Based on sequence and physiological properties, K2P channels were divided into six subfamilies: TWIK, TREK (TWIK-related K+ channel), TASK (TWIK-related acid-sensitive K+ channel), TALK (TWIK-related alkaline pH-activated K+ channel), THIK (tandem pore domain halothaneinhibited K+ channel), and TRESK (TWIKrelated spinal cord K+ channel). The systematic nomenclature is KCNK followed by a specific number for each subtype, e.g., TWIK-1 is KCNK1. The sequence similarity between families is relatively low, but the general structural features are retained. Each subunit is formed by intracellular N- and C-termini that comprise four

transmembrane domains (TMS1–TMS4). Each subunit contains two P-loops, one between TMS1 and TMS2 and the other between TMS3 and TMS4. The functional channel is a dimer of such subunits and contains a K+-selective pore lined by four P-loops. K2P are also found in nonmammalian animals, including Drosophila and Caenorhabditis, and plants. K2P channels constitute background K+ channels, i.e., ion channels that show weak voltage dependence and rectification and thus may give a prominent contribution to Vrest. More specific physiological roles in different tissues are currently under intense investigation (Enyedi and Czirkjak 2010).

Ligand-binding domain

NH2

COOH M2 ion M2

M1 M4

M2 M3

M1

M2

M3

M2 M2

M4

Subunit topology (viewed from side)

Subunit assembly around the pore (viewed from top)

Ligand

Ligand α γ

β

α

β

Heteromomeric channels

Ion Channels and Transporters, Fig. 3 Schematic structure of ligand-gated cys-loop channels. The top panel shows the basic structure of the typical subunit of cys-loop receptors (i.e., ionotropic receptors for ACh, serotonin, GABA, or Gly), with four transmembrane domains (M1–M4) and a large extracellular N-terminal domain which contributes to form the ligand-binding site. Subunits do not always contain a ligand-binding domains. Arrows indicate transduction of the conformational transition

α α

α

α

α

Homomomeric channels

between the ligand-binding pocket and the filter region formed by M2 segments. In nAChRs, about 5 nm separate these channel portions. The top right panel shows the schematic arrangement of five subunits to surround a central pore, lined by M2 segments. Receptors can assemble as homo- or heteropentamers. Different subunit stoichiometries can result in the formation of 2–5 ligandbinding sites. More details about specific receptor types are given in the text

Ion Channels and Transporters

Ligand-Gated Channels: “Cys-Loop” Receptors

This group comprises four classes of structurally related channels, respectively gated by ACh, serotonin (or 5-hydroxytriptamine, 5-HT), g-aminobutyric acid (GABA), and glycine (Gly). These receptors are all pentamers of different or equal subunits that surround a pore permeable to cations (for nAChRs and 5-HT3Rs) or anions (for GABAA and Gly receptors). The general structure of each subunit is thought to be similar to the one determined for nAChRs, with a large extracellular N-terminal domain containing the ligand-binding site, four hydrophobic transmembrane domains (named M1–M4 or TM1–TM4), and a short extracellular C-terminus (Fig. 3). The precise subunit stoichiometry is variable and splice variants are also present. The name “cys-loop” derives from the presence of a disulfide bridge which delimits a loop of 13 conserved amino acid residues at the base of the extracellular domain. This structure participates in transducing the ligand-binding signal to the pore (Miller and Smart 2010). nAChRs The muscular nAChR that mediates the postsynaptic response at the neuromuscular junction is a heteropentamer with stoichiometry 2a1b1gd (in the adult). Many other subunits have been isolated from the mammalian nervous system (a2–a10; b2–b4). An exception is a8, which was isolated from avian tissue and has not yet been found in mammals. The nAChRs formed by these subunits are thus generally called “neuronal,” but increasing evidence indicates that they are frequently expressed in nonneuronal cells (Table 4). Subunits can associate to form homopentamers (typically of a7 or a9) that contain five ligand-binding pockets or heteropentamers of variable stoichiometry. Typical heteromeric forms contain a4/b2 in the CNS and a3/b4 in the PNS. Stoichiometric ratios of 2:3 or 3:2 are both functional and which prevails in physiological conditions is matter of current debate. In heteromeric nAChRs, the ligandbinding site is formed by the contribution of the adjacent a and b subunits. The number of ligands thus depends on subunits’ stoichiometry. Several subunits, such as a5 and a10, do not contain ligand-binding pockets, but have

601 Ion Channels and Transporters, Table 4 Ligand-gated channels: nAChRs and GluRs Subunit a1

Gene CHRNA1

b1

CHRNB1

d

CHRND

e

CHRNE

g

CHRNG

a2–a10

CHRNA2–CHRNA1

b2–b4

CHRNB2–CHRNB4

NR1

GRIN1

NR2A–NR2D

GRIN2A–GRIN2D

NR3A–NR3B

GRIN3A–GRIN3B

GluR1–GluR4

GRIA2–GRIA4

GluR5–GluR7

GRIK1–GRIK7

KA-1–KA-2

GRIK4–GRIK5

Channel type Muscle nicotinic AChR Muscle nicotinic AChR Muscle nicotinic AChR Muscle nicotinic AChR, embryonic subunit Muscle nicotinic AChR, adult subunit Neuronal nicotinic AChR, also expressed in other tissues Neuronal nicotinic AChR, also expressed in other tissues Glutamate receptor (NMDA) Glutamate receptor (NMDA) Glutamate receptor (NMDA) Glutamate receptor (AMPA) Glutamate receptor (kainate) Glutamate receptor (kainate)

regulatory roles. For introduction to bibliography, see the references in Miller and Smart (2010). Different nAChRs can be functionally distinguished based on kinetics, permeability, and pharmacology. These ionotropic receptors

I

602

are generally permeable to cations, including Ca2+. The permeability to Ca2+ is especially high in the homomeric subtypes, such as (a7)5, which also display quicker desensitization rates. Besides mediating excitatory postsynaptic potentials, the nAChRs are often expressed at presynaptic or extrasynaptic locations, where they regulate transmitter release or dendrite excitability. In nonneuronal cells, they seem to be often implicated in controlling the exocytosis of autocrine and paracrine factors, such as growth factors, and cell migration. 5-HT Receptors Among the 5-HT receptors, only 5-HT3 are ion channels, whereas the others are metabotropic receptors coupled to G proteins. The 5-HT3 receptors comprise A, B, and C subunits that can assemble to form heteropentamers. They are expressed in both CNS and PNS and, being nonselective cation channels, generally mediate excitatory synaptic transmission (Miller and Smart 2010). Anion-Selective Channels: GABA and Glycine Receptors The GABA receptors constitute a very ample subfamily. The ionotropic GABA receptors are collectively named GABAA receptors (Ben-Ari et al. 2007). They are heteropentamers of a variety of subunits, the main being a1–a6, b1–b3, g1–g3, and d. Additional subunits with more localized expression include e, p, y, and r1–r3 subunits. Receptors formed by r receptors are sometimes called GABAC. The most common subunit stoichiometry is 2a2bg (d). Glycine receptors are much less diverse and are pentamers of only two types of subunits: a1–a4 and b. The most common stoichiometry probably consists of 3 a and 2 b subunits. In physiological conditions, Cl is the main permeant anion, but HCO3 also gives a significant contribution. In the adult, these channels are the main responsible for the rapid postsynaptic inhibitory currents. GABA is the main inhibitory transmitter in the brain, whereas Gly tends to prevail in the spinal cord. However, because of the variability of the intracellular Cl concentration (and thus of ECl), these channels can also exert excitatory actions, when [Cl]i is

Ion Channels and Transporters

sufficiently high. This certainly occurs during the development of the nervous system, when, because of a different expression of Cl transporters compared to later stages, the ratio [Cl]o/ [Cl]i is smaller than it is in mature circuits. In these conditions, activating GABAA receptors has depolarizing (and thus excitatory) effects. As is the case of other ligand-gated channels, recent evidence suggests that GABAA receptors are also significantly expressed in nonneuronal tissue, where their physiological roles are unclear. Ligand-Gated Channels: Glutamate Receptors (GluRs)

GluRs are tetramers of subunits only weakly homologous to those of the cys-loop receptors. Each subunit contains a large extracellular N-terminal domain followed by four a-helical domains (M1–M4). Only M1, M3, and M4 span the plasma membrane, whereas M2 forms a loop that enters into and exits from the cytoplasmic side of the membrane. M2 constitutes the so-called P element, which is thought to form the pore selectivity filter. The ligand-binding site is formed by the extracellular N-terminal domain and the large loop between M2 and M3. The C-terminus is intracellular. Glu is the main excitatory transmitter in the CNS. Three main families of ionotropic GluRs are known (Table 4), pharmacologically distinguishable by three relatively specific agonists: N-methyl-D-aspartate (NMDA, formed by subunits NR1, NR2A-NR2D, NR3ANR3B), a-amino-3-hydroxy-5-methyl-4-isoxazole propionic acid (AMPA receptors, formed by subunits GluR1–GluR4), and kainate (KA receptors, formed by GluR5–GluR7, KA-1, and KA-2 subunits). A fourth related family (“orphan”) is formed by d1–d2 subunits. These seem to function as a channel only in pathological conditions and the primary ligand is unknown (Wollmuth and Sobolevsky 2004). In the presence of Glu, non-NMDA receptors quickly activate and then desensitize within about 30 ms. They are expressed in most central neurons, in which they mediate the early large component of excitatory postsynaptic currents. They are usually permeable to Na+ and K+, but not Ca2+. Their currents can be also blocked by intracellular polyamines, as is the case

Ion Channels and Transporters

of IRK and nAChRs. However, these permeation properties can be modulated by RNA editing of the GluR2 primary transcript. The NMDA receptors have slower kinetics and thus carry the slow component of the excitatory postsynaptic currents. They are permeable to Na+, K+, and Ca2+ and are modulated by a variety of extra- and intracellular effectors. They generally require extracellular Gly for full activation and are inhibited by extracellular Mg2+ at negative Vm’s. Hence, when Glu is released onto a resting neuron, NMDA channels contribute little current unless the stimulus is so strong to produce large and sustained depolarization (brought about by other ion channels, usually non-NMDA receptors). When synaptic activation is strong enough to knock off Mg2+ from the NMDA receptor pore, significant Ca2+ influx takes place. This may cause long-lasting synaptic remodeling through activation of intracellular calcium signals. Activity-dependent synaptic modifications such as these are thought to be implicated in learning and memory. Moreover, sustained levels of extracellular Glu may have neurotoxic effects (excitotoxicity) because tonic Ca2+ influx tends to activate the intracellular pathways that lead to apoptosis. Ionotropic GluRs can also be expressed in non-excitable cells, such as T-lymphocytes, microglia, and dendritic cells, where they regulate phagocytic roles, cytokine secretion, and cell migration. Ligand-Gated Channels: Purinergic Receptors

The purinergic receptors are classified into P1 and P2 receptors, whose main physiological ligands are, respectively, extracellular adenosine and ATP. P2 receptors include ionotropic P2X as well as metabotropic P2Y receptors, which are structurally unrelated. The known P2X subunits (P2X1–P2X7) have intracellular N- and C-terminus separated by two transmembrane domains connected by a large extracellular loop. Subunits associate to form homo- and heterotrimeric channels widely expressed in nervous and nonnervous tissue. P2X channels are permeable to cations and thus tend to exert excitatory effects. In this way, they can modulate Glu release and are implicated in sensory transmission (particularly nociception) and inflammation. Other physiological roles of P2X have turned out

603 Ion Channels and Transporters, Table 5 Voltagegated chloride channels in mammals Channel type ClC-1 ClC-2 ClC-Ka ClC-Kb CLC-3 CLC-4 CLC-5 CLC-6 CLC-7

Gene CLCN1 CLCN2 CLCNKA CLCNKB CLCN3 CLCN4 CLCN5 CLCN6 CLCN7

Channel distribution Skeletal muscle Broad Kidney; inner ear Kidney; inner ear Neurons; broad Broad Kidney Nervous system Osteoclasts; broad

to be difficult to recognize because of lack of subtype-specific drugs. This may soon change because of the structural insight provided by the recent X-ray resolved structure of P2X receptors in the closed state (Browne et al. 2010). Epithelial Na+ Channel (ENaC)/Degenerin (DEG) Family

This heterogeneous family of tetrameric channels is formed by subunits that generally contain two transmembrane segments (M1–M2), a large extracellular loop with several cysteine-rich domain and short intracellular N- and C-terminal segments. It comprises ENaC, the acid-sensitive cation channel (ASIC) expressed throughout the nervous system, and several channel types expressed in invertebrates, such as the Caenorhabditis DEG (Kellenberger and Schild 2002). These channels are involved in physiological functions as different as sensory transduction and transepithelial transport. In particular, ENaC is highly selective to Na+ and mediates apical Na+ reabsorption in tight epithelia, such as those lining the nephron distal tubule and cortical collecting duct. The subunit stoichiometry is 2abg. ENaC is virtually voltage independent and constitutively active at Vrest. However, its activity and expression are potently regulated by hormones such as aldosterone and vasopressin, which control blood pressure and volume. VG Chloride Channels

The prototypic VG Cl channel (named CLC-0) was cloned in Torpedo and led to the identification

I

604

of nine mammalian homologous subunits (CLC-1 to CLC-7, CLC-Ka, and CLC-Kb; Table 5). CLC-0 and the mammalian CLC-1, CLC-2, CLC-Ka, and CLC-Kb are certainly Cl channels, activated by membrane depolarization and permeable to different anions. CLC-3, CLC-4, and CLC-5 are expressed on the membrane of intracellular vesicles and are thought to function as Cl-H+ antiporters. However, evidence about CLC-3 is controversial, as several lines of evidence indicate that it may be double-barreled Cl channel, with physiological roles both in neuronal function and cell volume control. CLC-6 and CLC-7 are detected in various tissues, but their function is unclear because the attempts at expressing them in heterologous systems have been to date unsuccessful. CLC are overall unusual because of the overlap in the same structural family of channel as well as transporter features, which probably derives from the evolutionary origin of these proteins from H+/Cl exchangers. In general, the CLC protein is formed by two identical subunits, each containing 18 a helical segments named A-R. Both the N- and the C-termini are intracellular. The segments A-I (the N-terminal array) are homologous to J-R (C-terminal), but the two halves of the subunit have opposite orientation in the membrane. Further detailed structural information has been obtained by the crystal structure of a prokaryotic CLC channel. In general, CLC proteins contain two independent conduction pores and are expressed in a variety of cells, where they regulate Vm, cell volume, pH (particularly in intracellular vesicles), and transepithelial Cl flux (Chen and Hwang 2008). CFTR (Cystic Fibrosis Transmembrane Conductance Regulator)

CFTR belongs to the structural family of the ABC (ATP-binding cassette) transporters (see later) and has little resemblance to the other Cl channels. However, studies in reconstituted systems have shown that it is a bona fide Cl channel. CFTR is involved in transepithelial transport, particularly in secretion. It has been intensely studied because several mutations in the CFTR gene can cause cystic fibrosis, a very grave and unfortunately

Ion Channels and Transporters

common genetic disease. CFTR is a large protein containing about 1,500 amino acids and constituted by two membrane-spanning domains (MSD1 and MSD2), each containing six putative transmembrane segments (M1–M6). Both the N- and C-termini are intracellular. Each of the two M6 segments is followed by an intracellular domain containing the nucleotide-binding domain. Therefore, CFTR contains two ATP-binding domains, whose effect on channel Po is thought to be controlled by the phosphorylation of several regulatory sites mainly located in the loop between the first nucleotide-binding domain and the M1 segment of MSD2 (Chen and Hwang 2008). Intracellular Ion Channels and Gap Junctions

As mentioned above, ion channels are also expressed by intracellular vesicles and organelles, including the nucleus. Some of these intracellular channels belong to molecular families different from those already discussed. Thorough studies have been carried out on the complex regulation of [Ca2+]i homeostasis, which affects exocytosis, muscle contraction, cell motility, synaptic remodeling, the cell mitotic cycle, and other processes. Two main classes of Ca2+ channels have been found to mediate Ca2+ release from the endoplasmic and sarcoplasmic reticulum, i.e., channels gated by cytoplasmic IP3 (IP3Rs) and ryanodine receptors (RyRs) gated by cytoplasmic Ca2+ and modulated by compounds such as cyclic ADP ribose. These channels are tetramers of subunits showing different degrees of homology with VG channels and retaining a P-loop. They show complex regulation by Ca2+ itself and a variety of intracellular effectors. Particularly interesting is NAADP, the most potent mobilizer of intracellular calcium. Recent results indicate that this second messenger can target a new class of intracellular Ca2+ channels, localized on acidic organelles (lysosomes and endosomes) and named TPC (two-pore channels; Zhu et al. 2010). The TPC subunit bears homology to the VG channels, but contains two modules with S1–S6 and the P-loop. Therefore, each subunit contains a couple of P-loops. Regulation of Ca2+ transport across the intracellular membranes is tightly coupled to that occurring through the plasma membrane. For

Ion Channels and Transporters

instance, CRACs (Ca2+ release-activated Ca2+ channels) mediate Ca2+ influx from the extracellular space depending on the state of intracellular Ca2+ stores. For introduction to literature, see Zhu et al. (2010). Finally, in many tissues, adjacent cells are connected by gap junctions constituted by two hexameric hemichannels residing in the plasma membrane of each cell. Gap junctions are not selective ion channels, as their wide pore (about 2 nm) also provides a passageway for polar molecules with a molecular weight up to approximately 1 Kd, which can include sugars, amino acids, nucleotides, and oligopeptides. Moreover, their gating kinetics is very slow. Gap junctions constitute the electric synapses, important to coordinate nearby excitable cells such as cardiac myocytes. Cells connected by gap junctions can be also coordinated by the diffusion of molecular building blocks, nourishment, or regulatory compounds, which occurs in many adult and developing tissues.

Ion Channels in Disease Channel malfunction can have manifold pathological effects. In a low percentage of cases, such alteration depends on mutations of channelcoding genes, thus leading to the so-called channelopathies. Although channelopathies are congenital diseases, the symptoms can manifest themselves even many years after birth. Such a delay may depend on many reasons, such as slow developmental alterations (particularly slow is the maturation of cerebral circuits) or the requirement of concomitant facilitating environmental stimuli. In principle, a mutation can affect any of the factors in Eq. 4 and thus modify the total current through different molecular mechanisms. Because it is generally impossible to predict the effect of a mutation on a protein’s function, an important first step towards understanding the pathogenesis is usually a thorough study of the mutant channel properties in some expression system. This procedure is also applied to other mutant proteins linked to pathologies, but ion channels offer special advantages because

605

their properties can be studied in living cells with greatly sensitive methods. Moreover, the mechanistic insight offered by the application of single-channel recording, when feasible, allows to better discriminate between the different factors in Eq. 4 and thus obtain invaluable information for pharmacologic treatment. In the most fortunate cases, the subsequent production of a transgenic murine model yields strains with pathological features comparable to those observed in humans. Evidence about channelopathies is now very ample and examples can be given of harmful mutations changing one or more of the factors summarized in Eq. 4. Particularly ample evidence regards epilepsy, neuromuscular disorders, and cardiac arrhythmias, but the range of channelopathies is by no means limited to excitable cells. Classical examples are the abovementioned cystic fibrosis, caused by mutant CFTRs (Chen and Hwang 2008), and pseudoaldosteronism (or Liddle syndrome), which is caused by mutant ENaC channels (Kellenberger and Schild 2002). Channelopathies of course constitute only a minority of the diseases in which ion channels are implicated. In fact, most antiepileptic and antiarrhythmic drugs modulate ion channels and the modern electrophysiological methods permit to study the mechanism of action of these drugs in great detail. It should be also clear from the previous paragraphs that ion channels are also implicated in the physiology and pathology of non-excitable tissues. For example, proper regulation of the renal ENaC is fundamental for normal Na+ reabsorption and control of blood pressure. Another example is provided by HV1, which was first identified in neutrophils because of its role in mediating the transmembrane charge balancing during the production of reactive oxygen species by NADPH oxidase during the phagocytic oxidative burst. What is more, evidence accumulated during the last 2–3 decades indicates that different channel types are implicated in every stage of the neoplastic progression, from cell cycle control, to secretion of growth factors, to cell migration and invasiveness. Devicing ever more specific and effective channel-targeting drugs is in fact

I

606 Ion Channels and Transporters, Fig. 4 Example of primary and secondary active transporters. The stoichiometry of the sodium pump, of the Na+/Ca2+ and Na+/H+ exchangers, and of the synaptic glutamate transporters are shown, as indicated. “Out” labels the extracellular side

Ion Channels and Transporters

3Na+ out

Primary Active Transport: the Na+/K+ ATPase

ADP + Pi

ATP

2K+

Ca2+

H+

out Secondary Active Transport: ion exchangers

3Na+

Na+

K+ out

Glutamate

a most active and promising field of pharmacological research.

Establishing the Ion Gradients Steady-state ion gradients (Table 1) are maintained across the plasma membrane mainly by the action of the so-called primary active ion transporters, also known as ion pumps. These are integral membrane proteins that undergo cycles of conformational changes driven by ATP consumption or other energy sources, which transfer ions across the plasma membrane, against their electrochemical gradient. The typical resting extra- and intracellular ion concentrations are approximately the same throughout mammals. The ensuing transmembrane ion gradients are exploited by ion channels to allow passive ion fluxes across the membrane and by secondary active transporters to transport some other ion (e.g., H+), or organic compound (say glucose or amino acids) counters its (electro) chemical gradient (Fig. 4). The terms cotransport and symport are generally used to mean that solutes are transported

Secondary Active Transports: glutamate transporters (GLT1, GLAST)

3Na+/H+ Ion Channels and Transporters, Table 6 ATPase ion pumps Pump class P-type

V-Type F-type ABC type

Description Phosphorylated intermediate, blocked by La3+ and orthovanadate Vacuolar type, proton pumps ATP synthase Functionally heterogeneous, contain an ATP-binding cassette

Examples Ubiquitous Na-K and Ca pumps, stomach H-K pump, proton pumps, Cu-ATPases Synaptic vesicles, lysosomes, etc. Mitochondria, chloroplasts, bacteria Multidrug resistance (MDR) protein, ER peptide transporters CFTR

in the same direction, whereas the terms countertransport, antiport, or exchange denote solutes flowing in opposite directions. Primary Active Transporters (Ion Pumps) The energy which drives the conformational cycles that produce ion transport can come from redox reactions (e.g., in mitochondrial inner

Ion Channels and Transporters

membrane), light (e.g., in bacteriorhodopsin), or ATP hydrolysis (Morth et al. 2011). As the former two mechanisms are discussed in other sections, the latter will be illustrated here. ATP-driven transporters can be subdivided into P-type (undergoing phosphorylation during the transport cycle), V-type (expressed in vacuoles and other intracellular organelles), F-type (mitochondrial coupling factors), and ABC transporters (Table 6). The Na/K ATPase

Evidence about the existence of active transport of sodium was obtained in the early 1950s, but the first crystal structure, epitomizing more than 60 years of work on this transporter, came out in 2007 (Morth et al. 2011). In brief, the sodium pump, at each cycle, extrudes 3 Na+ ions and imports 2 K+ ions. Both ions are transported against their electrochemical gradient. The energy is provided by the hydrolysis of one ATP molecule. In this way, the sodium pump maintains the transmembrane concentration gradients for Na+ and K+, which are essential for membrane excitability and control of cell volume. Moreover, these gradients can be exploited by other transporters to force the counter-gradient transport of some other solute (see later). Because the transport stoichiometry is not 1:1, the sodium pump produces a net transmembrane current. That is, the sodium pump is an electrogenic transporter. Therefore, the sodium pump contributes to Vrest in two ways. First, it gives a major, but indirect, contribution by setting the ion gradients that are exploited by the ion channels implicated in regulating Vrest. Second, it contributes a smaller direct effect by constantly producing an outward current, which accounts for a few mV of Vrest. The electrogenic nature of the sodium pump is conceptually important for another reason. Because an electrogenic transporter produces a net charge flow across the membrane, it is possible to impede the pump function by setting the membrane potential at a value sufficiently large to block the outward current. This is the pump equilibrium potential (Veq). Veq can be calculated by equating the energy released by ATP hydrolysis (approximately 60 kJmol1, in physiological conditions) with the energy released by transporting 3Na+ into the cell in exchange for 2 K+ (i.e., when the pump transport

607

in reverse mode). The latter is given by the electric component zFVm plus the component that depends on the concentration gradients. When the energy released by ATP hydrolysis equals the energy released by ion transport in the reverse mode, it is energetically indifferent for the pump to transport the ions as usual by consuming ATP or transporting the ions in reverse mode and synthetize ATP. In these conditions, Vm is equal to Veq, the pump equilibrium potential:   ½Naþ o 3 RTln  FV eq ½Naþ i   ½K þ  þ 2 RTln þ i þ FV eq ½K o ¼ 60 kJmol1

(5)

where z, F, R, and T have their usual meanings. In physiological conditions, Veq turns out to be approximately 240 mV. Expressions such as these can be used either for calculating Veq with fixed ion concentrations or to find the ratio of ion concentrations at a given Vm. In the case of the sodium pump as well as other P-type transporters (which are often electrogenic), Veq is so negative to be physiologically irrelevant. However, reversal potentials well within the physiological range are possible for many other electrogenic transporters, such as the Na+/Ca2+ exchanger, the Na+/HCO3 cotransport, and many neurotransmitter transporters. This point is further discussed later. Some Structure-Function Features of the P-Type ATPases

The structures of the Na+/K+ ATPase, the Ca2+ATPase, and H+-ATPase are very similar. The H+/ K+ ATPase which extrudes H+ into the stomach lumen during digestion is also closely related to the Na+/K+ ATPase (Morth et al. 2011). In humans, the latter is a dimer of an a and a b subunits. Four isoforms (a1–a4) are known for the former and three (b1–b3) for the latter. The most common form in mammalian cells is a1b1. The a subunit exerts the main catalytic function, whereas

I

608

b regulates insertion into the plasma membrane, as well as ion affinity and transport kinetics. Different isoforms are expressed in different cell types and confer somewhat different transport properties. For example, the glial Na+/K+ ATPase has a higher affinity for K+, compared to the neuronal isoform. Therefore, it comes into play when [K+] in the cerebrospinal fluid significantly increases above the physiological levels. This may occur, for example, during abnormal high-frequency neuronal firing, during which the capacity of the neuronal pumps to reabsorb K+ may saturate. Other subunits are also known (e.g., g). P-type ATPases have three cytoplasmic domains: P (phosphorylation), N (nucleotide binding), and A (actuator, i.e., the ATPase domain). The core of the ion transport domain is constituted by six transmembrane domains (M1–M6). This is often followed by a C-terminal transmembrane domain constituted by three supplementary segments (M7–M10). The transmembrane segments contain the ion-binding sites. Many P-type pumps also have an inhibitory regulatory R domain. For details about the pumping cycle (Post-Albers cycle), the reader is referred to the specialized literature (Morth et al. 2011). In brief, the ion-binding sites in the middle of the structure are thought to be alternatively accessible from each side of the plasma membrane through hemichannels that open and close during the transport cycle. The Ca2+- ATPases

The calcium pumps contribute to maintain a low free cytosolic calcium concentration (Table 1), by extruding Ca2+ into the extracellular space or accumulating it into the endoplasmic (ER) or sarcoplasmic (SR) reticulum. Therefore, plasma membrane Ca2+-ATPases (PMCA) as well as SR/ER Ca2+-ATPases (SERCA) are known. SERCA was the first P-type ion pump for which a crystal structure was obtained. The calcium pumps cooperate with the plasma membrane Na+/Ca2+ exchanger and a wide variety of calcium channels (expressed into both the plasma membrane and intracellular organelles), to regulate calcium homeostasis. More recently, a secretory pathway Ca2+-ATPase (SPCA) was

Ion Channels and Transporters

also identified. This latter is also able to transport Mn2+, which provides this ion to the enzymes that use it a cofactor in the Golgi system. Moreover, the cytosolic Mn2+ levels are thus regulated. The catalytic mechanism is similar in these pumps. In terms of stoichiometry, SERCA transports 2 Ca2+ into the cytoplasmic cysternae, in exchange with 1H+. PMCA is also thought to work as a Ca2+/H+ exchanger, but the precise stoichiometry (and thus the electrogenicity) is still debated (Brini and Carafoli 2009). Each type of Ca2+-ATPase can be produced by a multigene family and diversity is further increased by alternative splicing. Different isoforms present different distribution, regulation, and functional features. Several genetic pumps defect has been recently observed that led to more or less severe alterations in calcium signaling. Alterations in the function of calcium pumps are also implicated in nongenetic pathologies (Brini and Carafoli 2009). Other Ion Pumps

Proton pumps are discussed in the acid transport section. In brief, P-type proton ATPases in the plasma membrane are not very common in mammalian cells, whereas in plants and prokaryotes, they establish the transmembrane H+ gradient that generally takes the role exerted in the animals by Na+ in secondary active transporters. Much wider distribution in mammals is found for the intracellular V-type proton pumps, which acidify the intravesicle compartments of many intracellular organelles. An example is provided by synaptic vesicles (see later). The F-type pumps are also named ATP synthases and are expressed in the inner membrane of mitochondria, the thylakoid membrane of chloroplasts, and the plasma membrane of bacteria. They are structurally related to the V-type pumps and generally work in reverse mode, i.e., they exploit the transmembrane H+ gradient to synthetize ATP. They are also fully discussed in other sections. Other details are indicated in Table 6. Interestingly, increasing evidence indicates that P-type ATPase transporters also exist for other ions with more specialized functions,

Ion Channels and Transporters

e.g., enzyme cofactors, such as Cu2+. Mutation of these transporters can lead to serious diseases (Lutsenko et al. 2007). Secondary Active Transporters These couple the free energy released by the flux of one solute (often Na+, in the animals) to the counter-gradient transfer of other solutes. Very many secondary active transporters are known, which cannot be summarized here. A few examples are shown in Fig. 4. Many of these transporters are discussed in the sections related to sugar transport, pH control, and mitochondrial transport. The following paragraphs briefly discuss the neurotransmitter transporters, which are not treated in the other sections. Neurotransmitter Transporters

These transporters serve to reabsorb neurotransmitters from the synaptic cleft and load synaptic vesicles. They have great pharmacological importance as they are common targets of psychoactive drugs (Blakely and Edwards 2012). Vesicular Neurotransmitter Transporters ACh is concentrated into the synaptic vesicles by the vesicular ACh transporter (VAChT). VMAT2 (vesicular monoamine transporter 2) loads synaptic vesicles with monoamine neurotransmitters, either dopamine or norepinephrine or serotonin, depending on the cell type. GABA is transported by the GABA vesicular transporters (VGAT) and glutamate is loaded into vesicles by VGLUTs. These vesicular transporters are often labeled in experimental work to identify specific fiber types in the CNS, and transporter isoforms may label different fiber subtypes. In general, the vesicular neurotransmitter transporters contain 12 transmembrane domains and exchange the neurotransmitter with two protons, thus exploiting the pH gradient established across synaptic vesicles by V-type H+ pumps. Reuptake of Neurotransmitters Specific transporters reabsorb the neurotransmitter after it has been released into the synaptic cleft. This is necessary to quickly interrupt the signal and

609

permit subsequent reuse of the molecule. ACh is an exception in that it is degraded extracellularly into acetic acid and choline by the enzyme acetylcholinesterase. It is choline which is then taken up by specific choline transporters. All of these transporters exploit the transmembrane Na+ gradient to drive reuptake, but their stoichiometry is anything but simple. From the structural and functional point of view, one can distinguish two main classes (Blakely and Edwards 2012). Glutamate transporters constitute the first class, comprising eight transmembrane domains. The main members are the glial EAAT1 (also known as GLAST), the glial EAAT2 (or GLT1), and the postsynaptic EAAT3 (or EAAC1). Work carried out in the last 15 years indicates that glutamate entry into the cell is accompanied by cotransport of 3Na+ and 1H+ and countertransport of 1 K+ (Fig. 4). The other class is formed by the neurotransmitter sodium transporters (NSS). NSS belong to the structural superfamily characterized by 12 membranespanning domains, which also includes the amino acid transporters. In this case, the neurotransmitter entry into the cytosol is accompanied by 1–3 Na+ ions and 1 Cl. Therefore, the neurotransmitter transporters are generally electrogenic. In most physiological conditions, the driving force is probably such that the molecule is reabsorbed into the cell. It is however believed that, in certain conditions, nonvesicular neurotransmitter release can occur by reversed uptake.

References Alam A, Jang Y (2011) Structural studies of ion selectivity in tetrameric cation channels. J Gen Physiol 137:397–403 Ben-Ari Y, Gaiarsa J-L, Tyzio R, Khazipov R (2007) GABA: a pioneer transmitter that excites immature neurons and generates primitive oscillations. Physiol Rev 87:1215–1284 Blakely RD, Edwards RH (2012) Vesicular and plasma membrane transporters for neurotransmitters. Cold Spring Harb Perspect Biol 4:a005595

I

610 Brini M, Carafoli E (2009) Calcium pumps in health and disease. Physiol Rev 89:1341–1378 Browne LE, Jiang LH, North LA (2010) New structure enlivens interest in P2X receptors. Trends Pharmacol Sci 31:229–237 Capasso M, DeCoursey TE, Dyer MJS (2011) pH regulation and beyond: unanticipated functions for the voltage-gated proton channel, HVCN1. Trends Cell Biol 21:20–28 Chen T-Y, Hwang T-C (2008) CLC-0 and CFTR: chloride channels evolved from transporters. Physiol Rev 88:351–387 Craven KB, Zagotta WN (2006) CNG and HCN channels: two peas, one pod. Annu Rev Physiol 68:375–401 Enyedi P, Czirkjak G (2010) Molecular background of leak K+ currents: two-pore domain potassium channels. Physiol Rev 90:559–605 Hille B (2001) Ion channels of excitable membranes, 3rd edn. Sinauer Associates, Sunderland Hille B, Catterall W (2012) Electrical excitability and ion channels. In: Brady ST, Siegel GJ, Wayne Albers R, Price DL (eds) Basic neurochemistry. Academic, Amsterdam, pp 63–80

Ion Channels and Transporters Kellenberger S, Schild L (2002) Epithelial sodium channel/degenerin family of ion channels: a variety of functions for a shared structure. Physiol Rev 82:735–767 Lutsenko S, Barnes NL, Bartee MY, Dmitriev OY (2007) Function and regulation of human coppertransporting ATPases. Physiol Rev 87:1011–1046 Miller PS, Smart TG (2010) Binding, activation and modulation of Cys-loop receptors. Trends Pharmacol Sci 31:161–174 Morth JP, Pedersen BP, Buch-Pedersen MJ, Andersen JP, Vilsen B, Palmgren MG, Nissen P (2011) A structural overview of the plasma membrane Na+, K+-ATPase and H+-ATPase ion pumps. Nat Rev Mol Cell Biol 12:60–70 Venkatachalam K, Montell C (2007) TRP channels. Annu Rev Biochem 76:387–417 Wollmuth LP, Sobolevsky AI (2004) Structure and gating of the glutamate receptor ion channel. Trends Neurosci 27:321–328 Zhu MX, Ma J, Parrington J, Calcraft PJ, Galione A, Evans AM (2010) Calcium signaling via two-pore channels: local or global, that is the question. Am J Physiol Cell Physiol 298:C430–C441

K

Key Enzymes Used in Cloning, Some

Introduction

Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

The isolation, modification, and joining of specific DNA fragments to produce recombinant products are essential steps in DNA cloning applications. The development of cloning technology was critically dependent on many years of basic research on enzymes that catalyze the cellular processes that involve DNA as the substrate. DNA ligases are the enzymes that catalyze formation of phosphodiester bonds that join two DNA molecules covalently. Polynucleotide kinase, alkaline phosphatases, DNA polymerases, and a variety of nucleases that allow for the modification of the structure of a DNA molecule to control their subsequent joining are essential in these processes. DNA polymerases and reverse transcriptases are also used in many applications to synthesize and amplify desired DNA sequences and to determine the base sequence of DNA.

Synopsis A large number of different enzymes are used in DNA cloning procedures. DNA ligase is used to join two DNA molecules covalently, the key step in constructing a recombinant plasmid from a cloning vector and a DNA insert. The ends of DNA molecules can be modified to allow or prevent ligation by using the enzymes polynucleotide kinase or alkaline phosphatase. Enzymes that catalyze DNA synthesis, including DNA polymerases, reverse transcriptase, and terminal deoxynucleotidyl transferase, find application in many ways in cloning procedures, including to modify DNA ends to control ligation, in the polymerase chain reaction, in DNA sequencing, to produce DNA copies of RNA molecules, and other applications. A variety of nucleases are available that can be used to remove unwanted DNA, to modify DNA ends, and to delete larger portions of a DNA molecule.

# Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

DNA and RNA Ligases DNA ligases are found in all organisms and they are encoded in many bacteriophage genomes (Tomkinson et al. 2006). DNA ligases are responsible for catalyzing phosphodiester bond formation between newly synthesized DNA molecules during DNA replication, repair, and

612

Key Enzymes Used in Cloning, Some

Key Enzymes Used in Cloning, Some, Fig. 1 DNA joining reactions catalyzed by DNA ligase. All of the reaction require energy from ATP or NAD+ (not shown; see Fig. 2). A. DNA ligase seals a nick in one strand of a double-stranded DNA molecule by joining a 30 -OH to a 50 -phosphoryl group (P) to form a phosphodiester bond. B. DNA molecules with complementary single-stranded

terminal extensions (“sticky-ends”) associate non-covalently by base pairing, followed by phosphodiester bond synthesis to join both strands covalently. The single-stranded extension can be on the 50 -end as shown, or the 30 -end (not shown; see Fig. 3). C. DNA ligase joins two blunt-ended DNA molecules, regardless of the DNA sequence

recombination. The DNA ligase from bacteriophage T4 is used most commonly for in vitro recombinant DNA applications but ligases from other organisms and bacteriophages are also available commercially. DNA ligases catalyze the reactions illustrated in Fig. 1: (1) sealing nicks in one strand of a double-stranded DNA (dsDNA) molecule by synthesizing a phosphodiester bond between adjacent 50 -phosphoryl and 30 -hydroxyl groups (Fig. 1a). This reaction occurs during DNA replication and repair. (2) Sealing nicks in two double-stranded DNA molecules that are associated non-covalently by base-pairing of “sticky ends” produced by restriction endonuclease digestion (Fig. 1b). (3) Joining the ends of blunt-ended double-stranded DNA molecules (Fig. 1c). Reactions 2 and 3 are routine steps in many cloning applications in which DNA ends formed by

restriction endonuclease cleavage are joined by ligation. Ligation of blunt ends is generally less efficient compared to sticky ends. Ligation of sticky ends is sensitive to the end structure, and base-pair mismatches inhibit ligation, especially when at the ligation site or upstream of the 30 -terminated ligation substrate (Tomkinson et al. 2006). DNA ligases catalyze formation of a phosphodiester bond that links a 30 -OH and a 50 -phosphoryl group in one strand of a doublestranded DNA molecule: 50 -DNA-OH þ 2 O3 P-O-DNA-30 þ Hþ ! 50 -DNA-O-PO2 -O-DNA-30 þ H2 O This reaction as written is thermodynamically unfavorable at neutral pH (DG0 = +22.2 kJ/mol,

Key Enzymes Used in Cloning, Some

Dickson et al. 2000). DNA ligases therefore require a source of free energy to enable the reaction to proceed in the direction of phosphodiester bond formation. Most DNA ligases use ATP in the reaction, but the ligases from eubacteria use NAD+ (Tomkinson et al. 2006). The complete reaction catalyzed by these enzymes is thus: 50 -DNA-OH þ 2 O3 P-O-DNA-30 þ ATP ! 50 -DNA-O-PO2 -O-DNA-30 þ AMP þ PPi þ Hþ

or 50 -DNA-OH þ2 O3 P-O-DNA-30 þ NADþ ! 50 -DNA-O-PO2 -O-DNA-30 þ AMP þ NMN þ Hþ

where PPi is inorganic pyrophosphate and NMN is nicotinamide mononucleotide. Note that NAD+ is not used as a redox cofactor in these reactions, in contrast to the more common metabolic function of NAD+. As for most enzymes that use ATP as a substrate, the enzyme requires Mg2+ and the true substrate is the Mg2+•ATP complex. The first step in the catalytic mechanism of DNA ligases is formation of a covalent enzymeAMP adduct when the e-amino group of an active-site lysine residue acts as a nucleophile to attack either ATP or NAD+ and displace pyrophosphate (PPi) or nicotinamide mononucleotide (NMN) (Tomkinson et al. 2006) (Fig. 2a). The subsequent steps are the same for both types of enzyme. The AMP is transferred from the enzyme to the phosphoryl group at the 50 -end of a DNA strand (Fig. 2b, c). The 30 -OH group then attacks the adenylylated DNA and displaces the AMP, forming the new phosphodiester bond (Fig. 2c,d). The free energy change of the ATPdependent reaction is estimated to be DG0 = 26.4 kJ/mol (Dickson et al. 2000). A common application of DNA ligase in molecular cloning is to join two separate DNA molecules (cloning vector and DNA insert). In general one wants to maximize joining of the insert to the vector while minimizing intramolecular ligation of the two vector ends to form an empty circular vector molecule. DNA ligation has been the subject of both theoretical and

613

experimental studies to enable calculation of DNA concentrations that will maximize bimolecular (intermolecular) ligation relative to the intramolecular reaction (Dugaiczyk et al. 1975; Revie et al. 1988; Shore et al. 1981). Insert-vector joining is a bimolecular reaction and so its efficiency depends on the concentration of the two reactant DNA molecules. The efficiency of the intramolecular reaction depends on the concentration of one DNA end in the vicinity of the other. This in turn depends on the DNA length. Short doublestranded DNA molecules are too stiff to bend into a circle to allow ligation, while the ends of excessively long DNA molecules are unlikely to come near each other due to the flexibility of long double-stranded DNA molecules. RNA ligases catalyze formation of phosphodiester bonds between two RNA molecules by the same chemical mechanism used by DNA ligases (Wood et al. 2004). The RNA ligase encoded by bacteriophage T4 is available commercially. The phage produces RNA ligase to repair phage-encoded transfer RNAs that are cleaved by host ribonucleases, and in some RNA processing reactions. The enzyme can be used in vitro to join two single-stranded RNA (ssRNA) molecules, to seal nicks in doublestranded RNA, and to radiolabel the 30 -end of ssRNA by ligating it to [50 -32P]pCp (Cobianchi and Wilson 1987).

Polynucleotide Kinase and Alkaline Phosphatase The DNA substrates for the reactions catalyzed by DNA ligase must have a phosphoryl group on the 50 -end. Restriction endonuclease products have this structure and can be joined by DNA ligase with no further processing (Fig. 3, reactions 1). However, in some cases a phosphoryl group must be added to the 50 -end to allow ligation. This reaction is achieved by the enzyme polynucleotide kinase (Fig. 3, reaction 5). Alternatively, the 50 -phosphoryl group can be removed from a DNA end to prevent ligation or to control the spectrum of ligation products that may be formed from a mixture of DNA molecules. Alkaline

K

614

Key Enzymes Used in Cloning, Some

Key Enzymes Used in Cloning, Some, Fig. 2 Catalytic mechanism of DNA ligase. (a). An active-site lysine residue reacts with ATP (or NAD+ for bacterial ligases) to form an adenylylated enzyme intermediate and inorganic pyrophosphate (PPi). rA = riboadenosine. (b). The 50 -phosphoryl group on the DNA reacts with the adenylyl

enzyme to form an adenylylated-DNA intermediate that activates the phosphoryl group for phosphodiester bond formation. (c) The 30 -hydroxyl group attacks the 50 -phosphoryl, displacing AMP and forming the phosphodiester bond that seals the nick in the DNA (d)

phosphatases are enzymes that are used for the dephosphorylation of DNA ends (Fig. 3, reaction 4). The polynucleotide kinase (PNK) encoded by bacteriophage T4 can be used to phosphorylate the 50 -ends of DNA or RNA (Cobianchi and Wilson 1987). The enzyme reacts efficiently with single-stranded DNA (ssDNA) or dsDNA with short 50 -single-stranded extensions left by some restriction endonucleases. It is much less efficient with blunt ends or recessed 50 -ends (Wang et al. 2002). PNK is also used to phosphorylate

the 50 -ends of chemically synthesized oligodeoxyribonucleotides which lack a 50 -phosphoryl group after synthesis, and to prepare radioactively labeled DNA molecules by using [g-32P]ATP as the substrate. A DNA strand must bear a 50 -phosphorylgroup to be ligatable. Thus removal of the 50 -phosphoryl group prevents ligation of that DNA end. Alkaline phosphatases catalyze hydrolysis of phosphoryl esters, including the 50 -phosphoryl ester on DNA (Cobianchi and Wilson 1987):

Key Enzymes Used in Cloning, Some

615

K Key Enzymes Used in Cloning, Some, Fig. 3 Processing of DNA ends in cloning. Double-stranded DNA molecules with three types of end structure are illustrated in the middle of the figure (50 -overhang, blunt end, and 30 -overhang). These ends can be joined by DNA ligase to produce a joined DNA product (top) (Reaction 1). The 50 - or 30 -single-stranded overhangs can be removed and the termini converted to blunt ends by hydrolysis catalyzed by a 50 -30 or 30 -50 -exonuclease (Reactions 2a and 2b). The 50 -overhang can also be converted to a blunt end by DNA synthesis catalyzed by a DNA polymerase (DNA Pol), using the recessed 30 -end as the primer and the overhang

2

O3 P-O-DNA-30 þ H2 O ! HPO4 2 þ HO-DNA-30

This reaction is useful in a number of applications. One is to prevent ligation of the ends of an empty plasmid cloning vector that has been linearized by restriction endonuclease digestion. The plasmid cannot be recircularized by ligation of its two ends once the 50 -phosphoryl group is removed. The 30 -terminated strands of the vector (with a 30 -OH group) can still be joined by ligation to a DNA insert that has 50 -phosphoryl groups on its ends. The resulting nicked circular ligation product can be introduced into E. coli cells in which endogenous enzymes repair the nicks to produce a

strand as the template (Reaction 3). The 50 -phosphoryl group can be removed by hydrolysis catalyzed by alkaline phosphatase, which renders the DNA ends unligatable (Reaction 4). Alkaline phosphatase can also remove the phosphoryl group from blunt ends and from the recessed 50 -ends in the molecules with 30 -overhangs. Polynucleotide kinase (PNK) catalyzes addition of a phosphoryl group to the 50 -hydroxyl group, with greatest activity on 50 -overhang ends (Reaction 5). Terminal deoxynucleotidyl transferase (TdT) catalyzes untemplated addition of nucleotide residues to single-stranded 30 -ends (Reaction 6)

covalently closed circular recombinant plasmid molecule. The advantage of this approach is that the number of E. coli colonies from cells transformed with re-circularized vectors that do not contain a DNA insert (but that do confer antibiotic resistance on the host cell) is very low. This facilitates identifying cells that contain recombinant plasmids by reducing the background of colonies containing empty vectors. Alkaline phosphatase is also used to remove a 50 -phosphoryl group from a cleaved plasmid DNA so that 32P radiolabel can be added using polynucleotide kinase and [g-32P]ATP. Bacterial alkaline phosphatase (BAP) and the phosphatase from calf intestine (CIAP) have been

616

Key Enzymes Used in Cloning, Some

used for many years. These enzymes are non-specific in that they will hydrolyze most phosphoester linkages as well as the phosphoanhydride bonds in ATP. Thus the phosphatase must be removed from or inactivated completely in a DNA sample before any procedure that uses ATP can be done, including ligation or reaction with polynucleotide kinase. These alkaline phosphatases are very stable and extensive extraction with phenol or treatment with proteinase K is required to inactivate them in a mixture so that they do not inhibit subsequent procedures (Cobianchi and Wilson 1987; Sambrook et al. 1989). The problems that arise from the stability of the bacterial and calf thymus alkaline phosphatases have been overcome by using enzymes isolated from organisms that normally grow at low temperatures. An alkaline phosphatase gene from a psychrophilic bacterium found in Antarctica was cloned and the enzyme expressed and purified

from E. coli (Rina et al. 2000). The enzyme is active at moderate temperatures (37  C) and it can be inactivated easily by brief incubation at 50–70  C. A heat-sensitive alkaline phosphatase from the cold-water shrimp Pandalus borealis is also available commercially (Nilsen et al. 2001).

Key Enzymes Used in Cloning, Some, Fig. 4 DNA synthesis reaction catalyzed by a DNA polymerase. The enzyme catalyzes nucleophilic attack by the 30 -hydroxyl group on the a-phosphoryl group of a 20 -deoxyribonucleoside triphosphate to form a phosphodiester bond, displacing inorganic pyrophosphate in the process and

extending the primer strand by one nucleotide residue. (dN = a 20 -deoxyribonucleoside with one of the four bases A, C, G, or T.) The dNTP must form a Watson-Crick base pair to the base in the template strand (N0 ). Reverse transcriptases can use either DNA or RNA as the template strand

DNA Polymerases DNA polymerases are enzymes that catalyze synthesis of the phosphodiester bonds in a DNA strand, using the information in a preexisting DNA single-strand as the template (Fig. 4): DNAn þ dNTP ! DNAnþ1 þ PPi where DNAn represents one strand of a doublestranded DNA molecule, consisting of “n” nucleotide residues, dNTP is a mixture of the four deoxyribonucleoside 50 -triphosphates (dATP,

Key Enzymes Used in Cloning, Some

dCTP, dGTP, and dTTP), DNAn+1 is the DNA strand extended by the addition of one nucleotide residue, and PPi is inorganic pyrophosphate (Chen 2014; Kornberg and Baker 1992). These enzymes also require Mg2+ to chelate the dNTP substrate (Mg•dNTP). The reaction proceeds with nucleophilic attack on the dNTP by the 30 -OH group of an existing DNA or RNA strand (the primer strand). The incoming dNTP forms a WatsonCrick base-pair with the base in the template strand. Some DNA polymerases dissociate from the DNA substrate after each dNTP incorporation reaction and are said to act distributively. Others act processively, remaining bound to and moving along the DNA to incorporate more than one dNTP for each DNA binding event. Bacteria such as E. coli encode five DNA polymerases (Chen 2014). DNA polymerase III is responsible for the bulk of DNA synthesis during cellular DNA replication. DNA polymerase III is a remarkable and complex enzyme composed of ten different protein subunits. It has both high fidelity and high processivity, befitting an enzyme that must replicate a chromosome of 4.6 Mbp. DNA polymerase I is a single-subunit enzyme that acts during DNA replication and in some DNA repair pathways. The remaining polymerases (II, IV (aka DinB), and V (aka UmuCD)) are enzymes that act during lesion-bypass DNA synthesis on chemically-damaged DNA templates. Eukaryotes including humans encode a large number of DNA polymerases. The DNA polymerases a, d, and e are involved in nuclear DNA replication, DNA polymerase b acts in DNA repair, and DNA polymerase g replicates mitochondrial DNA. A menagerie of lesion-bypass DNA polymerases (ζ, Z, ι, k, l, m, etc.) have also been discovered in eukaryotic cells. The properties of many of these DNA polymerases have compiled on-line in the Polbase database (Langhorst et al. 2012). Many DNA polymerases have intrinsic nuclease activity in addition to the signature DNA synthesis activity. Most common is 30 -50 -exonuclease activity that has a proofreading function (Fig. 5a). DNA polymerases select and incorporate the correct nucleotide most of the time by Watson-Crick base pairing, typically making a mis-incorporation error only once in 104 to 106

617

base pairs synthesized (Kunkel 2004). The proofreading exonuclease activity is able to hydrolyze the phosphodiester bond joining a misincorporated nucleotide in the newly synthesized DNA strand, releasing a nucleoside monophosphate (dNMP) product. The proofreading activity raises the overall fidelity of DNA replication to 1 error in 106 to 108 bp synthesized (Kunkel 2004). Bacterial DNA polymerase I has an additional 50 -30 - exonuclease activity that is useful in nick translation applications (see below). The DNA polymerases that catalyze the bulk of DNA synthesis in cellular DNA replication are complex multisubunit enzymes. Those that are most useful in DNA cloning are simpler, single subunit enzymes produced by bacteria, archaebacteria, or bacteriophages. DNA polymerases are used in vitro to convert sticky ends to blunt ends by DNA synthesis (Fig. 3, reaction 3) or by using the 30 -50 -exonuclease activity (Fig. 3, reaction 2b), to incorporate radioactive or fluorescent labels into DNA, in the polymerase chain reaction, in DNA sequencing, for site-directed mutagenesis, and in other applications. DNA Polymerase I Bacterial DNA polymerase I is a monomeric enzyme (103 kD in E. coli) that has DNA synthesis, 30 -50 -exonuclease, and 50 -30 -exonuclease activities (Patel et al. 2001). It has high fidelity and low processivity (Kornberg and Baker 1992). The enzyme uses the 50 -30 -exonuclease and DNA synthesis activities to remove RNA primers and replace them with DNA during DNA replication. It also acts in the base excision and nucleotide excision DNA repair pathways (Patel et al. 2001). The 50 -30 -exonuclease activity makes Pol I useful in nick translation, a method to incorporate a radioactive or fluorescent label into doublestranded DNA (Fig. 5b). The enzyme can bind to a 30 -OH group at a nick in dsDNA. The 50 -30 -exonuclease activity hydrolyzes the nicked strand in front of the enzyme, starting at the 50 -end, to form a single-stranded gap in the dsDNA. The enzyme catalyzes DNA synthesis using the 30 -OH as a primer terminus and the newly-exposed single-stranded DNA as the template. The result is that the nick is moved (translated) on the DNA,

K

618

Key Enzymes Used in Cloning, Some

Key Enzymes Used in Cloning, Some, Fig. 5 Nuclease reactions catalyzed by DNA polymerases. (a) The 30 50 -proofreading exonuclease enables the enzyme to correct replication errors. In this example the enzyme has mis-incorporated a T opposite a G in the template, creating a mis-matched base pair. The 30 -50 -exonuclease active site of the enzyme catalyzes hydrolysis of the phosphodiester bond to release the mis-incorporated nucleotide as dTMP. The DNA polymerase active site can then extend the resulting 30 -hydroxyl group, adding the correct base (C) most of the time. The 30 -50 -exonuclease can also remove correctly base-paired nucleotides (not shown),

particularly in the absence of dNTPs when DNA synthesis is not possible. (b) Nick translation by the 50 -30 -exonuclease activity of bacterial DNA polymerase I. The enzyme binds to a nick in a double-stranded DNA molecule. The 50 -30 -exonuclease activity in the small domain of the enzyme catalyzes hydrolysis of phosphodiester bonds to the 30 -side of the nick (righthand side in this illustration), releasing short oligonucleotide products. The DNA polymerase activity catalyzes DNA synthesis (heavy line) using the newly exposed template strand. The result is that the nick is moved towards the 30 -end of the nicked DNA strand

with no net gain or loss of DNA. The DNA becomes labeled if the reaction is done in the presence of an a-32P- or fluorescently-labeled dNTP. The DNA strand to the 30 -side of the nick can also be simply unwound and displaced from the template strand as the enzyme synthesizes new DNA behind itself, in a process called stranddisplacement synthesis. DNA polymerase I is cleaved to produce two protein fragments when it is treated with the proteases subtilisin or trypsin (Kornberg and Baker 1992). The large fragment, known as the Klenow fragment, retains the DNA synthesis and 30 50 -exonuclease activities, while the small fragment has the 50 -30 -exonuclease activity. The Klenow fragment is commercially available and has been used in a variety of applications, including site-directed mutagenesis, DNA sequencing, and filling DNA ends produced by restriction endonucleases that produce recessed 30 -ends

(Cobianchi and Wilson 1987). The 30 -50 -exonuclease activities of both DNA polymerase I and the Klenow fragment can be used to remove 30 -overhangs left after restriction enzyme digestion to produce a blunt end. Bacteriophage DNA Polymerases DNA polymerases produced by several bacteriophages are used in recombinant DNA applications. The DNA polymerase from bacteriophage T4 is a monomeric enzyme of 114 kD (Sambrook et al. 1989). It has both DNA synthesis and 30 50 -exonuclease activities, the latter being about 200-times more active than that of DNA polymerase I or the Klenow fragment. It lacks a 50 -30 exonuclease activity and it does not do strand displacement synthesis or nick translation. It is useful for degrading 30 -terminated single-stranded DNA, such as that produced by some restriction enzymes (Cobianchi and Wilson 1987).

Key Enzymes Used in Cloning, Some

Treatment of such DNA with T4 DNA polymerase converts sticky ends to blunt ends in the presence of dNTPs by the combined degradation of the 30 -single-stranded end past the 50 -end of the other strand, followed by DNA synthesis to the end of the template strand. The DNA can be labeled in this way if a radioactive dNTP is included. The enzyme is also used in site-directed mutagenesis procedures in which stranddisplacement synthesis is undesirable. The DNA polymerase from bacteriophage T7 is an 80 kD protein that has DNA synthesis and 30 -50 -exonuclease activities (Sambrook et al. 1989). This protein forms a complex with thioredoxin from the E. coli host cell to constitute a very rapid and highly processive DNA polymerase. The enzyme lacks 50 -30 -exonuclease activity and is used in a variety of applications. Bacteriophage j29 infects Bacillus subtilis. It has a 20 kb linear dsDNA genome that is replicated by the 66 kD DNA polymerase encoded by gene 2 of the virus (Blanco et al. 1989). This enzyme has DNA synthesis and 30 -50 -exonuclease activities. Most importantly, it acts with very high processivity and is adept at strand displacement DNA synthesis. These qualities have made the enzyme useful for whole genome amplification (WGA) (Dean et al. 2002), an in vitro method for producing complete copies of all of the DNA in a cell. In one example, genomic DNA from as few as ten human cells was copied by j29 DNA polymerase to produce a collection of DNA fragments averaging 10 kb in size that encompassed the entire human genome (Dean et al. 2002). Thermostable DNA Polymerases The polymerase chain reaction (PCR) is an in vitro process by which a specific DNA molecule can be amplified to high concentration by repeated cycles of synthesis catalyzed by a DNA polymerase. A reaction mixture is cycled through a high temperature (ca. 95  C) to denature the dsDNA template, followed by incubation at lower temperature to allow DNA primers to anneal to the template and DNA synthesis by a DNA polymerase. The first PCR reactions were done using the Klenow fragment of E. coli DNA polymerase I. Fresh enzyme was added after each

619

denaturation step (Mullis and Faloona 1987; Saiki et al. 1985). The procedure was transformed into the widely-used, automated, workhorse of molecular biology and other fields when PCR was done using a thermostable DNA polymerase (Saiki et al. 1988). The first such enzyme used for PCR was the DNA polymerase I enzyme isolated from the thermophilic bacterium Thermus aquaticus (Taq polymerase). This organism grows at high temperatures and so the Taq DNA polymerase enzyme retains activity after the denaturation steps, obviating the need to add enzyme for each reaction cycle. Taq polymerase has low processivity and relatively low fidelity due to the lack of 30 -50 -exonuclease activity. The enzyme also has a terminal deoxynucleotidyl transferase (TdT) activity (see below), in which the enzyme adds an un-templated nucleotide (dA) to the 30 -OH end of a double-stranded DNA. The product from PCR using Taq polymerase thus has a singlestranded base (A) on the 30 -ends. This allows for the T/A cloning procedure, in which the PCR product is mixed with a linearized cloning vector that has unpaired T’s on both 50 -ends, added using TdT and 20 ,30 -dideoxyTTP (ddTTP). The singlestranded ends form A:T base pairs and the two DNA molecules can be joined by ligation. Thermostable DNA polymerases are available that have been isolated from other thermophilic organisms (Pavlov et al. 2004) including the Pfu polymerase from Pyrococcus furiosus and Vent polymerase from Thermococcus litoralis. These enzymes have higher fidelity than Taq polymerase and they produce blunt-ended DNA products. DNA Polymerases Used in DNA Sequencing DNA sequencing technologies use a DNA polymerase to incorporate a modified dNTP during the DNA synthesis process (Chen 2014). The first widely used method developed by Frederick Sanger used 20 ,30 -dideoxynucleoside triphosphates (ddNTPs). This method was later improved by developing ddNTPs that also carry fluorophores. The original DNA polymerases used in these procedures were bacterial DNA polymerase I and the enzyme from bacteriophage T7. The active sites of these enzymes were altered

K

620

by protein engineering methods to improve their ability to catalyze DNA synthesis using these chemically modified substrates. The proofreading exonuclease activity was also attenuated to prevent degradation of the DNA products. The more recently developed “next-generation” DNA sequencing technologies use dNTPs that carry chemical groups on the 30 -hydroxyl or the phosphoryl groups. DNA polymerases from thermophilic bacteria and archaea (Taq polymerase and the 9E N enzyme from Thermococcus sp. 9E N-7) and from bacteriophage j29 are used in these applications (Chen 2014). Reverse Transcriptases Reverse transcriptases (RT) are DNA polymerases with the unique ability to synthesize DNA using a single-stranded RNA as the template (Telesnitsky and Goff 1997). These enzymes are encoded by retroviruses to convert the viral RNA genome to a double-stranded DNA copy that can be inserted into the host cell chromosome and used as a template for transcription by the host RNA polymerase. RTs encoded by the Moloney murine leukemia virus (MLV) and avian myeloblastosis virus (AMV) are available commercially. Reverse transcriptases have two enzymatic activities - DNA polymerase and ribonuclease H (RNaseH) (Telesnitsky and Goff 1997). The DNA polymerase can use either RNA or DNA as the primer and as the template for DNA synthesis. A ribonuclease H catalyzes hydrolysis of phosphodiester bonds in the RNA strand of a double-stranded RNA:DNA hybrid molecule. RTs lack both a proofreading 30 -50 -exonuclease activity and the 50 -30 -exonuclease found in DNA polymerase I. The commonly-used commercial enzymes act as a single subunit. The DNA polymerase activity resides in an N-terminal domain and the RNaseH activity is in a C-terminal domain. The DNA polymerase activity replicates an RNA template to form an RNA:DNA hybrid. The RNaseH activity can cleave the RNA template strand at 2–15 nucleotide intervals and the 30 -OH groups of the resulting oligoribonucleotides can prime synthesis of a second DNA strand, using the newly-synthesized DNA strand

Key Enzymes Used in Cloning, Some

as the template. RTs generally have low processivity, dissociating from the DNA substrate after synthesizing only a few hundred base pairs. They also have low fidelity (1 error in ca. 104 bases synthesized), in part because of the absence of 30 -50 -exonuclease activity. The enzymes are capable of strand-displacement DNA synthesis but they do not carry out nick translation since they lack a 50 -30 -exonuclease activity. Reverse transcriptases are particularly useful to generate complementary DNA (cDNA) copies of mature eukaryotic messenger RNAs. A large portion of the genes in humans and other higher eukaryotes is intron sequence that does not encode protein. The mature mRNA produced after the introns are removed by RNA splicing can be isolated by virtue of the poly-A tail on the 30 -end of the mRNA. The RNA is converted to a cDNA by RT. The cDNA can then be sequenced to reveal the sequence of that part of the complete gene that actually encodes protein, or inserted into an expression vector for production of the encoded protein. Terminal Deoxynucleotidyl Transferase Terminal deoxynucleotidyl transferase (TdT) is a DNA polymerase with the ability to synthesize DNA without a polynucleotide template (Fig. 3, reaction 6) (Motea and Berdis 2010). The enzyme combines dNTPs with the 30 -OH group of a single-stranded DNA molecule to form a phosphodiester bond and pyrophosphate (PPi), as do other DNA polymerases. The enzyme from calf thymus is commercially available. In the cell TdT is involved in generating antibody diversity as it adds random sequence to the ends of DNA molecules during V(D)J recombination. TdT adds any of the four dNTPs (dATP, dCTP, dGTP, or dTTP) in vitro. The enzyme exists in two forms, including a large form that has 30 -50 -exonuclease activity. The commercial enzyme lacks this nuclease activity. The enzyme can use ssDNA as short as three nucleotide residues as a primer. Unlike most DNA polymerases involved in DNA replication and repair, the enzyme acts distributively, meaning that it dissociates from the DNA chain after each nucleotide incorporation reaction (Motea and Berdis 2010).

Key Enzymes Used in Cloning, Some

TdT was used in some of the earliest cloning experiments to modify two DNA molecules to facilitate their joining by ligation (Berg and Mertz 2010). One DNA molecule was treated with TdT and dATP to add a short “tail” of A’s to the 30 -end. The 30 -end of the other DNA was tailed with T’s by using dTTP. The two DNAs were mixed and they associated due to A-T basepairing of the tails. The excess ssDNA was trimmed by a nuclease and the ends were ligated to join the DNAs covalently. TdT is also used in the TUNEL assay that detects DNA breaks (Motea and Berdis 2010).

Nucleases Nucleases are enzymes that catalyze hydrolysis of phosphodiester bonds in DNA or RNA substrates (Linn et al. 1993; Sambrook et al. 1989; Yang 2011). These enzymes have played a pivotal role in the development of recombinant DNA technologies. Most notable are the restriction endonucleases that are covered in a separate essay. A variety of other nucleases produced by eukaryotes, prokaryotes, and bacteriophages are available commercially. These enzymes, of which only DNases are covered here, differ in their specificity for ss- or dsDNA, the requirement for an end or the ability to act on a closed circular molecule, and the requirement for a divalent metal ion (usually Mg2+ or Mn2+) for catalytic activity (Yang 2011). Endonucleases are able to cleave bonds internal to a DNA end or in a circular molecule, while exonucleases, in the strictest definition, catalyze hydrolysis of every phosphodiester bond starting at an end to produce mononucleotide monophosphate products (dNMPs). Nucleases have found application in many ways in recombinant DNA technology and in DNA analysis. Applications of nucleases include destruction of undesired single-stranded or double-stranded DNA molecules in a mixture, conversion of single-stranded sticky-ends produced by restriction endonucleases to blunt ends (Fig. 3, reactions 2a and 2b), removal or shortening of one DNA strand of a double-stranded

621

molecule, and introducing nicks into doublestranded DNA in preparation for nick translation (Nichols 2011; Sambrook et al. 1989). Single-stranded DNA can be hydrolyzed to mononucleotides using either the RecJ exonuclease or exonuclease I from E. coli. These enzymes attack single-stranded DNA with 50 -30 and 30 -50 polarity, respectively. The 30 -50 proofreading exonuclease activity of the bacteriophage T4 DNA polymerase can be used to degrade 30 -singlestranded sticky-ends produced by restriction endonucleases (see above). The fungal S1 and P1 endonucleases are also specific for singlestranded DNA or unpaired regions within a double-stranded molecule Nichols 2011). The E. coli RecBCD enzyme, an ATP-dependent enzyme also known as exonuclease V (Dillingham and Kowalczykowski 2008), can be used to destroy linear dsDNA while it is has no activity on circular dsDNA. RecBCD can be used to remove linear, unligated DNA from circular ligated recombinant plasmids prior to their introduction in a host cell. Several nucleases are available that catalyze hydrolysis of one strand of a dsDNA molecule. The Lambda exonuclease encoded by bacteriophage lambda is a highly processive 50 -30 -exonuclease that acts on dsDNA. It will hydrolyze the 50 -terminated strand while leaving the other strand intact, a process called end resection (Cobianchi and Wilson 1987). The enzyme is capable of acting on restriction fragments that have blunt ends and 30 -single-stranded overhangs (50 recessed ends). Enzymes from bacteriophages T5 and T7 also have 50 -30 -exonuclease activity and are capable of end resection. Exonuclease III (ExoIII) degrades the 30 -terminated strand of a DNA duplex, starting from an end or from an internal nick (Cobianchi and Wilson 1987). ExoIII is also an AP endonuclease that acts in base excision repair in vivo. Bal 31 is a complex enzyme that is an endonuclease on single-stranded DNA or transientlyunwound regions of double-stranded. The enzyme can also remove nucleotides from the 30 -ends of linear double-stranded DNA. The two activities together can be used to generate deletions of sequence from the ends of the linear DNA

K

622

(Cobianchi and Wilson 1987; Nichols 2011; Sambrook et al. 1989). Nucleases are useful in a variety of other applications to probe DNA structure and DNA-protein complexes. DNaseI is a pancreatic endonuclease that acts on both single- and double-stranded DNA (Cobianchi and Wilson 1987; Nichols 2011). It can be used to introduce nicks at random positions in a dsDNA molecule in preparation for radiolabeling the DNA by nick translation using bacterial DNA polymerase I (see above). DNaseI is also used as a probe of DNA accessibility in footprinting applications to localize binding sites of sequence-specific DNA binding proteins. Micrococcal nuclease is used to characterize chromatin structure and accessibility of DNA bound to nucleosomes, among other applications (Nichols 2011).

Cross-References ▶ Bacterial DNA Replicases ▶ Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of ▶ Polymerase Chain Reaction ▶ Recombination: Mechanisms, Pathways, and Applications ▶ Restriction Endonucleases

References Berg P, Mertz JE (2010) Personal reflections on the origins and emergence of recombinant DNA technology. Genetics 184:9–17 Blanco L, Bernad A, Lazaro JM, Martin G, Garmendia C, Salas M (1989) Highly efficient DNA synthesis by the phage phi 29 DNA polymerase. Symmetrical mode of DNA replication. J Biol Chem 264:8935–8940 Chen CY (2014) DNA polymerases drive DNA sequencing-by-synthesis technologies: both past and present. Front Microbiol 5:305 Cobianchi F, Wilson SH (1987) Enzymes for modifying and labeling DNA and RNA. Methods Enzymol 152:94–110 Dean FB, Hosono S, Fang L, Wu X, Faruqi AF, BrayWard P, Sun Z, Zong Q, Du Y, Du J et al (2002) Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 99:5261–5266

Key Enzymes Used in Cloning, Some Dickson K, Burns C, Richardson J (2000) Determination of the free-energy change for repair of a DNA phosphodiester bond. J Biol Chem 275:15828–15831 Dillingham MS, Kowalczykowski SC (2008) RecBCD enzyme and the repair of double-stranded DNA breaks. Microbiol. Mol. Biol. Rev. 72:642–671 Dugaiczyk A, Boyer HW, Goodman HM (1975) Ligation of EcoRI endonuclease-generated DNA fragments into linear and circular structures. J Mol Biol 96:171–184 Kornberg A, Baker TA (1992) DNA replication, 2nd edn. W. H. Freeman, New York Kunkel T (2004) DNA replication fidelity. J Biol Chem 279:16895–16898 Langhorst BW, Jack WE, Reha-Krantz L, Nichols NM (2012) Polbase: a repository of biochemical, genetic and structural information about DNA polymerases. Nucleic Acids Res 40:D381–D387 Linn SM, Lloyd RS, Roberts RJ (eds) (1993) Nucleases. Cold Spring Harbor Laboratory Press, Plainview, NY Motea EA, Berdis AJ (2010) Terminal deoxynucleotidyl transferase: the story of a misguided DNA polymerase. Biochim Biophys Acta 1804:1151–1166 Mullis KB, Faloona FA (1987) Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods Enzymol 155:335–350 Nichols NM (2011) Endonucleases. In Current Protocols in Molecular Biology. Wiley Nilsen I, Øverbø K, Olsen R (2001) Thermolabile alkaline phosphatase from Northern shrimp (Pandalus borealis): protein and cDNA sequence analyses. Comp Biochem Physiol B Biochem Mol Biol 129:853–861 Patel P, Suzuki M, Adman E, Shinkai A, Loeb L (2001) Prokaryotic DNA polymerase I: evolution, structure, and “base flipping” mechanism for nucleotide selection. J Mol Biol 308:823–837 Pavlov AR, Pavlova NV, Kozyavkin SA, Slesarev AI (2004) Recent developments in the optimization of thermostable DNA polymerases for efficient applications. Trends Biotechnol 22:253–260 Revie D, Smith DW, Yee TW (1988) Kinetic analysis for optimization of DNA ligation reactions. Nucleic Acids Res 16:10301–10321 Rina M, Pozidis C, Mavromatis K, Tzanodaskalaki M, Kokkinidis M, Bouriotis V (2000) Alkaline phosphatase from the Antarctic strain TAB5. Properties and psychrophilic adaptations. Eur J Biochem 267:1230–1238 Saiki RK, Scharf S, Faloona F, Mullis KB, Horn GT, Erlich HA, Arnheim N (1985) Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230:1350–1354 Saiki RK, Gelfand DH, Stoffel S, Scharf SJ, Higuchi R, Horn GT, Mullis KB, Erlich HA (1988) Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science 239:487–491 Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning: a laboratory manual, 2nd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor

Kinetics of DNA Damage

623

Shore D, Langowski J, Baldwin RL (1981) DNA flexibility studied by covalent closure of short fragments into circles. Proc Natl Acad Sci U S A 78:4833–4837 Telesnitsky A, Goff SP (1997) Reverse transcriptase and the generation of retroviral DNA. In: Coffin JM, Hughes SH, Varmus HE (eds) Retroviruses. Cold Spring Harbor Laboratory Press, Plainview, pp 121–160 Tomkinson AE, Vijayakumar S, Pascal JM, Ellenberger T (2006) DNA ligases: structure, reaction mechanism, and function. Chem Rev 106:687–699 Wang LK, Lima CD, Shuman S (2002) Structure and mechanism of T4 polynucleotide kinase: an RNA repair enzyme. Embo J 21:3873–3880 Wood Z, Sabatini R, Hajduk S (2004) RNA ligase: picking up the pieces. Mol Cell 13:455–456 Yang W (2011) Nucleases: diversity of structure, function and mechanism. Q Rev Biophys 44:1–93

Kinetics of DNA Damage Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Definition Kinetics is a measure of the rate at which a reactive molecule interacts with DNA, either non-covalently or covalently, and also the rate of rearrangement or decomposition of the molecule to products that do not react with DNA. Another factor is any reaction of the modified DNA (to other products).

k1

R þ DNA

! k2 ½R  DNA ! R  DNA k1

in cases where a non-covalent complex is formed and then reacts. If the alkylating agent is unstable in H2O, then an additional reaction to consider is the hydrolysis reaction. khydrolysis

R þ H2 O ! R0 In some cases all three reactions may be possible and will compete. An example of the complexity of the kinetics occurs with aflatoxin B1 8,9-exo-epoxide (Johnson and Guengerich 1997). The hydrolytic reaction is fast, k ~ 1 s1 at neutral pH (Johnson et al. 1996). However, the compound reacts very efficiently with DNA in a reaction that can be described with a Kd of ~6 mM (based on DNA bases) and a “kcat” of 42 s1 (Johnson and Guengerich 1997) (Fig. 1). Similar kinetic measurements have been reported for a diol epoxide derivative of benzo[a]pyrene (Islam et al. 1987). Another issue is the kinetics of hydrolysis of reactive chemicals as they move through cells. Chemicals have to move into cells and to the nucleus or be converted to reactive compounds in the cytosolic compartment of the cell. One question often asked is how this is possible. An answer has been published in mathematical form, in the case of aflatoxin B1 8,9-exo-epoxide (Johnson and Guengerich 1997). Second-order rates of encounter limited by diffusion can be calculated by 4pN A ðDA þ DB Þ k D ¼ ð1 eU=kT dr=r 2

!

Discussion In some cases the chemical entities that react with DNA (R below) are stable in aqueous media and the rates of reaction with DNA can be described by either of two expressions: k

R þ DNA! R  DNA, if the reaction is simply second order, or

a

where r = radius of the atom, D = diffusion of the atom in the medium, NA is Avogrado’s number, T is absolute temperature, and U represents the Coulombic effect which is assumed to be null as the atom is uncharged. Simply put, the reason lies in the diffusion rates and the short distances to travel within cells.

K

624

Knockdown

Kinetics of DNA Damage, Fig. 1 Kinetic competition between aflatoxin B1 8,9-exo-epoxide hydrolysis and reaction with DNA (Johnson and Guengerich 1997)

Cross-References ▶ Base Intercalation in DNA ▶ Damaged DNA, Analysis of ▶ DNA Damage, Frequency of ▶ DNA Damage, Types of ▶ Electrophiles, Types of ▶ Hydrolytic, Deamination, and Rearrangement Reactions of DNA Adducts ▶ Selectivity of Chemicals for DNA Damage

covalent binding and DNA-induced hydrolysis. Proc Natl Acad Sci USA 94:6121–6125 Johnson WW, Harris TM, Guengerich FP (1996) Kinetics and mechanism of hydrolysis of aflatoxin B1 exo-8,9epoxide and rearrangement of the dihydrodiol. J Am Chem Soc 118: 8213–8220

Knockdown ▶ RNA Interference

References Islam NB, Whalen DL, Yagi H et al (1987) pH dependence of the mechanism of hydrolysis of benzo[a]pyrenecis-7,8-diol 9,10-epoxide catalyzed by DNA, poly(G), and poly(A). J Am Chem Soc 109:2108–2111 Johnson WW, Guengerich FP (1997) Reaction of aflatoxin B1 exo-8,9-epoxide with DNA: kinetic analysis of

Knockout Mice ▶ Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

L

Land Plants

Definition

▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Certain regions of the chromosome, including the centromere and telomeres, are perpetually maintained in a heterochromatin state. These regions of the chromosome contain not only DNA and histones but also other proteins that help the DNA stay tightly packaged and unable to be transcribed. Centromeres and telomeres perform important functions unrelated to gene expression. Telomeres maintain the stability of each individual chromosome, while centromeres promote the faithful segregation of chromosomes during meiosis and mitosis. The DNA in these regions is tightly packed with proteins that aid their respective functions. These regions, as they are not normally accessible to transcription machinery, do not contain genes.

Linear Motif ▶ Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

Localized Translation ▶ mRNA Localization and Localized Translation

Long-Term Genetic Silencing at Centromere and Telomeres Scheherazade Khan and Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms Chromatin structure; Gene silencing; Transcription repression # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

Discussion Many regions of eukaryotic chromosomes are subject to permanent or long-term silencing resulting from changes in chromatin structure. This silencing can be inherited from one generation to the next, but not all of these regions are silenced by DNA methylation, as is the case for genetic imprinting. Centromeres and telomeres are structurally important parts of eukaryotic chromosomes that are maintained in a permanent heterochromatin (silenced) state and are,

626

Long-Term Genetic Silencing at Centromere and Telomeres

therefore, naturally devoid of genes. Centromeres are required for proper segregation of chromosomes during meiosis and mitosis. Telomeres, or the ends of eukaryotic chromosomes, have a unique structure that prevents shortening of chromosomes or fusing between chromosomes. Thus, both centromeres and telomeres are essential for maintaining the integrity of the genome. The proteins responsible for maintaining the heterochromatin state differ between centromeres and telomeres and among species (reviewed in Bühler and Gasser 2009). However, there are two general ideas that are common between centromeres and telomeres. First, specialized histone proteins help maintain or restrict the silencing. For example, a variant of histone H3 is present in centromeres (Bühler and Gasser 2009), while a variant of histone 2A helps restrict heterochromatin at telomeres from spreading (Meneghini et al. 2003). Second, specific DNA sequences in these regions recruit the proteins that promote a silenced state. For example, the TG-repeats in telomere DNA recruit DNA-binding proteins (such as Rap1 and Ku in budding yeast), which in turn recruit the silent information regulating (Sir) complexes. The Sir complex initiates

silencing by promoting the deacetylation of adjacent histones to promote heterochromatin formation (reviewed in Ref. (Bühler and Gasser 2009)). While these regions are naturally devoid of genes, if a gene were placed within telomeres or centromeres, it would be silenced, simply by virtue of its proximity to the heterochromatic telomere or centromere structure. This phenomenon is referred to as position-effect variegation.

Cross-References ▶ Long-Term Genetic Silencing at Centromere and Telomeres

References Bühler M, Gasser SM (2009) Silent chromatin at the middle and ends: lessons from yeasts. EMBO 28:2149–2161 Meneghini MD, Wu M, Madhani HD (2003) Conserved histone variant H2A.Z protects euchromatin from the ectopic spread of silent heterochromatin. Cell 112:725–736

M

Main Chain ▶ Secondary Structure

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols Robert Augustin and Eric Mayoux Department of Cardiometabolic Diseases Research, Boehringer-Ingelheim Pharma GmbH & Co KG, Biberach an der Riss, Germany

Synonyms Glucose homeostasis; Glucose transport facilitator; Glucose transporter; GLUT; Inherited disorders; Knockout mice; SGLT; Sodium-dependent glucose transport; Symporter; Type 2 diabetes; Uric acid

Synopsis This entry summarizes the principal characteristics of sodium-dependent, active (SGLT; Wright et al. 2011) and facilitative (GLUT; Augustin 2010) sugar transport in mammalian cells, primarily focusing on the human transporters. Emphasis # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

is placed on the physiology of such transporters, based on inherited disorders and syndromes in humans and the phenotypic characteristics of genetically modified mice.

Introduction Glucose represents the major energy source of mammalian cells. Due to its hydrophilic nature, glucose requires specific transporters in order to cross cellular membranes. Such transport is, in the case of glucose and also other monosaccharides, mediated by energy-coupled as well as facilitative mechanisms represented by protein families of sodium-driven sugar cotransporters (SGLTs) and glucose transporters (GLUTs), respectively. Active – i.e., SGLT-driven – transport primarily occurs at sites of sugar absorption and within the gastrointestinal (GI) tract and in the kidney, respectively. Glucose homeostasis within the body is mainly maintained by the various members of the GLUT protein family which is comprised of 14 isoforms (Uldry and Thorens 2004). Within the GLUT protein family, three different subclasses can be distinguished based on primary sequence comparisons: class I comprises the classical transporters GLUT1–GLUT4 as well as the gene duplication of GLUT3 which is GLUT14, class II contains the isoforms GLUT5, GLUT7, GLUT9, and GLUT11, while GLUT6, GLUT8, GLUT10, GLUT12, and the proton-

628

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols, Fig. 1 Overview on the principal physiological role for specific glucose transporters and their involvement in regulating glucose homeostasis in specific tissues

BRAIN

Central glucose homeostasis predominantly involves GLUT1 and GLUT3 SKELETAL MUSCLE

ADIPOSE TISSUE

Tissues of insulin stimulated mediated glucose disposal is predominantly mediated by GLUT4 PANCREAS

LIVER

Hepatic glucose uptake and output occurs mainly via GLUT2 INTESTINE

BRAIN

Glucose sensing mechanisms in the pancreas and the brain involve GLUT2 KIDNEY

SGLT’s mediate the active, ATP dependent hexose absorption in and reabsorption at the luminal site of the epithelium in the intestine and kidney, respectively.

driven myoinositol transporter HMIT (GLUT13) belong to class III (Joost and Thorens 2001). Current understanding of whole body glucose homeostasis under normal and, more importantly, under disease conditions is directly linked to the understanding of SGLT and GLUT physiology (Fig. 1). The active mechanism of glucose (as well as galactose) absorption in the intestine is primarily catalyzed by SGLT1 (Fig. 2b), while SGLT2 represents the predominant mechanism for glucose reuptake by the kidney (Fig. 2a). As the brain’s main energy source, glucose needs to be transported across the blood–brain barrier. This process is facilitated by and is

dependent on GLUT1 (Fig. 1). For insulin-induced clearance of blood glucose through uptake into the skeletal muscle, the heart, and adipose tissue, the rate-limiting step is defined by translocation of intracellular GLUT4 to the plasma membrane, and it is this signaling cascade that directly represents insulin sensitivity (Fig. 3b). The secretion of insulin by the pancreatic b-cells of the islets of Langerhans is dependent on GLUT2 which functions as a b-cell glucose sensor (Fig. 3c). However, the involvement of sugar transporters in the regulation of processes such as brain glucose sensing or glucose transport in the mammary gland is not yet well understood.

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

a

Proximal convoluted tubule

Proximal straight tubule

2Na+

Glucose

629

Glucose

SGLT1

Na+

SGLT2

2K+ 3Na+

Glucose

Glucose

ATP ADP+P

ATP ADP+Pi

i

Na+/K+ pump

GLUT1 Glucose

2K+ 3Na+

2K+

Glucose

3Na+

b

Na+/K+ pump

GLUT2 2K+

3Na+

Intestine Glucose

M

2Na+

Fructose

SGLT1

GLUT5

2K+ 3Na+

Glucose Fructose

ATP ADP+Pi

Na+/K+ pump

GLUT2 Glucose Fructose

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols, Fig. 2 SGLT and GLUT family members

2K+

3Na+

regulate intestinal absorption and renal reabsorption of hexoses. (a) Renal glucose reabsorption. In the kidney, proximal tubule transepithelial reabsorption of glucose

630

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

The SGLT Family Synonyms SGLT1–SGLT6, gene symbols: sodium–glucose symporters

SLC5,

Summary The model of active, ATP-dependent glucose transport against a concentration gradient that was proposed in 1960 by Bob Crane (1960). Intestinal reabsorption of glucose by the intestinal epithelium through transporters requires sodium symport which is ATP dependent via coupling to the sodium/potassium (Na+/K+) pump. Mechanistically, the inward sodium gradient at the apical site of epithelial cells is maintained by the ATP-driven active extrusion of sodium at the basolateral membrane (Figs. 2 and 4b). The sodium-dependent glucose transporters (SGLTs) are members of a larger gene family (>200 genes) of sodium–solute symporters (SSF) that contain a common SSF motif in the fifth transmembrane region (Wright et al. 2011). The human SGLT protein family (SLC5A) comprises 11 isoforms that structurally are characterized by 14 transmembrane domains, where the N- and C-termini face the extracellular (luminal) side of the cell. The 11 family members share an amino acid identity of 21–70%. A broad range of substrates are transported by proteins encoded by the SLC5 genes. The focus of the current entry is on the sodiumdependent glucose transporters within the SLC5 gene family, namely, SGLT1–SGLT5, and the closely related, based on sequence homology and substrate specificity, sodium-driven myoinositol transporter SMIT1 (SLC5A3) and SMIT2 (SGLT6/SLC5A11). More distant relatives of the SLC5A gene family are the iodide transporters NIS (sodium–iodide symporter [SLC5A5]) and AIT (apical iodide

transporter [SLC5A11]), the Na+/Cl/choline transporter (CHT [SLC5A7]), and the sodiumdependent multivitamin transporter (SMVT [SLC5A6]). NIS and AIT are expressed in the thyroid gland. While NIS is responsible for iodide uptake which is required for the production of T3 and T4, AIT is thought to catalyze the movement of iodide from the thyrocyte cytoplasm to the lumen of the gland. SMVT is widely expressed, while CHT is mainly found in the central nervous system. Biochemically, the CHT mediates Na+/ choline cotransport in a chloride-dependent manner. Structural Features and Substrate Specificities SGLT family members 1–6 contain between 596 and 681 amino acids with a 50–70% identity (67–84% similarity) where divergence in sequences can be mainly attributed to the N- and C-terminal domains of the proteins. Alternative splicing has been described for SGLT4–SGLT6; however, whether respective functional proteins with varying amino acid composition are encoded has yet to be shown. A common structural component among the large gene family of sodium–solute symporters (SSF) is the presence of a consensus pattern (Fig. 5a). The consensus sequence for the six SGLTs and for SMIT1 is located near the N-terminal domain of the proteins (Fig. 5a). A secondary structural model for human SGLT1 predicts the presence of 14 transmembrane helices. The model is based on N-glycosylation and cysteine scanning mutagenesis, antibody tagging, mass spectrometry, as well as computer algorithms predicting membrane-spanning regions (Fig. 5a). Freeze-fracture electron microscopy provided direct evidence that both SGLT1 and vSGLT (from Vibrio parahaemolyticus) function as 14 transmembrane helical monomers. The

ä Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols, Fig. 2 (continued) occurs at the apical membrane by SGLT2 and GLUT2 at the basolateral membrane. In the proximal straight tubule, the remaining glucose is reabsorbed by SGLT1 at the apical site of the epithelium

and GLUT1 at the basolateral membrane. (b) Intestinal glucose absorption. In the intestine, transepithelial glucose uptake at the apical site is mediated by the Na+dependent glucose transporter SGLT1, while fructose is absorbed by facilitated diffusion via GLUT5. These hexoses can all exit the basolateral membrane through GLUT2

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

a

b

Hepatocyte

Skeletal muscle, adipose tissue

Fructose Glucose

Insulin Urate

P

GLUT9

P

IRS1/2

Urate P

ADP AMP IMP Glycogen

GLUT4

P

PI3K

Fructose Fructokinase

Glycolysis

P

P Tbc1d1/4

Akt

GSV

Glucose GSK3

Triglyceridesynthesis Glycogen synthesis

Gluconeogenesis P

Glucose

IR

GLUT2

Glucose

631

Glycolysis

P

IR Insulin

Pancreatic β-cell

c Glucose

GLUT1?

GLUT2 Glucose Glucokinase Glucose-6-P

Insulin secretion ATP/ADP ratio

++

Ca

Ca++ channel KATP channel

ψ? K+

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols, Fig. 3 GLUT family members facilitate glucose transport into tissues that control glucose homeostasis such as hepatocytes, skeletal muscle, adipose tissue, and the pancreatic b-cells of the islets of Langerhans. (a) Hexose transport in hepatocytes. GLUT2 mediates glucose uptake under feeding conditions into hepatocytes where glucose is metabolized by glycolysis or incorporated into glycogen. In patients with FBS, fructose handling is normal; therefore, GLUT2 might not be exclusively involved in the uptake of this ketohexose by the hepatocyte (Manz et al. 1987). GLUT9 is highly expressed in the liver, and due to its capability to transport uric acid, its proposed function in humans might be the release of uric acid from the liver. In mice, GLUT9 is required for uric acid uptake into the liver for further breakdown by uricase to allantoin (Preitner et al. 2009). Whether GLUT9 also contributes to hexose transport, namely, fructose, in hepatocytes is currently unknown. Fructose that is taken up by the liver mainly feeds into triglyceride synthesis and via ATP depletion stimulates

AMP deaminase and thereby purine degradation, leading to an increased generation of uric acid. (b) Insulinstimulated glucose uptake into the skeletal muscle and adipose tissue. In muscle and fat, glucose uptake is stimulated by insulin. After insulin binds to its receptor, autophosphorylation of the receptor occurs that triggers a signaling cascade that finally leads to a translocation of GLUT4 vesicles from an intracellular pool to the plasma membrane. The acute increase of GLUT4 molecules at the cell surface leads to an increase in glucose uptake and represents the rate-limiting step in insulin-stimulated glucose uptake in the adipose and muscle tissues. (c) Pancreatic b-cells secrete insulin in response to elevations in blood glucose. GLUT2 mediates glucose uptake into b-cells. Phosphorylation of glucose by glucokinase is the rate-limiting step of glycolysis which increases the ATP to ADP ratio of the cell, leading to closure of the KATP channel and subsequent opening of the Ca++ channels caused presumably by changes of the plasma membrane polarization. The opening of Ca++ channels raises intracellular Ca++ concentrations and induces exocytosis of the insulin granules

M

632

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

a Extracellular

Glc

1

2

3

4

Intracellular

b Extracellular

Na+ Glc

1

2

3

4

5

6

Intracellular

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols, Fig. 4 Proposed models for the mechanism of facilitative and active glucose transport across cellular membranes. (a) A 6-state model is proposed for SGLT-mediated glucose transport. The empty transporter is assumed to have a valence of 2 (1). Sugar transport is initiated upon binding of two sodium ions to the open form of the outside gate (2). In the next step, glucose binds to the transporter, which induces a conformational change from an outward to an inward occluded state (3, 4). Upon opening of the inward gate, the

glucose is released into the cytoplasm before the sodium (5). The transport cycle is completed by a conformational change to return the ligand-free inward-facing (6) structure to the ligand-free outward-facing structure (1). (b) Model of GLUT-mediated glucose transport. Glucose binds to an outward-facing site of the transporter (1) which induces a conformational change that allows movement of the hexose through the protein (2–3). After the release of the hexose from its inward-facing binding site into the cytosol, the transporter undergoes a reverse conformational change (4–1)

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

a

633

luminal - extracellular N

Glucose-binding and translocation domain extracellular glucose binding

NH2

COOH

SSF signature SGLT/SMIT motif Intracellular glucose binding

intracellular

Class I and II family members b extracellular N

Substrate binding? GR W

SG L Q Q

W

NH2

EX(6)RG

GRK/R

M

PETKG

PESPR EX(6)R/K

D/ERAGRR

CB binding Substrate binding

COOH

intracellular

Class III family members extracellular N Substrate binding? GR SG L Q

W W PETKG

PXXPR NH2

LL GRK/R

EX(6)RG

D/EXXGRR

SEX(6)R/K CB binding Substrate binding

intracellular

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols, Fig. 5 Putative secondary structural models for GLUTand SGLT proteins. (a) SGLT family members contain 14 transmembrane domains (21). Highlighted is

COOH

the presence of the SSF motif present in all members of the solute symporter family (SSF) gene family. The motif that is shared between the SGLTs and sodium–myoinositol cotransporters (SMITs) is also indicated. The glucosebinding and translocation domain is located at the

634

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

recent crystal structure for the sodium/galactose symporter vSGLT demonstrated the presence of 14 transmembrane helices providing evidence for the secondary structural model of human SGLT1. SGLTs are highly glycosylated membrane proteins, for example, SGLT1 contains an N-linked glycosylation site at position N248. However, for SGLT1, glycosylation appears not to be required for functional expression, indicating proper folding and membrane targeting in its absence. The transport kinetics and substrate specificities have been intensively studied for SGLT1, SGLT2, SGLT3, and more recently SGLT4, mainly based on electrophysiological and biochemical studies upon heterologous expression of the transporters in Xenopus laevis oocytes. SGLT1–SGLT4 and SMIT1 and SMIT2 all transport (or bind in the case of human SGLT3) D-glucose and the non-metabolizable alphamethyl-D-glucopyranoside (a-MDG). Transport of those substrates is inhibited by the glycoside phlorizin (Ehrenkranz et al. 2005). SGLTs are only able to transport sugars with pyranose ring; cyclic polyhydroxy alcohols are not transported. The importance of the single hydroxyl groups for substrate recognition has been well characterized. The oxygen is essential for transport by human SGLT1 – while sulfur substitution lowers affinity, nitrogen is not tolerated. This particular feature does not apply to SGLT3 for which imino sugars – containing an amine group in place of a hydroxyl group – are ligands. Based on mutational analysis of SGLTs and the crystal structure of vSGLT, which is 32% identical to human SGLT1, residues that coordinate substrate recognition have been shown to be relatively conserved. An exception to that is the human SGLT3, a glucose sensor, which can be

converted to a functional transporter based on a single amino acid exchange. In addition to glucose and other monosaccharides, SGLTs also transport glycosides. Those can be either substrates such as indican and arbutin or actual inhibitors such as the highly potent, classic competitive SGLT inhibitor phlorizin, a naturally occurring b-glucoside (see below). The ability to recognize galactose as substrate by SLC5 family members has been attributed to the presence of a threonine corresponding to amino acid 460 in human SGLT1. Ion selectivity and stoichiometry have been well characterized for SGLTs. The transporters are selective for the cotransport cation Na+ (Km = 4 mM), and while Li+ (Km = 9 mM) and H+ (Km = 7 mM) can replace Na+, no other monovalent cation is accepted. The Na+-to-glucose transport stoichiometry is established for SGLT1–SGLT3, where two Na+ ions bind to SGLT1 and SGLT3 and only one Na+ is required to drive SGLT2 activity. Despite crystallographic information for vSGLT, the electron density is not sufficient to assign binding sites for small single ions such as sodium. However, using mutational analysis and superimposition of structural models from the solute symporters vSGLT, LeuT, and Mhp1, a sodium-binding site for vSGLT is suggested to be close to the sugar-binding residues in transmembrane domains 1 and 8. The predicted cation-binding site in vSGLT appears to allow accessibility to the cytoplasmic aqueous phase. The mechanism of sodium-driven glucose transport has been intensively investigated for SGLT1, applying various methodologies that allow the kinetics of transport to be determined using heterologous expression of the transporter

ä Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols, Fig. 5 (continued) COOH-terminus of the protein. The residues that are proposed to be involved in glucose binding at the extra- and intracellular sides of the membrane are highlighted. (b) GLUT proteins contain 12 transmembrane regions. Specific structural features for class I and II (upper panel) and class III (lower panel)

family members are indicated such as the proposed substrate-binding site, the N-linked glycosylation sites, and conserved signature sequences. The tryptophan residues implicated in cytochalasin B (CB) binding (positions 338 and 412 in GLUT1) and the N-terminal dileucine signal present in class III members (except for GLUT10) are also shown

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

in Xenopus laevis oocytes. From the kinetic measurements, a 6-state equilibrium model is proposed, where conformational changes dependent on cation and sugar binding, transport, and cytoplasmic release are integrated. The six kinetic states describe the “empty” transporter, the sodium-bound form, and the sodium- and glucose-bound transporter at the external and internal plasma membrane surfaces (Fig. 4a). SGLT1-mediated glucose transport has been characterized regarding its kinetics, conformational changes, and the significance of residues for substrate/inhibitor binding. However, many questions remain unanswered such as the precise identity of the second sodium-binding site for SGLT1 and the location of the phlorizin-binding site in SGLT1 and SGLT2, which may be of relevance for SGLT2 selective inhibitors that are in development for the treatment of type 2 diabetes mellitus (T2DM) (see below). SGLT1 (SLC5A1) In 1987, the laboratory of Ernest Wright cloned the first sodium-dependent glucose transporter from rabbit intestinal mRNA by an expressing cloning strategy using Xenopus laevis oocytes (Wright et al. 2011). SGLT1 is primarily expressed in the brush-border membrane of mature enterocytes in the small intestine and catalyzes the absorption of the dietary sugars glucose and galactose from the gut lumen. SGLT1 is also expressed in the kidney on the luminal surface of cells within the S3 segment of the proximal tubule, where it contributes to renal glucose reabsorption. SGLT1 is a high-affinity, low-capacity transporter with a Km of 0.5 mM for the substrate a-MDG in Xenopus laevis oocytes. Substrate transport of glucose is coupled to symport of two sodium ions. The protein is highly glycosylated which leads to an apparent molecular weight of 75 kDa. SGLT1 Physiology

Humans with Deficiency for SGLT1 Display Glucose and Galactose Malabsorption Mutations in the SGLT1 gene cause glucose–galactose malabsorption (GGM). GGM was first described in 1962 (Lindquist and Meeuwisse 1962) as a severe life-

635

threatening diarrhea in newborn children, which is fatal within weeks unless lactose, glucose, and galactose are removed from the diet. The diarrhea returns immediately upon reintroduction of the respective sugars into the diet. GGM was predicted to be caused by defective intestinal sodiumcoupled glucose transport, a hypothesis that was confirmed following the cloning of human SGLT1 and the identification of homozygous carriers for the D28N mutation encoding a nonfunctional protein (Turk et al. 1991). GGM is a rare autosomal recessive disease caused by missense, nonsense, frameshift, and splice-site mutations within the SGLT1 gene. Missense mutations – even single amino acid changes – have been demonstrated to cause missorting of the protein in cells, suggesting that slight conformational changes in the protein can interfere with proper folding and/or delivery and integration of SGLT1 into the plasma membrane, thereby affecting its function. More than 80 patients with GGM have been screened for mutations in the SGLT1 gene (Wright et al. 2011). SGLT2 (SLC5A2) SGLT2 was cloned from human kidney cDNA in 1992 and was found to encode for a 672-aminoacid protein with 59% similarity to SGLT1. SGLT2 is almost exclusively expressed in the kidney and localizes to the apical domain of epithelial cells that line the S1/S2 segments of the proximal convoluted renal tubule. It has been characterized as a kidney-specific transporter controlling the initial step of renal glucose reabsorption, working in concert with SGLT1, which appears responsible for clearance of residual glucose in the more distal S3 segment of the proximal tubular system. In contrast to SGLT1, which transports glucose and galactose, SGLT2 represents a low-affinity, high-capacity sodium–glucose symporter with a Km for glucose of 6 mM and a sodium-to-glucose coupling ratio of 1:1 while having no affinity for galactose. SGLT2 Physiology

Human Physiology: Familial Renal Glucosuria (FRG) Is Caused by Nonfunctional SGLT2 Glucosuria in the absence of both

M

636

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

generalized proximal tubular dysfunction and hyperglycemia is known as de Toni–Debré– Fanconi syndrome. This is recognized as an inherited disorder and designated as familial renal glucosuria (FRG). FRG is an autosomal recessive disorder and is diagnosed by persistent isolated glucosuria (urine excretion >1 g/day) with normal fasting plasma glucose levels and oral glucose tolerance. Since the first SLC5A2 mutation in FRG was presented in 2002 (van den Heuvel et al. 2002), forty-four mutations have been identified including premature stops, frameshifts, and missense mutations. Although the pattern of inheritance for FRG is of codominance, a clear definition of genotype–phenotype correlation has not been established. Individuals with similar or even identical mutations display a broad range of severity in glucosuria, indicating that environmental, as well as genetic, factors affect urinary glucose reabsorption. Since, thus far, none of the FRG mutations has been tested for functional SGLT2 effects, it is unknown how the various mutations relate to the severity of glucosuria. FRG established the fundamental role of SGLT2 in renal glucose reabsorption. Since patients with FRG are not affected by severe clinical consequences, it is considered a benign condition, more a phenotype than a disease. Mouse Models of SGLT2 Deficiency The metabolic consequences of SGLT2 deficiency in mice have been investigated in a model of diet-induced obesity and associated insulin resistance and a genetic model of T2DM, the db/db mouse strain (Jurczak et al. 2011). Deletion of SGLT2 leads to increased urine output and a tremendous increase in glucosuria that is associated with compensatory increases in feeding, drinking, and activity. SGLT2 knockout mice are protected from dietinduced hyperglycemia and glucose intolerance and have reduced plasma insulin concentrations. In the diabetic db/db mouse, deficiency of SGLT2 prevents fasting hyperglycemia and is associated with normalized plasma insulin levels and preserved pancreatic b-cell function. These data confirm the concept of glucotoxicity which was established by studying the antidiabetic effects

of blocking renal glucose reabsorption in diabetic rats by pharmacological means of SGLT inhibition using phlorizin (Rossetti et al. 1987). SGLT2 Inhibitors: A New Concept for the Treatment of Type 2 Diabetes T2DM is characterized by hyperglycemia that results both from peripheral resistance to the action of insulin and from progressive failure of the pancreatic b-cell to compensate for the increasing demand for insulin. Chronic hyperglycemia culminates in glucotoxicity, a term summarizing the vicious cycle between hyperglycemia inducing and b-cell dysfunction and insulin resistance that aggravates disease progression and leads to micro- and macrovascular complications. Current treatments for T2DM come with significant limitations regarding their potential to induce adverse effects. Metformin can cause gastrointestinal effects such as diarrhea and nausea, while sulfonylureas and insulin can induce hypoglycemia and are associated with weight gain. Thiazolidinediones that act as insulin sensitizers can induce weight gain and are associated with edema and are potentially associated with an increased risk. GLP-1 analogs which are incretin-mimicking agents can cause nausea and diarrhea. New therapeutic strategies are needed that not only are effective in terms of glucose control but provide excellent safety and potential add-on effects such as weight loss, lipid lowering, or reductions in blood pressure. The kidney has an important role in controlling blood glucose levels by mediating glucose reabsorption into the bloodstream. In patients with T2DM, increased renal absorptive capacity has been observed, indicating that blocking the process of glucose reuptake and by the kidney might be an attractive new strategy for the treatment of T2DM. However, glucosuria has historically been perceived as a manifestation of the disease, which makes this therapeutic concept seem rather counterintuitive. The phenotype of subjects identified with FRG, as well as studies performed with phlorizin, indicated that correcting hyperglycemia via specific inhibition of SGLT2 might provide a new option for a safe and effective treatment of T2DM.

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

Phlorizin, representing a potent SGLT inhibitor, proved to be an important tool for the investigation of mechanisms and consequences of blocking renal sugar reabsorption. Its use established the concept of glucotoxicity: blocking of renal glucose reabsorption with phlorizin in diabetic rats normalized insulin levels and restored insulin sensitivity (Rossetti et al. 1987). Disadvantages of phlorizin include the nonselective inhibition of SGLT2, poor bioavailability, short half-life, and potential for side effects caused, e.g., by blocking GLUT via its major metabolite phloretin. These disadvantages that are inherited to the molecular phlorizin led into research for new compounds in order to achieve proof of concept for selective SGLT2 inhibition for the treatment of T2DM. Although nonselective for SGLT2, T-1095 was the first orally available phlorizin derivative that was metabolically stable. When administered to diabetic animals, T-1095 corrected hyperglycemia and reduced hyperinsulinemia and hypertriglyceridemia (Chao and Henry 2010). These findings indicated that SGLT2 inhibition might be a viable approach to the treatment of T2DM. In the following years, the selective SGLT2 inhibitors sergliflozin and remogliflozin progressed to clinical trials. While many selective SGLT2 inhibitors went into clinical testing, development of O-glycosidic SGLT2 inhibitors was halted, presumably due to their unfavorable pharmacokinetic profile. In contrast, a number of C-glycosidic compounds which differ from O-glycosides in structure and stability are currently in clinical development with the most advanced inhibitors dapagliflozin, canagliflozin, and BI10773 (empagliflozin) being investigated in phase III clinical trials (Chao and Henry 2010). SGLT3 (SLC5A4) The human SGLT3 cDNA was cloned from colon carcinoma and was found to encode a 659-aminoacid protein with 70% identity to human SGLT1. SGLT3 mRNA is detected in the intestine, testes, uterus, lung, brain, and thyroid, while the protein is predominantly found in the intestine and skeletal muscle. Immunohistochemical analysis of the

637

intestine identified cholinergic neurons in submucosal and myenteric plexuses as the site of SGLT3 expression. In the skeletal muscle, SGLT3 co-localized with the nicotinic acetylcholine receptor indicating expression at the neuromuscular junction. Functional characterization of human SGLT3 demonstrated a lack of sugar transport activity. Instead, human SGLT3 was found to be a glucose-sensitive ion channel where sugar binding induces plasma membrane depolarization in a saturable, sodium-dependent, and phlorizinsensitive manner (Diez-Sampedro et al. 2003). Interestingly, this is in sharp contrast to pig and mouse SGLT3 which are able to transport glucose. The sodium-to-substrate stoichiometry is 2:1, which is similar to SGLT1, while substrate specificity appears closer to SGLT2 with no acceptance of galactose as a substrate. More in-depth characterization of SGLT3 substrate specificities found that human SGLT3, similarly to SGLT1, interacts with various glucosides, while pig SGLT3 was found to transport imino sugars with high affinity. The lack of transport activity by human SGLT3 has been shown to involve a specific amino acid: residue 457. This residue has been shown to be important for the function of human SGLT1, since mutations of that particular amino acid cause GGM. Structural information from vSGLT (Faham et al. 2008) revealed that the corresponding residue mediates direct interaction with the sugar. Accordingly, mutation of glutamate 457 in human SGLT3 to glutamine conferred transport activity on the transporter displaying SGLT1-like transport characteristics with respect to substrate-to-sodium stoichiometry, sugar specificities, as well as affinities (Bianchi and DiezSampedro 2010). Physiologically, SGLT3 is hypothesized to act as a glucose sensor which, at the site of its expression in cholinergic neurons and the neuromuscular junction, might modulate action potentials of neuron/skeletal muscle cell glucose dependently. This hypothesis is supported by the observation that upon expression of human SGLT3 in sensory neurons of C. elegans, glucose sensing in vivo can be monitored (Bianchi and Diez-Sampedro 2010).

M

638

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

SGLT4 (SLC5A9) SGLT4 was cloned from human small intestinal cDNA libraries. The mRNA encoding SGLT4 is almost exclusively found in the small intestine and kidney. SGLT4 exhibits Na+-dependent AMG transport with a Km of 2.6 mM. Inhibition studies of AMG-mediated transport indicated that SGLT4 appears to transport naturally occurring sugars with a rank order of mannose, glucose, fructose, and galactose. Transport studies using radiolabeled mannose indicated that SGLT4 might be physiologically relevant for intestinal absorption as well as renal reabsorption of mannose (Tazawa et al. 2005). SGLT5 (SLC5A10) SGLT5 was recently cloned from human kidney cDNA and characterized as a kidney-specific sodium-dependent mannose transporter which is also able to transport glucose and fructose (Grempler et al. 2012). While specifically expressed in the human kidney, its precise localization and physiological role remain unknown. Based on amino acid sequence homology, SGLT5 represents the closest homologue to SGLT4. In a manner reminiscent of the relationship between SGLT1 and SGLT2, it can be speculated that SGLT4 and SGLT5 may act as complementary mannose transporters that regulate intestinal absorption and renal reabsorption of mannose, respectively. SMIT1 (SLC5A3) A Na(+)/myoinositol cotransporter cDNA (SLC5A3) was cloned from canine renal cells and sequenced in 1992 followed by the human SLC5A3 in 1995. The human transporter is mainly expressed in the kidney, brain, placenta, pancreas, heart, skeletal muscle, and the lung. SMIT1 Physiology

Phenotype of Mice Deficient for SMIT1 Myoinositol is a precursor of the main inositol-containing phospholipids phosphatidylinositol and phosphatidylinositol-4,5-bisphosphate, a key molecule in cellular signal transduction. In addition, myoinositol has an

important role in osmoregulation. The highest myoinositol levels are found in certain regions of the brain with cerebrospinal fluid levels ranging from 2 to 25 mM, which are higher than levels in the blood. One hypothesis as to why lithium is effective in the treatment of bipolar disorders is based on its effect on reducing cellular concentrations of myoinositol (the inositol depletion model). Ablation of the murine SLC5A3 gene demonstrated the significant role of this transporter in maintaining central myoinositol concentrations. SMIT1 knockout mice have significantly reduced central inositol levels with no changes in phosphatidylinositol concentrations. Besides the severe myoinositol deficiency in the brain, those animals display congenital central apnea due to abnormal respiratory rhythmogenesis leading to death shortly after birth. The neonatal lethality of SMIT1 knockout animals appears to be caused by failures in the development of peripheral nerves, specifically in nerves controlling breathing. The peripheral nerve abnormalities can be corrected by prenatal myoinositol supplementation, suggesting that myoinositol is required for peripheral nerve development. Phenotypic analysis of homozygous SMIT1 knockout mice indicated that a reduction of central inositol levels is associated with lithium-like neurobehavioral effects. Potentially, the inositol depletion hypothesis as a mode of action for lithium might be supported by the phenotypic characteristics of SMIT1 knockout mice (Bersudsky et al. 2008). SGLT6/SMIT2 (SLC5A11) SMIT2 was initially cloned by PCR from rabbit kidney cDNA. Sequence analysis indicated 49% and 43% protein sequence identity to SGLT1 and SMIT1, respectively. SMIT2 mRNA is detected in the brain, kidney, heart, skeletal muscle, spleen, liver, placenta, lung, leukocytes, and neurons. Three transcript variants named SMIT1a, SMIT1b, and SMIT1c have been identified for the SLC5A2 gene. It was not until 2002 that the cloned product was functionally characterized and identified as a sodium-coupled myoinositol transporter with a Km of 120 mM and 13 mM for

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

myoinositol and sodium, respectively (Coady et al. 2002). Transport mediated by SMIT2 is phlorizin sensitive (Ki of 76 mm). The substrate specificities of SMIT1 and SMIT2 are remarkably different: SMIT2 shows stereospecific transport of D-glucose and D-xylose without affinity for fucose, while SMIT1 transports L-fucose and L-xylose (but not their D-isomers) and does not distinguish between D- and L-glucose. In contrast to SMIT1, SMIT2 transports D-chiro-inositol.

The Family of Glucose Transport Facilitators Synonyms GLUT1–GLUT14, gene symbols: SLC2A1– SLC2A14, solute carrier family 2A1–14 Summary Glucose transporters are uniporters that facilitate the diffusion of their respective substrates (e.g., glucose) across cellular membranes along a concentration gradient (Augustin 2010; Uldry and Thorens 2004). The protein family comprises 14 isoforms that share common structural features such as 12 transmembrane domains, N- and C-termini facing the cytoplasm of the cell, and an N-glycosylation site within either the first or fifth extracellular loop. Based on their sequence homology (14–63% identity), three classes can be distinguished: class I includes the “classic” glucose transporters GLUT1–GLUT4 and GLUT14; the class II members are GLUT5, GLUT7, GLUT9, and GLUT11; and the class III transporters comprise GLUT6, GLUT8, GLUT10, GLUT12, and the proton-driven myoinositol transporter HMIT (or GLUT13). Despite their structural similarities, the different isoforms are characterized by tissue-specific expression and distinct characteristics such as alternative splicing and (sub)cellular localization. With respect to their substrate specificities, the protein family includes transporters of glucose (GLUT1–GLUT4, GLUT8, GLUT14), fructose (GLUT5, GLUT7, GLUT11), polyol (GLUT12), myoinositol (GLUT13), and urea (GLUT9) transporters.

639

Structural Features and Substrate Specificity The protein sequences of the 14 isoforms protein sequences are between 14% and 63% identical and 30–79% conserved. Common to all isoforms of the GLUT protein family are the predicted 12 transmembrane helices that are based on the initial hydropathy plot for GLUT1 (Fig. 5b). The 12-helix model has been supported by studies applying glycosylation scanning and the use of epitope tags placed within the extrafacial loops as well as antibodies directed against the predicted extra- and intracellular loops and the N-and C-terminal parts of the proteins. All family members are highly glycosylated membrane proteins harboring an N-linked glycosylation site. For class I and II family members, this is positioned in the first exofacial loop between transmembrane helices 1 and 2, while class III family members contain a shorter extracellular loop 1 and harbor the glycosylation site within the larger loop 9 (Joost and Thorens 2001). Sequence comparisons between all isoforms identified conserved residues that have been termed sugar transporter signatures (Joost and Thorens 2001). These include conserved glycine residues in helices 1, 2, 4, 5, 7, 8, and 10, indicating a critical role in the structure of these helices. In particular, helix 7 appears to be important for substrate binding from the exofacial site. Nonetheless, the primary sequences of the various isoforms do not allow prediction of their substrate specificity or kinetics of transport. However, on the basis of mutational analyses, several residues and motifs have been demonstrated to participate in the substrate recognition by GLUT1 as well as other isoforms. GLUT1, GLUT3, and GLUT4, which transport glucose but not fructose, have the QLS sequence in helix 7. GLUT2 and GLUT5, which both transport fructose, have a HVA or MGG in this position. Structure–function analysis of GLUT2 and GLUT3 chimeras expressed in Xenopus laevis oocytes demonstrated that GLUT3 can be converted to a glucose/fructose transporter with GLUT2 transport kinetics when the amino acid sequence of GLUT2 from the beginning of helix 7 to the COOH-terminus is inserted into GLUT3.

M

640

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

This demonstrates the impact of helix 7 on substrate specificity and kinetics of transport. A specific feature of all class II transporters is their ability to transport fructose, with GLUT5 representing the bona fide fructose transporter. Again, helix 7 is important for exofacial substrate recognition, since transport of fructose by class II isoforms has been linked to the presence of a NXV/NXI motif. Mutating the isoleucine position significantly reduced fructose transport, while glucose uptake was unaltered. Interestingly, class II isoforms were unable to transport 2-deoxy-D-glucose (2-DG) or galactose. A valine to isoleucine mutation in GLUT2 (V165I) within helix 5 abolished 2-DG transport, and introduction of the same point mutation in GLUT1 also resulted in a comparable reduction of 2-DG transport. For GLUT1, the topology and relative orientation of the 12 transmembrane helices with the outward-facing substrate-binding sites have been proposed by two models. More than 50% of the complete polypeptide sequence has been analyzed by cysteine scanning mutagenesis using the substituted cysteine accessibility method (SCAM) allowing a detailed prediction of the exofacial substrate-binding site and the folding of the human GLUT1. A three-dimensional model for GLUT1 has been developed based on structural information from crystallized members of the major facilitator superfamily, glycerol-3phosphate transporter and lactose permease. Binding of glucose, forskolin, and phloretin was predicted in close proximity to the exofacial vestibule in this model. While a second binding site for forskolin and phloretin was predicted at the intracellular portion of GLUT1, cytochalasin B has been docked only at one particular endofacial position of the protein. Class I Family Members Isoforms of class I GLUTs are well-characterized transporters with GLUT1 being the first isoform cloned and described in 1985. GLUT1 (SLC2A1)

GLUT1, also known as the HepG2 or erythrocyte sugar transporter, is highly abundant in

erythrocyte membranes, making up 3–5% of the total erythrocyte proteins. The high amount of GLUT1 in red blood cells allowed the generation and characterization of an antibody that was used for the molecular cloning of GLUT1 from a hepatoma cDNA expression library in 1985. Although not present in hepatocytes, however, GLUT1 represents the most ubiquitously expressed isoform. The transporter is already found throughout early mammalian embryo development from the oocyte to the blastocyst and is present at high levels in endothelial and epithelial-like barriers of the brain, the eye, peripheral nerves, the placenta, and especially in certain tumor cell lines and tissues, although not in hepatocytes. Also, GLUT1 is highly expressed in most routinely used laboratory cell lines. When assessed in Xenopus laevis oocytes, GLUT1 transports glucose with a Km of ~3 mM. Under equilibrium exchange conditions, GLUT1 has a Km of 20–21 mM for 3-O-methylglucose and 5 mM for 2-DG. Other hexoses transported by GLUT1 are galactose, mannose, and glucosamine. When expressed in S. cerevisiae rat, GLUT1 showed a Km for D-glucose of 3.4 mM, and transport was inhibited by cytochalasin B (IC50 = 0.44 mM), HgCl2 (IC50 = 3.5 mM), phloretin (IC50 = 49 mM), and phlorizin (IC50 = 355 mM). GLUT1 Physiology Human Physiology: The GLUT1 Deficiency Syndrome (OMIM #606777) Mutations in the GLUT1 gene are causative for an autosomal dominant disorder that is characterized by infantile seizures, developmental delay, acquired microcephaly, and ataxia which is assumed to be caused by the decreased rate of glucose transport from the blood into cerebrospinal fluid. Defective glucose transport across the blood–brain barrier was first described 1991 and linked to GLUT1 deficiency in 1998. About 100 cases have been identified worldwide since that time. A wide spectrum of heterozygous mutations, including nonsense, missense, insertion, deletion, and splice-site mutations, and hemizygosity of the GLUT1 gene has been identified.

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

Since ketone bodies bypass the blood–brain barrier and enter the brain via a monocarboxylic acid transporter (MCT1), they provide an alternative energy source for the brain under conditions of GLUT1 deficiency. Accordingly, a ketogenic diet is effective in controlling the seizures and other symptoms of the GLUT1 deficiency syndrome. However, this treatment is less effective regarding neurobehavioral symptoms. Correlations between genotype and phenotype still remain elusive. Mouse Models of GLUT1 Deficiency Although representing the first GLUT isoform that was discovered, and despite being well characterized, mouse models for GLUT1 deficiency were only described recently. Mice that are transgenic for a homozygous GLUT1 antisense transgene are lethal during gestation; heterozygosity for the GLUT1 antisense cDNA was associated with growth retardation and developmental malformations. In mice, homozygous knockout of GLUT1 was associated with embryonic lethality around day E10dpc and E13-14dpc, while heterozygous animals were viable and showed no differences in body weight development and growth. Decreased brain weights were reported; however, histological abnormalities were not found. While plasma glucose levels were normal in the heterozygous animals, glucose was decreased in the cerebrospinal fluid (CSF). As shown by PET scan analysis, glucose uptake and metabolism were reduced in the brains of heterozygous GLUT1 knockout animals. These animals also showed deficits in motor activity, balance, and coordination, as well as spontaneous cortical seizures. Overall, heterozygosity for GLUT1 in mice resembles features of humans with the GLUT1 deficiency syndrome. GLUT2 (SLC2A2)

The second transporter of the GLUT family was cloned in 1988 from human liver and kidney cDNA libraries. The initial characterization detected GLUT2 mainly in the liver, kidney, and intestine, but the transporter was later demonstrated to be present specifically in the insulinproducing b-cells of the pancreas. GLUT2 is a low-affinity, high-capacity transporter, and with

641

a Km in the range of ~17 mM, it has the highest Km for glucose among the known members of the GLUT family. GLUT2 also transports galactose (~92 mM), D-mannose (~125 mM), and D-fructose (~76 mM). Recently, GLUT2 was shown to transport glucosamine with high affinity (Km = ~0.8 mM). Structurally, GLUT2 lacks the QLS motif in helix 7 which is thought to confer substrate specificity of the transporter and which may explain the high affinity for glucosamine. GLUT2 is located in the basolateral membrane of the epithelial cells of the intestine and kidney, where it participates in the release of absorbed (via SGLT1 in the intestine) or reabsorbed (via SGLT1 and SGLT2 in the kidney) glucose into the bloodstream. GLUT2 Physiology Human GLUT2 Deficiency and the Fanconi–Bickel Syndrome (OMIM #227810) Rare homozygous or compound heterozygous mutations within the GLUT2 gene cause a type of glycogen storage diseases (GSD), termed GSD XI. The first patient was described by Fanconi and Bickel; therefore, the GLUT2 deficiency is referred to as Fanconi–Bickel Syndrome (FBS). Thus far, 112 patients have been reported. Analysis of 63 patients revealed a total of 34 different GLUT2 mutations with none of them being particularly frequent. The clinical symptoms of FBS are hepatomegaly secondary to glycogen accumulation, glucose and galactose intolerance, fasting hypoglycemia, tubular nephropathy, and severely stunted growth. In contrast to the metabolism of glucose and galactose, utilization of orally or intravenously administered fructose is normal in FBS patients. Glucose homeostasis in FBS patients is heavily disturbed and postprandial hyperglycemia is frequently observed. A few patients have been diagnosed with diabetes mellitus and have been treated with insulin. Hypoglycemia in fasting states is a feature of FBS. Hypoglycemia has very frequently been documented, and plasma glucose levels as low as 18 mg/dl have been reported in FBS patients. Compared to other types of hepatic glycogen diseases, hepatic

M

642

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

adenomas or malignancies have never been observed in patients with FBS. No specific treatment exists for patients with FBS. Symptomatic treatment is directed toward stabilization of glucose homeostasis and compensation for renal losses of various solutes. The amelioration of the consequences of renal tubulopathy includes the replacement of water and electrolytes. In order to control renal glucose loss and hepatic glycogen accumulation in FBS, patients receive a diet with adequate caloric intake with slowly absorbed carbohydrates. Mouse Models of GLUT2 Deficiency The consequences of GLUT2 deficiency have been analyzed in detail for the various tissues where GLUT2 is involved and/or essential in maintaining whole body glucose homeostasis. Early on, the critical role for GLUT2 in the b-cell for glucosestimulated insulin secretion in vivo became apparent in mice with transgenic overexpression of a GLUT2 antisense RNA specifically in pancreatic b-cells. Upon an 80% reduction of GLUT2 in b-cells, those animals display hyperglycemia and develop diabetes. Whole body GLUT2 knockout in mice results in offspring that appears completely normal at birth, but neonates develop early symptoms similar to type 2 diabetes and do not survive beyond the age of 3 weeks. Homozygous GLUT2 deficiency in mice results in hyperglycemia and elevated plasma levels of free fatty acids and b-hydroxybutyrate. In vivo glucose tolerance is abnormal, and, in vitro, b-cells display a gradual loss of control of insulin gene expression by glucose. Glucose-stimulated insulin secretion in islets was impaired by loss of the first, but not the second phase of insulin secretion. GLUT2 knockout mice show marked hyperglucagonemia, and this is accompanied by alterations in the postnatal development of pancreatic islets, evidenced by a gradual inversion of the a- to b-cell ratio. A direct link between diet-induced insulin resistance and b-cell dysfunction via disturbed GLUT2 plasma membrane localization has recently been demonstrated. Administration of a high-fat diet feeding in mice results in intracellular retention of the transporter due to improper

glycosylation of the protein, thereby leading to compromised glucose-stimulated insulin secretion. The early lethality of GLUT2-deficient mice shortly after birth hindered the analysis of GLUT2 physiology in the different tissues of its expression. Transgenic mice that overexpress GLUT1 specifically in b-cells under the control of the rat insulin promoter were generated (RIPGLUT1/ GLUT2/) in order to study the functional consequences of GLUT2 deficiency in tissues such as the liver, kidney, intestine, and also brain. GLUT2 deficiency in the liver was expected to dramatically affect hepatic glucose output under fasting conditions. Interestingly, hepatic glucose output and glucagon response of the livers from mice lacking GLUT2 were normal. No counterregulation of other transporters known at that time was observed (GLUT1, GLUT3, GLUT4, GLUT5, and SGLT1). Glucose output in GLUT2-deficient livers was not inhibited by cytochalasin B. An alternative membrane trafficbased pathway was proposed that releases glucose directly from the ER after glycogen breakdown or gluconeogenesis. The exact nature of this route has not been determined. In humans, GLUT2 deficiency is associated with a marked hypoglycemia in the fasting state owing to a diminished hepatic glucose output and a failure of glucagon to increase plasma glucose. However, human patients with FBS do not generally develop diabetes and do not display a complete loss of their beta-cell function, indicating that functionality of human pancreatic b-cells does not solely depend on GLUT2. Glucosuria was observed in RIPGLUT1/ GLUT2/ mice suggesting toward an essential role for GLUT2 in basolateral sugar reabsorption by tubular epithelial cells in the kidney. GLUT2 complements the active sugar uptake at the apical epithelium mediated by SGLT2. The functional relevance for GLUT2 in the kidney in humans is supported by the observation of impaired kidney glucose reabsorption in patients with FBS. GLUT3 (SLC2A3)

GLUT3 was cloned from a human fetal muscle cDNA library. GLUT3 is considered as a neuron-

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

specific glucose transporter due to its dominant expression in the brain in various species. However, besides the brain, GLUT3 is also expressed in tissues with high demand for glucose such as testes (spermatozoa), placenta, preimplantation embryos, or certain cancer cells and cancer tissues. Its tissue distribution and function correspond with its high affinity (Km = 1.4 mM for 2-DG) and transport capacity for glucose. While galactose (Km = 8.5 mM), mannose, maltose, xylose, and dehydroascorbic acid are substrates for GLUT3, fructose is not. GLUT3 is inhibited by cytochalasin B (Ki = 0.4 mM), phloretin, and phlorizin. GLUT3 Physiology Single Nucleotide Polymorphisms (SNPs) in the GLUT3 Gene That Have Been Identified and Linked to Dyslexia Dyslexia is one of the most common learning disorders in school-aged children. Dyslexic children show differences in eventrelated potential measurements, in particular for mismatch negativity (MMN), which reflects automatic speech deviance processing. Wholegenome association analysis in 200 dyslexic children, focusing on MMN measurements, identified two SNPs that both showed a significant association with mRNA expression levels of SLC2A3 on chromosome 12. It was suggested that a possible trans-regulation effect on SLC2A3 might lead to glucose deficits in dyslexic children that might cause their attenuated MMN in passive listening tasks. Mouse Models of GLUT3 Deficiency During mouse preimplantation development, GLUT3 is expressed at the apical membrane of the trophectoderm layer of the blastocyst and mediates glucose uptake by the embryo from the external (maternal) environment. Knockdown of the transporter by antisense RNA at this time point of development disrupts blastocyst development by diminishing uptake of glucose by the embryo. These data indicated a crucial role for GLUT3 during preimplantation embryo development, and its deficiency in mice was assumed to result in embryonic lethality before implantation.

643

Indeed, homozygous loss of GLUT3 leads to a complete loss of embryos at day 12.5. However, morulae develop normally to the blastocyst stage, and implantation is not affected by loss of GLUT3. Heterozygous GLUT3 knockout mice have been characterized especially for a potential neuron (brain)-specific phenotype. These animals exhibit significantly enhanced cerebrocortical activity and are slightly more sensitive to an acoustic startle stimulus. However, behavior of these animals regarding coordination, reflexes, motor abilities, anxiety, learning, and memory is normal. Zhao et al. described features of autism spectrum disorders in heterozygous GLUT3 knockout animals as being abnormal spatial learning, working memory, electroencephalographic seizures, and perturbed social behavior with reduced vocalization and stereotypes at low frequency. GLUT4 (SLC2A4)

Besides GLUT1, GLUT4 represents one of the most intensively studied glucose transporters which is attributed to its important physiological role regulating the rate-limiting step in insulinstimulated glucose uptake of the skeletal and cardiac muscle and brown and white adipose tissue. Thereby, impaired GLUT4 translocation is causally linked to insulin resistance and consequently to the disease condition of non-insulin-dependent diabetes mellitus. GLUT4 was cloned in 1989 by various groups from human, rat, and mouse tissues. GLUT4 displays a similar affinity for glucose as GLUT1 with a Km of ~5 mM and is also capable of transporting dehydroascorbic acid and glucosamine (Km ~3.9 mM). When expressed in yeast (S. cerevisiae), rat GLUT4 is inhibited by the classical inhibitors cytochalasin B (IC50 = 0.2 mM), phloretin (IC50 = 10 mM), and phlorizin (IC50 = 140 mM). GLUT4 Signaling and Cell Biology GLUT4 contains unique sorting motifs at its N-terminus (FQQI) and C-terminus (dileucine) are critical for its capability to traffic between specific intracellular compartments and translocate to the plasma

M

644

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

membrane in response to different stimuli. Insulin and exercise are able to rapidly and acutely stimulate GLUT4 translocation to the plasma membrane and thereby influence glucose uptake in muscle and adipose tissue via distinct signaling mechanisms. Activation of the insulin receptor (IR) leads to its autophosphorylation and subsequent signaling through the insulin receptor substrate proteins (IRS), and recruitment of PI 3-kinase catalyzes the formation of phosphatidylinositol (3,4,5)-3-phosphate (PIP3). PIP3 itself activates the protein kinase Akt via two intermediate protein kinases, PDK1 and Rictor/mTOR. Akt2 rather than the Akt1 or Akt3 isoforms appears to control GLUT4 trafficking. With the identification of the Akt substrates TBC1D4 (AS160), and more recently TBC1D1, two GTPase-activating proteins have been identified that appear to bridge the gap between insulin signaling and trafficking events. Exercise has also been shown to induce TBC1D4 phosphorylation but apparently via a distinct PI 3-kinaseindependent mechanism that requires activation of AMPK. However, simultaneous disruption of AMPK and Akt failed to completely inhibit contraction-induced AS160 phosphorylation hinting toward alternative signaling events leading to GLUT4 translocation. A PI 3-kinaseindependent pathway has been proposed that involves the adaptor molecules APS and CAP that bind to the insulin receptor and recruit c-Cbl. c-Cbl signals to the guanine nucleotide exchange factor C3G which activates the small GTP-binding protein TC10. However, recent data indicated a rather minor contribution for this pathway in insulin-mediated GLUT4 translocation. GLUT4 Physiology Human Although the significance of GLUT4 for insulin-stimulated glucose uptake in the muscle and adipose tissue is well understood, thus far, no polymorphisms within the GLUT4 gene have been identified that would robustly be associated with impaired glucose homeostasis in humans under circumstances such as type 2 diabetes, increased fasting blood glucose levels, or obesity.

Conventional Knockout Mouse Models of GLUT4 Deficiency GLUT4 knockout mice, surprisingly, are normoglycemic with insulin resistance and hyperinsulinemia in the fed state. The mice are growth retarded, with markedly reduced fat mass, cardiomegaly, and shortened lifespan, but no diabetes. In contrast, heterozygous GLUT4-null mice develop hyperglycemia and hyperinsulinemia associated with reduced muscle glucose uptake, hypertension, and morphological alterations in the heart and liver. Rather unexpectedly, about 50% of heterozygous GLUT4 knockout mice develop diabetes before the age of 6 months, a phenotype that can be reversed by selective overexpression of GLUT4 in the skeletal muscle. Muscle-Specific GLUT4 Knockout Mice In contrast to conventional GLUT4 knockout animals, muscle-specific GLUT4-deficient mice have normal body and fat pad weight and a normal lifespan. While skeletal muscle mass is unchanged, the heart weight is increased similar to GLUT4-deficient mice and heart-specific GLUT4 knockout animals. Basal and especially insulin- or contraction-induced glucose uptake into the skeletal muscle is reduced, which is causative for the hyperglycemia, glucose intolerance, and insulin resistance seen in those animals. A subset of animals develops diabetes. Surprisingly, insulin-stimulated glucose transport in adipose tissue and insulin-induced suppression of hepatic glucose production are also impaired which is assumed to be secondary to the hyperglycemia in those animals. Adipose Tissue-Specific GLUT4 Knockout Mice Adipose tissue selective GLUT4 inactivation, unlike the conventional GLUT4 disruption, does not affect growth, adipose mass, or size. Cardiac hypertrophy is not seen in those animals. Fat-specific GLUT4-null mice are insulin resistant and glucose intolerant, and a subset of animals develop diabetes, which is also observed in muscle-specific GLUT4-deficient mice. Muscle and liver are insulin resistant in those animals. Surprisingly, insulin resistance in the muscle was only observed in vivo but not ex vivo, indicating

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

a systemic impact of adipose tissue on insulin sensitivity. Cardiac Tissue-Specific GLUT4 Knockout Mice GLUT4 knockout mice, specifically muscle- and cardiac tissue-specific GLUT4-deficient mice, all develop cardiac hypertrophy. While homo- and heterozygous GLUT4-null mice show cardiac dysfunction under normal conditions, cardiacspecific GLUT4 knockouts have normal contractile function under basal conditions but decreased recovery after hypoxia. Metabolically, heartspecific GLUT4 knockout mice are normal and have a normal lifespan.

GLUT14 (SLC2A14)

Searching the human genome for additional genes coding for glucose transporters, a putative gene was identified and cloned showing 95% identity to GLUT3 on the nucleotide level. The SLC2A14 gene maps to chromosome 12p13.3, with a 10 Mb distance to GLUT3, and appears to be a consequence of a gene duplication of GLUT3. GLUT14 was shown to be specifically expressed in the testes. Two alternatively spliced forms of GLUT14 were identified. Interestingly, GLUT14 has no ortholog in mice, a finding that has also been made for GLUT11. Class II Family Members With GLUT5 being identified in 1990 and established as a fructose transporter, initial transport studies for GLUT7 and GLUT9 indicated that those transporters might also transport fructose. A specific feature for class II transporters is that cytochalasin B as a classical GLUT inhibitor does not block glucose transport. Furthermore, all isoforms do not show an affinity for 2-DG and galactose. Unique for GLUT9 and GLUT11 is the alternative splicing/promoter usage, where either two or three different mRNAs are transcribed, respectively. In case of GLUT9, this leads to two proteins that only differ in their N-terminal region. GLUT5 (SLC2A5)

Human GLUT5 was initially cloned from an intestinal epithelial cell line. GLUT5 is

645

considered as the prototypic fructose transporter – when expressed in Xenopus laevis oocytes, the human protein transports fructose with a Km of 6 mM without any noticeable glucose transport activity. However, fructose transport is not inhibited by cytochalasin B, phloretin, or phlorizin. Besides fructose, the rat GLUT5 transports glucose, an activity that can be blocked by cytochalasin B. In humans, rats, and mice, GLUT5 is primarily expressed in the jejunal region of the small intestine. Lower levels of the protein are expressed in the kidney, the brain, skeletal muscle, and adipose tissue. GLUT5 mediates fructose absorption in the jejunum at the apical, and potentially, at the basolateral membrane, of the epithelial cells into the portal vein (Fig. 2b). GLUT5 Physiology Mouse Models of GLUT5 Deficiency GLUT5 deficiency is associated with reduced fructose absorption when animals are challenged by a high fructose diet. While wild-type mice upon high fructose feeding display an enhanced salt absorption in their jejuna and develop systemic hypertension, GLUT5 knockout mice do not show fructose-stimulated salt absorption. Instead, the animals display impaired nutrient absorption that is accompanied by hypotension. Absence of GLUT5 leads to a massive dilatation of the cecum and colon, consistent with severe malabsorption. On a normal chow diet, GLUT5-deficient mice have normal blood pressure and display normal weight gain. The phenotype of GLUT5-deficient mice demonstrates that this isoform is essential for fructose absorption by the intestine and thereby fundamentally involved in fructose-induced hypertension. GLUT7 (SLC2A7)

The human GLUT7 was cloned from an intestinal cDNA library using a PCR-based strategy. GLUT7 is primarily expressed in the small intestine and colon, although mRNA has been detected in the testes and prostate as well. The protein has been localized to the apical membrane of the small intestine and colon. GLUT7 shows a rather high

M

646

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

affinity for glucose and fructose (Km for glucose = 0.3 mM), while galactose, 2-DG, and xylose are not transported. Sugar transport by GLUT7 is not inhibited by cytochalasin B or phloretin. Sequence alignments between fructoseand non-fructose-transporting GLUT isoforms identified a motif in GLUT7 that potentially confers its ability to transport fructose. Mutational analysis of those residues in GLUT7 identified isoleucine 314 as an important determinant for fructose affinity. The finding of a specific residue within the extracellular vestibule of helix 7 that drives substrate specificity was extended to GLUT2, GLUT5, GLUT9, and GLUT11 and proposed as a common NXI/V consensus motif among isoforms capable of transporting fructose. GLUT9 (SLC2A9)

Human GLUT9 cDNA was isolated by PCR amplification from a human kidney cDNA library on the basis of sequence information from ESTs and from its genomic sequence. GLUT9 mRNA is detected almost exclusively in the kidney and liver and at low levels in the small intestine, placenta, lung, and leukocytes. GLUT9 is localized to the insulin-secreting b-cells of human and mouse islets, where downregulation of the protein by siRNA in rat and mouse insulinoma cells leads to a reduced glucose-stimulated insulin secretion. In humans as well as in mice, alternative splicing/ or promoter usage results in two proteins, GLUT9a and GLUT9b, which only differ in their N-terminal region. While human and mouse GLUT9b are mainly expressed in the kidney, placenta, and liver, GLUT9a shows a broader tissue distribution. The different N-termini of human GLUT9a and GLUT9b determine basolateral versus apical sorting in polarized cells in vitro, respectively. In mouse kidney, GLUT9 is localized to apical as well as basolateral membranes of distal convoluted tubules. Initial characterization of GLUT9 determined a rather low affinity for 2-DG. However, a high-affinity transport has been reported for glucose and fructose with Km values of 0.6 and 0.4 mM, respectively. More recently, GLUT9 has been identified as a high-affinity uric acid transporter with a Km of 0.9 and 0.6 mM for the human and mouse protein,

respectively. Although transport of glucose and fructose has not been observed by some investigators, GLUT9 is thought to exchange glucose or fructose for urate. The classical inhibitor cytochalasin B does not block GLUT9 function. GLUT9 has been established to represent a major regulator of urate homeostasis in dogs, mice, and men. GLUT9 Physiology In humans, GLUT9 is involved in renal uric acid reabsorption – mutations in the GLUT9 gene have consequences on plasma uric acid levels. Various genome-wide association studies uncovered polymorphisms in SLC2A9 as being one of the most significantly associated genes that can be linked to gout and increased serum uric acid concentrations. Although the identified SNPs are located in intronic regions of the gene, evidence exists that increased RNA expression for GLUT9 positively correlates with increased serum uric acid concentrations. In contrast, exonic mutations in GLUT9 suggested that loss of GLUT9 function is associated with hypouricemia. Two distinct heterozygous missense mutations (R380W and R198C in GLUT9a) were described in three patients with hypouricemia. The two mutations were shown to result in loss of function when uric acid transport was studied in Xenopus laevis oocytes. A genome-wide homozygosity screen linked hereditary hypouricemia to two homozygous SLC2A9 mutations that lead to a missense mutation (L75R) or a 36 kb deletion. The homozygous loss-of-function mutations of GLUT9 caused a total defect of uric acid absorption, leading to severe hypouricemia complicated by nephrolithiasis and exercise-induced acute renal failure. Therefore, GLUT9 is essential for renal reabsorption of uric acid – increased expression is associated with hyperuricemia and gout, while loss of function leads to severe hypouricemia. Mouse Models of GLUT9 Deficiency GLUT9 deficiency in mice leads to hyperuricemia, massive hyperuricosuria, and an early-onset nephropathy, which is in contrast to the condition in humans where dysfunctional GLUT9 is associated with hypouricemia. Hyperuricemia in mice

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

due to GLUT9 deficiency appears to be a result of impaired urate uptake by the liver and therefore inability to be degraded to allantoin by uricase. The nephropathy in GLUT9 knockout animals is characterized by obstructive lithiasis, tubulointerstitial inflammation, and progressive inflammatory fibrosis of the cortex. In contrast, liverspecific GLUT9 inactivation in adult mice results in severe hyperuricemia and hyperuricosuria, absence of urate nephropathy, or any structural abnormality of the kidney. The deficiency of GLUT9 in mice showed that it represents a functional urate transporter in vivo, allowing GLUT9 to be identified as a major player in urate homeostasis due to its dual role in urate handling in the kidney and in the liver. Whether GLUT9 at all plays a role as a glucose/fructose transporter that potentially links fructose uptake by the liver and urate metabolism remains to be seen (Fig. 3a). GLUT11 (SLC2A11)

The human GLUT11 was cloned by PCR on the basis of sequence information obtained from ESTs and a genomic sequence. Three variants of GLUT11 (GLUT11-A, GLUT11-B, and GLUT11- C) have been identified that only differ in their N-terminal sequences. Since for each of the variants the corresponding 50 sequences upstream of exon 1 exhibit promoter activity, transcriptional regulation is assumed to occur by alternative promoter usage. The three isoforms are expressed in a tissue-specific manner: GLUT11-A is present in the heart, skeletal muscle, and kidney; GLUT11-B is expressed in the placental, adipose, and kidney tissue; while GLUT11-C is found in the adipose, heart, skeletal muscle, and pancreatic tissue. Glucose transport activity for GLUT11 was detected in liposomes reconstituted with GLUT11 containing membranes. The transporter shows affinity for glucose with a Km of 0.16 mM when measured in Xenopus laevis oocytes. GLUT11 also transports fructose but not galactose and shows a rather low affinity for cytochalasin B. Mutational analysis of the DSV motif in GLUT11 that corresponds to the “fructose” transporter motif “NAI” also showed that also in GLUT11 this particular region of helix

647

7 determines substrate selectivity of the transporter. Endogenous GLUT11 protein was localized in the heart and skeletal muscle tissue with an antibody raised against the C-terminus of GLUT11 that does not distinguish between the different variants. Human GLUT11 is expressed exclusively in slow-twitch muscle fibers and is unaffected by physiological and pathophysiological conditions except in primary myopathy. Surprisingly, the human SLC2A11 gene has no ortholog in the rat and mouse genome. Class III Family Members Class III isoforms share specific features that are unique to this class. First, structurally all class III members carry their N-glycosylation site at the fifth extracellular loop. Common to all the isoforms is the presence of an internalization signal (dileucine or YSRI in case of GLUT10) that retains these transporters at an intracellular localization under steady-state conditions. Thus far, a stimulus for translocation of class III isoforms leading to plasma membrane localization has only been proposed for GLUT13 which has not been confirmed. GLUT6 (SLC2A6)

GLUT6 (formerly designated GLUT9) was cloned from human leukocyte cDNA by (RACE)-PCR on the basis of murine ESTs and the human genomic sequence. GLUT6 mRNA is expressed predominantly in the brain, spleen, and peripheral leukocytes. Hexose transport for GLUT6 was only shown when reconstituted in liposomes, where GLUT6 transport activity was found in the presence of 5 mM but not 1 mM glucose. GLUT6 exhibits a low cytochalasin B binding affinity. Thus far, no other data on kinetics of transport and potential substrates have been published for this isoform. GLUT6 contains an N-terminal dileucine motif that is responsible for intracellular retention of the protein when overexpressed in primary rat adipocytes. GLUT6 is only detected at the plasma membrane when the dileucine residues are mutated to alanine or when clathrin-dependent endocytosis is blocked by overexpression of a dominant-negative dynamin mutant. However, no cell-surface translocation of GLUT6 is observed in response to stimuli such as insulin, phorbol ester, or hyperosmolarity.

M

648

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

A gene expression profiling study aimed to identify deregulated in chronic lymphocytic leukemia associated with Trisomy 12 identified Slc2a6 among the seven genes with the strongest correlation. Although a significant deregulation was not confirmed subsequently by real-time PCR analysis, the specific expression of GLUT6 in leukocytes and the spleen might indicate an important role for this transporter in normal physiology for this cell lineage. GLUT8 (SLC2A8)

GLUT8 (formerly GLUTX1) was the first isoform of the extended SLC2A family to be identified by database mining. The human, rat, and mouse cDNA were cloned by 50 and 30 RACE-PCR from testis cDNA samples. The transporter is mainly expressed in the testes, and lower levels are found in the brain (cerebellum), adrenal gland, liver, spleen, brown adipose tissue, and lung. Functional characterization of GLUT8 in Xenopus laevis oocytes immediately revealed the intracellular retention of the transporter due to the presence of a dileucine-based motif in its N-terminus. Mutating the two leucine residues to alanine leads to plasma membrane localization of GLUT8 in mammalian cells and Xenopus laevis oocytes, which allows the determination of transport kinetics for the protein. GLUT8 shows a high affinity for glucose with a Km of ~2 mM. Glucose transport can be competitively inhibited with fructose, galactose, and cytochalasin B. The intracellular localization of GLUT8 raised the question of whether this isoform would be insulin responsive and thereby compensate for GLUT4 in the respective knockout mice which are lacking the insulinresponsive glucose transporter yet near-normal glucose tolerance. Indeed, in blastocysts of mice, GLUT8 has been found to account for insulinstimulated glucose uptake. However, various groups performed extensive studies in primary rat adipocytes, 3T3-L1 adipocytes, insulinresponsive CHO cells, as well as neuronal cell types such as N2A, PC12, and hippocampal neurons that all failed to identify a stimulus that leads to a plasma membrane translocation of GLUT8. The intracellular localization of GLUT8 is also observed in vivo under steady-state conditions. In

the testis, immunofluorescence microscopy shows intracellular localization of GLUT8 in a late endosomal/lysosomal compartment. In the brain, immunogold labeling electron microscopy localized GLUT8 in synaptic dense core vesicles of nerve terminals and secretory granules of vasopressin neurons. The intracellular retention signal of GLUT8 has been characterized and found to contain the consensus sequence [D/E]XXXL[L/I] that represents a late endosomal/lysosomal sorting motif. The DEXXXLL sorting signal of GLUT8 interacts with the adaptor proteins AP-1 and AP-2 and controls trafficking of the protein. Interestingly, both GLUT8 and GLUT12 contain a [D/E]XXXL [L/I] sorting signal; however, the exact composition of the XXX residues seems to fine-tune the routing and trafficking of the two proteins. GLUT8 Physiology Mouse Models of GLUT8 Deficiency GLUT8 knockout mice appear healthy and exhibit normal growth, body weight development, and glycemic control, indicating that GLUT8 does not play a significant role in the maintenance of whole body glucose homeostasis. Offspring distribution from heterozygous matings indicated a deviation from the expected Mendelian distribution regarding birth of GLUT8 homozygous animals. This observation was attributed to a decreased sperm motility of GLUT8-deficient spermatocytes that is associated with lower ATP levels and a reduced mitochondrial membrane potential, while the number and survival rate of spermatozoa are unchanged. The reduced amount of homozygous GLUT8 offspring is not related to impaired preimplantation embryo development – as might have been suggested by antisense studies in embryos – since mating of knockout mice produced viable, normally developing offspring in numbers comparable to those of a wild-type intercross. GLUT8 deficiency is associated with behavioral alterations indicating a significant physiological role for this isoform in the central nervous system. GLUT8-deficient mice have an increased proliferation of hippocampal cells, and behavioral tests show increased arousal,

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

a tendency to altered grooming, and a reduced risk assessment in those animals. However, despite the in-depth characterization of GLUT8 regarding its cell biology and physiological significance in vivo, the current understanding for its cellular function remains thus far unknown. GLUT10 (SLC2A10)

GLUT10 has been cloned from human liver cDNA by 30 and 50 RACE-PCR based on an EST sequence that was identified via a homology search with known GLUT protein sequences. GLUT10 mRNA is present in the human heart, lung, brain, liver, skeletal muscle, pancreas, placenta, and kidney. Expression of GLUT10 is also detected in human and mouse white adipose tissue as well as human and mouse adipocyte cell lines SGBS and 3T3L1, respectively. One remarkable structural feature is the absence of the PESPR motif just after helix 6 that is conserved for all the GLUT isoforms. Heterologous expression of GLUT10 in Xenopus laevis oocytes demonstrated 2-DG transport with high affinity (Km ~0.3 mM). Uptake of 2-DG can be competed with galactose and glucose and inhibited with phloretin. Although no localization study has been performed for GLUT10 clearly demonstrating a plasma membrane or intracellular localization as seen for other class III family members, immunocytochemical studies indicated an intracellular localization for GLUT10 under steady-state conditions. The presence of the potential internalization motif YSRI at the C-terminus of the transporter supports these findings. GLUT10 is located at the chromosomal region 20q12-13.1, a susceptibility locus that has been linked to type 2 diabetes in Caucasian Americans. Therefore, there has been particular interest in GLUT10 as a potential candidate or susceptibility gene involved in the disease. However, polymorphisms in GLUT10 were not associated to type 2 diabetes in Caucasian American, Danish, Finn, and Taiwanese populations.

649

GLUT10 in humans has been found to be associated with ATS, a rare autosomal recessive connective tissue disease that is characterized by widespread arterial involvement with elongation, tortuosity, and aneurysms of the arteries. Homozygous mutations (deletion, nonsense, missense) for GLUT10 were found in six families with ATS. It is currently unknown how loss of GLUT10 leads to this connective tissue disorder (Coucke et al. 2006). Mouse Models of GLUT10 Deficiency Two groups reported the phenotypic analysis of mice with amino acid substitutions G128E or S150F in GLUT10 (Callewaert et al. 2008; Cheng et al. 2009). Both substitutions are located in exon 2 of GLUT10 and are conserved among the rodent and human protein. Based on predictions, the substitutions G128E and S150F were expected to interfere with the helix structure of transmembrane regions 4 and 5, respectively. The mice strains were generated after screening a mutant mouse library that was based on N-ethyl-N-nitrosourea (ENU) mutagenesis in healthy C3HeB/FeJ males. Callewaert et al. (2008) did not report any of the vascular, anatomical, or immunohistological abnormalities as encountered in patients with ATS. Both mutant strains appear normal at birth, gained weight appropriately, and survived to adulthood. The animals showed normal heart rhythm, heart structure, and ventricular function. No specific arterial tortuosity, stenosis, dilatation, or aneurysm in cerebral vessel pattern was noted. However, histopathology revealed thickening and an irregular vessel wall shape of arteries with increased elastic fibers. Furthermore, the animals displayed endothelial hypertrophy and disarranged elastic fibers that resulted in disruption of internal elastic lamina in the aorta. Neither group analyzed whether the mutations caused any dysfunction or loss of the target protein; therefore, the reported phenotype of those mice remains inconclusive with respect to GLUT10 function.

GLUT10 Physiology GLUT12 (SLC2A12)

GLUT10 and Arterial Tortuosity Syndrome (ATS, OMIM# 208050) in Humans Deficiency for

GLUT12 was identified by 50 and 30 RACE-PCR from the human breast cancer cell line MCF-7.

M

650

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

Strong GLUT12 expression is found in ductal cell carcinoma in situ when compared to benign ducts of breast cancer tissues. GLUT12 is mainly expressed in the skeletal muscle, heart, small intestine, and prostate tissues. GLUT12 shows glucose transport when functionally characterized in Xenopus laevis oocytes that can be competed with fructose, galactose, 2-DG, and cytochalasin B. However, thus far, the affinity of GLUT12 for glucose is not known. As a class III isoform, GLUT12 also contains a dileucine motifs – at both the N- and C-terminal ends of the protein. Endogenous as well as overexpressed GLUT12 protein localizes to intracellular compartments as well as to the plasma membrane in various cell lines. The N-terminal dileucine signal of GLUT12 is similar to that of GLUT8 which is the [DE]XXXL[LI] consensus sequence that represents a late endosomal/lysosomal sorting signal. However, GLUT12 does not co-localize with GLUT8, but rather resides in the Golgi network and at the plasma membrane. Plasma membrane-associated GLUT12 is not endocytosed, which indicates the absence of a continuous cycling mechanism for GLUT12. The phenotypic characterization of mice deficient in GLUT4 indicates the presence of a second transporter that facilitates insulin-stimulated glucose transport. Due to its tissue expression and its biological characteristics, GLUT12 was studied for its ability to respond to insulin-induced plasma membrane translocation of the protein. Indeed, in human skeletal muscle, insulin induces an increase in plasma membrane GLUT12 which is comparable to the insulin-stimulated GLUT4 translocation. Although GLUT 12 expression in the skeletal muscle is unaltered under pathophysiological conditions such as obesity and type 2 diabetes, these data imply that an additional transporter is expressed in human muscle that is insulin responsive in a PI 3-kinase-dependent manner. The characterization of mice that overexpress GLUT12 by a transgenic approach appears to confirm these findings, demonstrating that those animals display increased insulin sensitivity in insulin-sensitive tissues, while basal (non-stimulated) glucose uptake into adipose tissue and skeletal muscle was unaffected (Purcell et al. 2011).

GLUT13; HMIT (SLC2A13)

Screening of public expressed sequence databases with the GLUT8 protein sequence identified a rat EST clone that allowed cloning of the rat and human HMIT (SLC2A13) cDNAs from the spleen and frontal cortex cDNA libraries. Despite low-level expression in the adipose tissue and kidney, HMIT is predominantly expressed in the brain, with high expression found in the hippocampus, hypothalamus, cerebellum, and brainstem. The HMIT amino acid sequence contains all motifs known to be important for glucose transport activity. As for other class III GLUT family members, HMIT is restricted to an intracellular location. Functional characterization of the protein in Xenopus laevis oocytes and in mammalian cells has been possible through the introduction of various mutations that yielded significant plasma membrane expression. Surprisingly, no sugar transport activity has been found for HMIT. Instead, HMIT has been identified as a H+coupled myoinositol symporter with a Km of about 100 mM. More recently, HMIT has been shown to transport inositol-3-phosphate (IP3). HMIT is inhibited by the common GLUT inhibitors phloretin, phlorizin, and cytochalasin B, although at high concentrations. Translocation of HMIT to the plasma membrane has been demonstrated to occur in PC12 cells or primary neurons upon depolarization or protein kinase C (PKC) activation resulting in functional HMIT at the plasma membrane evidenced by increased myoinositol uptake in those cells. However, those initial findings were not reproduced by other groups, leaving uncertainty about a stimulus that induces plasma membrane translocation of the transporter. In the brain, myoinositol serves as the precursor for phosphatidylinositol, a key regulator for various signaling pathways. Dysregulation of the phosphatidylinositol signaling has been implicated in psychiatric illness such as bipolar disorder. Standard therapies (lithium, valproic acid, and carbamazepine) alter neuronal growth cone morphology, a phenotype that is reversed by extracellular myoinositol. Because of its predominant expression in the brain compared to two other myoinositol transporters that are sodium coupled (SMIT1 and SMIT2), interest

Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

has been raised as to whether HMIT might play a role in the regulation of myoinositol/phosphatidylinositol physiology in neurons. Mice deficient for the transporter demonstrated that HMIT is not involved in the neuronal transport of inositol from the extracellular environment.

Cross-References ▶ Fatty Acid Metabolism

References Augustin R (2010) The protein family of glucose transport facilitators: it’s not only about glucose after all. IUBMB Life 62:315–333 Bersudsky Y, Shaldubina A, Agam G, Berry GT, Belmaker RH (2008) Homozygote inositol transporter knockout mice show a lithium-like phenotype. Bipolar Disord 10:453–459 Bianchi L, Diez-Sampedro A (2010) A single amino acid change converts the sugar sensor SGLT3 into a sugar transporter. PLoS One 5:e10241 Callewaert BL, Loeys BL, Casteleyn C, Willaert A, Dewint P, De Backer J, Sedlmeier R, Simoens P, De Paepe AM, Coucke PJ (2008) Absence of arterial phenotype in mice with homozygous slc2A10 missense substitutions. Genesis 46:385–389 Chao EC, Henry RR (2010) SGLT2 inhibition – a novel strategy for diabetes treatment. Nat Rev Drug Discov 9:551–559 Cheng CH, Kikuchi T, Chen YH, Sabbagha NG, Lee YC, Pan HJ, Chang C, Chen YT (2009) Mutations in the SLC2A10 gene cause arterial abnormalities in mice. Cardiovasc Res 81:381–388 Coady MJ, Wallendorff B, Gagnon DG, Lapointe JY (2002) Identification of a novel Na+/myo-inositol cotransporter. J Biol Chem 277:35219–35224 Coucke PJ, Willaert A, Wessels MW, Callewaert B, Zoppi N, De Backer J, Fox JE, Mancini GM, Kambouris M, Gardella R, Facchetti F, Willems PJ, Forsyth R, Dietz HC, Barlati S, Colombi M, Loeys B, De Paepe A (2006) Mutations in the facilitative glucose transporter GLUT10 alter angiogenesis and cause arterial tortuosity syndrome. Nat Genet 38:452–457 Crane RK (1960) Intestinal absorption of sugars. Physiol Rev 40:789–825 Diez-Sampedro A, Hirayama BA, Osswald C, Gorboulev V, Baumgarten K, Volk C, Wright EM, Koepsell H (2003) A glucose sensor hiding in a family of transporters. Proc Natl Acad Sci U S A 100:11753–11758 Ehrenkranz JR, Lewis NG, Kahn CR, Roth J (2005) Phlorizin: a review. Diabetes Metab Res Rev 21:31–38

651

Faham S, Watanabe A, Besserer GM, Cascio D, Specht A, Hirayama BA, Wright EM, Abramson J (2008) The crystal structure of a sodium galactose transporter reveals mechanistic insights into Na+/sugar symport. Science 321:810–814 Grempler R, Augustin R, Froehner S, Hildebrandt T, Simon E, Mark M, Eickelmann P (2012) Functional characterisation of human SGLT-5 as a novel kidneyspecific sodium-dependent sugar transporter. FEBS Lett 586(3):248–53 Joost HG, Thorens B (2001) The extended GLUT-family of sugar/polyol transport facilitators: nomenclature, sequence characteristics, and potential function of its novel members (review). Mol Membr Biol 18:247–256 Jurczak MJ, Lee HY, Birkenfeld AL, Jornayvaz FR, Frederick DW, Pongratz RL, Zhao X, Moeckel GW, Samuel VT, Whaley JM, Shulman GI, Kibbey RG (2011) SGLT2 deletion improves glucose homeostasis and preserves pancreatic beta-cell function. Diabetes 60:890–898 Lindquist B, Meeuwisse GW (1962) Chronic diarrhoea caused by monosaccharide malabsorption. Acta Paediatr 51:674–685 Manz F, Bickel H, Brodehl J, Feist D, Gellissen K, Gescholl-Bauer B, Gilli G, Harms E, Helwig H, Nutzenadel W et al (1987) Fanconi-Bickel syndrome. Pediatr Nephrol 1:509–518 Preitner F, Bonny O, Laverriere A, Rotman S, Firsov D, Da Costa A, Metref S, Thorens B (2009) Glut9 is a major regulator of urate homeostasis and its genetic inactivation induces hyperuricosuria and urate nephropathy. Proc Natl Acad Sci U S A 106:15501–15506 Purcell SH, Aerni-Flessner LB, Willcockson AR, DiggsAndrews KA, Fisher SJ, Moley KH (2011) Improved insulin sensitivity by GLUT12 overexpression in mice. Diabetes 60:1478–1482 Rossetti L, Smith D, Shulman GI, Papachristou D, DeFronzo RA (1987) Correction of hyperglycemia with phlorizin normalizes tissue sensitivity to insulin in diabetic rats. J Clin Invest 79:1510–1515 Tazawa S, Yamato T, Fujikura H, Hiratochi M, Itoh F, Tomae M, Takemura Y, Maruyama H, Sugiyama T, Wakamatsu A, Isogai T, Isaji M (2005) SLC5A9/ SGLT4, a new Na + -dependent glucose transporter, is an essential transporter for mannose, 1,5-anhydroD-glucitol, and fructose. Life Sci 76:1039–1050 Turk E, Zabel B, Mundlos S, Dyer J, Wright EM (1991) Glucose/galactose malabsorption caused by a defect in the Na+/glucose cotransporter. Nature 350:354–356 Uldry M, Thorens B (2004) The SLC2 family of facilitated hexose and polyol transporters. Pflugers Arch 447: 480–489 van den Heuvel LP, Assink K, Willemsen M, Monnens L (2002) Autosomal recessive renal glucosuria attributable to a mutation in the sodium glucose cotransporter (SGLT2). Hum Genet 111:544–547 Wright EM, Loo DD, Hirayama BA (2011) Biology of human sodium glucose transporters. Physiol Rev 91: 733–794

M

652

Many Bacteria Use a Special Mutagenic Pol III in Place of Pol V

Many Bacteria Use a Special Mutagenic Pol III in Place of Pol V Charles S. McHenry Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO, USA

Synopsis All bacteria contain specialized polymerases that are required for bypassing unrepairable lesions and for mutagenesis. In the well-characterized E. coli system, these include the Pol Y family polymerases, Pol IV and Pol V. Yet, many bacteria lack Pol V and contain, in its place, a special mutagenic Pol III (ImuC) that appears to exert its action in concert with two additional proteins, often found in the same operon, ImuA and ImuB. ImuC differs from replicative DnaE by a series of conserved amino acid changes in the enzyme’s active site.

Introduction In E. coli, a specialized class of Pol Y polymerases serves the role of induced mutagenesis and stressinduced adaptive modifications (reviewed in Foster 2005; McKenzie and Rosenberg 2001; Pages and Fuchs 2002; Tippin et al. 2004; Walker 2005). With the sequencing of multiple bacterial genomes, it has become apparent that many bacteria spread throughout numerous phyla have two E. coli-like dnaEs in their genomes (not including the PolC/DnaE combinations covered in (“Bacterial Replicases”, “Organisms that Contain Multiple DNA Polymerase IIIs: Functions and Interactions”), with the second one apparently replacing Pol V, the major polymerase responsible for induced mutagenesis in E. coli). These two Pol IIIs have sometimes been designated DnaE1 and DnaE2. Two early elegant studies that associated dnaE2 with a role in induced mutagenesis were performed in Mycobacterium tuberculosis (Mtb) and Caulobacter crescentus (Ccr) (Boshoff et al. 2003; Galhardo et al. 2005). In Mtb,

knockouts of dnaE2 resulted in the loss of the enhancement of mutation that accompanies UV irradiation (Boshoff et al. 2003). Overproduction of DnaE2 alone did not restore mutagenesis, a characteristic of single subunit Pol Y polymerases in other organisms (Boshoff et al. 2003; Kim et al. 1997).

Functions and Interactions of the Special Mutagenic Pol III, ImuC In Ccr, knockouts of dnaE2 substantially reduced the stimulation of mutagenesis following UV irradiation or mitomycin C treatment (Galhardo et al. 2005). Furthermore, DNA polymerase IV (DinB) was not induced by UV irradiation or treatment with mitomycin C, analogous to the Mtb situation. In Ccr, dnaE2 was the distal gene in an operon preceded by a small gene (imuA) showing weak similarity to E. coli sulA (and recA) and a gene showing similarity to Pol Y-like genes (imuB). Knockouts of imuA or imuB ablated induced mutagenesis and were epistatic to dnaE2. Knockouts of the Pol IV structural gene, dinB, did not result in a diminution in UVor mitomycin C-induced mutation (Galhardo et al. 2005). It is now recognized that Mtb has ImuA and ImuB homologues that are required for DnaE2 function (Warner et al. 2010), so I will refer to DnaE2 as ImuC, for clarity, even when the original publication did not. This convention is a logical extension of the nomenclature of Menck and co-workers (Galhardo et al. 2005) and will aid in distinguishing this polymerase from DnaEs that coexist with Pol Cs, with which they are often confused in the literature. It has been demonstrated that in Mtb, ImuB interacts (in a yeast two-hybrid determination) with the replicative DnaE and ImuC and ImuA (Warner et al. 2010). ImuB, in spite of being closely homologous to Pol Y error-prone polymerases, does not contain the triad of conserved Pol III catalytic acidic residues (Warner et al. 2010) and thus must be inactive as a polymerase. Mutation of ImuC’s predicted catalytic Asp residues ablates induced mutagenesis. Thus, ImuC is the error-prone polymerase in Mtb and

Many Bacteria Use a Special Mutagenic Pol III in Place of Pol V

presumably other organisms that contain ImuA, ImuB, and ImuC and lack Pol V homologues. Yet, ImuB interacts with b2, but ImuC does not. Thus, it appears that ImuB serves an important gatekeeper role, interacting with b2, the replicative DnaE, and the error-prone polymerase ImuC. What role might an inactive Pol Y polymerase play? Although not well understood, a mechanism must exist, permitting Pol Y polymerases to access the replication fork, replacing the normal cellular replicase when blocking lesions are encountered. Perhaps, the ImuB protein preserves these functions even though the trans-lesion catalytic function has been deferred to ImuC. If true, this may provide a favorable system for understanding the activities of Pol Y polymerases distinct from their polymerase catalytic activity. We note that a eukaryotic trans-lesion polymerase with feeble activity, Rev1, interacts with other trans-lesion polymerases, including Pol k, Pol ζ, Pol ι, and Pol Z (Acharya et al. 2005; Guo et al. 2003; Ohashi et al. 2004; Tissier et al. 2004). This may provide a functional parallel to the ImuB-ImuC interaction. In P. aeruginosa, an imuC gene is preceded by sulA [RecA]-like imuA and Pol Y-like structural genes (imuB) in an apparent operon. A LexA binding site lies immediately upstream of this operon, suggesting SOS regulation. imuA, imuB, and imuC are induced by agents that induce the SOS response and by ciprofloxacin treatment. Mutation of imuC abolishes UV-induced mutagenesis (Sanders et al. 2006). Knocking out imuB abolishes mutagenesis (Pope, Lindow, Dohrmann, and McHenry, unpublished), in agreement with the expectations from the Mtb and Ccr systems but differing from a report in another pseudomonad (Koorits et al. 2007). There has been considerable confusion in the literature about the relationship between DnaElike polymerases that coexist in Pol C-containing strains (“Bacterial Replicases”, “Organisms that Contain Multiple DNA Polymerase IIIs: Functions and Interactions”) and those that coexist with replicative DnaEs. The consensus sequence from a comprehensive ImuC alignment was compared with an alignment of DnaEs that exist alone or with Pol Cs. Both of the latter are very similar

653

and distinguishable from ImuC. Only a few sequence stretches occur in which the absolutely conserved residues in ImuC differ from those in DnaE (Fig. 1) (other than differences arising from elements being absent from ImuC, such as an obvious b2 binding loop (Warner et al. 2010)). These differences reside near the three catalytic aspartates and at the C-terminal end of the finger domain. Mapping the differences onto the DNA-dNTP-Taq Pol III a structure shows these residues lining the active site and the end of the DNA binding channel near the primer terminus (McHenry 2011). This would appear to be consistent with the error-prone function attributed to ImuC if these changes relax substrate binding in a way that diminishes fidelity and permits bypass. This hypothesis needs to be tested experimentally.

Might ImuC Play a Role in the Development of Drug Resistance in Tuberculosis and in Pseudomonasinfected Cystic Fibrosis Patients? In a seminal study, a group in Madrid showed that cystic fibrosis patients are colonized early in life with a small number of P. aeruginosa strains and that these remain for the patient’s lifetime but adapt to the environment by mutation (Oliver et al. 2000), a factor contributing to drug resistance and treatment failure. These strains become progressively hypermutable, often by mutation of mismatch repair genes, with disease progression (Ciofu et al. 2005; Hassett et al. 2010; Hogardt et al. 2007; Oliver et al. 2000). In E. coli, it has been shown that induction of SOS mutagenesis is important for the evolution of drug resistance, even in the presence of mismatch repair deficiencies (Cirz and Romesberg 2006). And, in Mtb, persistence and evolution of drug resistance in animal models are diminished in imuC knockouts (Boshoff et al. 2003). Because the development of drug resistance within individuals affects treatment outcomes in cystic fibrosis, the ImuC system represents a target for chemotherapy, as has been suggested (Smith and Romesberg 2007).

M

654

Many Bacteria Use a Special Mutagenic Pol III in Place of Pol V

Many Bacteria Use a Special Mutagenic Pol III in Place of Pol V, Fig. 1 Completely conserved residues in ImuC that differ from the corresponding position in DnaE. (a) Residues absolutely conserved in ImuC that differ from DnaE are highlighted in yellow, both here and in panel b. Other significant differences are highlighted in grey. Other than standard amino acid abbreviations, h represents hydrophobic, o polar, + basic, x no consensus, and # small. The first two sequence blocks show residues surrounding the three catalytic aspartates (PDID and hDh in the ImuC line). The third block shows sequences at the

C-terminal end of the finger domain. (b) The active site of Taq Pol III a (Wing et al. 2008) showing residues that differ from those conserved in ImuC using grey or yellow as specified in panel a. The three catalytic Asp residues are highlighted in orange. The incoming dNTP is green and the primer and template strands are shown in stick form in grey and wheat, respectively. The figure was prepared in pymol using PDB 3E0D. Only the palm (magenta) and fingers (blue) protein domains are shown. Gly 666, Arg 767, and Lys 771 are hidden for clarity

Cross-References

oxidative stress caused by chronic lung inflammation. Antimicrob Agents Chemother 49:2276–2282 Cirz RT, Romesberg FE (2006) Induction and inhibition of ciprofloxacin resistance-conferring mutations in hypermutator bacteria. Antimicrob Agents Chemother 50:220–225 Foster PL (2005) Stress responses and genetic variation in bacteria. Mutat Res 569:3–11 Galhardo RS, Rocha RP, Marques MV, Menck CF (2005) An SOS-regulated operon involved in damageinducible mutagenesis in Caulobacter crescentus. Nucleic Acids Res 33:2603–2614 Guo C, Fischhaber PL, Luk-Paszyc MJ, Masuda Y, Zhou J, Kamiya K, Kisker C, Friedberg EC (2003) Mouse Rev1 protein interacts with multiple DNA polymerases involved in translesion DNA synthesis. EMBO J 22:6621–6630 Hassett DJ, Korfhagen TR, Irvin RT, Schurr MJ, Sauer K, Lau GW, Sutton MD, Yu H, Hoiby N (2010) Pseudomonas aeruginosa biofilm infections in cystic fibrosis:

▶ Bacterial DNA Replicases

References Acharya N, Haracska L, Johnson RE, Unk I, Prakash S, Prakash L (2005) Complex formation of yeast Rev1 and Rev7 proteins: a novel role for the polymeraseassociated domain. Mol Cell Biol 25:9734–9740 Boshoff HI, Reed MB, Barry CE, Mizrahi V (2003) DnaE2 polymerase contributes to in vivo survival and the emergence of drug resistance in Mycobacterium tuberculosis. Cell 113:183–193 Ciofu O, Riis B, Pressler T, Poulsen HE, Hoiby N (2005) Occurrence of hypermutable Pseudomonas aeruginosa in cystic fibrosis patients is associated with the

Mass Spectrometry Approaches insights into pathogenic processes and treatment strategies. Expert Opin Ther Targets 14:117–130 Hogardt M, Hoboth C, Schmoldt S, Henke C, Bader L, Heesemann J (2007) Stage-specific adaptation of hypermutable Pseudomonas aeruginosa isolates during chronic pulmonary infection in patients with cystic fibrosis. J Infect Dis 195:70–80 Kim SR, Maenhaut-Michel G, Yamada M, Yamamoto Y, Matsui K, Sofuni T, Nohmi T, Ohmori H (1997) Multiple pathways for SOS-induced mutagenesis in Escherichia coli: an overexpression of dinB/dinP results in strongly enhancing mutagenesis in the absence of any exogenous treatment to damage DNA. Proc Natl Acad Sci U S A 94:13792–13797 Koorits L, Tegova R, Tark M, Tarassova K, Tover A, Kivisaar M (2007) Study of involvement of ImuB and DnaE2 in stationary-phase mutagenesis in Pseudomonas putida. DNA Repair 6:863–868 McHenry CS (2011) Breaking the rules: multiple replicases and DNA polymerase IIIs in bacteria. EMBO Rep 12:408–414 McKenzie GJ, Rosenberg SM (2001) Adaptive mutations, mutator DNA polymerases and genetic change strategies of pathogens. Curr Opin Microbiol 4: 586–594 Ohashi E, Murakumo Y, Kanjo N, Akagi J, Masutani C, Hanaoka F, Ohmori H (2004) Interaction of hREV1 with three human Y-family DNA polymerases. Genes Cells 9:523–531 Oliver A, Cantón R, Campo P, Baquero F, Blázquez J (2000) High frequency of hypermutable Pseudomonas aeruginosa in cystic fibrosis lung infection. Science 288:1251–1254 Pages V, Fuchs RPP (2002) How DNA lesions are turned into mutations within cells? Oncogene 21:8957–8966 Sanders LH, Rockel A, Lu H, Wozniak DJ, Sutton MD (2006) Role of Pseudomonas aeruginosa dinBencoded DNA polymerase IV in mutagenesis. J Bacteriol 188:8573–8585 Smith PA, Romesberg FE (2007) Combating bacteria and drug resistance by inhibiting mechanisms of persistence and adaptation. Nat Chem Biol 3:549–556 Tippin B, Pham P, Goodman MF (2004) Error-prone replication for better or worse. Trends Microbiol 12: 288–295 Tissier A, Kannouche P, Reck MP, Lehmann AR, Fuchs RP, Cordonnier A (2004) Co-localization in replication foci and interaction of human Y-family members, DNA polymerase polZ and REVl protein. DNA Repair 3:1503–1514 Walker GC (2005) Lighting torches in the DNA repair field: development of key concepts. Mutat Res 577: 14–23 Warner DF, Ndwandwe DE, Abrahams GL, Kana BD, Machowski EE, Venclovas C, Mizrahi V (2010) Essential roles for imuA’- and imuB-encoded accessory factors in DnaE2-dependent mutagenesis in Mycobacterium tuberculosis. Proc Natl Acad Sci U S A 107: 13093–13098

655 Wing RA, Bailey S, Steitz TA (2008) Insights into the replisome from the structure of a ternary complex of the DNA polymerase III a-subunit. J Mol Biol 382:859–869

Mass Spectrometry Approaches Scott Cooper and Anton Sanderfoot Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA

Synopsis Mass spectrometry (MS) separates proteins based on their mass-to-charge ratio. Complex collections of proteins must first be separated by either classical chromatography techniques or two-dimensional gel. Proteins are then proteolytically digested to produce peptides. These peptides are then identified using MS through peptide mass fingerprinting or tandem MS/MS to further break the peptides. Finally, electrospray ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI) are common methods to ionize protein samples for MS analysis.

Introduction Many modern proteomics studies rely on being able to identify proteins in very small samples down to the fmole (1015 mol) level. This cannot be accomplished with traditional N-terminal sequencing methods such as the Edman degradation. Instead, mass spectroscopy is increasingly used to identify entire peptides or fragments of peptides by their mass-to-charge ratio. This can be done in a top-down approach, where an entire protein is ionized and the fragments are analyzed. Alternatively, a bottom-up approach can be used, where the protein is proteolytically digested into fragments which are separated and analyzed and then the results combined to identify the original protein.

M

656

Mass Spectrometry Approaches

Mass Spectrometry (MS)

Separation of Proteins

Mass spectrometry (MS) is a process that separates molecules based on their mass-to-charge ratio. To do this, the sample must first be turned into a gas (vaporized) and then ionized. Once the sample is ionized, it is passed through an electromagnetic field such that the ionized sample travels in an arc. Based on the charge of the sample and its mass, its speed may be increased or decreased while passing through the electric field, and its direction may be altered by the magnetic field. The amount of deflection depends on the massto-charge ratio of the ionized sample. Based on Newton’s second law of motion, force equals mass times acceleration (F = ma), lighter ions get deflected by the magnetic force more than heavier ions (Fig. 1). The streams of sorted ions pass from the analyzer to the detector, which records the relative abundance of each ion type and their relative mass. Biological samples are often quite complex and contain large molecules which can make ionization and volatilization a challenge.

To reduce the complexity of a sample, the proteins must be separated. This can be done by a variety of traditional chromatography methods such as liquid chromatography, but increasingly, two-dimensional (2D) gels are used. In a 2D gel, proteins are first separated by charge on thin strips embedded with a mixture of ampholytes that carry a range of charges. Proteins are electrophoresed until they reach a region of the strip at which the pH is equal to their isoelectric point (pI). At this point the protein will no longer be charged and will stop moving. The strip is then layered at the top of an SDS-PAGE gel, and the proteins that were separated by pI are now separated by molecular weight. Thus proteins in a fairly complex sample can be separated by both charge and size. The resulting spot on a 2D gel can be removed from the gel, digested with enzymes into peptides which can be identified by MS (Fig. 2).

Mass Spectrometry Approaches, Fig. 1 Schematic diagram of a mass spectrometer (Courtesy of http:// www2.chemistry.msu.edu/faculty/reusch/virttxtjml/spectrpy/ massspec/masspec1.htm)

Peptide Identification by MS There are two main ways MS is used to identify peptides generated from proteins. Peptide mass fingerprinting compares the masses of proteolytic peptides from a sample to a database of known peptide masses, such as MASCOT. A sample would need to have multiple peptides that match to a single protein to ensure the protein was present in the initial sample. If peptide mass fingerprinting yields vague results, the peptides can be subject to tandem mass spectrometry for de novo sequencing. De novo peptide sequencing uses tandem MS (MS/MS). In this method, peptides are separated by 2D gel or liquid chromatography and then sent through a mass spectrometer to separate them by mass. Peptides of a specific mass are then further fragmented at peptide bonds using collisioninduced dissociation and sent through a second mass spectrometer (Fig. 3). Because the second fragmentation most commonly occurs at peptide bonds, the resulting fragment peaks are less complex and will often allow the amino acid sequence to be deduced. In either of these methods, peptides

Mass Spectrometry Approaches

657

Mass Spectrometry Approaches, Fig. 2 Steps in two-dimensional gel electrophoresis (Courtesy of National Institutes of Health)

RT: 0.00 - 65.08 NL: 1.98E8 Base Peak MS

1082.9

100 90 Relative Abundance

80 70

1083.1

60

1083.0

50 40

578.3

1137.8

578.0

30

1138.2 1137.6

577.9

20 10

854.6

0

1712.2 1748.2 778.3 664.4 807.1

854.6 810.8 1725.4

5

0

15

10

20

659.8

577.8 746.9

660.9 660.4 748.9 748.7

25

30

35

40

619.4 619.3

766.6 810.6 619.5 810.7 619.4

619.3 1082.8 1276.2

50

45

898.5

619.5

55

60

65

Time (min) #1763 RT: 50.06 AV: 1 NL: 2.95E5 T: + c NSI d Full ms2 [email protected] [140.00-1665.00]

Relative Abundance

100

856.4

642.3

80 60 40 20

431.1 472.3 360.0 244.9 288.2

629.1 5.85.4

741.7 643.4

813.1 926.3 987.5

700.1 0 200

400

600

800

1000

1200

1400

1600

Mass Spectrometry Approaches, Fig. 3 Chromatography trace and MS2 spectra of a peptide (Courtesy of http:// mpproteomics.com/ms.html)

M

658

Mass Spectrometry Approaches

Mass Spectrometry Approaches, Fig. 4 MALDI-TOF (Courtesy of iastate.edu)

Focusing Lens

Sample Slide Laser

Schematic of Matrix Ion Acceleration Assisted Laser Desorption/Ionization time-of-flight mass spectrometry.

Intensity

Detector Time

are identified by searching a database of known proteins that could be in the sample or by comparison to a search of all proteins encoded by a genome. Peptide mass fingerprinting tends to yield more accurate predictions, but would miss proteins that were not in the database.

Volatilization and Ionization of Proteins To volatilize and ionize large molecules such as peptides, electrospray ionization (ESI) and matrixassisted laser desorption/ionization (MALDI) are commonly used. Separation of peptides by liquid chromatography is most commonly coupled with ESI, and the eluted samples can be fed into the tip of a capillary and then high electric field applied. The sample will be sprayed into the electric field along with a flow of nitrogen to promote vaporization and ionization. The sample then enters the MS for mass-to-charge measurement. For most other samples, MALDI allows a high sample throughput, and several proteins can be analyzed in a single experiment. A small fraction of the peptide is mixed with a 10,000-fold excess of an ultraviolet light absorbing matrix and pipetted onto a MALDI target. The target is inserted into the vacuum chamber of the mass spectrometer, and a pulsed laser beam transfers high amounts of energy into the matrix molecules, vaporizing both the matrix and peptides.

However, unlike MS/MS techniques, the laser ionizes but does not fragment the sample. Once a gas, the peptides become accelerated in the electric field of the mass spectrometer and fly toward an ion detector where their arrival is detected as an electric signal (Fig. 4). Their mass is inversely proportional to their time of flight (TOF) to the detector and can be calculated. The time of flight for a molecule of mass m and charge z to travel this distance is proportional to t = k (m/z)1/2. Thus the time it takes a sample to reach the detector can be used to calculate the ion’s mass.

Cross-References ▶ Primary Structure

References Edman P, Högfeldt E, Sillén LG, Kinell P-O (1950) Method for determination of the amino acid sequence in peptides. Acta Chem Scand 4:283–293 Gass SI (1987) Managing the modeling process: a personal perspective. Eur J Oper Res 31:1–8 George FH (1971) Cybernetics. St. Paul’s House, Middlegreen Jaiswal NK (ed) (1985) OR for developing countries. Operational Research Society of India, New Delhi Simon HA (1957) Models of man. Wiley, New York

Mathematical Modeling of Plasmid Dynamics

Mathematical Modeling of Plasmid Dynamics Jan-Ulrich Kreft School of Biosciences and Institute of Microbiology and Infection and Centre for Systems Biology, University of Birmingham, Edgbaston, Birmingham, UK

Synopsis Plasmids allow the rapid spread of genes through bacterial populations by a number of routes – transformation, transduction, and most specifically conjugative transfer – followed by independent replication rather than being constrained by the need for recombination in the new host. The dynamics of such spread is of both fundamental and practical importance, particularly in relation to understanding the spread of antibiotic resistance and in developing plasmids as tools for delivery of key functions into a variety of systems. Modeling can pinpoint key factors that facilitate or impede spread and can help to extend predictions from laboratory systems to practical applications. Mass action models have been useful to model plasmid dynamics in wellmixed systems but are much less successful when applied to structured bacterial communities as found in biofilms and other types of growth on surfaces or in matrices. Initially, this second generation of model worked with simple surface systems making simplifying assumptions about nutrient supply and growth rate, but more complex individual-based modeling is required to simulate more realistic scenarios. These tools now provide an important resource to aid in the understanding of plasmid behavior and the selective advantages of properties such as incompatibility and host range.

Introduction Plasmids form a key but variable part of bacterial genomes that can be lost in some contexts and

659

spread rapidly through a population in others. This dynamic situation has great practical importance in many contexts, especially in the maintenance and spread of plasmid-mediated antibiotic resistance in both commensal and pathogenic bacteria. Following on from the consideration of intracellular control circuits encoded by plasmids for the benefit of themselves and their hosts, it is important to consider how computer modeling can provide an understanding of the distribution of plasmids in populations and microbial communities.

Mass Action Models Levin and coworkers (Levin and Stewart 1976) began modeling of plasmid dynamics already in the 1970s following the mass action approach to population dynamics pioneered by Lotka and Volterra in the 1920s. As the name mass action indicates, random encounters of predator and prey – or donor and recipient – are treated like the collisions of molecules leading to chemical reactions. The rate of plasmid transfer is then simply proportional to the number of collisions and therefore proportional to the product of the densities of the interacting “particles.” As with most other dynamical models, mass action models are based on processes, which are deemed to be relevant, and described in terms of their rates. In addition to plasmid transfer, plasmid loss and host growth, and how this is affected by carrying a plasmid, are minimal ingredients. From such simple models, a very important conclusion can already be drawn: plasmids can only persist if their rate of transfer into new hosts is high enough to compensate the rate of decline of plasmidbearing cells relative to plasmid-free cells due to the burden (fitness cost) caused by the plasmid plus the rate of loss by segregation (Simonsen 1991; Bergstrom et al. 2000). The mass action approach has been used successfully to describe plasmid transfer in wellmixed systems when accounting for the dependence of the transfer rate on substrate concentration (Smets et al. 1993; Licht et al. 1999) and type of donor (Dionisio et al. 2002). The study by

M

660

Dionisio et al. (2002) is a good example of fitting mass action models of plasmid transfer to experimental data in order to obtain proper rate constants of plasmid transfer that are independent of donor and recipient densities, which is crucial if one wants to compare results obtained in different systems which are unlikely to have the same densities of donors and recipients. Recent work by Zhong et al. (2010) has advanced the art of parameter inference by decomposing an overall transfer rate into several steps, leading to more intrinsic parameters that are less dependent on the specifics of the experimental setup and therefore of a more universal nature. See also their follow-up parameter inference study including models with spatial structure (Zhong et al. 2012).

Models Incorporating Spatial Structure Lagido et al. (2003) were the first to include the effects of spatial structure on plasmid transfer in a model. They implemented a model for horizontal transfer of plasmids between colonies on flat surfaces, e.g., agar plates, describing the dynamics on the level of the colonies. Donor and recipient cells were randomly placed on the surface and formed separate colonies that grew exponentially at the same rate until nutrients were completely depleted. Plasmid transfer occurred when donor and recipient colonies met, and all recipients became donors instantaneously. Although this model was able to describe observed trends resulting from varying donor and recipient numbers and ratios in the inoculum, it overestimated conjugation frequencies by orders of magnitude. Relaxing the assumption of instantaneous transfer by including lag times improved the predicted conjugation frequencies to within half a log unit of experimental values (Lagido et al. 2003). Massoudieh et al. (2007) modeled plasmid transfer in saturated porous media such as the subsurface, with a view to extend this to unsaturated porous media such as the soil. They consider the planktonic phase and the sessile phase on the colony scale, using a delay-difference equation approach to track the change of bacterial type

Mathematical Modeling of Plasmid Dynamics

from recipient to transconjugant during a transfer event, and found that including such delays yields a better fit to experimental results (Massoudieh et al. 2007). Models that consider spatial structure on the scale of colonies (Lagido et al. 2003; Massoudieh et al. 2007) can simulate whole Petri dishes but would be too coarse for simulating complex biofilm structures or communities with many species. Also, it is advantageous if the same model can be used to simulate a range of environments including planktonic and biofilm phase, where colonies are not always meaningful entities. Always meaningful entities are individual cells, and models on the finer scale of single cells that describe the growth of individual cells and the plasmid transfer between cells, fitness costs, and segregational loss are called individual-oriented models and, if they allow individual variability, individual-based models. Note that such models are ideal for incorporating models of subcellular dynamics, such as models of plasmid gene regulation.

Individual-Based Models Individual-oriented models were first developed by Krone and coworkers (2007). Simulated time courses of donor, recipient, and transconjugant densities for pB10 transfer in E. coli could be matched reasonably well to experiments, and simulations of sectorial outgrowth of the fitter segregants in colonies on agar plates matched experimentally observed patterns qualitatively (Fig. 1). An extension of this model could reproduce the observed dependence of IncP-1 plasmid infection and abundance on spatial structure and nutrient availability (Fox et al. 2008). Because their studies focused on modeling bacterial growth and plasmid transfer in macroscopic colonies on agar plates, they had to make some concessions in order to make simulations of the large domains computationally feasible. Though individual bacteria are modeled, they are constrained to lie on a lattice and can only reproduce if there is vacant space in the lattice neighborhood (Krone et al. 2007). The nutrients were particles on the scale of bacteria, i.e., consumption

Mathematical Modeling of Plasmid Dynamics

661

Two example simulation results

Experimental result

Segregation rate: 0.001 Growth rate ratio: 0.73

E. coli colony 37 °C, 6 days

Sectors originating from an ancestor that lost the plasmid

pB10 plasmid expressing rfp (red)

Mathematical Modeling of Plasmid Dynamics, Fig. 1 Comparison of simulations of the growth of pB10 plasmid carrying and spontaneously forming segregants of E. coli cells with experimental observations. Two simulations are shown to indicate variability of the simulated pattern mainly due to the stochastic plasmid loss. In experiments: segregants white, plasmid-bearing cells pink. In

simulations: nutrients green, two recipients light blue, one recipient dark blue, two donors or transconjugants pink, one donor or transconjugant red. Note that the spatial scales for simulations and for experiments differ; the region shown corresponds to approximately 1 mm and 5 cm per side, respectively (Adapted from Krone et al. 2007)

of one food particle results in reproduction of the cell (Krone et al. 2007). Aimed at a somewhat finer scale, the opensource individual-based model simulation software iDynoMiCS developed by Lardon et al. (2011) represents individual bacteria as spherical particles located in continuous space rather than on a grid. These “particles” take up space and shove neighbors away to make space when they grow rather than ceasing to grow. Substrates, being much smaller and more numerous particles, are not modeled as particles but as a continuum, i.e., as a solute concentration field. Concentrations are updated according to local substrate consumption rates and diffusion, allowing quantitative and accurate predictions of substrate concentration gradients and, from this, growth rates. iDynoMiCS can also simulate chemostat or batch cultures. Based on iDynoMiCS, Merkey et al. (2011) included plasmid transfer and made it dependent on growth rate in various ways considering that a number of studies have suggested various types of dependence. Figure 2 shows that some of the growth rate dependencies could reproduce the observation that plasmid spread into biofilms is often limited to the biofilm surface (Merkey et al. 2011). While iDynoMiCS is

computationally more demanding than the model of Krone and colleagues, it is conceptually simpler (representing space, time, and substrates as continuous) and more flexible as there are no restrictions on how many species or plasmids or interactions to include. Another individual-based model, COSMICrules, was developed by Gregory et al. (2008). It consists of three levels: genomes (including plasmids), cells, and environments. The environment is seeded with various substances, such as nutrients and antibiotics, in various places. Possible interactions between cells and between cells and substances in the environment are decided by bit matching of gene and substance patterns, thereby implementing a mapping from genotype to phenotype (behavior rules coded by the genome). COSMIC-rules has been used to model the dynamics of compatible and incompatible resistance plasmids in environments with and without antibiotic selection including some specifics such as fertility inhibition. However, this model has limitations, for example, it does not include diffusion and is not meant to make quantitative predictions for any particular real system. Rather, it is meant to be more abstract to enable the evolution of novel characteristics of virtual organisms in a virtual world.

M

662

Mathematical Modeling of Plasmid Dynamics

Mathematical Modeling of Plasmid Dynamics, Fig. 2 Simulations showing how the plasmid spread into a 1-day-old biofilm after 18 h is limited to the biofilm surface if transfer depends on growth rate as shown schematically to the left, from no dependence for the top panel, transfer rate proportional to growth rate for the middle panel, to stronger growth dependence with a growth rate threshold for plasmid transfer for the bottom panel.

Invasion always starts with a single donor cell at the top of the biofilm. The lateral extent of spread, i.e., along the biofilm surface, is similar in all cases. Recipients red, donors blue, transconjugants yellow, and transconjugant daughter cells green, visualizing the contribution of horizontal transmission as yellow and vertical transmission as green (Adapted from Merkey et al. 2011)

Conclusions

understanding and more data – preferably on the single-cell and population level. This apparent disadvantage could in fact be an advantage as they can make good use of data on both levels. Thus, using complementary large- and smallscale models in combination is the preferred way to model plasmid systems.

In conclusion, mass action models can be fitted to population level data in order to extract parameters. They do not require single-cell data or mechanistic knowledge. They are also a good choice for modeling dynamics in large, relatively homogenous systems or when not enough details are known about the heterogeneities to allow a meaningful description of the system. If the purpose is prediction of dynamics in different environments or situations rather than fitting the model to the data, individual-oriented models are to be preferred. This is because they are based on a mechanistic description of the dynamics of transfer and maintenance that includes the effects of relevant environmental conditions, e.g., how transfer rate depends on substrate concentration. Hence, they require at least a basic mechanistic

Cross-References ▶ Conjugative Transfer Systems and Classifying Plasmid Genomes ▶ Mathematical Models in the Sciences ▶ Metamobilomics – The Plasmid Metagenome of Natural Environments ▶ Plasmid Genomes, Introduction to ▶ Plasmid Incompatibility ▶ Plasmid Regulatory Systems, Modeling

Mathematical Models in the Sciences

▶ Plasmids as Secondary Chromosomes ▶ Plasmids, Naming and Annotation of ▶ Synthetic Plasmid Biology ▶ Transposable Elements and Plasmid Genomes

References Bergstrom CT, Lipsitch M, Levin BR (2000) Natural selection, infectious transfer and the existence conditions for bacterial plasmids. Genetics 155:1505–1519 Dionisio F, Matic I, Radman M, Rodrigues OR, Taddei F (2002) Plasmids spread very fast in heterogeneous bacterial communities. Genetics 162:1525–1532 Fox RE, Zhong X, Krone SM, Top EM (2008) Spatial structure and nutrients promote invasion of IncP-1 plasmids in bacterial populations. ISME J 2:1024–1039 Gregory R, Saunders JR, Saunders VA (2008) Rule-based modelling of conjugative plasmid transfer and incompatibility. BioSystems 91:201–215 Krone SM, Lu R, Fox R, Suzuki H, Top EM (2007) Modelling the spatial dynamics of plasmid transfer and persistence. Microbiology 153:2803–2816 Lagido C, Wilson IJ, Glover LA, Prosser JI (2003) A model for bacterial conjugal gene transfer on solid surfaces. FEMS Microbiol Ecol 44:67–78 Lardon LA, Merkey BV, Martins S, Dötsch A, Picioreanu C, Kreft JU, Smets BF (2011) iDynoMiCS: next-generation individual-based modelling of biofilms. Environ Microbiol 13:2416–2434 Levin BR, Stewart FM (1976) Conditions for existence of conjugationally transmitted plasmids in bacterial populations. Genetics 83:S45–S46 Licht TR, Christensen BB, Krogfelt KA, Molin S (1999) Plasmid transfer in the animal intestine and other dynamic bacterial populations: the role of community structure and environment. Microbiology 145:2615–2622 Massoudieh A, Mathew A, Lambertini E, Nelson K, Ginn T (2007) Horizontal gene transfer on surfaces in natural porous media: conjugation and kinetics. Vadose Zone J 6:306–315 Merkey BV, Lardon LA, Seoane JM, Kreft JU, Smets BF (2011) Growth dependence of conjugation explains limited plasmid invasion in biofilms: an individual-based modelling study. Environ Microbiol 13:2435–2452 Simonsen L (1991) The existence conditions for bacterial plasmids: theory and reality. Microb Ecol 22:187–205 Smets BF, Rittmann BE, Stahl DA (1993) The specific growth rate of Pseudomonas putida PAW1 influences the conjugal transfer rate of the TOL plasmid. Appl Environ Microbiol 59:3430–3437 Zhong X, Krol JE, Top EM, Krone SM (2010) Accounting for mating pair formation in plasmid population dynamics. J Theor Biol 262:711–719 Zhong X, Droesch J, Fox R, Top EM, Krone SM (2012) On the meaning and estimation of plasmid transfer rates for surface-associated and well-mixed bacterial populations. J Theor Biol 294:144–152

663

Mathematical Models in the Sciences John Wesley Cain Department of Mathematics and Computer Science, University of Richmond, Richmond, VA, USA

Synopsis A mathematical model is an attempt to describe a natural phenomenon quantitatively. Mathematical models in the molecular biosciences appear in a variety of ways: some models are deterministic while others are stochastic, some models regard time as a discrete quantity while others treat it as a continuous variable, and some models offer algebraic relationships between variables while others describe how those variables evolve over time. This entry begins with a coarse dichotomy of some of the most common types of mathematical models. As an illustration, two very different models of the same simple decay process are derived and contrasted. The entry concludes with a discussion of (i) limitations of mathematical models, (ii) validation of models against scientific data, and (iii) the iterative process of refining and improving a model.

Introduction A mathematical model is an attempt to describe a natural phenomenon quantitatively. An example of a mathematical model that most students encounter during their science training is Newton’s Second Law (usually written F = ma), which states that the acceleration a of an object is proportional to the net force F acting on the object, with the mass m of the object acting as the proportionality constant. As a mathematical model, Newton’s Second Law tends to be an excellent approximation for interactions of macroscopic objects that one sees in daily life. However, at a nanoscale level or at relativistic speeds, the model may perform poorly. Herein lies an important theme: models of natural systems

M

664

always incorporate assumptions regarding the system being modeled. When the assumptions are not met, the model may be a poor representation of reality, in which case successive refinements may help. Indeed, mathematical modeling (the process of constructing a model) is often an iterative process as highlighted in section “Validation, Improvement, and Limitations of Models” below. Surveying the entire spectrum of mathematical models would be far too broad of an undertaking, but the following basic distinctions are helpful in dichotomizing model types. Algebraic versus evolution equations: Model equations such as the ideal gas equation PV = nRT or Newton’s Second Law express exact algebraic relationships between important variables. For example, doubling the temperature of an ideal gas in a vessel of constant volume will also double the pressure. Algebraic model equations govern what would happen if a quantity is changed, but say nothing of how the quantities change. A very different class of mathematical models fall under the heading of evolution equations: equations which describe how quantities of interest change over time. Evolution equations tend to contain derivatives (rates of change) of dependent variables with respect to time; see, for example, section “Differential Equation Model of a Decay Process” below. Mathematical versus statistical models: It is worth distinguishing between mathematical models and statistical models. Mathematical models are usually constructed in a more “principle-driven” manner, e.g., by appealing to Fick’s Law to describe the rate of motion of a chemical diffusing in a stationary liquid. Statistical models aim to quantify relationships between random variables – hopefully the reader will find that term sufficiently suggestive to proceed without requiring a technical definition. On some level, the distinction between mathematical and statistical models is blurred, because many of the famous equations in chemistry are derived from rather sophisticated statistical mechanics (e.g., the Arrhenius equation for dependence of reaction rates on temperature). There are two related terms worth mentioning here: deterministic versus

Mathematical Models in the Sciences

stochastic models. Roughly speaking, a deterministic evolution model is one for which the initial state of the system completely determines all future states – randomness is not taken into account. Stochastic models do incorporate randomness, which can be important in biochemistry contexts when random interactions between molecules are important. Continuous versus discrete: Evolution equations can be subdivided into those for which time is regarded as a continuous variable and those for which it is regarded as discrete. As an illustration, consider two different phenomena relating to cardiac dynamics: (a) fluctuations in voltage across a cell membrane due to Na+, K+, and Ca2+ ion transport and (b) fluctuations in the peak voltage attained during each heartbeat. A model of the former might treat time as continuous if the goal is to use mathematics to predict voltage as a function of time (graphically rendered as a continuous trace), while a model of the latter might regard time as discrete due to the inherently discrete nature of the heartbeat. There are many instances in which it is convenient to consider all variables as varying continuously even if they are technically discrete. Concentration of a chemical in a vessel of fixed volume is a good example: technically, there are only discretely many concentrations that can be achieved since the number of molecules is an integer.

Example: Two Models of a Simple Decay Process The various distinctions mentioned above can be understood through derivation of two different models of a simple decay process X ! Y. Differential Equation Model of a Decay Process An equation that contains a derivative of some dependent variable of interest is called a differential equation (DE). There are several reasons that DE models are prevalent in the sciences. First, many biological, chemical, and physical principles give rise to evolution equations

Mathematical Models in the Sciences

which describe how important quantities change over time, as opposed to providing exact formulas for the quantities. Another advantage of DE models is that they tend to be deterministic: if a DE is supplemented with appropriate auxiliary information (such as initial conditions, which describe the state of the system at some reference time), then the future state of the system is uniquely determined by this information. Finally, there is a vast literature devoted toward the analysis of DE models, and there are standard techniques for solving (at least approximately) such equations. As the simplest illustration of where DE models arise in chemistry, consider a first-order, nonreversible decay process X ! Y in which one chemical species is converted into another. Assume that the system is closed, so that neither species is artificially added to or subtracted from the system. If x(t) denotes the number of molecules of X at time t, how might x(t) change during a short time interval of duration Dt? Provided that Dt is small, it is natural to postulate that the number of molecules of X which spontaneously convert to Y will be proportional to both (i) the number of molecules of X and (ii) the time interval Dt. Mathematically, xðt þ DtÞ  xðtÞ  kxðtÞDt, where k is a positive kinetic constant which would need to be measured experimentally. The expression kDt approximates the percentage of the molecules of X which are converted to Y during the time interval from t to t + Dt. As it stands, the above equation is an example of a discrete-time model – given an initial condition (IC) x(0) which measures the mass of X at an “initial” reference time t = 0, the formula can be applied recursively to estimate x at times Dt, 2Dt, 3Dt, and so on. Transitioning from this discrete-time model to a continuous-time DE model involves a routine procedure from introductory calculus. By preliminary algebra, xðt þ DtÞ  xðtÞ  kxðtÞ, Dt

665

after which taking the limit as Dt shrinks to 0 yields the DE dxðtÞ ¼ kxðtÞ dt In words, this DE states that the rate of change of x is proportional to the amount of x present. The same DE could be used to describe a radioactive decay process, in which the rate of change of the number of radioactive atoms in a sample is proportional to the number of atoms in the sample. As DEs go, this one is very basic, a consequence of the simplicity of the underlying chemical process. The equation does not provide an exact representation of x(t), only an expression explaining how x(t) influences its own rate of change. Ideally, one wishes to produce an explicit formula for x(t) as a function of t  a formula that could be validated against experimental data and (hopefully) used for interpolation and extrapolation. Such formulas are referred to as solutions of the DE, and this DE has infinitely many solutions. To single out a specific solution of particular interest, one imposes auxiliary conditions that solutions must satisfy, the most common type being an IC. In the above DE, suppose that the initial number of molecules of X is x(0) = N, a positive constant. It is possible to use standard mathematical techniques to show that the unique solution of the DE dx/dt =  kx subject to the IC x(0) = N is x(t) = Nekt. This formula predicts simple, exponential decay of x(t) over time, agreeing with earlier intuition. A DE together with its ICs is called an initial value problem (IVP), a ubiquitous class of mathematical models in the sciences. (Stochastic) Markov Chain Model of a Decay Process The same nonreversible, first-order process X ! Y can be modeled in a probabilistic manner as follows. Let ζ(t) be a time-dependent random variable corresponding to the number of molecules of X that remain at time t, and let Px(t) denote the probability that ζ(t) = x. The quantity ζ(t) can take on finitely many values, namely, the numbers 0, 1, 2, . . . N, where N denotes the initial number of

M

666

molecules of X. If ζ(t) = x, how might one approximate Px1(t + Dt), the probability that precisely one molecule of X has transitioned to Y a short time Dt later? Assume that Dt is chosen so small that the probability of two or more molecules of X decaying during the time interval is negligible compared to the probability that a single molecule makes the transition. Then to a reasonable approximation, Px1(t + Dt) can be split into a sum of two conditional probabilities: (i) the probability that exactly one transition occurs during the time interval from t to t + Dt, namely, that ζ(t + Dt) = x  1 given that ζ(t) = x, and (ii) the probability that zero transitions occur during that time interval, namely, that ζ(t + Dt) = x  1 given that ζ(t) = x  1. The probability that an individual molecule of X experiences a transition is approximately proportional to Dt. Letting k denote the [positive] proportionality constant, Px1 ðt þ DtÞ  kxDtPx ðtÞ þ ð1  kðx  1ÞDtÞPx1 ðtÞ This model relates the state of the system at time t + Dt to the state at time t and is an example of a discrete-time Markov chain model. There are many textbooks in which the theory and applications of Markov chain models are developed, and the most user-friendly introductions tend to appear in elementary texts on probability theory (see, for example, Ross 2012). There is an interesting link between this stochastic Markov chain model and the deterministic DE formulation in the previous section: even though the transitions X ! Y are random and independent, the “average” or expected behavior of this system is deterministic. Subtracting Px1(t) from both sides of the above equation, dividing by Dt, and taking the limit Dt ! 0 lead to the [continuous-time] DE dPx1 ¼ kxPx  kðx  1ÞPx1 , ð1  x  N Þ dt Through a creative transformation (McQuarrie 1967), it is possible to solve a deterministic equation to obtain the average behavior of this system. One finds that the expected number of molecules

Mathematical Models in the Sciences

of X at time t is given by Nekt, with a variance of Nekt(1  ekt). Notice that the expected behavior of the stochastic system agrees with the solution of the [deterministic] DE described in the previous section. Extracting exact formulas for the expected value and variance from a stochastic model is a rare luxury and is impossible for more sophisticated processes (such as the Michaelis-Menten enzymatic reaction). This alludes to one disadvantage of stochastic models: the presence of randomness requires Monte Carlo simulation (repeated trials) in order to estimate expected values and variances associated with a process. Monte Carlo simulations can be computationally expensive – time-consuming even for the fastest computers. By contrast, because DE models are deterministic, there is no need to perform multiple trials using the same parameters and ICs.

Validation, Improvement, and Limitations of Models Because biochemical processes are so complex and experiments are always subject to measurement error, it is unrealistic to expect a mathematical model to perfectly reproduce lab data. Models must incorporate simplifying assumptions regarding a system’s behavior or environment, for otherwise the equations would be far too complex to write down. Ideally, these assumptions should be as realistic as possible and should not pigeonhole a model into an overly narrow or idealized setting. Only the most essential variables and parameters should be included, and the presumed quantitative relationships between those variables should be firmly grounded in sound biochemistry principles. When possible, a model should exhibit predictive power in a variety of experimental conditions. The process of refining and improving a model is often iterative. Consider, for example, the simplest model for population growth for a species of fish in an environment with no predators and an abundance of food. Assume that the percentages of fish that are either (a) about to give birth or (b) about to die remain constant over time, so that

Mathematical Models in the Sciences

the rate of change of population is proportional to the population itself. If P(t) denotes the population at time t, the preceding statement yields the mathematical model dP/dt = kP, where k is a constant representing the difference between the intrinsic per capita birth and death rates. The solution of this DE is P(t) = P(0)ekt, and assuming that k is positive (i.e., the birth rate exceeds the death rate), the model predicts exponential population growth. While exponential population growth may be sustainable over short time scales in an idealized environment, limited resources would prevent such growth from continuing indefinitely. One way to refine the model is to assume that the environment can only sustain some maximum population of M fish. There are numerous ways to modify the previous DE model to incorporate this new assumption, the most common being the inclusion of a new factor which shuts off the population growth as P approaches M. The most typical choice appears in the equation dP/dt = kP (1  P/M), in which the factor (1  P/M) approaches zero as the population approaches the environment’s carrying capacity. Accounting for harvesting (fishing) requires yet another model refinement: If fish are harvested at a constant rate, the equation could be modified to read dP/dt = kP(1  P/M)  r where r is the harvesting rate. On the other hand, if they are harvested at a rate proportional to their population, then the modified equation might read dP/dt = kP(1  P/M)  rP. These harvesting models are slightly more complex than the original DE and offer the flexibility (or curse – see below) of including three parameters r, k, M instead of just one. Introduction of predators such as sharks would increase the complexity of the system in a much more significant way, as this would require tracking of two different populations (dependent variables) whose interactions must be modeled. Evidently this process of adapting and refining a model can be carried out ad infinitum, leaving the researcher to decide where to draw the line. The process validating a model against experimental data is a key step in the refinement process. In the above example, suppose that population data is collected at distinct times t0,

667

t1, t2, . . . tn, and let Pi denote the measured population at time ti. Letting the “initial” time t0 = 0, the constant harvesting model can be written as the IVP   dP P ¼ kP 1   r, Pð0Þ ¼ P0 dt M Determining the choices of parameter values for which the model “best” fits the data is, in general, a highly challenging problem. The most common way to do so is via least squares regression, which aims to minimize the residual error between the solution of the model equation and the actual data points (see also ▶ “Mathematics of Fitting Scientific Data”). More precisely, let P(ti) denote the population at time ti as predicted by the solution of the DE model, and recall that Pi denotes the actual population as measured experimentally. In this example, the goal of least squares regression is to determine the choices of the parameters k, M, and r that minimize the sum n X

½Pðti Þ  Pi 2 :

i¼1

Once optimal parameter choices are identified, they may be substituted into the model equation (s), which can then be solved and compared to the actual data. Validating the model against a single data set is one of the first tests of its success or failure to interpolate, extrapolate, and make predictions. While a successful fit is certainly a good sign, it is unlikely that a single parameter set would allow the model to accurately simulate a broad range of experimental conditions. If additional trials are feasible, then the model can be “tuned” against other data sets. Ultimately, the researcher must decide whether a model has been sufficiently refined so that its predictive power may guide future experimental protocol. It is a pleasant occurrence when model simulations predict novel, scientifically interesting, or important phenomena, particularly when those phenomena are subsequently confirmed through experiments. As a rule, parameter fitting via least squares regression becomes vastly more difficult as the

M

668

number of parameters in a model increases. Two ways to help streamline the process include (a) restricting the ranges of allowable parameter choices so that their values remain reasonable and meaningful and (b) sensitivity analysis, identifying which parameters the model is least sensitive to and excluding those parameters from the initial stage of the regression process.

Further Reading

Mathematics of Fitting Scientific Data

Mathematics of Fitting Scientific Data John Wesley Cain Department of Mathematics and Computer Science, University of Richmond, Richmond, VA, USA

Synopsis

The above survey of types of mathematical models, their construction and validation, is far from comprehensive. Differential equation and Markov chain models are certainly widespread in the molecular biosciences, but so are network flow models. Texts which introduce different types of models in biology and biochemistry include Beard and Qian (2008), Fall et al. (2002), Keener and Sneyd (2009), Murray (2002 and 2003), and Waterman (1995). Additionally, the manuscript of McQuarrie (1967) offers one of the earliest surveys of stochastic models of chemical kinetic processes.

The ability to make predictions based upon scientific data is fundamentally important. Interpolation and extrapolation of data allow researchers to predict how a system will behave and sometimes elucidate the mechanisms responsible for observed behaviors. The usual way of fitting scientific data is via least squares regression, a systematic process for identifying curves that “best” fit a data set. This essay explains the process of least squares regression for fitting several types of curves (linear, power, exponential) to data sets. Also included are general guidelines for selecting which type of function to use, as well as a list of key issues to be aware of when fitting data.

Cross-References

Introduction

▶ Mathematics of Fitting Scientific Data

References Beard DA, Qian H (2008) Chemical biophysics: quantitative analysis of cellular systems. Cambridge University Press, Cambridge Fall CP, Marland ES, Wagner JM, Tyson JJ (2002) Computational cell biology. Springer, New York Keener JP, Sneyd J (2009) Mathematical physiology, 2nd edn, vols 1 & 2. Springer, New York McQuarrie DA (1967) Stochastic approach to chemical kinetics. J Appl Prob 4:413–478 Murray JD (2002/2003) Mathematical biology, 3rd edn, vol 1 & 2. Springer, Berlin Ross SM (2012) A first course in probability, 9th edn. Pearson Prentice Hall, Upper Saddle River Waterman MS (1995) Introduction to computational biology: maps, sequences, and genomes. Chapman & Hall/ CRC, London

Processing and interpreting experimental data is one of the most vitally important steps in the process of scientific research. Once data is collected, researchers are typically concerned with questions such as: 1. Does the data look “believable” in the sense that there are no obvious signs of faulty instrument calibration or poor experimental design/ protocol? For example, if temperature recordings are measured from a sample that is moved from an oven into a freezer and the data indicates that the temperature increases, then clearly something is wrong. 2. How might one quantify relationships between variables involved in a data set? 3. Once a relationship between variables is quantified, how can that information be used to

Mathematics of Fitting Scientific Data

669

make predictions beyond what may be experimentally measurable (interpolation and extrapolation)? In what follows, assume that the first question is answered in the affirmative, as there is no point in trying to draw conclusions from a “bad” data set. The statistical procedure known as regression analysis specifically addresses the other questions. This article covers the basics of regression analysis for data sets involving one independent variable (usually time, t) and one dependent variable. Although there are well-established techniques for handling multiple independent and dependent variables, the relevant mathematics is more technical. One goal of a regression analysis is to generate the regression function, the graph of which is a smooth curve that is designed to stay as close as possible to the data points in a “holistic” sense to be made precise below. Regression functions allow one to interpolate and extrapolate data by choosing some test value of the independent variable, substituting it into the formula for the regression function, and using the result to estimate the corresponding value of the dependent variable.

Least Squares Regression and Three Special Cases Consider, for example, an experiment involving a simple decay process A ! ⋆ in which some chemical species A decays to an inert product, and suppose that the concentration of A is measured at various times. To establish notation, let t1, t2, . . . tn denote the times at which the measurements were recorded, and let ci denote the concentration of A recorded at time ti. The data can be rendered graphically by plotting all of the ordered pairs (ti, ci); here the ordering suggestively indicates that concentration is regarded as the dependent variable and time as the independent variable. The purpose of least squares regression can be stated as follows: given some class of mathematical function (e.g., linear, exponential, sinusoidal, etc.) which is believed to mimic the data set, identify the specific function within that

y f(x) yi

(xi , yi) residual

f(xi)

x

xi

Mathematics of Fitting Scientific Data, Fig. 1 Schematic illustration of a regression function f (x) fit to data points (solid squares). The residual error is shown for one of the data points (xi, yi)

class “best” fits the data points in a sense to be made precise in the next paragraph. For the decay process described above, it is natural to try a simple exponential function, one of the form f (t) = bemt where b and m are parameters. In that case, least squares regression should produce the specific values of b and m for which the exponential function fits the data set “optimally.” The general process of least squares regression for data involving a single independent variable x and a single dependent variable y is as follows. Given data points (x1, y1), (x2, y2), . . . (xn, yn) and a function f (x) defined for all possible choices of x, define the residual sum of squares (RSS) as the quantity RSS ¼

n X

½ y i  f ð xi Þ 2 :

i¼1

The goal of least squares regression is to identify the specific choice of f (x) (from within some class of mathematical functions) for which the RSS is minimized. It is not terribly difficult to understand the rationale behind minimizing the above sum, particularly with some graphical intuition (Fig. 1). The quantity yi  f (xi) is called a residual error and measures the vertical separation between the data point (xi, yi) and the point (xi, f (xi)) on the graph of f (x) for the same choice of the dependent variable (i.e., x = xi). Squaring a residual always returns a nonnegative quantity regardless of

M

670

Mathematics of Fitting Scientific Data

whether f (xi) is an underestimate or overestimate of yi, and [yi  f (xi)]2 increases with the amount of separation between yi and f (xi). The main purpose of squaring the residuals (as opposed to not doing so) is to prevent “cancellations” of errors when f (x) underestimates some data points but overestimates others; i.e., the sum should impose a penalty whenever f (x) deviates from the data, no matter how that deviation occurs. More concretely, consider a data set with just two points. If one draws a line which overestimates one point by 1,000 and underestimates the other point by 1,000, it would be inappropriate to declare the fit perfect on the basis that the errors cancel out. The RSS offers a holistic indicator of the goodness of fit between f (x) and the actual data, with larger RSS signifying larger error. Minimizing the RSS assures that, at least in some aggregate sense, the graph of f (x) ought to remain “close” to the data for all values of the independent variable x at which measurements were provided. There are three special classes of regression functions f (x) that are worth singling out: linear, exponential, and power. Those functions model a wide variety of natural phenomena and have the rare advantage of precise formulas for minimizing the RSS error. Linear Regression Lines of the form f (x) = mx + b are the simplest class of functions commonly used to fit experimental data. By varying the parameters m (the slope) and b (the intercept), it is possible to represent every possible line of finite slope, and the purpose of linear least squares regression is to find the choices of m and b that minimize the RSS for a given data set (xi, yi), i = 1, 2, . . . n. It happens that there is an exact formula for these optimal choices of m and b: let x and y denote the mean (average) values of the values xi and yi, respectively, for a given data set. Then the regression line for the data set is achieved by choosing Xn m¼

and

ð x i  x Þ ðy i  y Þ i¼1 X n ðx i  x Þ2 i¼1

b ¼ y  mx: These formulas are obtained by differentiating the RSS sum with respect to the parameters m and b (separately), setting the two results equal to 0, and solving the resulting system of two equations for m and b. For readers familiar with statistical parlance, notice that the slope m of the regression line is the ratio of the covariance of the data sets xi and yi to the variance of the data set xi. An example illustrating the computation of a regression line for a sample data set is provided in the next section. Exponential Regression An exponential function is one of the form f (x) = bemx, where b and m are parameters, and can simulate growth or decay according to whether m is positive or negative. (Note: If the “natural” base e = 2.71828. . . feels unnatural to the reader, converting an exponential function to another base such as 2 or 10 is not difficult and requires only minor modification of the following material.) Henceforth, assume that the parameter b and the measurements yi are positive, a reasonable assumption in biochemistry contexts in which the dependent variable is a (nonnegative) chemical concentration. This assumption is not needed for general exponential regression but allows avoidance of technical issues when taking logarithms in what follows. The problem of determining the best-fit exponential function to a data set (xi, yi), i.e., finding the optimal b and m, can be reduced to a problem of linear regression. By taking the natural logarithm of f (x) = bemx and defining b = ln b, one obtains ln f ðxÞ ¼ mx þ b, a form that looks familiar from the above discussion of linear regression. Consequently, if one defines Yi = ln yi and forms a new data set (xi, Yi), the problem of fitting an exponential function bemx to the original data set is reduced to the problem of fitting a line mx + b to the new data set. The optimal choices of m and b are given by the formulas from the linear regression case above, after which one may compute b = eb.

Mathematics of Fitting Scientific Data

671

Mathematics of Fitting Scientific Data, Table 1 Data showing 11 measurements of chemical concentration (mmol/L) during a decay process over a 5-min span. The Time (min) Conc (mmol/L) ln(conc)

0.0 7.72 2.04

0.5 5.39 1.68

1.0 4.98 1.61

1.5 3.80 1.34

2.0 2.69 0.990

last row shows the natural logarithm of the concentrations in the middle row, for use in exponential regression 2.5 2.41 0.879

3.0 2.06 0.721

3.5 1.27 0.239

4.0 1.29 0.252

4.5 0.812 0.208

5.0 0.603 0.505

8

conc (mmol/L)

ln(conc)

2

1

0

−1 0

1

2

3

4

5

time (min)

Mathematics of Fitting Scientific Data, Fig. 2 Left: Semilog plot (Row 3 vs Row 1 of Table 1) along with the least squares regression line. Right: Exponential regression

6 4 2

0

0

1

2

3

4

5

time (min)

fit to concentration-versus-time data (Row 2 vs. Row 1 from Table 1)

M Incidentally, a plot of the points (xi, Yi) = (xi, ln yi) is called a semilog plot of the original data set. Table 1 shows sample data of concentration (Row 2 in the table) over time (Row 1) for a simple chemical decay process. The natural logarithms of concentrations appear in Row 3 and are included for the purpose of fitting a regression line to the semilog plot. The left panel of Fig. 2, a semilog plot, includes a graph of the best-fit regression line (m = 0.488, b = 2.04) as computed by the formulas appearing in the linear regression section above. The right panel of the figure shows the corresponding exponential regression function f (x) = bemx, where b = eb = 7.69. Using the regression function f (x) = bemx = 7.69e0.488x to interpolate and extrapolate the data in Table 1 is straightforward. To estimate the concentration at time t = 1.25 min (interpolation), simply calculate f (1.25)  4.18 mmol/L. Similarly, one may extrapolate the concentration at time t = 8.0 min by computing f (8.0)  0.155 mmol/L.

Power Regression A power function is one of the form f (x) = bxm where b and m are parameters. The important distinction between power and exponential functions is that, in the former, the base is variable and the exponent is constant, whereas in the latter, the base is constant and the exponent is variable. Much like the process of exponential regression, determining the best-fit power function to a data set can be distilled to a problem involving linear regression. To avoid similar technicalities to those mentioned previously, suppose that the data set (xi, yi) is such that both coordinates are positive. Taking logarithms of both sides of the relationship f (x) = bxm and defining b = ln b, one obtains ln f ðxÞ ¼ m ln x þ b, implying that ln f (x) depends linearly on ln x. Hence, if one transforms the original data set (xi, yi) by defining new variables Xi = ln xi and Yi = ln yi, then the problem of finding the best-fit power function bxm to the original data is reduced

672

to the linear regression problem for the points (Xi, Yi). The optimal choices of m and b are then provided by the above formulas in the linear regression section, after which one may compute b = eb. A plot of the points (Xi, Yi) = (ln xi, ln yi) is called a log-log plot of the original data set. As an example of when power regression might be appropriate, consider a simple, nonreversible kinetic process 2A ! B in which two molecules of species A combine to form species B. Then the law of mass action would suggest that the rate of accumulation of B (say, in moles per unit time) depends quadratically on the number of moles of A. Hence, given data points (xi, yi0 ) where xi and yi0 denote the concentration of A and rate of accumulation of B (respectively) at the ith measurement, one may fit a power function bxm by fitting a regression line to the log-log plot of the data set. If the law of mass action applies, one would expect the regression process to produce an optimal exponent of m  2 in this example, while the optimal choice of b represents an approximation of the kinetic constant for the process.

Choosing Functions and Parameters The above are the most common types of functions fit to experimental data sets, but there are many other possibilities. The general guiding principle when selecting which class of functions to use is this: make reasonable choices that can be justified based upon scientific principles and/or mathematical models of the presumed dynamics of the system (Mathematical Models in the Sciences). For a simple decay process, an exponential function f (x) = bemx is probably appropriate. For a saturation process, a sigmoidal function (S-curve; see example below) may suffice, or alternatively, a function of the form f (x) = a  bemx, where a, b, and m are positive constants. Fitting data involving biochemical oscillators might involve sinusoidal functions; however, such fits are far less common in the literature, largely because the underlying dynamics of oscillatory reactions are too complex to be

Mathematics of Fitting Scientific Data

described using simple combinations of (scaled) sine and cosine functions. Here are a few other remarks to bear in mind: 1. Optimization can be computationally difficult: The luxury of precise mathematical formulas for optimal parameter choices (as in the linear, power, and exponential cases above) is a rarity. No such formulas exist for general classes of regression functions, in which case performing least squares fits may require sophisticated computer algorithms. 2. Computer software: Of course, statistical and mathematical exploration of a data set is facilitated through use of a computer. There are several major statistical software packages available for performing least squares regression (among other things), some of which are available cost-free. Many of them allow the user to select from a variety of function classes, including the ones described above. 3. Number of parameters: Increasing the number of parameters (degrees of freedom) in the class of functions being used to fit data vastly complicates the computation of a regression curve (even for a computer). For example, suppose that one wishes to fit a sigmoidal function (S-curve) of the form f ðxÞ ¼ a þ

b 1 þ expððx  gÞ=dÞ

to a data set describing a saturation process. The four parameters in this class of functions offer considerable flexibility: a shifts the graph of f (x) vertically, b stretches the graph vertically, g shifts the graph horizontally, and d stretches the graph horizontally. If a computer were to (naively) test a paltry 10 different values of each of those four parameters and calculate the RSS for each parameter set, there would be 104 = 10,000 parameter sets to check. By contrast, the linear, exponential, and power regression functions described above each include only two parameters, so that a similar “exhaustive” search of parameter sets requires testing only 102 = 100 parameter

Mathematics of Fitting Scientific Data

pairs. For complicated models involving dozens of parameters, even the fastest computers could never complete an exhaustive check of every possible parameter set. 4. Reasonable parameter choices: Software that performs general least squares regression is often automated with mathematical algorithms for seeking parameters that will minimize the RSS. Whenever possible, it is helpful to specify acceptable ranges that each parameter is allowed to take, thereby preventing the computer from searching highly unrealistic parameter choices. Visual inspection of Rows 1 and 3 in Table 1 suggests that the slope of the regression line is negative but certainly not less than 1, implying that a restriction 1  m  0 could be appropriate. Of course, this is merely an illustration – for linear least squares regression, there would be no need to restrict the allowable ranges of m and b, because there are precise formulas for the optimal choices of those parameters. Generally, the more parameters there are, the more important it is to supply reasonable ranges for each parameter, and the narrower the range the better. Some software actually prompts the user to input a reasonable initial guess for the best-fit parameter values. 5. Size of the data set: Let N denote the total number of distinct values of the independent variable within a data set. Then, without getting into technical details, the total number of parameters in the class of functions being used to fit the data should not exceed N . For example, fitting a line f (x) = mx + b (two parameters, m and b) to a single data point would not make sense. 6. Visually inspect the goodness of fit: When a computer determines the best-fit parameter set for a regression curve, be sure to graph the curve on the same set of axes as the data itself. On some level, data fitting is a subjective pursuit in which the scientist must check that the function truly captures the desired trends within the data. Here is an extreme example of what might go wrong if data fitting is performed carelessly: Given a set of data points

673

(xi, yi) where i = 1, 2, . . . n for which the independent variable values xi are all distinct, it is straightforward to construct a polynomial of degree at most (n + 1) which crosses all of the data points. However, the fact that such polynomials are called interpolating polynomials is something of a misnomer – they are often useless for the purpose of interpolation, oscillating wildly between consecutive data pairs and failing to capture overall trends within the data. Graphing data and regression curves can also aid in spotting outliers, isolated data points that are likely due to significant measurement error or, more rarely, scientific anomaly. Removing outliers from a data set can improve the goodness of fit but should always be accompanied with a disclosure that the fit was achieved by neglecting some of the data points.

Further Reading This overview of scientific data fitting is far from comprehensive and makes no attempt to describe the various technicalities and assumptions underlying regression analysis. For a deeper introduction to the relevant mathematics and statistics, refer to the texts of Chatterjee and Hadi (2012), Kutner et al. (2004), or Montgomery et al. (2006).

Cross-References ▶ Mathematical Models in the Sciences

References Chatterjee S, Hadi AS (2012) Regression analysis by example, 5th edn. Wiley, Hoboken Kutner M, Nachtsheim C, Neter J (2004) Applied linear regression models, 4th edn. McGraw-Hill/Irwin, Chicago Montgomery DC, Peck EA, Vining GG (2006) Introduction to linear regression analysis, 4th edn. WileyInterscience, Hoboken

M

674

Megaplasmid ▶ Plasmids as Secondary Chromosomes

Meiotic Recombination Galina Petukhova1 and Hannah Klein2 1 Department of Biochemistry and Molecular Biology, Uniformed Services University of the Health Sciences, Bethesda, MD, USA 2 Department of Biochemistry, New York University School of Medicine, New York, NY, USA

Synopsis Meiotic recombination is an essential feature of the meiotic chromosome divisions and is required for the proper pairing and segregation of chromosomes at meiosis I. To achieve this, DNA doublestrand breaks (DSBs) are initiated by the conserved Spo11 protein at multiple sites throughout the genome. A fraction of these DSBs are processed into recombination intermediates that are destined to become crossovers (CO), while most are repaired through a noncrossover pathway called synthesis-dependent strand annealing (SDSA). The crossover events are regulated such that at least one crossover occurs per chromosome arm. The crossovers physically link the paired homologous chromosomes together and thus ensure that they are correctly aligned and attached to the spindle apparatus of meiosis I that will segregate the homologous chromosomes. Failure to execute the recombination crossovers results in chromosome missegregation and aneuploid meiotic products or gametes.

Introduction Meiotic recombination is a carefully controlled DNA breakage and repair process that is needed to ensure proper segregation of homologous

Megaplasmid

chromosomes at the first meiotic division. In addition, meiotic recombination provides variation in chromosome haplotypes that are passed on to the next generation. Reviewed here are the major features of meiotic recombination and note how it differs from mitotic recombination, from its initiation process, to the enzymes involved in meiotic recombination and the predominant types of recombination events that take place.

Initiation of Meiotic Recombination by the Spo11 Protein Mitotic recombination occurs in response to DNA damage induced by external or endogenous factors such as chemical mutagenesis, ionizing radiation, or oxidative damage and is usually manifest as blocks to DNA replication. Indeed, often the process of repair of a replication fork stalled at a site of DNA damage results in processing of the damaged area into a DSB, which then initiates a process of homologous recombination. Additionally, some mitotic DSBs may be repaired by the end-joining pathways, depending on the context and cell cycle phase of the damage. There are also programmed DSBs during mitosis, such as those that initiate mating-type switching in yeast or V(D)J recombination in the immune system cells. However, high levels of mitotic recombination are rare, and as previously discussed, most DSBs are repaired either using the sister chromatid as a template or, if the homologue chromosome is used, via the SDSA pathway which does not result in crossover products. In contrast, meiotic recombination is high, is initiated in a defined way, and results in a significant number of crossovers. Meiotic DSBs are catalyzed by the conserved protein Spo11, which has a mode of action similar in part to that of DNA topoisomerases (Keeney and Neale 2006). Spo11 catalyzes two nucleophilic scissions on each strand of a duplex DNA through the active site tyrosine, resulting in DSBs that have Spo11 covalently bound at each 50 side of the break (Fig. 1). Although Spo11 is conserved across species, the mammalian protein has two major isoforms. While SPO11b alone induces

Meiotic Recombination

675

M

Meiotic Recombination, Fig. 1 Outline of the meiotic recombination pathways for crossover and noncrossover outcomes, with some of the indicated proteins

apparently normal number of DSBs, SPO11a has been implicated in introduction of late-forming DSBs in homologous regions of X and Y chromosomes (pseudoautosomal region) and being specifically required for male meiosis (Kauppi et al. 2011). The number of DSBs introduced by SPO11 is strictly controlled, likely through a negative feedback provided by the ATM protein (Lange et al. 2011).

Recombination Hotspots Distribution of recombination across the genome is not random, and the control of recombination rates occurs at different layers of chromatin

organization (Arnheim et al. 2007; Buard and de Massy 2007; Lichten 2008; May et al. 2008; Paigen and Petkov 2010). At the sub-chromosomal level, recombination is suppressed around centromeres and telomeres in yeast, although subtelomeric regions in human males show high recombination activity. At a regional scale, large chromosomal domains (tens of Kb in yeast and several Mb in mammals) with high or low recombination rates can be recognized. Individual recombination events predominantly cluster in highly discrete regions called recombination hotspots. In budding yeast the majority of hotspots are located in nucleosome-free gene promoter regions with no specific recognition sequence (Pan et al. 2011). In mammals the position of hotspots largely

676

depends on the DNA binding specificity of the PRDM9 protein (Baudat et al. 2010; Myers et al. 2010; Parvanov et al. 2010), a meiosisspecific methyltransferase that trimethylates lysine 4 of histone H3 at the sites of future DSBs. PRDM9 is a highly polymorphic protein with different alleles predicted to recognize distinct DNA sequences. Indeed, mouse strains with different Prdm9 alleles have virtually no overlap in the hotspot locations (Brick et al. 2012). In the Prdm9 knockout mice, the majority of meiotic DSBs are formed at promoters and other sites of PRDM9-independent H3K4 trimethylation. Such sites are rarely targeted in wild-type mice, suggesting a role of the PRDM9 protein in sequestering recombination machinery away from functional genomic elements (Brick et al. 2012). The number of recombination hotspots on a genomewide scale (~3,600 in yeast and up to 18,000 in mice) far exceeds the estimated number of DSBs per genome per meiosis, indicating that not all potential hotspots are used in every cell in every meiosis.

Types of Meiotic Recombination Once Spo11 catalyzes DSB formation, the break is acted upon by a complex of proteins to engage the DSB in a search for homology and a homologous recombination reaction. This complex first processes the DSB end that has Spo11 covalently bound into an end that can be used for homologous recombination by removing the bound Spo11 protein. Spo11 is not directly removed. Rather, endonucleolytic processing of the DSB end removes Spo11 and some attached nucleotides, up to 40 nucleotides, to form a DSB end with a short 30 overhang. Subsequent nucleolytic processing of the free 50 end extends the 30 single-strand tail to form a recombination substrate for the recombinases Rad51 and its meiotic counterpart Dmc1 (Fig. 1). Studies have suggested that there is an asymmetric binding of the recombinases to DSBs, with one end forming a Rad51 nucleoprotein filament while the other forming a Dmc1 nucleoprotein filament. The

Meiotic Recombination

dynamics and regulation of this process are under active investigation. The recombinases bind to the 30 single-strand DNA tail and engage in homologous pairing and strand invasion with the homologue chromosome partner to form a joint molecule that has a D-loop and is called a single-end invasion (SEI) (Fig. 1). This molecular intermediate can be detected on DNA 2-D gels due to its unique shape and electrophoretic properties (Schwacha and Kleckner 1995). As the SEI intermediate is extended, the second DSB end is captured by strand annealing, forming the intermediate called the double Holliday junction (dHJ) (Fig. 1). This structure too has a unique shape and electrophoretic mobility and can be detected on DNA 2-D gels. Subsequent resolution of the dHJ molecule results in crossover products. Although it is theoretically possible to resolve the dHJ intermediate into crossover and noncrossover products, current studies suggest a bias toward the crossover products. The enzymes that resolve the dHJ molecules are not completely defined yet, but studies point toward the nucleases Mus81, and Yen1/Gen1 and Exo1 together with Mlh1-Mlh3 as critical components (Zakharyevich et al. 2010). As discussed above, many Spo11-induced DSBs are not used for crossover recombination. Although the ratio of crossovers to noncrossovers varies between organisms, it is on the order of 1:1 to 1:10. Since unrepaired DSBs are lethal, clearly most meiotic DSBs must be repaired in a noncrossover mode. It now seems that the regulation of this is not at the level of dHJ resolution. Rather, most meiotic DSBs are repaired by a repair mechanism that is similar if not the same as that used in mitosis, namely, the SDSA pathway. In this case, after strand invasion, the D-loop intermediate is transient but is extended by unwinding and synthesis until the newly replicated single-strand DNA can be dissociated and paired with the single-stranded tail from the other side of the DSB. This ultimately results in DSB repair that is noncrossover and, when the homologous chromosome is used as a template for strand invasion and synthesis, can lead to the nonreciprocal recombination event called gene conversion (Fig. 1).

Meiotic Recombination

Meiotic Recombination Proteins Meiotic recombination utilizes the basic DSB repair recombination machinery. However, it also uses proteins that are either specific to meiosis or to the formation of crossover recombination. After Spo11 induces DSBs, it is removed by the action of Sae2/CtIP with the Mre11/Rad50/Xrs2 (MRX) complex in yeast (Mre11/Rad50 Nbs1 (MRN) in mammalian cells) and is then further processed by Exo1, Sgs1, Dna2, and possibly additional nucleases. Yeast MRX is also required for DSB formation per se, although this function of the complex is not evolutionarily conserved. Additional factors involved in Spo11-dependent DSB formation in yeast include the meioticspecific proteins Mei4, Mer2, Rec102, Rec104, and Rec114 along with Ski8. Poor structural conservation has made it difficult to identify the orthologues of these proteins in mammals, where only three proteins essential for Spo11-mediated DSB formation have been identified so far: orthologues of yeast Mei4 and Rec114 and a novel protein MEI1 (Handel and Schimenti 2010; Kumar et al. 2010). The 30 single-strand DNA tails generated by end resection are coated with Rad51 or Dmc1 to form the nucleoprotein filament that engages in the search for DNA homology and strand invasion into a homologous duplex DNA molecule. Additional meiosis-specific factors that promote the search for homology include Sae3, Mei5, Rec8, Hop2, and Mnd1. Although meiotic role of the mammalian orthologues of Sae3 and Mei5 (SWI5 and SFR1 (MEI5), respectively) has not been explored, these factors are not meiosis specific and both proteins are involved in homologous recombination in somatic cells (Akamatsu and Jasin 2010; Yuan and Chen 2011). The crossover pathway in yeast involves Zip1, Zip2, Zip3, Mer3, Msh4, and Msh5. Zip1, Zip2, and Zip3 are components of the synaptonemal complex, a proteinaceous structure that binds along the length of paired homologous chromosomes in meiotic prophase to promote pairing and crossover recombination. In their absence, the noncrossover pathway can proceed but crossovers

677

are eliminated, with the subsequent problems for proper chromosome segregation at meiosis I. Although synaptonemal complex is evolutionarily conserved, Zip1 and Zip2 proteins do not have structural homologues in mammals, at least at the amino acid level. Mammalian SYCP2 and SYCP3 proteins assemble on the chromosome axes to form axial/lateral elements that also contain meiosis-specific cohesin subunits SMC1b, STAG3, REC8, and RAD21L (Handel and Schimenti 2010; Jessberger 2011). As homologous chromosomes find each other, axial elements are zipped together by the central element of synaptonemal complex consisting of SYCP1, TEX12, SYCE2, and SYCE3 proteins (Handel and Schimenti 2010). The HORMAD 1 protein (structurally related to yeast Hop1) associates with asynapsed and desynapsed chromosomal axes but is specifically displaced by the assembly of the central element of the synaptonemal complex (Fukuda et al. 2010; Wojtasz et al. 2009). Furthermore, DSB formation appears to be reduced in Hormad1/ mice (Daniel et al. 2011; Shin et al. 2010). Although all central element proteins are essential for CO formation, COs do form in Smc1b/, Sycp3/, and Hormad1/ females (Handel and Schimenti 2010; Daniel et al. 2011). Deficiency of MSH4 or MSH5 proteins in mice leads to meiotic arrest with unrepaired DSBs and unsynapsed chromosomes, indicating that these factors play additional role besides promoting CO formation (Baudat and de Massy 2007). The mouse RNF212 and HEI10 proteins have opposing roles in stabilizing recombination intermediates originally bound by MSH4/MSH5 complexes, ensuring that only a few of them proceed to the formation of CO-designated sites (Qiao et al. 2014). Allelic variants of both RNF212 and HEI10 contribute to the variation in recombination rates among humans (Kong et al. 2014). The factors involved in dHJ resolution may include Mus81 and Mms4 (EME1 in mammals), Yen1/GEN1, and Slx1-Slx4/SLX1-SLX4 as structure-specific nucleases that recognize Holliday junctions and other branched DNA structures. The redundancy of these nucleases

M

678

Meiotic Recombination

Meiotic Recombination, Table 1 Meiosis-specific recombination proteins S. cerevisiae protein Spo11 Mei4 Mer2 Rec102 Rec104 Rec114 Ski8 Dmc1 Hop2-Mnd1 Mei5-Sae3 Rec8

Zip1 Zip2 Zip3 Zip4

Mer3 Msh4-Msh5

Mammalian protein SPO11 MEI4 MEI1

REC114 WDR61 DMC1 HOP2-MND1 SWI5SFR1(MEI5) REC8 SMC1b STAG3 RAD21L

RNF212 TEX11 SYCP2 SYCP3 HORMAD1 HORMAD2 SYCP1 SYCE2 SYCE3 TEX12 HFM1 MSH4-MSH5 HEI10

Activity DSB formation, related to type II topoisomerase DSB formation DSB formation DSB formation DSB formation DSB formation DSB formation (yeast) Promotes Spo11 protein interactions (yeast), has WD repeats Strand exchange, ATPase Interacts with DMC1/RAD51, mediator activity Interacts with Dmc1, mediator activity (yeast) Cohesin subunit Cohesin subunit Cohesin subunit Cohesin subunit Synaptonemal complex formation Synaptonemal complex formation Synaptonemal complex formation (yeast), crossover formation Synaptonemal complex formation (yeast), crossover formation Synaptonemal complex formation Synaptonemal complex formation DSB formation or processing, Synaptonemal complex formation Associates with unsynapsed chromosome axes Synaptonemal complex formation Synaptonemal complex formation Synaptonemal complex formation Synaptonemal complex formation Crossover formation, DNA helicase Crossover formation (yeast), DSB repair (mammals), related to MutS family Crossover formation

has made it difficult to assign role in vivo, and some of the structure-specific nucleases have critical mitotic roles in resolving stalled replication forks (Schwartz and Heyer 2011; Table 1).

References Akamatsu Y, Jasin M (2010) Role for the mammalian Swi5-Sfr1 complex in DNA strand break repair through homologous recombination. PLoS Genet 6:e1001160 Arnheim N, Calabrese P, Tiemann-Boege I (2007) Mammalian meiotic recombination hot spots. Annu Rev Genet 41:369–399

Baudat F, de Massy B (2007) Regulating double-stranded DNA break repair towards crossover or non-crossover during mammalian meiosis. Chromosome Res 15:565–577 Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, Przeworski M, Coop G, de Massy B (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327:836–840 Brick K, Smagulova F, Khil P, Camerini-Otero RD, Petukhova GV (2012) Genetic recombination is directed away from functional genomic elements in mice. Nature 485:642–645 Buard J, de Massy B (2007) Playing hide and seek with mammalian meiotic crossover hotspots. Trends Genet 23:301–309

Metamobilomics – The Plasmid Metagenome of Natural Environments Daniel K, Lange J, Hached K, Fu J, Anastassiadis K, Roig I, Cooke HJ, Stewart AF, Wassmann K, Jasin M et al (2011) Meiotic homologue alignment and its quality surveillance are controlled by mouse HORMAD1. Nat Cell Biol 13:599–610 Fukuda T, Daniel K, Wojtasz L, Toth A, Hoog C (2010) A novel mammalian HORMA domain-containing protein, HORMAD1, preferentially associates with unsynapsed meiotic chromosomes. Exp Cell Res 316:158–171 Handel MA, Schimenti JC (2010) Genetics of mammalian meiosis: regulation, dynamics and impact on fertility. Nat Rev Genet 11:124–136 Jessberger R (2011) Cohesin complexes get more complex: the novel kleisin RAD21L. Cell Cycle 10:2053–2054 Kauppi L, Barchi M, Baudat F, Romanienko PJ, Keeney S, Jasin M (2011) Distinct properties of the XY pseudoautosomal region crucial for male meiosis. Science 331:916–920 Keeney S, Neale MJ (2006) Initiation of meiotic recombination by formation of DNA double-strand breaks: mechanism and regulation. Biochem Soc Trans 34:523–525 Kong A, Thorleifsson G, Frigge ML, Masson G, Gudbjartsson DF, Villemoes R, Magnusdottir E, Olafsdottir SB, Thorsteinsdottir U, Stefansson K (2014) Common and low-frequency variants associated with genome-wide recombination rate. Nat Genet 46:11–16 Kumar R, Bourbon HM, de Massy B (2010) Functional conservation of Mei4 for meiotic DNA double-strand break formation from yeasts to mice. Genes Dev 24:1266–1280 Lange J, Pan J, Cole F, Thelen MP, Jasin M, Keeney S (2011) ATM controls meiotic double-strand-break formation. Nature 479:237–240 Lichten M (2008) Meiotic chromatin: the substrate for recombination initiation. Genome Dyn Stab 3:165–193 May C, Slingsby T, Jeffreys A (2008) Human recombination hotspots: before and after the HapMap project. Genome Dyn Stab 2:195–244 Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, MacFie TS, McVean G, Donnelly P (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327:876–879 Paigen K, Petkov P (2010) Mammalian recombination hot spots: properties, control and evolution. Nat Rev Genet 11:221–233 Pan J, Sasaki M, Kniewel R, Murakami H, Blitzblau HG, Tischfield SE, Zhu X, Neale MJ, Jasin M, Socci ND et al (2011) A hierarchical combination of factors shapes the genome-wide topography of yeast meiotic recombination initiation. Cell 144:719–731 Parvanov ED, Petkov PM, Paigen K (2010) Prdm9 controls activation of mammalian recombination hotspots. Science 327:835 Qiao H, Prasada Rao HB, Yang Y, Fong JH, Cloutier JM, Deacon DC, Nagel KE, Swartz RK, Strong E, Holloway JK et al (2014) Antagonistic roles of ubiquitin ligase HEI10 and SUMO ligase RNF212 regulate meiotic recombination. Nat Genet 46:194–199

679

Schwacha A, Kleckner N (1995) Identification of double Holliday junctions as intermediates in meiotic recombination. Cell 83:783–791 Schwartz EK, Heyer WD (2011) Processing of joint molecule intermediates by structure-selective endonucleases during homologous recombination in eukaryotes. Chromosoma 120:109–127 Shin YH, Choi Y, Erdin SU, Yatsenko SA, Kloc M, Yang F, Wang PJ, Meistrich ML, Rajkovic A (2010) Hormad1 mutation disrupts synaptonemal complex formation, recombination, and chromosome segregation in mammalian meiosis. PLoS Genet 6:e1001190 Wojtasz L, Daniel K, Roig I, Bolcun-Filas E, Xu H, Boonsanay V, Eckmann CR, Cooke HJ, Jasin M, Keeney S et al (2009) Mouse HORMAD1 and HORMAD2, two conserved meiotic chromosomal proteins, are depleted from synapsed chromosome axes with the help of TRIP13 AAA-ATPase. PLoS Genet 5:e1000702 Yuan J, Chen J (2011) The role of the human SWI5-MEI5 complex in homologous recombination repair. J Biol Chem 286:9888–9893 Zakharyevich K, Ma Y, Tang S, Hwang PY, Boiteux S, Hunter N (2010) Temporally and biochemically distinct activities of Exo1 during meiosis: double-strand break resection and resolution of double Holliday junctions. Mol Cell 40:1001–1015

Metabolic Activation ▶ Bioactivation of Carcinogens

Metamobilomics – The Plasmid Metagenome of Natural Environments Lili Li1, Wenting Luo1, Lars Hestbjerg Hansen1,2 and Søren Johannes Sørensen1 1 Section for Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark 2 Department of Environmental Science, Aarhus University, Roskilde, Denmark

Synopsis The horizontal gene pool is the collection of genes on mobile elements that are distributed across potentially connected microbes and can be

M

680

Metamobilomics – The Plasmid Metagenome of Natural Environments

accessed by those microbes via horizontal transfer. Characterization of the pool is limited by the fact that any microbes from a range of environments are unculturable, so to determine what plasmids and associated genes they carry can be difficult. The plasmid metagenome can be accessed by pooling bacteria and extracting plasmid DNA without going through a cultivation stage followed by a variety of amplification steps before sequencing. Technically, it is difficult to avoid biasing the library of fragments sequenced, particularly in favor of smaller plasmids, and the field is still searching for the best solutions. Nevertheless, the glimpses of metamobilomes already available are encouraging and, coupled with synthetic biology to test the properties of elements first discovered in silico, provide a justification for major effort in this field.

Introduction One of the most exciting outcomes of the recent genomic revolution is the appreciation of horizontal gene transfer (HGT) as a force that shapes microbial life on Earth in ways not found in plants and animals. Frequent HGT among bacteria suggests that this process plays a key role in shaping the organization of genomes and fuelling evolution. The analysis of around 20,000 genes from the genomes of 8 free-living prokaryotes demonstrated that HGT has accelerated the introduction of new genes into species (Jain et al. 2003). Intense HGT in microbial communities driven by genetic elements such as plasmids, transposons, and viruses, coupled with selection and competition, leads to a continuous gene shuffling within a communal gene pool. This is dramatically exemplified by the emergence and spread of multiple antibiotic resistant plasmids in and between potentially pathogenic bacteria (Kruse and Sorum 1994). However, this communal pool of genes may vary from one environment to another, and thus, information on the communal gene pool in specific environments (e.g., wastewater) can be instrumental for our understanding of the evolutionary potential of bacteria living in and spreading from these environments.

The Communal Gene Pool Virtually all known plasmid DNA sequences are derived from plasmids selected for enabling their bacterial hosts to degrade xenobiotics or become resistant to antibiotics or heavy metals. It can be speculated that this observation does not represent the normal composition of plasmids in natural environments but rather reflects the methods of investigation. The high abundance of known plasmids carrying genes encoding drug resistance, catabolic enzymes, or pathogenicity traits is therefore due to the simple fact that microbiologists have traditionally searched for these traits and because the presence of easily detectable genes is required due to technological limitations. In addition, an essential initial step in traditional plasmid isolation is cultivation of the bacteria, although only a minor fraction (1-5%) of the bacteria in the natural community can be cultured when applying traditional cultivation methods. Recently, new cultivation-independent approaches have been developed to study the communal gene pool in natural environments. In the following section, the current plasmid DNA isolation methods will be described briefly.

How to Study the Communal Gene Pool? Plasmids in natural environments have traditionally been isolated endogenously by selective cultivation of indigenous bacteria expressing specific plasmid-encoded traits such as metal or antibiotic resistance. Although critical information has been obtained by this approach, it is highly biased in only focusing on the minority culturable bacteria and furthermore limited to plasmids encoding easily selectable traits (Bahl et al. 2007). Hence, it is not clear whether such endogenous isolated plasmids are representative of important phenotypic characteristics in the natural environment. Exogenous isolation is another traditionally used approach to isolate plasmids from culturable bacteria. For instance, in the triparental version of this approach, indigenous plasmids are captured purely by their ability to transfer, thereby circumventing the selection biases imposed on

Metamobilomics – The Plasmid Metagenome of Natural Environments

681

Metamobilomics – The Plasmid Metagenome of Natural Environments, Fig. 1 Principle of endogenous isolation (A), biparental exogenous (B) and triparental exogenous (C) isolation methods

the endogenous plasmid isolation (Fig. 1; Smalla et al. 2000). Recently, researchers have developed cultivation-independent approaches to study the communal gene pool taking advantage of the technological breakthrough in high-throughput sequencing techniques. Here, this approach is named metamobilomes. The metamobilome is defined as a metagenome of all circular genetic elements in a certain bacterial community. Metamobilomic analysis provides insights into the composition and structure of environmental plasmid communities and circumvents the limitations of previous methods. Such studies can provide a much needed overview of the diversity and prevalence of plasmids in environmental samples and new knowledge of the accessory genetic elements residing in the plasmid. Until now, there are few published metamobilome libraries, most of which study wastewater treatment plants (WWTP) and mammal intestinal samples. A Glimpse of the Plasmid Metagenomes The first plasmid metagenome was published in 2008 by Schlüter and his colleagues (Schluter et al. 2008). They plated the activated sludge sample on Luria broth media with 12 clinically relevant antibiotics, pooled all culturable bacteria, and isolated total plasmid DNA by using an alkaline lysis method, followed by caesium chloride (CsCl) density ultracentrifugation and pyrosequencing. Their results showed that 37.1% of sequence reads belonged to the “replication, recombination and repair” orthologous

group of proteins and a large number of reads hit the gene ontology “extrachromosomal circular DNA”. This indicates that they successfully collected plasmids from wastewater samples; however, only bacteria growing aerobically on LB media were selected. Thus, this approach is in fact a high-throughput version of classical endogenous plasmid isolation with all the traditional biases. The second mobile metagenome study of human gut was published using cultureindependent transposon-aided capture method (TRACA), which does not rely on any plasmidencoded traits. They captured mainly plasmids of Gram-positive origins and genes involved in plasmid mobilization and replication. However, the inserted transposons may inactivate genes of interest, and the currently available transposons may limit the plasmid range. The TRACA approach was also applied to total community DNA directly from wastewater samples using a standard bead-beating protocol to break up bacterial cells (Zhang et al. 2011). Interestingly, a large number of antibiotic resistance genes were revealed in this wastewater metamobilome. However, bead-beating DNA extraction will most likely destroy large plasmids, which may explain why only small plasmids were recovered in this study. A novel metamobilomic approach combining the multiple displacement amplification method and pyrosequencing technique has been suggested as a fast and efficient strategy to study the environmental plasmid metagenome, including plasmids

M

682

Metamobilomics – The Plasmid Metagenome of Natural Environments

Metamobilomics – The Plasmid Metagenome of Natural Environments, Fig. 2 Steps for building the metamobilome library. (A) Sample was taken from Danish wastewater treatment plant and cells were harvested; (B)

plasmids & sheared chromosomal DNA; (C) pure plasmids; (D) multiple displacement amplification by using the f29 DNA polymerase to gain mg’s of DNA (E) high throughput sequencing

hosted in unculturable bacteria (Li et al 2012). First intact bacterial cells were harvested, then metamobilomic DNA was isolated by using a standard plasmid isolation kit, after the removal of sheared genomic DNA by treating DNA with plasmid-safeTM ATP-dependent DNase (EPICENTRE® Biotechnologies, USA), the covalently closed circular metaplasmid DNA was subsequently amplified using the f29 DNA polymerase and pyrosequenced (Fig. 2; Li et al. 2012). This wastewater metamobilome library was compared with a wastewater metagenome library against chromosomes, plasmids, phages, and IS element databases, respectively. It revealed that very few strictly chromosomal reads were present in the mobilome library. Furthermore, data analysis indicated that this library strongly enriched genes encoding plasmid-selfish traits, such as replication, stability, and conjugation, and most strikingly, several hundred new putative plasmid replicases have been recovered unveiling the depth of unexplored plasmid diversity (Li et al. 2012). At the same time, a rumen plasmidome, which employed a similar method except for using Illumina sequencing instead of pyrosequencing, revealed the mosaic nature that of rumen plasmidome and some gene functions only enriched in the rumen ecological niche (Kav et al. 2012). In the latest published rat cecum metamobilome, the authors developed an in silico procedure for identifying complete small novel plasmids in metagenomic data sets from the whole genome shotgun sequencing (Jorgensen et al. 2014).

likely due to the preferential amplification of small circular DNA molecules by using the f29 DNA polymerase. To overcome this amplification bias, recently a modified protocol including a gel electrophoresis step to isolate large-sized plasmids (>10 Kb) was established (Norman et al 2014). This modified approach exclusively recovered a large number of proteins associated with mate-pair formation of conjugative plasmids, such as type IV secretion systems and T4SSrelaxosome coupling, indicating the successful capture of large conjugative plasmids (Norman et al 2014).

Technical Issues to Acquire Good Libraries These sequencing results show that most of the sequence reads came from small plasmids most

Future Directions The metamobilomic approach provides novel opportunities to generate a cultivation-independent characterization of the plasmids present in various organisms and entire environments. But still up to 80% recovered sequences have no homologues in the databases, illustrating the limited diversity represented in the database. However, as more environmental samples are investigated and an increased number of sequencing reads becomes available in combination with more precise and robust computing algorithms of bioinformatic tools, this limitation will be alleviated. The new types of plasmid-selfish genes observed in the first few metamobilome studies strongly indicates that environmental metamobilomes are still largely untapped and that conducting studies directly focused on metamobilomes can generate many new discoveries. The metamobilome isolation technique may also be beneficial in screenings for novel vector systems for biotech tools. As shown earlier, even large conjugative plasmids can be synthesized,

Mismatch Repair

and new plasmid vectors based on the wastewater metamobilome library is under construction in the author’s group (see entry ▶ “Synthetic Plasmid Biology”).

683 isolation of antibiotic resistance plasmids from piggery manure slurries reveals a high prevalence and diversity of IncQ-like plasmids. Appl Environ Microbiol 66:4854–4862 Zhang T, Zhang XX, Ye L (2011) Plasmid metagenome reveals high levels of antibiotic resistance genes and mobile genetic elements in activated sludge. PLoS One 6(10):e26041

Cross-References ▶ Conjugative Transfer Systems and Classifying Plasmid Genomes ▶ Plasmid Genomes, Introduction to ▶ Plasmids as Secondary Chromosomes ▶ Sequence Information to Assess Evolutionary Relationships ▶ Synthetic Plasmid Biology

miRNA-Induced Deadenylation ▶ RNA Interference

miRNA-Induced Silencing References Bahl MI, Hansen LH, Goesmann A, Sorensen SJ (2007) The multiple antibiotic resistance IncP-1 plasmid pKJK5 isolated from a soil environment is phylogenetically divergent from members of the previously established alpha, beta and delta sub-groups. Plasmid 58:31–43 Jain R, Rivera MC, Moore JE, Lake JA (2003) Horizontal gene transfer accelerates genome innovation and evolution. Mol Biol Evol 20:1598–1602 Jorgensen TS, Xu Z, Hansen MA, Sorensen SJ, Hansen LH (2014) Hundreds of circular novel plasmids and DNA elements identified in a rat cecum metamobilome. PLoS One 9(2):e87924 Kav AB, Sasson G, Jami E, Doron-Faigenboim A, Benhar I, Mizrahi I (2012) Insights into the bovine rumen plasmidome. Proc Natl Acad Sci U S A 109(14):5452–5457 Kruse H, Sorum H (1994) Transfer of multiple drug resistance plasmids between bacteria of diverse origins in natural microenvironments. Appl Environ Microbiol 60:4015–4021 Li LL, Norman A, Hansen LH, Sorensen SJ (2012) Metamobilomics – expanding our knowledge on the pool of plasmid encoded traits in natural environments using high throughput sequencing. Clin Microbiol Infect 18:8–11 Norman A, Riber L, Luo W, Li LL, Hansen LH, Sorensen SJ (2014) An improved method for including upper size range plasmids in metamobilomes. PLoS One 9(8):e104405 Schluter A, Krause L, Szczepanowski R, Goesmann A, Puhler A (2008) Genetic diversity and composition of a plasmid metagenome from a wastewater treatment plant. J Biotechnol 136:65–76 Smalla K, Heuer H, Gotz A, Niemeyer D, Krogerrecklenfort E, Tietze E (2000) Exogenous

▶ RNA Interference

Mismatch Repair Guo-Min Li Graduate Center for Toxicology and Markey Cancer Center, University of Kentucky College of Medicine, Lexington, KY, USA

Synonyms Base substitution mutation; DNA heteroduplex; DNA mismatch; DNA mispair; Insertion-deletion mutation; Slip mispairing

Synopsis DNA repair refers to all aspects of the biological response to DNA damage and aberrant DNA structures, including the response to persistent DNA replication intermediates that accumulate when DNA lesions inhibit DNA replication and RNA transcription, and to products of illegitimate DNA recombination. DNA mismatch repair (MMR) is the process by which misincorporation errors during semiconservative replicative DNA

M

684

synthesis are corrected. To correctly repair such DNA mispairs, MMR must distinguish the parental strand from the nascent DNA strand in a newly replicated DNA duplex. MMR is highly conserved from Escherichia coli to man, and defects in MMR enzymes increase the spontaneous mutation rate, leading to cellular dysfunction, or in higher organisms, to potentially life-threatening diseases including cancer. This entry describes early history of MMR, its significance, and the proposed molecular mechanisms and models of MMR in prokaryotic and eukaryotic cells, MMR in the context of the cellular response to DNA damage, and the link between defective MMR and human cancer.

Development of the Field The history of DNA mismatch repair (MMR) begins in the 1950s, when researchers observed that unusually high frequencies of genetic exchanges occur in localized regions of some fungal and bacteriophage genomes (Friedberg et al. 2006). This came to be understood as the result of high negative interference during homologous recombination. In the early 1960s, Robin Holliday (1964) and Evelyn Witkin (1964) suggested that these unstable genomic regions were associated with the formation of “hybrid” DNA containing DNA mispairs and their subsequent repair. Later, Matthew Meselson and Maurice Fox independently provided direct evidence for what is now known as MMR, performing transfection experiments with mispair-containing heteroduplex DNA (White and Fox 1975; Wildenberg and Meselson 1975). A critical question in MMR is how the MMR system distinguishes the parental strand (containing the correct information) from the nascent DNA strand (containing the mispaired base) in a newly replicated DNA duplex (see below for more discussion). In one of the most critical discoveries in the field of MMR, Matthew Meselson and his colleagues discovered that such discrimination is provided by transient undermethylation of adenine residues in the d(GATC) sequence in nascent DNA relative to parental

Mismatch Repair

DNA (Pukkila et al. 1983). At the same time, Paul Modrich and his colleagues identified and characterized the components of the E. coli enzymatic pathway that first recognizes transiently hemi-methylated DNA and then replaces the misincorporated nucleotide in the nascent DNA, reconstituting the correct DNA base pair (Lu et al. 1983). Modrich and Richard Kolodner and their colleagues went on to conduct pioneering work on MMR in eukaryotic cells (Fishel et al. 1993; Kolodner 1996; Kunkel and Erie 2005; Leach et al. 1993; Li 2008; Modrich and Lahue 1996; Parsons et al. 1993), which are independent of hemi-methylated d(GATC) sites. In the early 1990s, Modrich, Kolodner, Bert Vogelstein, and others demonstrated that individuals with hereditary nonpolyposis colon cancer carry defects in human genes functionally essential for human MMR (Fishel et al. 1993; Kolodner 1996; Kunkel and Erie 2005; Leach et al. 1993; Modrich and Lahue 1996; Parsons et al. 1993). Since then, causal relationships (direct and indirect) between many other human diseases and defects in DNA repair or defects in other biological responses to DNA damage have been demonstrated. The years during which DNA mismatch repair pathways were initially recognized and characterized are among the more exciting phases of the twentieth century molecular biology, with large implications for cancer biology and understanding of human disease.

Overview DNA mismatch repair is a highly conserved biological pathway that plays a key role in maintaining genome stability. The major biological role of MMR is to repair mismatches generated during DNA replication. MMR also suppresses DNA recombination between two divergent DNA sequences and plays a role in DNA damage signaling in mammalian cells. Escherichia coli MutS and MutL and their respective eukaryotic homologs, MutSa and MutLa, are the most important highly specialized enzymes involved in executing the MMR pathway. Proliferating cellular nuclear antigen

Mismatch Repair

685

Mismatch Repair, Fig. 1 Structure of DNA mismatches. (a) Base-base mismatches. Upper, a G:A mispair; lower, a T:G mispair. (b) DNA duplexes. Top, a homoduplex; middle, an A:G mismatched heteroduplex; and bottom, an insertion-deletion heteroduplex

(PCNA) and replication protein A (RPA) are multifunctional proteins that play a critical role during MMR and many other DNA transactions in eukaryotic cells. Defects in MMR are associated with genomewide instability, predisposition to certain types of cancer including hereditary nonpolyposis colorectal cancer (HNPCC), resistance to certain chemotherapeutic agents in humans, and abnormalities in meiosis and sterility in mammalian systems. This article reviews well-established fundamental characteristics of MMR, including its molecular biochemical properties, biological roles, and the consequences of defects in MMR in bacterial and mammalian cells. Definition and Significance of DNA Mismatch Repair During the process of DNA replication, the entire genome of an organism is duplicated, such that two daughter DNA molecules, one for each progeny cell, are created from one parental genome. This process is complex and, for the genome of even the simplest bacterial cell, requires thousands of iterative steps, each of which is subject to error. Because the fidelity of DNA replication is essential to the integrity and viability of each cell, cells have evolved multiple systems whose role is to preserve genome integrity, by one of the following mechanisms: (1) checking for and removing errors introduced into daughter DNA molecules during DNA replication, (2) detecting and responding to template strand DNA lesion, or (3) preventing or correcting the effects of other

processes that destabilize or change DNA sequence or structure. MMR contributes to all three types of genome maintenance reactions. A DNA mismatch is defined as a non-WatsonCrick nucleotide base pair (also called heteroduplex) in dsDNA, where Watson-Crick base pairs (i.e., homoduplexes) are guanine:cytosine (G:C) and adenine:thymine (A:T). The structures of several heteroduplexes are shown in Fig. 1. MMR proteins recognize and repair 8 different base-base mismatches (i.e., A:A, A:C, A:G, C:C, C:T, G:G, G:T, and T:T) as well as small insertion-deletion (ID) mispairs, although the relative efficiency of repair varies both with the specific mispair or ID and with the DNA sequence context. If a DNA mismatch persists in a proliferating cell, and the replicative cycle proceeds to completion, the genome of one of the two progeny cells will harbor a mutation at the site of the mismatch. Non-synonymous mutations change the amino acid sequence of the protein product of a gene, while silent mutations do not. In most cases, non-synonymous mutations confer growth advantage or disadvantage at the level of the cell and a survival advantage or disadvantage at the level of the organism. In the absence of mutation, there could be no material for evolution and improvement of the species; on the other hand, random mutations (in the absence of selective pressure for survival) are more likely than not to have deleterious consequences. A large body of data collected in many experimental systems demonstrates that a high mutation rate, due to loss of

M

686

DNA replication fidelity or loss of DNA repair capacity, is incompatible with cell viability. MMR in Prokaryotes Early in the history of research on MMR, extensive biochemical and molecular genetic studies were carried out on MMR in the prototypical bacterial species, E. coli. These studies led to a detailed molecular model of MMR. One of the keys to deducing this model, and therefore understanding MMR in E. coli, was to identify all the essential protein factors required to carry out the multistep process of correcting a DNA mispair. The current model for E. coli MMR describes essential roles for the following E. coli proteins: MutS, MutL, MutH, DNA helicase II (MutU/ UvrD), four exonucleases (ExoI, ExoVII, ExoX, and RecJ), single-stranded DNA-binding protein (SSB), DNA polymerase III holoenzyme, and DNA ligase (Table 1). As discussed below, many but not all of these proteins are highly conserved from bacteria through mammals and have one or more closely related homologs or analogs in human cells. MutS, MutL, and MutH initiate MMR and play specialized biological roles in MMR in E. coli, while the remaining proteins required for MMR in E. coli participate in several other DNA metabolic pathways. In the context of DNA replication, the first “problem” to be solved during MMR is recognition of a DNA mismatch generated by polymerase misincorporation. For example, E. coli DNA polymerase III holoenzyme might correctly insert two cytosines opposite two sequential guanine bases and then misincorporate an adenine opposite the third of three tandem guanines, generating an A:G mispair. If the A:G mispair was then extended by additional cycles of correct DNA polymerase-mediated nucleotide incorporation, the mispair would be surrounded by WatsonCrick G:C and A:T base pairs. How can the single G:A mispair be identified and repaired by MMR? This first problem in MMR is solved by structure-specific mismatch recognition by a MutS homodimer. MutS is a critical MMR protein that has been characterized extensively using biochemical and molecular genetic approaches. Molecular models of the MutS atomic structure

Mismatch Repair Mismatch Repair, Table 1 MMR components and their functionsa E. coli (MutS)2

(MutL)2

MutH UvrD ExoI, ExoVII, ExoX, RecJ Pol III holoenzyme

SSB

Human MutSa (MSH2MSH6)b MutSb _ (MSH2MSH3) MutLa (MLH1PMS2)b MutLb (MLH1PMS1) MutLg (MLH1MLH3) ?c ?c ExoI

Function DNA mismatch/damage recognition

Pol d

DNA resynthesis

PCNA

Initiation of MMR, DNA resynthesis, activation of MutLa endonuclease ssDNA binding/ protection, stimulating mismatch excision, termination of DNA excision, promoting DNA resynthesis Mismatch-provoked excision PCNA loading, 30 nickdirected repair, activation of MutLa endonuclease Nick ligation

RPA

HMGB1 RFC

DNA ligase

DNA ligase I

Molecular matchmaker; endonuclease, termination of mismatch-provoked excision

Strand discrimination DNA helicase DNA excision, mismatch excision

a

Modified with permission from Cell Res. 18:85–98, 2008 Major components in cells c Not yet identified b

and MutS bound to a DNA mismatch have been solved by X-ray crystallography (Fig. 2) (Lamers et al. 2000; Obmolova et al. 2000). Importantly, MutS distinguishes between homo- and heteroduplex DNA and binds with higher affinity to the mismatch than to flanking homoduplex DNA. Different mismatches are also bound by MutS with different affinities.

Mismatch Repair

Mismatch Repair, Fig. 2 Crystal structure of Taq MutS bound to a heteroduplex DNA. The two MutS subunits are represented by ribbon diagrams in blue (right) and green (left). The DNA is shown in a spacefilling model, in which the backbone atoms are red and bases are pink (Reprinted with permission from Nature 407, 703–710, 2000)

The second problem to be solved during MMR is discriminating the parental DNA strand from the newly replicated daughter DNA strand in nascent duplex DNA. In early research on DNA replication and replication fidelity in E. coli, it was discovered that E. coli DNA is methylated at the N6 position of adenine in dGATC sequences by DNA adenine methyltransferase (Dam). During DNA replication, DNA polymerase incorporates unmethylated adenine, so that methylated dGATC sequences become transiently hemi-methylated (only one strand methylated). Thus, a hemimethylated dGATC sequence is a molecular signature of nascent DNA. In the context of MMR, this is important, because the methylation status indicates which DNA strand carries the “correct” and which strand carries the “incorrect” base. The MutH protein, a latent endonuclease, is designed to read this signature by nicking (incising) the newly synthesized strand on the unmethylated

687

DNA strand in the dGATC site. The resulting strand break is the starting point for exonuclease-catalyzed DNA excision from the nick up to and beyond the mispaired base, which precedes strand-specific DNA synthesis. In the latter step, DNA polymerase III holoenzyme catalyzes gap-filling DNA repair synthesis using the methylated parental DNA strand as a template. Thus, in E. coli, the hemi-methylated dGATC site closest to a mismatch determines the strand specificity of repair for that mismatch. Extensive biochemical and molecular genetic studies of MMR-proficient and MMR-deficient E. coli cells and protein extracts have been carried out. The results of these studies have provided extensive insight into the current understanding of the MMR pathway in E. coli (Fig. 3), which can be divided into three phases: initiation, excision, and resynthesis (Li 2008). The initiation reaction involves mismatch recognition, assembly of the MMR complex, and location and incision of a strand discrimination signal up to 1 kb from the site of the mismatch. The excision reaction includes DNA unwinding and excision from the nick up to and beyond the mismatch. The resynthesis reaction involves DNA gap filling and ligation. In the first phase of MMR, the MutS homodimer recognizes base-base mismatches and small nucleotide insertion-deletion (indel) mispairs. MutS possesses intrinsic ATPase activity, an activity that must be present to preserve functional MMR in E. coli. The binding of a mismatch by MutS recruits MutL to the complex, which then recruits and activates the latent endonuclease activity of MutH. Like MutS, MutL functions as a homodimer and possesses ATPase activity. The activated MutH endonuclease recognizes and specifically incises the unmethylated daughter strand at the closest hemi-methylated dGATC site either 50 or 30 to the mismatch. The second phase of the MMR pathway is excision of non-parental nascent DNA, starting from the nick and extending up to and beyond the mismatch. In the presence of MutS and MutL, helicase II loads at the nick and unwinds the duplex from the nick towards the mismatch, generating single-stranded DNA, which is rapidly bound by single-stranded

M

688

Mismatch Repair, Fig. 3 Molecular mechanism of MMR in E. coli. The MMR pathway in E. coli involves three phases. In phase 1, the mismatch is recognized, MMR proteins are recruited, and the daughter DNA strand is cleaved at the hemi-methylated dGATC site closest to the mismatch. This phase requires MutS, MutL, MutH, and ATP. In phase 2, non-parental nascent DNA is excised from the nick up to and beyond the mismatch. Phase 2 requires helicase II, SSB, one of the 4 exonucleases (ExoI, ExoX, ExoVII, and RecJ), and continued presence of MutS and MutL. In phase 3, the ssDNA gap surrounding the mismatch is repaired and continuous homoduplex DNA is restored. Phase 3 requires DNA polymerase III holoenzyme, SSB and DNA ligase, DNA nucleotides, and ATP

DNA-binding protein (SSB) and transiently protected from nuclease attack. Depending on the position of the strand break relative to the mismatch, ExoI or ExoX (30 ! 50 exonuclease) or ExoVII or RecJ (50 ! 30 exonucleases) excises the nicked strand from the nicked dGATC site up to and slightly past the mismatch. It has been proposed that MutL and SSB play critical roles in terminating DNA excision after mismatch removal, but the mechanism remains to be investigated. In the last phase of MMR, DNA polymerase III holoenzyme, SSB, and DNA ligase carry out repair DNA synthesis to close and seal the gap and restore continuous homoduplex DNA.

Mismatch Repair

Notably, during MMR in E. coli, the hemimethylated dGATC site is located and cleaved by the concerted actions of MutS, MutL, MutH, and ATP. However, this process is not completely understood. In fact, three models have been proposed to address how mismatch binding by MutS leads to cleavage of a nearby hemi-methylated dGATC site in E. coli or the analogous interaction between damage-sensing protein complex and strand discrimination signal in eukaryotic cells (see discussion below). The early studies on E. coli MMR demonstrated three key features of the pathway: first, repair is specifically targeted to the newly synthesized DNA strand, ensuring replication fidelity; second, repair is bidirectional, proceeding 50 ! 30 or 30 ! 50 from the nick to the site of the mismatch; and third, MMR has broad substrate specificity including base-base mismatches and small ID mispairs. All of these properties require functional MutS, MutL, and MutH. Because the mechanism of MMR is highly conserved throughout evolution, E. coli MMR is an excellent model for MMR in eukaryotic cells. Nevertheless, eukaryotic MMR is significantly different from bacterial MMR, and a number of questions remain to be answered about MMR in both prokaryotic and eukaryotic cells. Unique features of human MMR are described below. MMR in Human Cells Studies on eukaryotic MMR have focused heavily but not exclusively on human experimental systems. Early studies showed that the similarity between human and prototypical E. coli MMR is very high. These similarities include substrate specificity, bidirectionality, and nick-directed strand specificity. However, human MMR also exhibits significant differences from bacterial MMR including no role for hemi-methylated dGATC sites, apparent absence of homolog or analog of MutH, heterodimer instead of homodimer composition and multiple isoforms of human MutS and MutL, dramatically increased complexity of MMR protein machinery, and acquisition of additional biological roles for MMR in human cells. Despite these changes, the

Mismatch Repair

deduced model for human MMR strongly resembles the deduced model for E. coli MMR. The protein components that carry out MMR in human cells have been characterized extensively. A major achievement in the field was purification and biochemical characterization of the critical human MMR proteins, MutSa and MutLa and in vitro reconstitution of MMR using model DNA substrates and purified recombinant human proteins, and/or selectively depleted human cell nuclear extracts. The genes encoding all human MMR proteins were cloned and sequenced independent of and prior to completion of the Human Genome Project. Naturally occurring mutant alleles of MutS and MutL homologs have been identified at significant frequency in human populations, and these alleles and their biological significance have been studied extensively. Human MutS and MutL mutant proteins are of major interest and importance because they confer strong predisposition to certain types of cancer (see below). Human MutS and MutL proteins were identified based on their homology to E. coli and yeast counterparts (Table 1). However, unlike E. coli MutS and MutL proteins, which are homodimers, human MutS and MutL homologs are heterodimers. The MSH2 subunit heterodimerizes with the MSH6 or the MSH3 subunit to form MutSa or MutSb, respectively. At least 4 human MutL homologs (MLH1, MLH3, PMS1, and PMS2) have been identified. MLH1 heterodimerizes with PMS2, PMS1, or hMLH3, to form MutLa, MutLb, or MutLg, respectively. MutSa and MutSb are both ATPases that play a critical role in mismatch recognition and MMR initiation. MutSa preferentially recognizes basebase mismatches and ID mispairs of 1 or 2 nucleotides, while MutSb preferentially recognizes larger ID mispairs. MutLa is required for MMR and MutLg plays a role in meiosis, but no specific biological role has been identified for MutLb. MutLa is also an ATPase, which is essential for MMR in human cells. In in vitro reconstituted human MMR assays, MutLa regulates termination of mismatch-provoked excision. Recent studies show that MutLa possesses a PCNA/RFC-

689

dependent endonuclease activity, which plays a critical role in 30 nick-directed MMR involving EXO1. PCNA plays critical roles in the initiation and DNA resynthesis steps of MMR. It interacts with MSH6 and MSH3 via a conserved PCNA interaction protein motif termed the PIP box. It has been proposed that PCNA may help localize MutSa and MutSb to mispairs in newly replicated DNA. Interestingly, recent studies have revealed that human MutSa is recruited to replicating chromatin through its physical interaction with a histone mark called H3K36me3 (histone 3 Lys36 trimethylation) (Li 2014). Although PCNA is absolutely required during 30 nickdirected MMR, it is not essential for 50 nickdirected MMR. This observation might be explained by the fact that EXO1, a 50 ! 30 exonuclease, is involved in both 50 - and 30 -directed MMR. Like PCNA, EXO1 also interacts with MutSa and MutLa. While EXO1 can readily carry out 50 directed mismatch excision in the presence of MutSa or MutSb and RPA, recent in vitro studies showed that involvement of EXO1 in catalyzing 30 nick-directed excision requires the MutLa endonuclease, whose activation is dependent on PCNA and replication factor C (RFC). This activity seems to be critical for removing mispairs in the leading strand, where only a 30 nick is available. The current model suggests that after recognition of the 30 nick and the mismatch by the MMR initiation complex, MutLa endonuclease makes an incision 50 to the mismatch in a PCNA-dependent manner, and the resulting nick is used for the excision initiation by EXO1 to remove the mismatch. However, exo1 null mutants in mice and yeast have a weak mutator phenotype; thus, it is likely that additional as yet unidentified exonucleases are involved in eukaryotic MMR. It is also interesting to note that recent studies have discovered another mechanism for repairing leading strand misincorporation. Misincorporation of ribonucleotides and their subsequent cleavage by RNase H2 generate 50 strand breaks to a misincorporated base, and these nicks can be used as the starting point by EXO1 to remove the mispaired based (Williams and Kunkel 2014).

M

690

Other essential protein components involved in human MMR include single-stranded DNA-binding protein RPA, RFC, DNA polymerase d (pol d), and ligase 1. RPA seems to be involved in all stages of MMR: it binds to nicked heteroduplex DNA before MutSa and MutLa, stimulates mismatch-provoked excision, protects the ssDNA gapped region generated during excision, and facilitates DNA resynthesis. Furthermore, RPA is phosphorylated after pol d has been recruited to the gapped DNA substrate. Recent studies indicate that phosphorylation reduces the affinity of RPA for DNA, that unphosphorylated RPA stimulates mismatch-provoked DNA excision more efficiently than phosphorylated RPA, and that phosphorylated RPA facilitates MMR-associated DNA resynthesis more efficiently than unphosphorylated RPA. These results are consistent with the fact that a high affinity RPA-DNA complex might be required to protect nascent ssDNA and to displace DNA-bound MutSa/MutLb, while _ facilia lower affinity RPA-DNA complex might tate DNA resynthesis by pol d. One difference between E. coli and human cell DNA metabolism is that human cells do not methylate dGATC sites, and therefore, human cells lack replication-associated hemi-methylated DNA regions. In addition, human cells carry out MMR without a MutH homolog. Clearly, this has implications regarding the molecular mechanism of human MMR. Importantly, in vitro studies demonstrate that the strand specificity of human MMR can be modulated by introduction of a strand-specific nick 50 or 30 to a unique DNA mismatch in a model DNA substrate. Based on convincing in vitro evidence, it is presumed that endogenous strand-specific nicks in replicating DNA perform this role in human and other eukaryotic cells in vivo. This presumption is based on the fact that the lagging DNA strand is replicated discontinuously, such that transient strand breaks are located in between Okazaki fragments. As discussed above, although the leading strand only contains a 30 end, the MutLa endonuclease activity and the cleavage of misincorporated ribonucleotides by RNase H2 can provide 50 nicks to remove mispaired bases in the leading strand.

Mismatch Repair

MMR and DNA Damage Signaling in Human Cells The primary role of MMR is to correct misincorporation and slip-mispairing errors introduced into nascent DNA during DNA replication; however, human MMR plays several other important roles in the cell. One of these roles is to facilitate DNA damage signaling leading to programmed cell death (also known as apoptosis) (Jiricny 2006; Li 2008). This feature of human MMR was first recognized in the context of the response (or lack of response) of bacterial and human cancer cells to DNA damaging agents, many of which are used as chemotherapeutic drugs. For example, N-methyl-N0 -nitro-N-nitrosoguanidine (MNNG) is a cytotoxic DNA alkylating agent that rapidly kills most proliferating cells in vitro. However, experiments performed with E. coli in the 1980s demonstrated that mutS and mutL cells are resistant to killing by MNNG, and similar results were obtained with isogenic human cells with or without defects in MutSa or MutLa. For example, the MSH6-deficient lymphoblastoid cell line MT1, which was derived from TK6 cells via mutagenesis in the presence of a high dose of MNNG, is 500-fold more resistant to killing by MNNG than the TK6 cells. Similarly, the MLH1-deficient colorectal tumor cell line HCT116 is more tolerant to MNNG than HCT116 cells carrying a wild-type MLH1 gene on an ectopic chromosome introduced by chromosome transfer. These experiments show that functional MMR mediates the cytotoxicity of DNA alkylating agents. Similar observations have been made regarding the response to other types of DNA damaging agents, including cisplatin, some environmental carcinogens, and ionizing radiation. These observations also apply in the context of experimental animal model systems including the mouse and worm. Although the mechanism by which MMR proteins facilitate the cytotoxicity of DNA damaging agents is not completely clear, it appears that an active MMR complex engages in a nonfunctional manner with DNA lesions, when they are distributed widely in the genome. One model to explain this phenomenon is called the “futile cycle”

Mismatch Repair

Mismatch Repair, Fig. 4 MMR-mediated DNA damage signaling. A DNA adduct (solid black circle) induces misincorporation during DNA replication. The resulting DNA mismatch triggers strand-specific MMR, which removes the misincorporated base but does not remove the DNA adduct in the template strand. Misincorporation recurs during the DNA resynthesis step (phase 3) of MMR. This generates another mismatch and repeated cycles of MMR. This futile repair cycle activates ATR/ATMdependent signaling network, leading to cell cycle arrest and/or apoptosis (Adapted from Li (2008))

model. This model proposes that DNA adductinduced misincorporations triggers repeated cycles of MMR, without removing the DNA adduct from the template strand. Repeat cycles of MMR and persistent DNA adducts in the template DNA trigger a DNA damage response leading to cell cycle arrest or apoptosis. The latter events are likely mediated by ATR/ATM signaling networks (Fig. 4). A traditional view of MMR is that its primary biological role is to promote genomic stability by enhancing and maintaining an extremely high level of DNA replication fidelity in dividing cells. The novel apoptotic function of MMR is to eliminate cells with widespread DNA damage from growth. These cells are usually at risk of accumulating high levels of mutations, leading to tumorigenesis. Thus, the ability of MMR to promote apoptosis can be considered a cancer prevention mechanism and has great implication in cancer chemotherapy.

691

Models for MMR Signaling As mentioned above, three models have been proposed to explain how mismatch proteins coordinately and cooperatively facilitate DNA damage recognition and subsequent strand-specific DNA excision and repair. In E. coli, the known molecular steps involved in this process include binding of MutS to a mismatch, cleavage of the newly synthesized strand of a nearby up- or downstream hemi-methylated dGATC site, and directional excision (50 ! 30 or 30 ! 50 ) from the strandspecific nick up to and slightly beyond the mismatch. In eukaryotic cells, the strand discrimination signal is presumed to be a nick or gap in nascent DNA on the lagging DNA strand and the 30 end on the leading strand at a nearby DNA replication fork. However, it remains unclear how MMR proteins facilitate communication between two physically distant DNA sites: the mismatch and the strand discrimination signal. Several alternative models have been proposed, which can be classified into stationary (or trans-) and moving (or cis-) models (Fig. 5), based on whether the model requires physical translocation of MutS proteins from the mismatch to the strand discrimination signal. The stationary model (Fig. 5, right) proposes that interactions among MMR proteins induce DNA loop formation that brings the mismatch and strand break together, while MutS proteins (i.e., MutS in E. coli and MutSa or MutSb in eukaryotes) remain bound at the mismatch or DNA lesion (Junop et al. 2001). Support for the stationary model came from the following experiments: (1) recognition of a mismatch by MutS on one DNA molecule activates MutH cleavage of a hemi-methylated dGATC site on a separate DNA molecule without a mismatch and (2) mismatch-provoked excision in human cell extracts occurs even when a biotin-streptavidin blockade is placed between the mismatch and a preexisting nick. However, the interpretation of these data is controversial to the fact that MutH can cleave a hemi-methylated dGATC sequence independent of MutS and MutL and that mismatch-provoked excision of nicked DNA in crude human cell extracts could be due to nonspecific nucleases.

M

692

Mismatch Repair

Mismatch Repair, Fig. 5 Models for MMR signaling. Three models have been proposed to explain how MMR proteins mediate communication between a DNA lesion (a mismatch or a DNA adduct) and a distant strand discrimination signal on the daughter DNA strand. The models apply to both 30 and 50 nick-directed MMR. In the stationary model (right), MutS protein (e.g., MutSa) remains bound to the mismatch and interacts with other MMR proteins, including MutLa and EXO1. These interactions facilitate bending or looping of the DNA in between the mismatch and the strand discrimination signal, which brings the two distant sites together to initiate the excision reaction. In the “translocation” (left) and the

“molecular switch” (middle) models, MutSa binds to the mismatch and then moves away from the mismatch to search for the strand discrimination signal. In the translocation model, ATP-dependent translocation of MutSa generates a loop that brings the mismatch and the strand break together. In the molecular switch model (middle), a MutSa-ADP complex binds to the mismatch and undergoes a conformational change that promotes ATP-ADP exchange and bidirectional sliding of the protein away from the mismatch. The sliding away from the mismatch allows another MutSa-ADP complex to bind the mismatch. Once MutSa reaches the strand discrimination signal, the excision reaction starts (Adapted from Li (2008))

There are two moving models, one called the “translocation” model (Allen et al. 1997) and the other called the “molecular switch” or “sliding clamp” model (Fishel 1998). The “translocation” model (Fig. 5, left) suggests that MutS proteins unidirectionally translocate along the DNA helix to reach a strand discrimination signal in either orientation in a manner dependent on ATP hydrolysis. This process forms an a-shape loop. In contrast, the “molecular switch” model proposes bidirectional sliding of MutS proteins along the DNA helix in a reaction that requires ADP to ATP exchange. Although these two models differ in terms of the number of MutS proteins involved and the way in which MutS moves along the DNA helix, both models suggest that MutS binds to a mismatch/lesion and then moves bidirectionally away from the site to search for a strand break, where exonucleases are recruited to initiate

excision. A recent study argues in favor of a moving model. In a minimal in vitro system containing E. coli MutS, MutL, and MutH and a dsDNA break or a protein “roadblock” between the mismatch and the strand discrimination signal, the hemi-methylated site is not incised and DNA excision does not initiate. This study directly conflicts with evidence supporting the stationary model. Therefore, additional thorough investigations are required to resolve this controversial question. MMR Deficiency and Cancer Awareness of the public health significance and scientific importance of DNA repair metabolism increased significantly in the early 1990s, when mutations in human MMR genes, especially MSH2, were linked to an inherited syndrome of colon cancer susceptibility (Fishel et al. 1993;

Mismatch Repair

Kolodner 1996; Leach et al. 1993; Modrich and Lahue 1996; Parsons et al. 1993), known as hereditary nonpolyposis colorectal cancer (HNPCC). In addition, clinicians noticed that the cellular phenotype associated with HNPCC was characterized by genetic instability, as indicated and readily monitored in patient biopsies by frequent insertion and deletion mutations in simple repeat (microsatellite) sequences. The latter phenomenon is known as microsatellite instability (MSI). These results were rapidly confirmed in sporadic colon cancer and in yeast strains with single or double knockouts in MSH2, MLH1, or PMS1 (equivalent to PMS2 in mammalian cells). Biochemical studies also showed that cell extracts derived from MSI-positive tumor cells were completely defective in MMR. Thus, genetic and biochemical evidence converged to support the hypothesis that defective MMR plays a direct role in cancer susceptibility in humans. In the case of HNPCC, predispositions in MMR genes appear to have greatest impact on colorectal cancer risk, although correlation between defective MMR and increased risk of other cancers is observed to a lesser degree. However, the tissue specificity of cancer risk associated with MMR deficiency remains poorly understood. It is worth mentioning that the vast majority of HNPCC patients carry mutations in MSH2 or MLH1, with few patients carrying mutations in PMS1, PMS2, and MSH6 but none yet identified with mutations in MSH3. These observations are consistent with the fact that MSH2 and MLH1 are obligated subunits of human MutS and MutL heterodimers, while MSH3, MSH6, PMS1, and PMS2 play important but partially redundant and/or dispensable roles in MMR. MSI has been observed in many tumors other than colorectal cancer, including endometrial, ovarian, gastric, cervical, breast, skin, lung, glioma, prostate, and bladder cancers as well as hematological malignancies such as leukemia and lymphoma. This result suggested that lack of or low capacity for MMR could promote cellular transformation in many cell types, not only in proliferative colonic cells. Interestingly, in many sporadic non-colonic tumors with MSI, the MLH1 promoter is hypermethylated, an epigenetic

693

modification that inactivates MMR by suppressing expression of the MLH1 gene. Thus, defective MMR correlates with MSI and carcinogenesis in colonic and non-colonic, sporadic, and inherited cancers with mutations or epigenetic modifications in MMR genes. Knockout mice carrying homozygous null mutations in MSH2, MSH3, MSH6, MLH1, MLH3, PMS1, PMS2, or EXO1 have been generated and characterized (Buermeyer et al. 1999). Most of these MMR-defective strains are MSI positive and cancer prone and demonstrate a mutator phenotype. However, the primary cancer susceptibility of MSH2, MLH1, and PMS2 knockout mice is lymphoma, not colorectal cancer as in humans, with secondary cancer susceptibilities in the mice being gastrointestinal tumors, skin neoplasms, and/or sarcomas. MSH2 knockout mice are fertile and MSI positive, develop lymphoma within 1 year of age, and have a significantly shorter life span than wildtype mice (i.e., 50% mortality by 6 months of age). MSH6 knockout mice have a similar phenotype; however, tumors appear with longer latency and MSI is weak or absent. MSH3 knockout mice have a repair defect but are not cancer prone. In contrast, MLH1, MLH3, and PMS2 knockout mice are sterile and susceptible to cancer and display genomic instability. PMS1 knockout mice are fertile, lack cancer susceptibility, and appear to be MSI negative. EXO1 defective mice are also sterile. It is clear that the loss of fertility in these knockout mice is caused by abnormal meiosis. These studies show that MMR plays a tissueand species-specific role in preventing carcinogenesis, and this specificity is poorly understood. The role of MMR proteins during meiosis in humans is also poorly characterized and poorly understood. However, MMR clearly plays a critical role during meiosis and/or gamete formation in mice.

Future Outlook The field of MMR might seem a minor subject in the context of cellular nucleic acid metabolism.

M

694

However, at the time of writing this chapter, studies of and knowledge about MMR have significantly changed how we think about and understand many aspects of nucleic acid metabolism, as well as making an impact on human clinical and non-clinical studies of carcinogenesis, mutagenesis, and genetic instability. As mentioned above, MMR pathways and enzymes in eukaryotic cells are strikingly complex, when compared to prototypical prokaryotic MMR, and such complexity reflects novel, previously unrecognized complex functions for eukaryotic MMR proteins. Even a simple list of such functions is long, and the putative novel roles for eukaryotic MMR proteins include pairing and segregation of meiotic chromosomes, homologous recombination, chromatin remodeling, chromatin structure, and somatic hypermutation; additional functions for MMR proteins may remain to be discovered, and much work is yet to be done to understand these functions in detail. Another relatively poorly understood area is repair of heteroduplex DNA in the context of chromatin (Li 2014) and in mitochondrion, a membrane-bound subcellular organelle in eukaryotic cells that contains its own DNA and plays critical roles in cellular metabolism. It appears that exciting discoveries may continue to emerge from the field of MMR for many years to come.

Foundational Concepts 1. DNA mismatch – Non-Watson-Crick base pairs, i.e., non-G:C or non-A:T DNA base pairs or small insertion-deletion (unpaired) nucleotides in duplex DNA. Duplex DNA containing mismatches and/or insertiondeletion mispairs is called heteroduplex DNA; Watson-Crick base-paired DNA is called homoduplex DNA. 2. Mutator phenotype – Cells that accumulate heritable genomic mutations at a rate significantly higher than a wild-type cell in the absence of selection and the absence of an exogenous source of DNA damage. 3. Replication fidelity – The mechanisms that ensure faithful production of two identical

Mismatch Repair

4.

5.

6.

7.

daughter chromosomes from a single parental chromosome. Hereditary nonpolyposis colorectal cancer (HNPCC) – An autosome-recessive inheritable human disease syndrome characterized by genomic instability, colorectal cancer susceptibility, and defective MMR. Microsatellite and microsatellite instability – A microsatellite is defined as a locus (or a regions within DNA sequences) where short sequences of DNA are repeated in tandem arrays, i.e., the sequences are repeated one right after the other, e.g., (A)n, (CA)n, or (CAG)n. Microsatellite instability (MSI) occurs when a microsatellite shows higher than normal copy number variation in a population of dividing cells or tissue sample. MSI is attributed to increased frequency of contraction or expansion events in tandem repeated sequences during DNA replication. DNA damage signaling – The ability of DNA damage to provoke changes in cell cycle progression, cell proliferation, and/or rate of cell death. A complex protein network mediates such signaling in eukaryotic cells. Apoptosis – An active process that allows a cell that is old, unhealthy, or severely damaged to commit suicide. The process is usually associated with loss of cell membrane asymmetry, loss of attachment to surfaces or other cells, cell shrinkage, chromatin condensation, and chromosomal DNA fragmentation.

Cross-References ▶ DNA Damage, Types of ▶ DNA Replication ▶ DNA Replication, Chemical Biology of ▶ Plasmid Incompatibility

References Allen DJ, Makhov A, Grilley M, Taylor J, Thresher R, Modrich P, Griffith JD (1997) MutS mediates heteroduplex loop formation by a translocation mechanism. Embo J 16:4467–4476

Mitochondrial Genomes Buermeyer AB, Deschenes SM, Baker SM, Liskay RM (1999) Mammalian DNA mismatch repair. Annu Rev Genet 33:533–564 Fishel R (1998) Mismatch repair, molecular switches, and signal transduction. Genes Dev 12:2096–2101 Fishel R, Lescoe MK, Rao MR, Copeland NG, Jenkins NA, Garber J, Kane M, Kolodner R (1993) The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer. Cell 75:1027–1038 Friedberg EC, Walker GC, Siede W, Wood RD, Schultz RA, Ellenberger T (2006) DNA repair and mutagenesis. ASM Press, Washington, DC Holliday RA (1964) A mechanism for gene conversion in fungi. Genet Res 5:282–304 Jiricny J (2006) The multifaceted mismatch-repair system. Nat Rev Mol Cell Biol 7:335–346 Junop MS, Obmolova G, Rausch K, Hsieh P, Yang W (2001) Composite active site of an ABC ATPase: MutS uses ATP to verify mismatch recognition and authorize DNA repair. Mol Cell 7:1–12 Kolodner R (1996) Biochemistry and genetics of eukaryotic mismatch repair. Genes Dev 10:1433–1442 Kunkel TA, Erie DA (2005) DNA mismatch repair. Annu Rev Biochem 74:681–710 Lamers MH, Perrakis A, Enzlin JH, Winterwerp HH, de Wind N, Sixma TK (2000) The crystal structure of DNA mismatch repair protein MutS binding to a G  T mismatch. Nature 407:711–717 Leach FS, Nicolaides NC, Papadopoulos N, Liu B, Jen J, Parsons R, Peltomaki P, Sistonen P, Aaltonen LA, Nystrom LM et al (1993) Mutations of a mutS homolog in hereditary nonpolyposis colorectal cancer. Cell 75:1215–1225 Li GM (2008) Mechanisms and functions of DNA mismatch repair. Cell Res 18:85–98 Li GM (2014) New insights and challenges in mismatch repair: Getting over the chromatin hurdle. DNA Repair (Amst) 19:48–54. https://doi.org/10.1016/j. dnarep.2014.03.027. Lu A-L, Clark S, Modrich P (1983) Methyl-directed repair of DNA base pair mismatches in vitro. Proc Natl Acad Sci U S A 80:4639–4643 Modrich P, Lahue R (1996) Mismatch repair in replication fidelity, genetic recombination, and cancer biology. Annu Rev Biochem 65:101–133 Obmolova G, Ban C, Hsieh P, Yang W (2000) Crystal structures of mismatch repair protein MutS and its complex with a substrate DNA. Nature 407:703–710 Parsons R, Li GM, Longley MJ, Fang WH, Papadopoulos N, Jen J, de la Chapelle A, Kinzler KW, Vogelstein B, Modrich P (1993) Hypermutability and mismatch repair deficiency in RER+ tumor cells. Cell 75:1227–1236 Pukkila PJ, Peterson J, Herman G, Modrich P, Meselson M (1983) Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. Genetics 104:571–582

695 White RL, Fox MS (1975) Genetic consequences of transfection with heteroduplex bacteriophage lambda DNA. Mol Gen Genet 141:163–171 Wildenberg J, Meselson M (1975) Mismatch repair in heteroduplex DNA. Proc Natl Acad Sci U S A 72: 2202–2206 Williams JS, Kunkel TA (2014) Ribonucleotides in DNA: origins, repair and consequences. DNA Repair (Amst). 19:27–37. https://doi.org/10.1016/j.dnarep. 2014.03.029. Witkin EM (1964) Pure clones of lactose-negative mutants obtained in Escherichia coli after treatment with 5-bromoouracil. J Mol Biol 8:610–613

Mitochondrial Genomes Michael W. Gray Department of Biochemistry and Molecular Biology, Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, NS, Canada

Abbreviations CMS cob cox1 DNAP DNase EGT ETC kb kDNA LSU mRNA MRO mtDNA nuDNA ORF RNAP rnl rns rRNA ssRNAP

Cytoplasmic male sterility Gene encoding apocytochrome b of ETC Complex III Gene encoding subunit 1 of cytochrome c oxidase (Complex IV) DNA polymerase Deoxyribonuclease Endosymbiotic gene transfer Electron transfer chain Kilobase pair Kinetoplast DNA (in kinetoplastid protozoa) Large subunit Messenger RNA Mitochondrion-related organelle Mitochondrial DNA Nuclear DNA Open reading frame RNA polymerase Large subunit rRNA gene Small subunit rRNA gene Ribosomal RNA T3/T7 phage type single-subunit RNA polymerase

M

696

SSU trn tRNA UTL

Mitochondrial Genomes

Small subunit Transfer RNA gene tRNA Untranslated leader

Synopsis When considered from a phylogenetic perspective, mitochondrial genomes display tremendous variability in just about every characteristic: size, physical form, genome organization, gene content, replication, gene expression, and evolutionary pattern. This variability is perplexing, from both functional and evolutionary viewpoints. Functionally, throughout the broad range of eukaryotes, mitochondria retain a basic set of functions, notably including ATP generation through coupled electron transport–oxidative phosphorylation. This conserved function is evidently able to be served by an incredibly diverse range of mitochondrial DNA (mtDNA) gene content and mode of expression. Models of mitochondrial genome evolution must come to grips with the fact that although mitochondrial genomes in diverse eukaryotic lineages may look nothing alike, they all descend from a common bacterialike ancestor. Evidently, very different evolutionary forces operate in these diverse lineages. Mutations in mtDNA are now known to cause a variety of specific human mitochondrial diseases, and mitochondrial mutations are increasingly being considered in wider aspects of human health, including aging and cancer.

RNAs) required for formation of a functional mitochondrion. Further, the presence of essential genes in both mtDNA and nuclear DNA (nuDNA) strongly suggested that mitochondrial dysfunction (and mitochondrial diseases) might be caused by mutations in either genome. Finally, the presence of mtDNA-encoded genes offered the prospect, through sequence determination and comparison, of elucidating their evolutionary origin, and that of the genome encoding them, thereby providing solid molecular data with which to test competing hypotheses about the evolutionary origin of mitochondria. The ever-expanding body of information about mitochondrial genome structure, function, and evolution has been comprehensively chronicled in a number of books and review articles published during the past three decades (Wolstenholme and Jeon 1992; Gillham 1994; Boore 1999; Burger et al. 2003; Gray et al. 1998, 2004). This subsection takes a phylogenetic approach to describing mitochondrial genome structure, expression, and evolution, with citations to original work emphasizing the more recent literature. Various authors (cited throughout) provide descriptions of key characteristics of mitochondrial genomes in major eukaryotic groups, focusing particularly on more recent work in this area. The overview that follows provides a broad look at mitochondrial genome size, physical form, organization, gene content, expression, and evolution, emphasizing the dictum that as far as mitochondrial genomes are concerned, “anything goes” (Burger et al. 2003).

Introduction Discovery of Mitochondrial DNA The recognition that mitochondria contain DNA (i.e., a genome) had profound implications for subsequent research into mitochondrial biogenesis and function, as well as for hypotheses about the evolutionary origin of this vital eukaryotic organelle. In the first place, the fact that essential mitochondrial functions are encoded in mitochondrial DNA (mtDNA) means that mitochondrial biogenesis requires coordinate expression of two genomes, nuclear as well as mitochondrial, each specifying protein components (and sometimes

In 1963, Nass and Nass published the first evidence for the presence in mitochondria (chick embryo) of material staining as DNA and sensitive to deoxyribonuclease (DNase). Subsequently, in 1964, Luck and Reich showed that mtDNA and nuDNA from the fungus Neurospora crassa exhibited different buoyant density properties during ultracentrifugation in cesium gradients, indicative of differences in base compositions, a finding Rabinowitz et al. confirmed in 1965

Mitochondrial Genomes

using DNA preparations from chick heart and liver. Throughout eukaryotes, mtDNA and nuDNA in the same organism often differ substantially in base composition (and hence buoyant density), and this property has proven to be extremely useful in the separation of small amounts of mtDNA from much larger amounts of nuDNA when mtDNA is isolated from total cellular DNA. The overall base composition of mtDNA varies widely depending on the organism, from over 50% in some animals (birds) to less than 20% in some fungi and protists (eukaryotic microbes). Marked compositional heterogeneity may exist within a given mitochondrial genome: for instance, in the ascomycete yeast Saccharomyces cerevisiae, roughly half of the mtDNA comprises A + T-rich spacers having a G + C content 130 in total). At the other extreme, plant mitoribosomes have sedimentation values and SSU and LSU rRNA sizes that more closely approximate those of cytoplasmic 80S ribosomes, due to large insertions in variable regions of the otherwise highly bacteria-like rRNAs. The most bacteria-like rRNAs in size and sequence are found within protists, such as R. americana, that harbor minimally derived (ancestral) mitochondrial genomes. In spite of these marked structural differences, mitoribosomes are functionally alike in being differentially sensitive to the antibiotic chloramphenicol and resistant to cycloheximide, the latter a selective inhibitor of cytoplasmic ribosome function. Like mitochondrial rRNAs, mitochondrial tRNAs display considerable variation in structure, with minimal deviations from the canonical cloverleaf secondary structure in many cases but striking alterations (involving absence of one or more cloverleaf arms) and departure from conventional helical structure in others. Nevertheless, tertiary structure modeling has demonstrated that even the most radically diverged mitochondrial tRNA sequences can assume a higher-order structure that maintains the critical distance between anticodon and tRNA 30 end (the site of aminoacylation) required for the ribosomemediated peptidyl transferase reaction in protein synthesis. The detailed biochemical mechanism of translation has been investigated in only a few systems, notably mammals and yeast (S. cerevisiae). The results of these studies reveal very different modes of translation initiation. In mammals,

Mitochondrial Genomes

mitochondrial mRNAs are basically leaderless, with little or no 50 untranslated leader (50 UTL) sequence, so that the initiation codon is very close to or indeed right at the 50 end of the message. The 50 UTL normally provides an entry point for the small ribosomal subunit during the initiation phase of translation; in bacteria, this occurs via the Shine–Dalgarno sequence that pairs with a complementary sequence at the 30 end of the SSU rRNA. In yeast, on the other hand, mitochondrial mRNAs contain unusually long 50 UTLs that interact with a host of nucleus-encoded, mRNAspecific translation factors that are required for proper translation of the mRNA to which they bind. Shine–Dalgarno motifs are not evident in the highly A + U-rich 50 UTL sequences of yeast mitochondrial mRNAs, but intriguingly they do appear in the expected position just upstream of protein-coding ORFs in the ancestral R. americana mtDNA sequence (Lang et al. 1997). Initiation codons in addition to the standard AUG are utilized during translation in a number of mitochondrial systems. For example, in human mitochondria, both AUG and AUA are recognized as initiation codons, while nonstandard initiation codons of the form AUN or NUG are employed in the ciliate protozoon, Tetrahymena pyriformis. As in mammals, T. pyriformis mitochondrial mRNAs have little or no 50 UTL (Ref. 92 in Gray et al. (2004)). Although in most cases the structural basis for the expanded recognition of initiation codons is unknown, in the case of human mitochondria, this relaxation of specificity has been attributed to a modified nucleoside at the first (wobble) position of the anticodon of the mitochondrial initiator tRNAMet, which permits anticodon pairing with AUA as well as AUG. Translation termination also shows peculiarities in various mitochondrial systems. Following posttranscriptional processing, a number of mammalian mitochondrial mRNAs lack a complete termination codon, ending instead in U or UA. Polyadenylation of these transcripts restores a complete UAA termination codon. Initially, it appeared that the arginine codons AGA and AGG had been reassigned as termination codons in human mitochondria; however, recent work has

Mitochondrial Genomes

demonstrated that a 1 frameshift at the position of these codons allows translation to terminate at standard UAG codons. On the basis of sequence comparisons, standard sense codons have been proposed to function as mitochondrial translation termination codons in various eukaryotes, e.g., UCA (normally serine) in the green alga, Scenedesmus obliquus (Refs. 167, 209 in Gray et al. (2004)). Conversely, as noted earlier, the standard termination codon UGA is frequently reassigned as a second tryptophan codon in mitochondria. In dinoflagellates, comparison of mitochondrial DNA and RNA sequences demonstrates that some protein-coding transcripts lack identifiable stop codons altogether. How translation in these cases is terminated remains a mystery (see ▶ “Mitochondrial Genomes in Alveolates”).

Evolution of Mitochondrial Genomes Mitochondrial genomes fall into three broad categories, which have been referred to as ancestral, reduced derived, and expanded. Ancestral mitochondrial genomes are gene-rich and relatively slowly evolving. They bear a high degree of similarity to bacterial genomes in terms of gene sequence, gene organization (e.g., similar operon-like structures), and rRNA and tRNA secondary structures, as well as (often) presence of a 5S rRNA gene. The most ancestral (least derived) mitochondrial genomes, those of jakobid flagellates, exhibit additional primitive characteristics, such as extra protein-coding genes, including especially rpo genes specifying a bacteria-like RNAP, and genes encoding other RNA species characteristic of bacteria, such as RNase P RNA and tm RNA (Table 1). These genomes tend to evolve by gradual loss of genes as well as by wholesale changes in gene order. Reduced derived mitochondrial genomes (exemplified, e.g., by animal mtDNAs) are characterized by extensive gene loss, often accompanied by extreme genome streamlining (severe reduction in length of intergenic spacer sequences, compaction of coding sequences, truncation of rRNA and tRNA sequences). Reduced derived mtDNAs tend to be relatively rapidly

705

evolving at the sequence level. Fragmentation of genes, both protein and RNA, and scrambling of the subgenomic fragments are often observed (see ▶ “Mitochondrial Genomes of Green, Red and Glaucophyte Algae”). Mitochondrial genomes in this category tend to exhibit a high ratio of coding to noncoding sequence. The large mtDNAs found in land plants exemplify the expanded type of mitochondrial genome. Despite their large size, plant mitochondrial genomes contain even fewer genes than the much smaller ancestral mtDNAs. Mitochondrial gene sequences are generally slowly diverging in land plants, more slowly than in any other eukaryotes (but see Sloan et al. 2012 for exceptions), whereas gene order is exceptionally fluid in plant mtDNA, even varying between different cultivars of the same plant species. Large mitochondrial genomes such as those of land plants are characterized by extensive noncoding sequence of uncertain origin, often enriched in repeated sequences of varying size. Thus, plant and other expanded mtDNAs mostly consist of noncoding DNA. Considering that the mitochondrial genome is considered to have evolved from a much larger bacterial progenitor possessing a much more extensive gene repertoire, it is evident that wholesale elimination of genes from mtDNA has occurred during its evolution. Many of the genes initially assumed to have been present appear to have been lost entirely, but other genes have been functionally relocated to the nuclear genome through EGT. This process requires not only “movement” of the sequence in question to the nucleus but the provision of the appropriate signals for correct nuclear gene expression, as well as information for targeting the nucleus-encoded protein product back into the mitochondrion, either via an N-terminal extension that provides a mitochondrial targeting peptide or through (an) internal targeting signal(s). EGT may be “phylogenetically complete” in the sense that the gene for a given mitochondrial protein (e.g., cytochrome c) is now encoded in the nuclear genome in all eukaryotes; however, in other cases, EGT is “incomplete” in that genes that are now nuDNA-encoded in some organisms are still mtDNA-encoded in others.

M

706

Genetic code differences between the mitochondrial and nucleocytoplasmic systems in a given organism would seem to preclude continuing EGT, so that in these cases it would appear that the gene content of the mtDNA in question is now effectively locked in (“frozen”). In land plants, no such barrier to EGT exists because the standard genetic code is used in all three cellular compartments (nucleocytoplasmic, mitochondrial, and plastid). In fact, mitochondrion-tonucleus EGT is an on-going process in the land plant lineage, with numerous examples in angiosperms (flowering plants) having been documented over the past several decades. How EGT occurs is not entirely clear, but because plant mitochondrial mRNAs undergo C-to-U RNA editing during their posttranscriptional maturation process, it has been possible in these cases to tell whether or not EGT has involved direct transfer of mtDNA to the nucleus. Although sequencing of plant nuclear genomes has provided examples of such DNA-to-DNA transfer, these transferred mtDNA sequences appear to be nonfunctional. Rather, functional EGT of a plant mitochondrial protein-coding gene involves reverse transcription of the corresponding edited mRNA, followed by incorporation of the resulting cDNA into the nuclear genome. A notable aspect of mitochondrion-to-nucleus EGT is the observation that a given gene may be transferred in pieces to the nucleus, rather than as a single entity. In these cases, the subgenomic pieces end up as separately expressed genes within the nuclear genome; their protein products are thought to be independently targeted to and imported into the mitochondrion, where they interact in trans to provide the same function as their covalently continuous counterparts in other eukaryotes. In these cases, fragmentation of the mtDNA-encoded gene presumably occurs first, followed by separate EGT of the individual pieces. In some cases, transfer is not complete, so that one portion of the fragmented gene is moved to and expressed from the nuclear genome, while the other portion remains in and continues to be expressed from the mitochondrial genome. A recently described case of this type involves

Mitochondrial Genomes

the cox1 gene, which in a number of eukaryotes has undergone gene fission, with a resulting small C-terminal portion being relocated to the nuclear genome from where it is now expressed (Gawryluk and Gray 2010).

Mitochondrial Genomes in Mitochondrion-Related Organelles (MROs) In a number of anaerobic protists, conventional mitochondria are absent, replaced instead by derived mitochondria termed “mitochondrionrelated organelles” (MROs) (Gray 2011). MROs are of two basic types, distinguished by whether or not they are able to generate ATP. Hydrogenosomes completely lack a classical electron transport chain, instead synthesizing ATP via substrate-level phosphorylation and releasing hydrogen in the process. A large number of pathways typical of aerobic mitochondria are absent from hydrogenosomes, notably including DNA replication, transcription, and translation, since these organelles completely lack a genome. Another DNA-lacking, even more derived type of MRO is the mitosome, which retains no ATP-generating capacity at all. Recently, organelles intermediate between conventional mitochondria and DNA-deficient MROs have been characterized. In these cases, a genome is still present, encoding many of the normal functions of mtDNA; notably, however, genes encoding subunits of respiratory complex III, IV, and V are completely absent, so that only a partial electron transport chain (comprising complexes I and II) is present in these intermediate MROs. Even though they retain many more genes than the smallest mtDNAs from aerobic eukaryotes, mitochondria from the latter nevertheless have a complete mitochondrial electron transport chain and carry out coupled oxidative phosphorylation. These variations in mtDNA gene content and function emphasize the dynamic and versatile nature of mitochondrial genome evolution, which may proceed to a point at which this genome disappears entirely.

Mitochondrial Genomes

What Mitochondrial Genomes Tell Us About the Origin of Mitochondria A bacterial, endosymbiotic origin of mitochondria has become a widely accepted theory in cell biology (Gray 1999, 2012b and refs. therein). Although this theory dates to the early part of the twentieth century, it was only firmly established after the discovery of mtDNA in the 1960s, when it was recognized that the mitochondrial genome had a coding function. While efforts to identify and sequence mitochondrial genes were initially directed at defining the genetic role of mtDNA and studying the function(s) of individual genes, it was soon realized that these genes might contain clues to their evolutionary origin – which indeed proved to be correct. Such evolutionary-focused studies eventually took a comparative mitochondrial genomics approach, based on complete sequencing of diverse mtDNAs. Data arguing in favor of a single, bacterial origin of the mitochondrial genome are of three sorts (Gray et al. 1999). First, the genes encoded by diverse mtDNAs are a subset of those encoded by the most ancestral mitochondrial genomes (Table 1): i.e., mitochondrial genomes across the spectrum of eukaryotes always encode more or less the same small set of genes. Assuming that the mitochondrial genome originated from a much more gene-rich bacterial ancestor, it is highly improbable that the genomes of independently acquired endosymbionts would have undergone severe genome reduction leading to retention of precisely the same limited array of genes. Secondly, plant and many ancestral protist mitochondrial genomes retain vestiges of operon-like structure characteristic of the same gene clusters in bacteria (e.g., ribosomal protein operons). Tellingly, however, the mitochondrial gene clusters lack a number of the genes that are present in the bacterial operons, due to translocation to other parts of the mtDNA or EGT to the nucleus. For example, in the mitochondrial version of the eubacterial S10 operon, comprising a cluster of 11 ribosomal protein genes, the same six genes (rpl3-rpl4-rpl23, rpl22, rpl29-rps17) are missing

707

in those characterized mitochondrial genomes that encode clustered ribosomal protein genes. This pattern, encompassing a range of mitochondrial genomes, argues strongly that the mitochondrial genome that was ancestral to contemporary ones already contained these characteristic gene deletions, pointing again to a single origin of the mitochondrial genome. Third, phylogenetic trees reconstructed from alignments of mitochondrial protein and rRNA sequences not only affirm the common origin (monophyly) of these sequences but also clearly point to an origin of mitochondria from within a particular lineage of bacteria, a-Proteobacteria. Within this phylum, mitochondria appear to be most closely related to Rickettsiales, the a-proteobacterial group that includes obligate intracellular parasites such as Rickettsia, Anaplasma, and Ehrlichia. Thus, analysis of mitochondrial genome sequence and structure has proven to be a powerful tool for uncovering the mitochondrion’s closest extant bacterial relatives. Sequencing of minimally derived mitochondrial genomes, particularly those of jakobid flagellates, has considerably bolstered the inference of a bacterial ancestry of the mitochondrial genome. In addition to the characteristics summarized above, jakobid mtDNA exhibits a number of bacterial features, such as highly bacteria-like rRNA and RNase P RNA secondary structures, bacteria-like rpo genes, and the presence of Shine–Dalgarno-like motifs upstream of proteincoding genes, all of which justify the description of the jakobid mitochondrial genome as an “ancestral mitochondrial DNA resembling a eubacterial genome in miniature” (Lang et al. 1997).

Mitochondrial Genome Mutations As in any other genome, heritable mutations may arise and be fixed in mtDNA as a result of DNA damage, particularly in the form of base substitutions and deletions. Mitochondrial DNA is especially vulnerable to the mutagenic effects of reactive oxygen species, generated in the course

M

708

of oxidative phosphorylation. In some instances, intragenomic recombination involving repeated segments of the genome may lead to deletion of relatively large parts of the mtDNA, including essential protein-coding, rRNA, and/or tRNA genes, giving rise to a respiratory-deficient phenotype. Such mutants are viable in organisms (such as S. cerevisiae) that are able to grow on non-fermentable carbon substrates, generating ATP through glycolysis. In obligately aerobic organisms, mutated, nonfunctional mtDNA molecules coexist with their wild-type counterparts (a state termed “heteroplasmy”); in this case, depending on the ratio of mutant to wild-type molecules, the deleterious effects of the mutations become manifest only under particular growth conditions or in selected tissues. The yeast petite mutation, mentioned earlier, was the first mtDNA mutation to be discovered. Petite (r) mutants exhibit deletions of wild-type (r+) mtDNA, while r0 mutants lack mtDNA entirely. Petite mutants arise spontaneously and at high frequency (1–2% per cell in each generation) as a result of recombination between small, directly repeated sequences distributed throughout the yeast mitochondrial genome. Such recombination deletes blocks of essential genes, inactivating mitochondrial translation, electron transport, and/or coupled oxidative phosphorylation. Petite mutants derive their name from the small (“petite”) colonies they form relative to wild-type cells when grown on solid medium containing glucose. In another fungus, N. crassa, the slow-growth, respiratory-deficient poky mutant also has a deletion defect in mtDNA, in this instance one that compromises formation of mitochondrial ribosomes, with mitochondrial translation being severely affected as a result. In flowering plants, recombination mediated by direct repeats not only plays a major role in rearranging the mitochondrial genome over evolutionary time but also underlies the phenomenon of cytoplasmic male sterility (CMS). The phenotypic effects of CMS mutations are often manifested through the creation of novel ORFs, whose protein products have a deleterious effect on mitochondrial function. A remarkable feature of CMS mutations is their tissue specificity, in that

Mitochondrial Genomes

they uniquely seem to affect and inactivate the pollen-producing cells of plants. For that reason, CMS has been widely exploited by plant breeders in the production of hybrid seed. Finally, mutations in human mtDNA, both point mutations and deletions, are responsible for a number of mitochondrial diseases (Wallace 1999). These mutations affect both protein-coding and tRNA genes, the latter usually having more severe phenotypic consequences because of their generalized effect on mitochondrial protein biosynthesis. Here again, the phenomenon of heteroplasmy comes into play, with both mutant and normal mtDNA molecules coexisting within a cell. Research into human mitochondrial diseases has led to the recognition of a “threshold effect,” whereby a phenotypic change is not apparent until mutant mtDNA molecules reach a sufficiently high proportion (often more than 90%) to compromise the overall bioenergetic capacity of the cell. Increasingly, in addition to specific mitochondrial syndromes, mtDNA mutations are being considered as potential contributors to wider aspects of human health, including aging and cancer.

Conclusions In just about every respect – size, physical form, genetic function, genome organization, replication, gene expression, and evolutionary pattern – mitochondrial genomes vary tremendously. Yet despite this variability, throughout the broad range of eukaryotes, mitochondria retain a basic set of functions, notably including coupled electron transport–oxidative phosphorylation. This dichotomy emphasizes the amazing structural and functional flexibility of the mitochondrial genome. Perhaps the relatively small number of genes encoded in mtDNA relaxes constraints on genome structure, function, and evolution. A phylogenetic perspective on mitochondrial genomes, where “anything goes” (Burger et al. 2003), emphasizes that the mitochondrion is not only a key eukaryotic organelle with a fascinating evolutionary history but also a site of unparalleled molecular biological experimentation.

Mitochondrial Genomes in Alveolates

Cross-References ▶ Mitochondrial Genomes in Alveolates ▶ Mitochondrial Genomes of Excavata ▶ Mitochondrial Genomes in Fungi ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae ▶ Mitochondrial Genomes in Invertebrate Animals ▶ Mitochondrial Genomes in Land Plants ▶ Mitochondrial Genomes in Unicellular Relatives of Animals ▶ Mitochondrial Genomes in Vertebrate Animals

References Alfonzo JD, Söll D (2009) Mitochondrial tRNA import – the challenge to understand has just begun. Biol Chem 390:717–722 Allen JF (2003) The function of genomes in bioenergetic organelles. Philos Trans R Soc Lond B Biol Sci 358:19–38 Bendich AJ (2010) The end of the circle for yeast mitochondrial DNA. Mol Cell 39:831–832 Bonawitz ND, Clayton DA, Shadel GS (2006) Initiation and beyond: multiple functions of the human mitochondrial transcription machinery. Mol Cell 24:813–825 Boore JL (1999) Animal mitochondrial genomes. Nucleic Acids Res 27:1767–1780 Bullerwell CE, Burger G, Gott JM, Kourennaia O, Schnare MN, Gray MW (2010) Abundant 5S rRNA-like transcripts encoded by the mitochondrial genome in Amoebozoa. Eukaryot Cell 9:762–773 Burger G, Gray MW, Lang BF (2003) Mitochondrial genomes: anything goes. Trends Genet 19:709–716 Burger G, Gray MW, Forget L, Lang BF (2013) Strikingly bacteria-like and gene-rich mitochondrial genomes throughout jakobid protists. Genome Biol Evol 5:418–438 Gawryluk RMR, Gray MW (2010) An ancient fission of mitochondrial cox1. Mol Biol Evol 27:7–10 Gillham NW (1978) Organelle heredity. Raven, New York Gillham NW (1994) Organelle genes and genomes. Oxford University Press, New York Gray MW, Burger G, Lang BF (1999) Mitochondrial evolution. Science 283:1476–1481 Gray MW (2003) Diversity and evolution of mitochondrial RNA editing systems. IUBMB Life 55:227–233 Gray MW (2011) The incredible shrinking organelle. EMBO Rep 12:873 Gray MW (2012a) Evolutionary origin of RNA editing. Biochemistry 51:5235–5242 Gray MW (2012b) Mitochondrial evolution. Cold Spring Harb Perspect Biol 4:a011403

709 Gray MW, Lang BF, Cedergren R, Golding GB, Lemieux C, Sankoff D, Turmel M, Brossard N, Delage E, Littlejohn TG, Plante I, Rioux P, SaintLouis D, Zhu Y, Burger G (1998) Genome structure and gene content in protist mitochondrial DNAs. Nucleic Acids Res 26:865–878 Gray MW (1999) Evolution of organellar genomes. Curr Opin Genet Dev 9:678–687 Gray MW, Lang BF, Burger G (2004) Mitochondria of protists. Annu Rev Genet 38:477–524 Lang BF, Burger G, O’Kelly CJ, Cedergren R, Golding GB, Lemieux C, Sankoff D, Turmel M, Gray MW (1997) An ancestral mitochondrial DNA resembling a eubacterial genome in miniature. Nature 387:493–497 Shutt TE, Gray MW (2006) Bacteriophage origins of mitochondrial replication and transcription proteins. Trends Genet 22:90–95 Sloan DB, Alverson AJ, Chuckalovcak JP, Wu M, McCauley DE, Palmer JD, Taylor DR (2012) Rapid evolution of enormous, multichromosomal genomes in flowering plant mitochondria with exceptionally high mutation rates. PLoS Biol 10:e1001241 Wallace DC (1999) Mitochondrial diseases in man and mouse. Science 283:1482–1488 Wolstenholme DR, Jeon KW (eds) (1992) Mitochondrial genomes, vol 141. Jeon KW (ed) International review of cytology: a survey of cell biology. Academic, San Diego

M Mitochondrial Genomes in Alveolates Claudio Slamovits Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS, Canada

Synopsis Emerging sequence and molecular data from the mitochondria of alveolate protists are revealing an array of unusual features in the genomes of these organelles, including structural rearrangements involving extensive fragmentation of ribosomal RNA genes and massive loss of protein-coding and tRNA genes. In addition, changes in conserved features of protein translation have occurred in some lineages, along with the acquisition of RNA editing, trans-splicing, and other RNA-processing events. Analyzing these genomes in a phylogenetic framework reveals

710

Mitochondrial Genomes in Alveolates

Mitochondrial Genomes in Alveolates, Table 1 Some genomic and molecular features of alveolate mitochondrial genomes Genome size (kb) Proteincoding genes rRNA genes tRNA genes Editing Start/stop codons

Ciliates 40–47

Apicomplexans 6–8

Dinoflagellates ~10–40?

20–44

3

2 or 3

1 or 2 rnl, both rnl and rns split in two in some species 3–8 No

Extensively fragmented, probably reduced 0 No

AUG, NUG, AUN/UAA

AUG, AUU, AUA/UAA

Extensively fragmented, probably reduced 0 None in basal spp., abundant in other spp. AUU, AUA, ?/UAA,?

how some of the unusual characteristics developed progressively over the history of the group, shedding light on the functional and evolutionary significance of these organelles.

Introduction Alveolates comprise a large group of protists (single-celled eukaryotes) defined by the presence of cortical alveoli (flattened membranous vesicles) located beneath the plasma membrane. Collectively, alveolates are very diverse, encompassing tens of thousands of described species that colonize a vast range of ecological niches including oceans, rivers and lakes, soils, and animal cells. Alveolates include three distinct lineages – ciliates, dinoflagellates, and apicomplexans – as well as a number of poorly characterized forms of uncertain phylogenetic placement (Cavalier-Smith 2004). Ciliates and dinoflagellates are generally free-living, typically highly motile protists that can be found in virtually any type of water body and are important components of marine and freshwater ecosystems worldwide. Ciliates are heterotrophic whereas dinoflagellates include both heterotrophic and photosynthetic types of nutrition. In contrast, all known apicomplexan species are obligate intracellular parasites of animals. Some genera, such as Plasmodium and Toxoplasma, are human parasites that cause severe diseases (e.g., malaria, toxoplasmosis) and thus represent an enormous burden for society. In spite of the great diversity and

importance of this group, comparatively very few alveolate mitochondrial genomes have been sequenced and analyzed. The available data show that the mitochondrial genomes of alveolates have experienced a great deal of change throughout the evolutionary history of the group resulting, as described below, in some of the most bizarre organellar genomic organizations among eukaryotes.

Ciliates Protein-coding genes and transcripts. Mitochondrial genomes have been sequenced in ten species of ciliates from four genera and three different classes. These genomes are all linear DNA molecules with sizes ranging from about 40 to 47 kb (Table 1) and contain between 20 and 44 genes, specifying proteins of diverse functional categories as well as rRNA and tRNA components of the mitochondrial translation system. Genome size, gene repertoire, gene order, and overall structure and composition are very similar among species from the genera Tetrahymena, Paramecium, and Euplotes (Ref. 57 in Gray et al. 2004; Barth and Berendonk 2011; de Graaf et al. 2011). These mtDNAs encode about 50 genes separated by very short intergenic spacers, and approximately half of the genes correspond to proteins of discernible homology to known mitochondrial proteins. The identifiable proteins include subunits of oxidative phosphorylation complexes, ribosomal

Mitochondrial Genomes in Alveolates

proteins, and a single gene encoding the transport and maturation protein YejR. The remaining genes encode open reading frames (ORFs) of unknown function and no detectable sequence similarity outside other ciliates. None of these genes, collectively termed “ymf,” is present in all surveyed species, and most of them are only conserved within closely related species. Tetrahymena and Paramecium species, for example, have 13 ymf genes in common, but in some cases, the homology is inferred based on positional co-localization (Ref. 57 in Gray et al. 2004). It has been suggested that these seemingly ciliate-specific genes might correspond to genes encoding highly divergent oxidative phosphorylation proteins. However, only one of the ymf genes shared by Paramecium and Tetrahymena could be found in Euplotes species and none in Nyctotherus ovalis (de Graaf et al. 2011, 2009). Likewise, E. minuta and E. crassus share 14 ymf genes with varying degrees of similarity, but none of them was found in N. ovalis. Ribosomal proteins are commonly encoded in mitochondrial genomes in plants and protists but not in animals. Ciliates mitochondrial genomes encode a limited number of ribosomal proteins, a majority of which is common to all ciliates. A total of 14 mitochondrial genes for ribosomal proteins were identified for the 10 species of ciliates sequenced. Of those, ten are present in Tetrahymena and Paramecium species. Euplotes species have seven and N. ovalis has six. Many mitochondrial ribosomal proteins are not highly conserved at the amino acid sequence level, and the genes for several are especially difficult to identify in the mtDNA of some organisms. Thus, some of the ymf genes could represent divergent ribosomal proteins that might be identified using meticulous comparative structural and proteomic analyses. Many ciliate mitochondrial genes do not have ATG codons within the region where translation is predicted to start, and close inspection of the 50 ends of protein-coding transcripts indicated that transcripts are not edited posttranscriptionally, implying that alternative start codons must be used for protein synthesis. In Tetrahymena, an array of codons that differ from AUG by one nucleotide at either the first or third positions

711

(i.e., NUG or AUN) function as initiators of translation (Ref. 92 in Gray et al. 2004). The initiation codons are located just a few nucleotides after the 50 terminus (only one nucleotide in some cases), a feature that is favored by the existence of several possible initiation codons. N. ovalis has the most dissimilar mtDNA in terms of gene content and structure among the sequenced ciliate mitochondrial genomes. While the total size is not significantly different from that of other ciliate mitochondrial genomes, N. ovalis mtDNA contains only 20 genes, or roughly half the number seen in the other species. This difference is reflected in substantially longer, and highly heterogeneous, intergenic spacers (de Graaf et al. 2011). N. ovalis inhabits the hindgut of cockroaches, an environment characterized by very low availability of oxygen. Consequently, mitochondria of N. ovalis have adapted to this condition and produce ATP anaerobically using fumarate as the ultimate electron acceptor, in the process producing hydrogen. Initially considered to be a hydrogenosome, which are membranebound organelles that produce ATP and hydrogen and lack a genome, the organelle from N. ovalis was shown to contain a bona fide mitochondrial genome and thus is classified as a hydrogen-producing anaerobic mitochondrion, possibly an evolutionary link between mitochondria and the highly reduced hydrogenosomes. The pronounced loss of genes in this species involves all the components from complexes III, IV, and V of the respiratory chain, while nine genes for Complex I components are still present (de Graaf et al. 2011). Drastic evolutionary shifts, such as the acquisition of a parasitic lifestyle or strong metabolic alterations, are often accompanied by changes in the genome involving loss and gain of genes, intron loss or reduction, extreme rearrangements, or marked compositional shifts. RNA genes. Transfer RNA (tRNA) genes are also scarce in ciliate mitochondrial genomes. Three genes, corresponding to Trp, Phe, and Tyr, are found in the ten genomes sequenced. Tetrahymena species encode four additional genes, one of which (tRNAMet) is also present in P. tetraurelia and the Euplotes species. E. minuta contains also a tRNA gene for Gln. While it is possible that some other highly divergent tRNAs may be

M

712

present in ciliate mitochondrial genomes, most tRNA molecules presumed to be indispensable for the organelle are encoded in the nucleus and imported into the mitochondrion by specific carriers (Duchêne et al. 2009). The mitochondrial genome of Tetrahymena species encodes two slightly different copies of rnl, encoding the large subunit (LSU) rRNA, whereas all other ciliates with known mitochondrial genomes have one copy of both rnl and rns (which encodes the small subunit (SSU) rRNA). Another particularity of the Tetrahymena and Paramecium species is that their mitochondrial rnl and rns are each encoded in two parts (Refs. 140, 141, and 271 in (Gray et al. 2004). The two fragments of each rRNA are likely co-transcribed, matured by endonucleolytic cleavage, and assembled separately into their respective ribosomal subunits. Genome composition and structure. Most ciliate mitochondrial genomes sequenced so far exhibit a very compact architecture characterized by very short intergenic regions that represent 4–13% of the total mitochondrial genome. Gene order is maintained almost completely between closely related species (within genera), but the positional equivalence of homologous genes is progressively disrupted at larger evolutionary distances. For example, large syntenic blocks can be recognized between Tetrahymena and Paramecium species (two oligohymenophorids) but no synteny is maintained between species belonging to different ciliate classes (e.g., among Tetrahymena/Paramecium, Euplotes, and Nyctotherus). Nucleotide composition is generally biased towards increased A+T content, reaching about 80% in the mtDNA of Tetrahymena species, whereas the lowest recorded A+T content (58%) is in P. tetraurelia mtDNA. It is still unclear how nucleotide composition and genome organization evolve across the group because data are very sparse. Species of Tetrahymena and some Paramecium species, including P. caudatum, have similar A+T content, as well as similar gene content and genome organization. However, a significant shift in nucleotide composition and codon usage, but not in genome structure, has occurred within the genus Paramecium, exemplified by a sharp reduction in

Mitochondrial Genomes in Alveolates

A+T content in P. tetraurelia and other closely related species (Barth and Berendonk 2011).

Apicomplexa In Apicomplexa, mitochondrial genomes have reached an extreme degree of reduction and structural divergence. The total size of the genomes that have been sequenced ranges between 6 and 8 kb, and in the basally branching apicomplexan Cryptosporidium, the highly degenerate mitochondrion appears to lack DNA altogether. The short, ultracompacted mitochondrial genomes can be circular or form linear strings of many tandemly arranged units. The most striking sign of reduction is the significant loss of protein-coding genes relative to the number of genes found in ciliate mitochondrial genomes. All the mitochondrial genomes sequenced so far in this phylum (30 at the time of writing) contain only three protein-coding genes, which are the same in all cases: apocytochrome b (cob) and subunits 1 and 3 of cytochrome c oxidase (cox1 and cox3, respectively). Several alternative start codons have been proposed in apicomplexans. A standard AUG is the initiation codon for cob in all observed cases, whereas AUU and AUA appear to initiate translation in transcripts of the other genes. However, the known diversity of start codons in apicomplexans could increase when more species are studied. For example, in Eimeria species from chickens, GUU and GUG are other start codon alternatives for the cox1 gene (Lin et al. 2011). In contrast, transcripts of all three genes use the standard UAA stop codon exclusively. The three genes encoding proteins account for roughly half of the apicomplexan mitochondrial genome. Virtually all the remaining DNA consists of rRNA genes and very short intergenic spacers. As in ciliates, the rRNA genes in apicomplexan mtDNA are fragmented, but here, the fragmentation is rampant: the LSU and SSU rRNA genes are split into multiple small fragments scattered throughout the flanking regions of the protein-coding genes. In species of Plasmodium, where these genes have been thoroughly examined, 12 fragments of the

Mitochondrial Genomes in Alveolates

LSU and 7 fragments of the SSU gene have been identified, but fewer could be matched in species of other genera. The fragments typically range between a few dozen and a few hundreds bp and are distributed in seemingly random fashion relative to their order in a typical prokaryotic rRNA gene. Moreover, in all cases, the identified fragments do not account for the entire rRNA molecule. While the functional apicomplexan mitochondrial rRNA could be shorter than homologues in other mitochondria, it is also likely that additional, still unidentified fragments of rRNA are encoded in the short intergenic regions. Another consequence of extreme reductions is the complete absence of tRNA genes in the mitochondrial genomes of apicomplexans. Absence of some tRNA genes from organellar genomes is known from several organisms (Duchêne et al. 2009), and in these cases, the missing tRNA species are encoded in the nucleus and imported from the cytoplasm. Complete absence of tRNA genes is rare but not unique to the mitochondria of alveolates. In the euglenozoan Trypanosoma brucei, all tRNAs except the eukaryotic-type tRNAMet-I and tRNASec are imported (Barbrook et al. 2010). In contrast with the remarkable conservation in size and gene content, the linear order of genes between species from different genera is completely rearranged, but at shorter evolutionary distances, genomic plasticity is less disruptive. For example, synteny is almost perfectly maintained among the 23 species of Plasmodium whose mitochondrial genomes have been analyzed (Hikosaka et al. 2011).

Dinoflagellates Apicomplexans and dinoflagellates are close evolutionary relatives, and as such, they share many of the particularities found in their mitochondrial genomes. The most conspicuous similarities are in gene content: cox1, cox3, and cob, the same three protein genes encoded in apicomplexan mitochondria, are also the only ones found to date in the mitochondrial genomes of dinoflagellates. Similarly, rRNA genes are extensively fragmented and tRNA genes appear to be absent.

713

In contrast, the physical organization of the mitochondrial genome is very different in apicomplexans and dinoflagellates. While in apicomplexans, the genomes occur as circularmapping molecules or tandem arrays of welldefined, highly streamlined 6–8 kb units; in dinoflagellates, the mitochondrial genome appears to comprise a highly complex collection of linear fragments of varying sizes and indefinable number. Early studies using electrophoresis of DNA and hybridization of specific probes targeting mitochondrial genes gave confusing results as molecules of defined size hybridizing to known mitochondrial sequences could not be found, and attempts to purify dinoflagellate mtDNA were largely unsuccessful. The use of high-throughput methods such as EST and shotgun sequencing using next-generation technologies has enabled the examination of mtDNA in several species of dinoflagellates, including species from basal lineages such as Oxyrrhis marina and Hematodinium sp., and also more derived dinoflagellates (Nash et al. 2008; Waller and Jackson 2009). Genome structure. The most remarkable characteristic, found in all studied dinoflagellate species, is the complex, seemingly chaotic structural arrangement of the mtDNA. Because of the intrinsic difficulty in physically separating the heterogeneous mtDNA from the nuclear DNA, typically PCR amplification and transcriptomic surveys have been used to analyze the organization of dinoflagellate mtDNA. The approximate size of the fragments that make up the mtDNA of Alexandrium carterae has been estimated at ~30 kb, but the total amount of distinct sequence is likely to be much larger. PCR surveys in several species using primers to known regions of cox1, cox3, and cob as well as the few rRNA fragments identified to date have revealed seemingly endless combinations of the three protein-coding genes and rRNA fragments in numerous different genomic contexts, which are likely to be reshuffled by recombination. Genes and transcripts. Because transcripts of protein-coding and rRNA genes are polyadenylated in the mitochondria of dinoflagellates, EST projects have provided abundant data on gene structure and molecular characteristics of

M

714

Mitochondrial Genomes in Alveolates

Mitochondrial Genomes in Alveolates, Fig. 1 A schematic representation of the evolutionary relationships of the three main lineages of alveolates, pinpointing the relative origin of some of the distinctive features of alveolate mitochondrial genomes. At the right, a representative mitochondrial genome from each of the three groups is

shown. Thin lines represent the mitochondrial DNA; solid rectangles represent genes as follows: red, cob; green, cox1; yellow, cox3; blue, rRNA-coding regions; grey, ymf genes; and black, other protein-coding genes. Arrows indicate direction of transcription

the mitochondrial genes. As in ciliates and apicomplexans, dinoflagellate mitochondrial mRNAs use alternative start codons; AUU and AUA have been identified as potential initiators of translation and are the same codons that have also been described in transcripts of two of the three genes in P. falciparum mtDNA. Consistent with this feature, 50 untranslated regions are very short, no more than a few nucleotides between the predicted initiation codon and the start of the transcript. In O. marina, a short stretch of uridylate (U) residues, not encoded in the genome, appears to be added to the 50 terminus of the mRNAs. If authenticated, this addition may represent a novel type of 50 processing (Nash et al. 2008; Waller and Jackson 2009). At the other end of the transcripts, another unusual feature seen in ciliates and

apicomplexans is taken one step further. While the other alveolates use only TAA as stop codon, in dinoflagellate mitochondria, two of the three protein-coding transcripts contain no evident translation termination codon. In the cox3 transcript, no stop codon is encoded at the DNA level, but the transcript ends with a U nucleotide, which upon polyadenylation becomes an in-frame standard UAA codon. It is unclear how termination of translation occurs in the transcripts of the two genes that lack stop codons. One possibility is that lysines are added to the carboxy terminus of the protein by reading the AAA codons created by polyadenylation (Slamovits et al. 2007) or perhaps there exists some other mechanism involving specialized tRNA or other types of RNA molecules (Waller and Jackson 2009).

Mitochondrial Genomes in Alveolates

Analysis of transcripts from Karlodinium micrum revealed that cox3 is encoded as separate subgenomic fragments that are spliced together to form the full-length mRNA, presumably by a novel type of trans-splicing (Waller and Jackson 2009). This type of processing of cox3 is probably widespread among dinoflagellates as fragmented cox3 genes were found at the DNA level in several species except in O. marina, which represents a basal dinoflagellate lineage. Instead, cox3 in O. marina is part of a cob–cox3 transcript. This transcript encodes a single ORF where no stop codon is found at the end of the predicted cob coding sequence, and cox3 follows in-frame with no detectable intervening sequence. Thus, translation of this transcript would generate a cob–cox3 fusion protein, but it is unknown whether the polypeptide is cleaved posttranscriptionally (Slamovits et al. 2007). RNA editing is a type of posttranscriptional process by which nucleotides in the unprocessed transcript can be inserted, deleted, or modified to form a different base, consequently resulting in a different sequence in the mature mRNA. Transcripts of mitochondrial genes in dinoflagellates have been found to undergo substitutional editing in as much as 6% of the bases (Waller and Jackson 2009). In other eukaryotic systems, substitutional editing consists only of transitions (i.e., purine to purine or pyrimidine to pyrimidine changes), but in dinoflagellates, transversions have also been observed, although in a minor proportion. The pattern that is emerging suggests that mitochondrial transcripts in basal species such as Perkinsus marinus and O. marina do not undergo editing, but more derived species do. This scenario indicates that editing arose after the origin of the early dinoflagellate lineages and became more pronounced in the more diverged groups (Fig.1).

Conclusion Alveolate protists contain highly divergent mitochondrial genomes that in some cases exhibit unique features that depart from widely conserved rules of molecular biology. The distribution of some of the features in the different groups of

715

alveolates indicates an evolutionary progression characterized by reduction in genome size achieved by loss and migration of genes to the nucleus and accompanied by extensive fragmentation of rRNA genes. Other conspicuous characteristics such as alternative start and stop codons suggest that important changes in the process of protein translation must have occurred.

References Barbrook AC, Howe CJ, Kurniawan DP, Tarr SJ (2010) Organization and expression of organellar genomes. Philos Trans R Soc Lond B Biol Sci 365:785–797 Barth D, Berendonk TU (2011) The mitochondrial genome sequence of the ciliate Paramecium caudatum reveals a shift in nucleotide composition and codon usage within the genus Paramecium. BMC Genomics 12:272 Cavalier-Smith T (2004) Only six kingdoms of life. Proc Biol Sci 271:1251–1262 de Graaf RM, van Alen TA, Dutilh B, Kuiper J, van Zoggel H, Huynh M, Gortz H-D, Huynen M, Hackstein J (2009) The mitochondrial genomes of the ciliates Euplotes minuta and Euplotes crassus. BMC Genomics 10:514 de Graaf RM, Ricard G, van Alen TA, Duarte I, Dutilh BE, Burgtorf C, Kuiper JWP, van der Staay GWM, Tielens AGM, Huynen MA, Hackstein JHP (2011) The organellar genome and metabolic potential of the hydrogen-producing mitochondrion of Nyctotherus ovalis. Mol Biol Evol 28:2379–2391 Duchêne AM, Pujol C, Maréchal-Drouard L (2009) Import of tRNAs and aminoacyl-tRNA synthetases into mitochondria. Curr Genet 55:1–18 Gray MW, Lang BF, Burger G (2004) Mitochondria of protists. Annu Rev Genet 38:477–524 Hikosaka K, Watanabe Y-I, Kobayashi F, Waki S, Kita K, Tanabe K (2011) Highly conserved gene arrangement of the mitochondrial genomes of 23 Plasmodium species. Parasitol Int 60:175–180 Lin R-Q, Qiu L-L, Liu G-H, Wu X-Y, Weng Y-B, Xie W-Q, Hou J, Pan H, Yuan Z-G, Zou F-C, Hu M, Zhu X-Q (2011) Characterization of the complete mitochondrial genomes of five Eimeria species from domestic chickens. Gene 480:28–33 Nash EA, Nisbet RER, Barbrook AC, Howe CJ (2008) Dinoflagellates: a mitochondrial genome all at sea. Trends Genet 24:328–335 Slamovits CH, Saldarriaga JF, Larocque A, Keeling PJ (2007) The highly reduced and fragmented mitochondrial genome of the early-branching dinoflagellate Oxyrrhis marina shares characteristics with both apicomplexan and dinoflagellate mitochondrial genomes. J Mol Biol 372:356–368 Waller RF, Jackson CJ (2009) Dinoflagellate mitochondrial genomes: stretching the rules of molecular biology. Bioessays 31:237–245

M

716

Mitochondrial Genomes in Amoebozoa Dennis Miller Department of Molecular and Cell Biology, The University of Texas at Dallas, Richardson, TX, USA

Synopsis Amoebozoa is a supergroup including the cellular (dictyostelid) and acellular (myxomycete) slime molds and the lobose amoebae, as well as several other groups of amoeboid protozoa. The mitochondrial DNAs (mtDNAs) from prototype organisms of each of these three groups have been completely sequenced and characterized. Although these mitochondrial genomes have a number of features in common as anticipated for related organisms, there is an unexpected similarity in mtDNA features between Acanthamoeba castellanii (lobose amoeba) and Dictyostelium discoideum (slime mold), while there are a surprising number of differences in mtDNA features between Physarum polycephalum and Dictyostelium discoideum (both slime molds). Similarities and differences in mtDNA size, gene content, and gene organization as well as mitochondrial gene expression are compared and contrasted. The evolutionary implications of these comparisons are discussed.

Introduction Amoebozoa is a supergroup of eukaryotes that unifies the slime molds, archamoebae and lobose amoebae (Ref. 62 in Gray et al. 2004). The monophyly of Amoebozoa is supported by molecular phylogenies based on small subunit rRNA genes (Ref. 98 in Gray et al. 2004) and combined analyses of protein sequences (Refs. 13, 14, 109 in Gray et al. 2004). The supergroup Amoebozoa branches with the common ancestor of animals and fungi (opisthokonts) in what is thought to constitute a higher order monophyletic group, the unikonts.

Mitochondrial Genomes in Amoebozoa

These molecular phylogenies indicate that Amoebozoa is composed of two related but anciently divergent groups, Lobosea (or Lobosa) and Conosea (or Conosa). Conosea includes Archamoebae (genera Entamoeba and Mastigamoeba), acellular slime molds (Myxomycetozoa) such as Physarum polycephalum, and cellular slime molds (Dictyostelida) such as Dictyostelium discoideum and Polysphondylium pallidum. Lobosea comprises lobose amoebae having broad, thick pseudopodia: a diverse assemblage including tubulinids (e.g., Hartmannella vermiformis) and acanthamoebids (such as Acanthamoeba castellanii). Mitochondrial DNAs (mtDNAs) have been sequenced from a number of representative organisms from Amoebozoa. In Conosea the archamoebae are amitochondriate, but the complete sequence of the mtDNA of the prototype organism of the myxomycetes, Physarum polycephalum, has been reported (Ref. 291 in Gray et al. 2004) and the mtDNA of the closely related myxomycete, Didymium iridis, has been partially sequenced (Hendrickson and Silliker 2010). Mitochondrial DNAs from two dictyostelid genera have been completely sequenced, those of Dictyostelium discoideum (Ref. 220 in Gray et al. 2004) and Polysphondylium pallidum (G. Burger et al., unpublished). The mtDNA from several other Dictyostelium species (citrinum, mucoroides, and fasciculatum) has also been sequenced (see Bullerwell et al. 2010). Two representative mtDNAs have been sequenced from Lobosea, those of Acanthamoeba castellanii, the prototype of the acanthamoebids (Ref. 55 in Gray et al. 2004), and Hartmannella vermiformis, a representative of tubulinids (G. Burger et al., unpublished). The mtDNAs from these six representative genera share a number of features but also display some fundamental differences. Based on nuclear SSU rRNA gene sequence comparisons, a schematic molecular phylogeny of these six genera is shown in Fig. 1a. Surprisingly, based on mtDNA features discussed below, a different phylogenetic topology can be inferred (Fig. 1b). Using these mtDNA features as evolutionary criteria, myxomycetes appear to be related to, but perhaps anciently divergent from, the other members of Amoebozoa; in

Mitochondrial Genomes in Amoebozoa

717

Mitochondrial Genomes in Amoebozoa, Fig. 1 Schematic phylogenetic trees of six representative amoebozoans. (a) Reconstructed from small subunit (SSU) rRNA data; (b) reconstructed from mtDNA features

M

contrast, the mtDNAs of dictyostelids (Conosea) and the acanthamoebid A. castellani (Lobosea) have more similarities with each other than either of them has with the mtDNA of P. polycephalum.

Comparison of the Mitochondrial Genomes of Three Representative Amoebozoons The well-characterized mtDNA of three prototype amoebozoons, A. castellanii, P. polycephalum, and D. discoideum, are compared below using a number of criteria. Similarities and differences are summarized in Table 1. Mitochondrial genome size and gene content. All three prototype mitochondrial genomes are double-stranded, circular-mapping DNAs between 40 and 65 kb in size. A. castellanii mtDNA comprises 41,591 bp with 57 genes,

while D. discoideum mtDNA is 55,564 bp in size with 59 genes (Table 1). Mitochondrial DNA in P. polycephalum ranges in size from 60 to 63 kb and contains at least 44 identified genes and 28 significant, but unassigned, open reading frames (ORFs) (Table 1). Forty genes are common to all three mtDNAs. Each mtDNA has genes needed for coupled electron transport – oxidative phosphorylation as well as mitochondrial protein synthesis (rRNAs, tRNAs, and ribosomal proteins), plus unassigned ORFs (Tables 2 and 3). Four tRNA genes, three rRNA genes, 16 ribosomal protein genes, and 17 genes necessary for electron transport and oxidative phosphorylation are common among the three mtDNAs. Fifty genes are shared between A. castellanii and D. discoideum, the unique genes in each mtDNA being primarily tRNA genes and unassigned ORFs. Genome organization and gene order. While all characterized genes and ORFs in the

718

Mitochondrial Genomes in Amoebozoa

Mitochondrial Genomes in Amoebozoa, Table 1 Comparison of three prototype amoebozoan mtDNAs

Size (bp) Identified protein coding genes Unassigned ORFs tRNA genes rRNA genes Fused cox1/2 gene All genes in same transcriptional orientation Group I introns Insertional RNA editing 50 -Replacement tRNA editing

Acanthamoeba castellanii 41,591 33 5 16 3 + + + – +

mitochondrial genomes of both A. castellanii and D. discoideum are in the same transcriptional orientation, the genes in P. polycephalum mtDNA are transcribed from both strands (51 genes and ORFs clockwise, 21 genes and ORFs counterclockwise). A. castellanii and D. discoideum mtDNAs also have several blocks of genes exhibiting a conserved gene order. The gene sequences rpl16-rpl14-rpl5-rps14-rps8-rpl6rps13-rps11, rps12-rps7-rpl2-rps19, and nad9nad7-atp6 are conserved in both mtDNAs. The gene clusters rps12-rps7-rpl2-rps19 and rps14-rps8-rpl6-rps13 are also found in P. polycephalum mtDNA and are therefore conserved in all three amoebozoan mitochondrial genomes. Both A. castellanii and D. discoideum mtDNAs have genes encoding subunits 1 and 2 of cytochrome c oxidase (cox1 and cox2) fused into a single ORF, designated cox1/2. The presence of this unusual feature in both organisms indicates that this feature arose in a common ancestor. This fusion is absent in P. polycephalum mtDNA. Another common feature of A. castellanii and D. discoideum mtDNAs that is absent in that of P. polycephalum is the presence of group I introns in the large subunit ribosomal RNA gene (rnl). A. castellanii rnl has three group I introns compared with a single group I intron in D. discoideum rnl and four group I introns in the fused cox1/2. Group I introns have not been detected in the mtDNA of P. polycephalum or other myxomycete mitochondrial genomes.

Dictyostelium discoideum 55,564 34 4 18 3 + + + – +

Physarum polycephalum 61,019 36 28 5 3 – – – + +

Mitochondrial Gene Expression in Amoebozoa Mitochondrial DNA transcription, promoters. Mitochondrial DNAs throughout eukaryotes are primarily transcribed by a highly conserved and dedicated mitochondrial RNA polymerase (Miller and Miller 2008; Le et al. 2009). However, promoter consensus sequences and transcriptional patterns of mtDNAs vary significantly. The gene arrangement and asymmetric location of the genes on one strand of the mtDNA of A. castellanii and D. discoideum indicate the possibility of polycistronic transcripts whose synthesis is initiated from a single promoter or small number of promoters. Le et al. (2009) detected eight major polycistronic transcripts from the mtDNA of D. discoideum, which they showed to be derived from rapid co-transcriptional processing of a single primary transcript initiated from a single promoter. The lone transcription initiation site, as detected by 50 capping analysis, is located upstream of rnl (Le et al. 2009). The arrangement and distribution of genes on both strands of the mtDNA of P. polycephalum implies the need for multiple sites of transcription initiation. Bundschuh et al. (2011) have characterized the mitochondrial transcriptome of the plasmodial form of P. polycephalum. They detect the potential for at least eight polycistronic transcripts and show that many of the adjacent genes are indeed transcribed as a single polycistronic

Mitochondrial Genomes in Amoebozoa

719

Mitochondrial Genomes in Amoebozoa, Table 2 Gene content comparison of three prototype amoebozoan mtDNAs Gene nad1 nad2 nad3 nad4 nad4L nad5 nad6 nad7 nad9 nad11 (nadG) cob cox1 cox2 cox3 atp1 atp6 atp8 atp9 rps2 rps3 rps4 rps7 rps8 rps11 rps12 rps13 rps14 rps16 rps19 rpl2 rpl5

Acanthamoeba castellanii + + + + + + + + + +

Dictyostelium discoideum + + + + + + + + + +

Physarum polycephalum + + + + + + + + + +

+ + + + + + – + + + + + + + + + + – + + +

+ + + + + + + + + ORF425 + + + + + + + – + + +

rpl6 rpl11

+ +

+ +

rpl14 rpl16 rpl19

+ + –

+ + –

+ + + + + + + + + + + + + + + + + + + + ERF2 (php23) + ERF3 (php22) + + +

RNA. Analysis of the regions upstream of the potential initiation sites has not revealed a possible promoter consensus sequence. Interestingly, 25 of the 26 significant unassigned ORFs do not appear to be transcribed in the plasmodial (diploid) form of P. polycephalum. Consistent with this lack of transcription is the observation

Mitochondrial Genomes in Amoebozoa, Table 3 RNA gene content comparison of three prototype amoebozoan mtDNAs Gene trnA trnC trnD trnE trnF trnG trnH trnI1 trnI2 trnI3 trnK trnL1 trnL2 trnM1 trnM2 trnN trnP trnQ trnR trnS trnT trnV trnW trnX trnY rnl rns rrn5

Acanthamoeba castellanii + – + + + – + + + – + + + + – – + + – – – – + + + + + +

Dictyostelium discoideum + + – + + – + + + + + + + + – + + + + – – – + – + + + +

Physarum polycephalum – – – + – – – – – – + – – + + – + – – – – – – – – + + +

trn tRNA gene, rnl large subunit rRNA gene, rns small subunit rRNA gene, rrn5 5S rRNA gene

that the mtDNA of some strains of P. polycephalum have deletions in these reading frames without apparent phenotypic effect. Further analysis of the mtDNA of related myxomycetes should reveal whether these ORFs are transcribed under other developmental or environmental conditions (e.g., in the amoebal form of the myxomycete) or are an evolutionary artifact resulting from the capture of DNA through recombination with a non-mtDNA such as a mitochondrial plasmid, as described by Nakagawa et al. (1998) and Nomura et al. (2005). The mF mitochondrial plasmid has been shown to enhance recombination of mtDNA

M

720

causing mtDNA gene rearrangement in some Physarum strains. RNA self-splicing by group I introns. A necessary step of mitochondrial gene expression in A. castellanii and D. discoideum is the selfsplicing of group I introns to produce the mature LSU rRNA in both organisms and the cox1/2 transcript in D. discoideum. The three introns of rnl in A. castellanii each have an ORF, as do introns 2 (two ORFs), 3, and 4 of cox1/2 in D. discoideum. These ORFs presumably code for maturases or homing endonucleases necessary for efficient splicing and/or intron mobility. These introns probably have the potential for mobility based on the presence of an ORF internal to intron 2 of cox1/2, which codes for a site-specific DNA homing endonuclease (Ref. 220 in Gray et al. 2004). RNA editing in amoebozoan mitochondria. Several types of RNA editing are necessary to produce functional, mature rRNAs, tRNAs, and mRNAs in mitochondria of various amoebozoons. One type of RNA editing present in all three prototype organisms is a 50 -replacement tRNA editing activity that repairs tRNAs that overlap other genes or have acceptor stem mismatches. First characterized in mitochondrial tRNAs of A. castellanii (Ref. 187 in Gray et al. 2004), 50 replacement editing of mitochondrial tRNAs has been inferred in D. discoideum based on tRNA gene sequence and demonstration of 50 -tRNA editing activity in vitro (Abad et al. 2011), and 50 -edited mitochondrial tRNAs have been characterized in P. palladium (E. Schindel and M.W. Gray, unpublished) and recently in P. polycephalum (Gott et al. 2010). This 50 -replacement tRNA editing is based on an activity that is able to add nucleotides to the 50 end of RNAs in a template-dependent manner (30 –50 polymerase). In D. discoideum, a His-tRNA guanylyltransferase (Thg-1)-like protein has been implicated in the 50 editing of tRNAs (Abad et al. 2011). While 13 out of 15 tRNAs encoded in the A. castellanii mitochondrial genome are edited (Refs. 187, 252 in Gray et al. 2004), only two of the five tRNAs encoded by P. polycephalum mtDNA are altered by 50 replacement editing. In each case, a single nucleotide is

Mitochondrial Genomes in Amoebozoa

replaced by a G to create a G-C base pair at the 1:72 position of the acceptor stem of the tRNA (Gott et al. 2010). C-to-U RNA editing has also been identified at a limited number of sites in amoebozoan mitochondrial RNAs (four sites in P. polycephalum cox1 mRNA (Gott et al. 1993) and one site in the mitochondrial SSU rRNA of D. discoideum (Barth et al. 1999)). Horton and Landweber (Ref. 145 in Gray et al. 2004) have shown that C-to-U editing occurs in the cox1 mRNA in other myxomycetes. The most extensive type of RNA editing is the insertional editing found exclusively in mitochondria of the myxomycetes, a signature feature of this amoebozoan assemblage. The absence of insertional RNA editing in the mitochondria of non-myxomycetes is a further example of the differences between the myxomycetes and the other amoebozoans. This type of RNA editing has been orders (Physarales, Stemonitales, Echinosteliales, Tricheales, and Liceales) and in over 60 myxomycete species (Ref. 145 in Gray et al. 2004; Krishnan et al. 2007). This broad but exclusive distribution of insertional RNA editing serves as a defining characteristic of the myxomycetes. RNAs produced from the mtDNA of P. polycephalum have insertions at 1,324 sites relative to the mtDNA sequence producing the RNAs (Bundschuh et al. 2011). These inserted nucleotides constitute about 4% of the total mRNA sequence and 2% of the sequences of tRNAs and rRNAs. Insertions are primarily single cytidines (1,255 sites) or single uridines (43 sites) or a subset of the possible dinucleotides (AA, 4 sites; UU, 2 sites; UG/GU, 4 sites; UC/CU, 9 sites; UA, 2 sites; GC/CG, 2 sites; 23 sites total). At three sites, there is a single purine insertion (a G at 2 sites and an A at 1 site). Of the 47 transcribed genes and ORFs in the mtDNA of the plasmodial form of P. polycephalum, 45 require insertional RNA editing to produce functional tRNAs, rRNAs, and mRNAs (Bundschuh et al. 2011). These nucleotide insertions occur co-transcriptionally as the RNA is produced by addition of non-templated nucleotides to the 30 end of the nascent RNA (Miller and Miller 2008; Cheng et al. 2001). RNA editing

Mitochondrial Genomes in Amoebozoa

sites are distributed relatively uniformly throughout genes whose transcripts are edited. The distribution of sites is consistent with the random creation of sites at any location, constrained only by a 9-nucleotide limitation in the proximity of adjacent sites (Krishnan et al. 2007). The observed pattern is an average spacing of about 25 nucleotides between editing sites in mRNA, about 72% of saturation density. This dynamic model of editing site fixation is consistent with the variation of editing site location in analogous genes of related myxomycetes (Ref. 145 in Gray et al. 2004; Hendrickson and Silliker 2010; Krishnan et al. 2007; Antes et al. 1998).

Evolutionary Implications of mtDNA Structure and Features A strong case can be made for the common ancestry of the organisms in Amoebozoa based on their mtDNA similarities. These genomes have a similar size and gene content, each with 17–18 genes for electron transport and oxidative phosphorylation (17 genes in common), 10–11 genes coding for proteins of the small subunit of the mitoribosome (10 genes in common), and 6–7 genes coding for proteins of the large subunit of the mitoribosome (6 genes in common), as well as three genes specifying LSU rRNA, SSU rRNA, and a putative 5S rRNA (Bullerwell et al. 2010). The surprising observation in light of phylogenies based on the nuclear SSU rRNA genes or other non-mitochondrial criteria is the number of similarities between the mtDNA features of A. castellanii and D. discoideum and the number of differences in mtDNA features between D. discoideum, the prototype of the cellular slime molds (dictyostelids), and P. polycephalum, the prototype of the acellular or plasmodial slime molds (myxomycetes). To determine whether these differences are due to a relatively ancient divergence of the dictyostelids and myxomycetes with the development of some convergent features and morphology, or due to a relatively late but rapid divergence of the mitochondrial genome of the myxomycetes from an ancestral pattern, will require characterization of

721

the mtDNAs of additional representatives of Amoebozoa, especially other amoebozoan groups that are thought to be monophyletic with the dictyostelids and myxomycetes.

References Abad MG, Long Y, Willcox A, Gott JM, Gray MW, Jackman JE (2011) A role for tRNAHis guanylyltransferase (Thg1)-like proteins from Dictyostelium discoideum in mitochondrial 50 -tRNA editing. RNA 17:613–623 Antes T, Costandy H, Mahendran R, Spottswood M, Miller D (1998) Insertional editing of tRNAs of Physarum polycephalum and Didymium nigripes. Mol Cell Biol 18:7521–7527 Barth C, Greferath U, Kotsifas M, Fisher PR (1999) Polycistronic transcription and editing of the mitochondrial small subunit (SSU) ribosomal RNA in Dictyostelium discoideum. Curr Genet 36:55–61 Bullerwell CE, Burger G, Gott JM, Kourennaia O, Schnare MN, Gray MW (2010) Abundant 5S rRNA-like transcripts encoded by the mitochondrial genome in Amoebozoa. Eukaryot Cell 9:762–773 Bundschuh R, Antmuller J, Becker C, Nurnburg P, Gott JM (2011) Complete characterization of the edited transcriptome of the mitochondrion of Physarum polycephalum using deep sequencing of RNA. Nucleic Acids Res 39:6044–6055 Cheng YW, Visomirski-Robic LM, Gott JM (2001) Non-templated addition of nucleotides to the 30 end of nascent RNA during RNA editing in Physarum. EMBO J 20:1405–1414 Gott JM, Somerlot BH, Gray MW (2010) Two forms of RNA editing are required for tRNA maturation in Physarum mitochondria. RNA 16:482–488 Gott JM, Visomirski LM, Hunter JL (1993) Substitutional and insertional RNA editing of the cytochrome c oxidase subunit 1 mRNA of Physarum polycephalum. J Biol Chem 268:25483–25486 Gray MW, Lang BF, Burger G (2004) Mitochondria of protists. Annu Rev Genet 38:477–524 Hendrickson PG, Silliker ME (2010) RNA editing is absent in a single mitochondrial gene of Didymium iridis. Mycologia 102:1288–1294 Krishnan U, Barsamian A, Miller DL (2007) Evolution of RNA editing sites in the mitochondrial small subunit rRNA of the myxomycetes. Methods Enzymol 424:197–220 Le P, Fisher PR, Barth C (2009) Transcription of the Dictyostelium discoideum mitochondrial genome occurs from a single initiation site. RNA 15:2321–2330 Miller ML, Miller DL (2008) Non-DNA-templated addition of nucleotides to the 30 end of RNAs by the mitochondrial RNA polymerase of Physarum polycephalum. Mol Cell Biol 28:5795–5802

M

722 Nakagawa CC, Jones EP, Miller DL (1998) Mitochondrial DNA rearrangements associated with mF plasmid integration and plasmodial longevity in Physarum polycephalum. Curr Genet 33:178–187 Nomura H, Moriyama Y, Kawano S (2005) Rearrangements in the Physarum polycephalum mitochondrial genome associated with a transition from linear mF-mtDNA recombinants to circular molecules. Curr Genet 47:100–110

Mitochondrial Genomes in Fungi B. Franz Lang Centre Robert Cedergren, Département de Biochimie, Université de Montréal, Montréal, QC, Canada

Synopsis Most fungal mitochondrial genomes are unsurprising in their features, exhibiting a single, circular-mapping chromosome, a standard gene set, a close-to-standard bacteria-like tRNA and rRNA structures, and a minimally derived genetic code (UGA “stop” codons specifying tryptophan). More substantial variation occurs in genome size, accounted for by intergenic regions and intron content, and rapid gene order changes even at short evolutionary distances. In contrast, research on less well-characterized, fast-evolving fungal lineages has unraveled a growing list of most unorthodox features – the focus of this review. Among these novelties are a host of different genome architectures (circular, linear, multi-chromosomal), genetic code changes, reduction in the number of tRNA genes, tRNA editing, previously unknown mechanisms of translation initiation together with unorthodox initiator tRNA structures, rRNA genes in pieces, mobile endonuclease open reading frames (ORFs), and, most recently, group I intronmediated mRNA trans-splicing. Evidently, mitochondria of certain fungal groups might be described as “Nature’s most advanced genetic laboratory.”

Mitochondrial Genomes in Fungi

Introduction Few reviews focus specifically on fungal mitochondrial genomes and genes. This lack is in part due to the fact that even as recently as 8 years ago, the count of publicly available complete fungal mtDNA sequences was just 22, at a time when many hundreds of animal mtDNA sequences were available. To some degree, this situation has changed due to a flurry of fungal genome projects: which, however, have centered primarily on ascomycetes and basidiomycetes, in consequence usually providing “more of the same” rather than a broad view across fungal diversity. This entry, rather than offering a detailed update on “standard” fungal mtDNAs, focuses instead on features that deviate from the typical pattern. Mitochondrial genes, on their way out. The typical fungal mitochondrial genome has a small, standard set of genes (~30–40, see list below; for an example, see Fig. 1), two more than in animals.

Genes of Identified Function in Fungal Mitochondrial DNAs 1. Electron transport and oxidative phosphorylation (Genes in bold occur in all mitochondriate fungi; metazoans never have rps3 or rnpB.) Complex I: nad 1, 2, 3, 4, 4 L, 5, 6 Complex III: cob Complex IV: cox 1, 2, 3 Complex V (ATP synthase): atp 6, 8, 9 2. Translation rRNAs: rns, rnl tRNAs: complete set (Often 24 or 25 distinct gene sequences.) Ribosomal proteins: rps3 3. tRNA processing RNase P RNA: rnpB Substantial gene loss in mtDNAs likely occurred during the evolutionary transition from fungal ancestors to fungi (e.g., Thecamonas trahens, a unicellular protist diverging close to the animal/fungal origin, encodes an additional

Mitochondrial Genomes in Fungi

723

Mitochondrial Genomes in Fungi, Fig. 1 Genetic map of Glomus irregulare mtDNA. The circular-mapping genome of G. irregulare (DAOM 197198) was arbitrarily opened upstream of rnl. Genes on the outer and inner circumference are transcribed in clockwise and counterclockwise direction, respectively. Arcs indicate coding regions interrupted by introns. Boxes of coding regions are filled black, intron ORFs dark gray, and introns light

gray. Regions with similarity to mitochondrial plasmidlike DNA polymerase are marked light blue. Gene and corresponding product names are atp6 ATP synthase subunit 6; cob apocytochrome b; cox1-3 cytochrome c oxidase subunits; nad1-6 and 4L NADH dehydrogenase subunits; rnl and rns large and small subunit rRNAs, respectively; and A-W, tRNAs (letters correspond to the amino acid specified by a particular tRNA)

15 ribosomal protein genes in its mtDNA), but it is also an ongoing process. For example, genes for subunits of the mitochondrial NADH dehydrogenase complex (Complex I of the respiratory chain) are absent from the mtDNA of fission yeasts (Schizosaccharomyces) (Bullerwell et al. 2003a) and in the distantly related budding yeasts of the Saccharomyces group. Likewise, rps3 (encoding a ribosomal protein) and rnpB (specifying the RNA component of mitochondrial RNase P) have a scattered distribution in fungal mtDNAs. The failure to identify rnpB has several possible explanations: (i) highly divergent rnpB genes remain unidentified, (ii) a protein-only enzyme substitutes rnpB function (as in human and Arabidopsis mitochondria), or (iii) the RNA subunit is nucleus encoded and imported into the mitochondria (as in Neurospora and Podospora; B. F. Lang, unpublished). Loss of tRNA genes also occurs, e.g., loss of a trnI(cau) gene capable of decoding AUA codons as isoleucine (due to a posttranscriptional

modification of the C residue in its CAU anticodon) and concomitant loss of AUA codons from mitochondrial genes (e.g., in Schizosaccharomyces octosporus, the zygomycete Rhizopus oryzae, and the basidiomycete Schizophyllum commune). Further, in chytridiomycetes of the orders Monoblepharidales, Chytridiales, and Spizellomycetales, a multitude of tRNAs have to be imported from the cytoplasm as only 4–9 mitochondrial tRNA genes remain (four in the very fast-evolving Rhizophlyctis rosea whose mtDNA exceeds 200 kb in size; C. E. Bullerwell and B. F. Lang, unpublished). The tRNA genes that remain in these systems are apparently those that are not easily replaceable by cytoplasmic counterparts, such as initiator methionine tRNA and tRNAs that recognize codons with altered specificity, such as UGA (tryptophan) or UAG (leucine). Mitochondrial genome structure, controversial and poorly understood. Most fungal mitochondrial genomes assemble into a single,

M

724

circular-mapping contig. On this basis, in the absence of supporting experimental data, many authors assume that the genome architecture is truly circular, whereas others point to the alternative, a linear concatemeric (tandemly repeated) structure that assembles into or maps as a circle – as experimentally demonstrated in a few cases (Bendich 1996). Yet, there is little point in defining genome structure without knowledge of the underlying replication mechanism. Mitochondrial DNA of certain yeast (and probably most fungal) species replicates by a rollingcircle mechanism (Valach et al. 2011), and therefore, linear, circular-mapping, tandem repeat units (concatemers) will occur side by side with the circular, replicative form of DNA: that is, the actual genome architecture is neither circular nor linearly repeated, but a combination of both. Unfortunately, experimental demonstration of circular DNA molecules is difficult, except for small genomes (such as yeast, see above). In this context, a chytrid (Spizellomyces punctatus) mitochondrial genome may be of interest for model investigations. This genome comprises three circular-mapping chromosomes, with the two smallest ones only 1.1 and 1.4 kb in size (GenBank # NC_003052, NC_003060, NC_003061). Preliminary data indeed reveal that most of the DNA of the two small chromosomes occurs in form of linear concatemers, together with a small fraction of supercoiled, monomeric circles (B.F. Lang unpublished). Mono- or multimeric linear-mapping fungal mtDNAs are overall rare (e.g., in the chytrid Hyaloraphidium), occurring frequently only across budding yeasts where this form of genome architecture emerges independently several times. In some instances, even multipartite (fragmented) linear genomes are observed. For details and a discussion of the underlying, sometimes complex replication mechanisms, see Valach et al. (2011) and references therein. Relicts of plasmid insertions in mitochondrial genomes. Fungal mitochondria sometimes harbor autonomously replicating, circular or linear plasmids of enigmatic origin, which encode either a reverse transcriptase (i.e., in retroplasmids) or a T7 phage-like, single-subunit DNA polymerase

Mitochondrial Genomes in Fungi

(dpo), RNA polymerase (rpo), or both (for a recent review, see Hausner 2011). These plasmids tend to integrate into mtDNA. With one potential exception in the yeast Candida subhashii (Fricova et al. 2010), transferred polymerase genes are unlikely to be functional in mtDNA replication and are apparently genetically not selected, rapidly fragmented, and lost. In some instances, several distantly related dpo or rpo fragments from the same gene region coexist in mtDNAs, testifying to repeated plasmid insertion and elimination cycles. For instance, in the arbuscular mycorrhizal fungus Glomus irregulare, a remarkable five-out-of-nine evolutionarily distinct dpo fragments occupy the same carboxy-terminal region (Fig. 1). Relicts of plasmid insertions are easily overlooked when fragments of the recognizable plasmid polymerase genes contain no initiation codon or are very short. A hidden Markov model (HMM) search – a sensitive technique that uses a statistical profile derived from numerous aligned sequences – against known fungal mtDNAs indicates the presence of dpo insertions in Mortierella and Smittium (distant relatives of Glomus); in the chytrid Spizellomyces; in the two basidiomycetes, Pleurotus and Moniliophthora; and in the aforementioned ascomycete yeast Candida subhashii. Fragments of rpo occur as well in Spizellomyces, Pleurotus, and Moniliophthora, as well as in Mycosphaerella and Podospora. Transcription, RNA processing, and RNA editing. Most of the available experimental data on mitochondrial transcription and RNA processing have been obtained from S. cerevisiae and S. pombe and some of their relatives (Schäfer et al. 2005 and references therein). In the two fungi, transcription starts at short A + T-rich promoter motifs, recognized by a typical phage-like RNA polymerase. Fission yeast has one major transcription start exactly at the first nucleotide position of the large subunit rRNA (two further start sites make a minor contribution), whereas S. cerevisiae has a multitude of transcription units and promoters. Consequently, two models of RNA precursor processing have been suggested. In the fission yeast, S. pombe, a tRNA punctuation model has been proposed, with 50 and

Mitochondrial Genomes in Fungi

30 tRNA processing liberating single-gene RNA units, which in most instances are further 30 processed at pyrimidine-rich motifs. In budding yeasts, transcription units span only one to a few genes, and processing follows a variety of mechanisms, from tRNA punctuation, through 30 processing at sites displaying a distinct motif (similar to fission yeast), to further unidentified maturation steps. In any case, intron sequences are removed by mechanisms specific to the two intron groups (I and II). Several of the abovementioned chytrid species with highly reduced tRNA gene counts have taken tRNA evolution a step further. The encoded sequences are incapable of folding into proper tRNA structures, due to a lack of pairing in the first few positions of the acceptor stem. As in various non-fungal systems, this defect is repaired by RNA editing at 50 termini, using the respective 30 sequence as a template (Laforest et al. 2004). Translation initiation, Nature’s experiments with tRNAs and their recognition sites. In bacteria, initiation of translation depends on a specialized initiator methionine tRNA that recognizes AUG, GUG, as well as UUG, in descending order of efficiency. In the case of UUG initiation, only the second and third codon positions (U and G) are able to interact effectively with the second and first positions (A and C), respectively, of the tRNA anticodon. Whereas in bacteria, stabilization of codon–anticodon interactions and precise codon positioning is achieved by Shine–Dalgarno (SD) sequence motifs (at a defined distance upstream of initiation codons); SD motifs do not exist in fungal mitochondrial systems. How then is effective and precise translation initiation achieved? Apparently, in many species, the most upstream AUG (sometimes also GUG) codon in a messenger RNA is used for translation initiation, potentially identified by some sort of ribosomal scanning mechanism. In these examples, AUG codons close to the expected initiation site are usually unique, and coding regions tend to be flanked by upstream, very A + T-rich sequence. Such a system would no longer require a specific initiator tRNA, and in fact, sometimes only one tRNAMet is encoded in a given fungal mtDNA,

725

which most likely serves in both initiation and elongation. A most interesting variation exists in monoblepharidalian fungi, in which almost every protein-coding gene has a guanosine (G) residue upstream of the predicted AUG or GUG start codons (Bullerwell et al. 2003b). This conserved G residue correlates with the presence of an unorthodox cytosine (C) residue at position 37 in the anticodon loop of the assumed mitochondrial initiator tRNA, suggesting a four-bp interaction between a CAUC anticodon and quartet GAUG/ GGUG codons and a precise start position for the ribosome (Fig. 2). Finally, in the mycorrhizal fungus Gigaspora rosea, initiation at UUG is most frequent, contrary to established rules (B. F. Lang, unpublished). This is because the predicted initiator tRNA has a G residue at position 32 that is otherwise always a pyrimidine (for a recent review on mitochondrial tRNA structure, see Lang et al. 2011), and in the three-dimensional tRNA structure, G-32 is in close proximity to U-36 of the anticodon and the first U of UUG initiation codons. Interaction of G-32 with the two U residues will thus stabilize the U–U pair. Evolution of the genetic code and reassignment of tRNA identity. The genetic code is standard in only a few fungal groups (e.g., Blastocladiales). Otherwise, there is a strong tendency to adopt UGA stop codons for tryptophan (as, for instance, in metazoan animals). In order to recognize both UGA and UGG as Trp, the anticodon of the respective tRNA is usually not CCA but UCA. Similarly, in several chytrids (e.g., Spizellomyces, Rhizophydium, and Rhizophlyctis), UAG stop codons are read as leucine, and the corresponding tRNA has a matching CUA anticodon. This tRNA has a long extra arm characteristic of tRNALeu, was apparently recruited by gene duplication, and is likely recognized by a leucyl-tRNA synthetase. The only stop codon that has so far been spared from reassignment is UAA. Finally, in a very recent investigation, a recruitment of the S. cerevisiae mitochondrial tRNAThr from tRNAHis was demonstrated by phylogenetic inference, and the decoding mechanism was investigated by biochemical approaches (Su et al. 2011). Surprisingly, the mitochondrial

M

726

Mitochondrial Genomes in Fungi

Mitochondrial Genomes in Fungi, Fig. 2 Model of mitochondrial initiator tRNA in Monoblepharidales, interacting with mRNA. The proposed fourbp interaction involved in translation initiation is shown; initiation codons in the mRNA are colored blue. Conserved, unorthodox features are indicated in red, including the non-Watson–Crick base pair at the base of the anticodon stem (facilitating opening of the anticodon loop, to allow for better interaction between the quartet codons and the tRNA) and the cytosine at position 37 of tRNAfMet

tRNAThr in question has a UAG anticodon that would be read as leucine in the standard genetic code. It has been demonstrated that both the standard and the new unorthodox tRNAThr are now charged by the same threonyl-tRNA synthetase. One might question why a transition has occurred specifically from CUN(Leu) to CUN(Thr), i.e., whether other changes would have been possible. In fact, according to interpretation of multiple protein alignments, A. gossypii CUN codons are most likely translated as alanine (Lang et al. 2011). In support of this interpretation, the putative new A. gossypii tRNAAla has a G3:U70 base pair, which is a well-known key recognition element for alanyl-tRNA synthetases. Based on phylogenetic analysis, A. gossypii diverges as a sister species to those fungi with a CUN(Thr) codon reassignment, i.e., at an early time point when CUN codons became available for change. Introns, mobile endonucleases, and other mobile genetic elements. Proven or putative mobile elements have a widespread distribution in fungal mitochondrial genomes, and they represent one of the major sources of variability in mtDNAs. The most abundant proven mobile elements are introns, which are present in the four principal divisions of Fungi but are highly

variable in terms of individual presence or absence. For example, introns are absent from both Harpochytrium species and from the basidiomycete S. commune but frequent in their relatives, such as the chytridiomycete Monoblepharella15 and the basidiomycete Microbotryum violaceum. Mobile endonuclease ORFs (mORFs) form a less frequent class of mobile elements, which propagate in fungi by insertion into proteincoding genes, most curiously with preference for atp6 and atp9. A loss of gene function is avoided because mORFs carry a copy of the partial gene sequence that is displaced by the insertion, thus creating an intact ORF through the insertion (Paquin et al. 1994). Such elements are found in the chytrid Allomyces macrogynus and various members of “zygomycetes” (a paraphyletic group) including Rhizopus and Glomus species. Finally, there have been various reports of conserved double-hairpin elements (DHE) throughout fungi that are suspected traces of mobile elements (e.g., Bullerwell et al. 2003b). Genes in pieces and gene insertions. Fragmentation of rRNAs has been observed in a wide variety of mitochondrial systems. In fungi, the mitochondrial small subunit rRNA gene (rns) is

Mitochondrial Genomes in Fungi

727

Mitochondrial Genomes in Fungi, Fig. 3 Mobile endonuclease ORF in Allomyces. Schematic view of the atp6 gene in A. arbusculus (yellow box, without mORF insertion) and A. macrogynus (with insertion). Dotted lines denote the insertion element that carries a partial atp6 sequence (orange), which is joined to the amino-terminal

fragment to form a functional atp6 gene. Nucleotide sequences in the corresponding regions of yellow boxes are identical. The carboxy-terminal fragment of atp6 (orange) differs from the native gene sequence at a few positions, thereby disallowing an endonucleolytic cut of the hybrid atp6 by the ORF-encoded endonuclease

fragmented in Monoblepharidales (H. curvatum, Monoblepharella15, Harpochytrium94, and Harpochytrium105) and in the mycorrhizal fungus G. rosea. RNA secondary structure modeling predicts that break points locate to variable regions and that the two rRNA pieces have the potential to assemble by intermolecular basepairing, without a requirement for ligation. Another fragmented gene is cox1 in G. rosea where, however, a complete mRNA is created by group I intron-mediated trans-splicing (B. F. Lang, unpublished) Fig. 3. It is well known that mitochondrial rRNAs may not only fragment in variable regions of the structure but that these regions may also carry large inserts. Examples are found in S. pombe, Schizosaccharomyces japonicus, and in the abovementioned G. rosea. To avoid misinterpretation of inserts as introns, RT-PCR experiments with primer pairs flanking inserts may be performed. Alternatively, true group I or II introns may be identified by the presence of orthodox intron RNA structures. To facilitate RNA modeling, an intron search engine has been developed (RNAweasel; http://megasun.bch.umontreal.ca/ RNAweasel). Fungal mitochondrial genome annotation. As new genome data continue to accumulate at an unprecedented rate, detailed genome annotation by the end user becomes increasingly challenging, in particular for fungal mtDNAs that carry a large number of introns. Annotation is facilitated by the development of freely available online tools, including the one mentioned above

for intron identification, as well as an integrated organelle intron annotator (MFannot; http://mega sun.bch.umontreal.ca/papers/MFannot). Identification of weakly conserved genes is achieved by sensitive HMM searches (http://hmmer.janelia. org) that are as fast as BLAST but far more sensitive and reliable. For instance, the abovementioned identification of plasmid DNA and RNA polymerase fragments was enabled by MFannot, as was the identification of genes for RNase P RNAs that are often highly derived and thus difficult to find.

References Bendich AJ (1996) Structural analysis of mitochondrial DNA molecules from fungi and plants using moving pictures and pulsed-field gel electrophoresis. J Mol Biol 255:564–588 Bullerwell CE, Leigh J, Forget L, Lang BF (2003a) A comparison of three fission yeast mitochondrial genomes. Nucleic Acids Res 31:759–768 Bullerwell CE, Forget L, Lang BF (2003b) Evolution of monoblepharidalean fungi based on complete mitochondrial genome sequences. Nucleic Acids Res 31:1614–1623 Fricova D, Valach M, Farkas Z, Pfeiffer I, Kucsera J, Tomaska L, Nosek J (2010) The mitochondrial genome of the pathogenic yeast Candida subhashii: GC-rich linear DNA with a protein covalently attached to the 50 termini. Microbiology 156:2153–2163 Hausner G (2011) Introns, mobile elements, and plasmids. In: Bullerwell CE (ed) Organelle genetics. Springer, Berlin/Heidelberg, pp 329–358 Laforest MJ, Bullerwell CE, Forget L, Lang BF (2004) Origin, evolution, and mechanism of 50 tRNA editing in chytridiomycete fungi. RNA 10:1191–1199

M

728 Lang BF, Lavrov D, Beck N, Steinberg V (2011) Mitochondrial tRNA structure, identity and evolution of the genetic code. In: Bullerwell CE (ed) Organelle genetics. Springer, Berlin/Heidelberg, pp 431–474 Paquin B, Laforest MJ, Lang BF (1994) Interspecific transfer of mitochondrial genes in fungi and creation of a homologous hybrid gene. Proc Natl Acad Sci U S A 91:11807–11810 Schäfer B, Hansen M, Lang BF (2005) Transcription and RNA-processing in fission yeast mitochondria. RNA 11:785–795 Su D, Lieberman A, Lang BF, Simonović M, Söll D, Ling J (2011) An unusual tRNAThr derived from tRNAHis reassigns in yeast mitochondria the CUN codons to threonine. Nucleic Acids Res 39:4866–4874 Valach M, Farkas Z, Fricova D, Kovac J, Brejova B, Vinar T, Pfeiffer I, Kucsera J, Tomaska L, Lang BF, Nosek J (2011) Evolution of linear chromosomes and multipartite genomes in yeast mitochondria. Nucleic Acids Res 39:4202–4219

Mitochondrial Genomes in Invertebrate Animals Dennis V. Lavrov Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA

Synopsis Animal mtDNA is commonly described as a small, circular molecule, remarkably uniform in size, gene content, and genomic organization. The results of recent studies contradict this view and reveal substantial diversity in animal mtDNA organization. As should be expected, most of this diversity is found in non-bilaterian animals: phyla Cnidaria, Ctenophora, Placozoa, and Porifera, each of which displays a unique mode and tempo of mitochondrial genome evolution. Mitochondrial DNA in the phylum Cnidaria is characterized by low rate of sequence evolution and loss of all but one or two tRNA genes. In addition, linear mitochondrial genome architecture has evolved within this phylum. In the phylum Ctenophora, mtDNA is characterized by small size, high rate of sequence evolution, and loss of at least 24 genes, including all tRNA genes and two protein-coding genes. In

Mitochondrial Genomes in Invertebrate Animals

the phylum Placozoa, mtDNA is large in size and contains a substantial number of introns. It also displays such unusual features as mRNA editing and group I intron trans-splicing. Finally, mtDNA patterns are remarkably different among the four main lineages of sponges, with variation in size, genome organization, genetic code, gene content, presence/absence of introns, tRNA structures and editing, rates of evolution, and proliferation of repetitive elements.

Introduction The subdivision of animals into vertebrates and invertebrates separates a single subphylum (Vertebrata) from two other subphyla within the phylum Chordata, as well as from more than 30 other animal phyla. This division is artificial and likely reflects our anthropocentric view of biology. Yet it persists in our scholarship and academic culture in the form of textbooks, courses, and departmental structures and is followed in this review in part because more research effort has been invested in the study of vertebrate mtDNA than the mtDNA of all other animals put together. Within invertebrates, this entry follows a more phylogenetic approach and recognizes five main lineages: sponges, placozoans, cnidarians, ctenophores, and bilaterian animals (minus vertebrates). Each of these lineages constitutes an ancient branch of the animal tree of life and each has evolved unique features in mitochondrial genome architecture. The relationships among these lineages are still unresolved; one of the proposed phylogenies is presented in Fig. 1. This entry begins with a brief description of mtDNA in bilaterian animals and then discusses mtDNA in four lineages of non-bilaterian animals.

Mitochondrial DNA of Bilaterian Animals: Well-Conserved Genome Organization, High Rate of Sequence Evolution Invertebrates with bilateral symmetry (Bilateria minus Vertebrata) represent the majority of animal

Mitochondrial Genomes in Invertebrate Animals Mitochondrial Genomes in Invertebrate Animals, Fig. 1 Phylogenetic relationships within Metazoa. One of several alternative reconstructions of animal relationships is presented. Only the taxa that are mentioned in the text are included. Larger font and blue color indicate the four major lineages of non-bilaterian animals. The large number of species of bilaterian animals is represented by a green triangle

species and include economically, scientifically, and medically important groups such as arthropods, nematodes, flatworms, and molluscs. Not surprisingly, initial information about invertebrate mtDNA came from bilaterian animals and the first completely sequenced mtDNA was that of Drosophila (Clary and Wolstenholme 1985) (Fig. 2). Like its counterpart in vertebrate animals, the Drosophila mitochondrial genome is small (~16 kb), contains 37 genes (13 protein genes for complexes I, III, IV, and V of the electron transfer chain, 2 rRNA genes, and 22 tRNA genes; Fig. 3), and has a well-conserved gene order. It contains a single large noncoding region, encompassing the origin of transcription and replication, but otherwise has few intergenic nucleotides. The genetic code used for mitochondrial translation in Drosophila is different from both the standard genetic code in the nucleus and that in vertebrate mitochondria. These differences include reassignments of AGR, ATA, and TGA codons to serine, isoleucine, and tryptophan, respectively (known as “invertebrate mitochondrial genetic code”; NCBI translation table 5). Several additional genetic codes have been inferred in invertebrate mitochondria, including echinoderm and flatworm (translation_table 9), ascidian (translation_table 13), trematode (translation_table 21), and Pterobranchia (translation_table 24). At the time of writing, 731 non-vertebrate animal mitochondrial genomes

729 Homoscleromorpha Calcarea

Porifera

Demospongiae Hexactinellida

Placozoa Ctenophora Anthozoa Hydrozoa Scyphozoa Medusozoa

Cnidaria

Cubozoa

Bilateria

were available in the NCBI genome database, including 644 from bilaterian animals. Interestingly, the first report of a complete invertebrate mtDNA already introduced a notion that “all metazoa, from Platyhelminthes to mammals possess a mitochondrial genome that consists of a single circular molecule ranging in size from 14.5 to 19.5 kb” and that both gene content and, to some extent, gene order are conserved in animal mtDNA (i.e., “typical animal mtDNA”) (Clary and Wolstenholme 1985). Subsequent studies generally supported the idea that mtDNA of bilaterian animals is relatively uniform, although several noticeable exceptions have been found, including mitochondrial genomes that are much larger in size (e.g., in the isopod Armadillidium vulgare), have lost multiple genes (e.g., within Chaetognatha), have genes with introns (annelid Nephtys sp.), have experienced rapid gene rearrangement (e.g., most tunicates), or consist of more than one chromosome (e.g., in rotifers, some nematodes, human lice). One of the most unusual systems was found in bivalve molluscs, which inherit not one but two mitochondrial genomes, one transmitted by mother to all offspring and the second by father exclusively to sons (Zouros et al. 1992). Other unusual genetic and genomic features of invertebrate mtDNA include unorthodox translation-initiation codons, highly modified structures of encoded transfer RNAs, a high rate of sequence evolution,

M

730

Mitochondrial Genomes in Invertebrate Animals rns

cob

rnl

GV

rnl

M

nad6 nad4

A

nad5

cox3

Me

F

cox2

17 0kbp

K atp8

15

cox2

atp6

Metridium senile 17,443 bp 5

rns

cox1

10

nad2 W

nad2

atp6

Mf I Y I L

nad4L atp8

nad5 nad3 nad1

R

Lubomirskia baicalensis 28,958 bp Q W N L

nad1 C D S

Cnidaria

cob X

Alatina moseri ~28.4kbp

T P

cox1 rns

nad2 nad5

nad6 nad3 nad4L nad1

cox2 atp8 atp6

cox1

ORF 314 polB

nad4

rnl

R

E

nad3nad6 nad4L

cob M

M

nad1

nad3 cox1_b A L

cob

rnl_a R

V

nad4L

Q Y

atp9

nad4

H

orf101

cob

S

Porifera

M

cox3

cox3

atp6 WL

cox1 cox3

nad1 D

R

Mnemiopsis leidyi 10,326 bp

nad6 nad5

I Mf

N 43 0kbp

nad3 cox2 rns

orf150 S orf139 K

nad4L nad2

T S

40

orf100

nad6

rnl nad4

I G

5

orf146

Ctenophora cox1_c 35

nad4

Trichoplax adhaerens 43,079 bp

10

cox3

nad2

rns

H F

30

IQM WC Y

V

rnl

cox1

15

nad5

cox1_a

25

L

L

Drosophila yakuba 16,019 bp

nad1 S

20

cob G P

nad2

E P C

Placozoa

rnl_b

cox2 rns

Mitochondrial Genomes in Invertebrate Animals, Fig. 2 A sample of mitochondrial diversity within Metazoa. Protein (blue) and ribosomal rRNA (green) genes are atp6 and atp8-9, subunits 6, 8 and 9 of F0 adenosine triphosphatase (ATP) synthase; cox1-3 cytochrome c oxidase subunits 1–3; cob apocytochrome b; nad1-6 and nad4L NADH dehydrogenase subunits 1–6 and 4 L; tatC twin-arginine translocase component C; and rns and rnl small and large subunit rRNAs. tRNA genes (black) are

nad6 T nad4L nad4

A SR EN F

H

cox2

K D

atp8 atp6

cox3 nad3

nad5

Bilateria

identified by the one-letter code for their corresponding amino acid; subscripts denote different genes for isoacceptor tRNAs. Gray regions indicate introns; yellow, gene overlaps. The two Metridium senile genes colored in yellow are located within the nad5 intron. Genes indicated on the outer and inner circumference are transcribed in clockwise and counterclockwise direction, respectively. The size of the circle is proportional to the size of the genome

Mitochondrial Genomes in Invertebrate Animals

D

Porifera G H

731

Placozoa Ctenophora C

Cnidaria A M

Bilateria Ch

atp6 atp8 atp9 cob cox1 cox2 cox3 nad1 nad2 nad3 nad4 nad4L nad5 nad6 tatC mutS polB rns rnl trnA trnC trnD trnE trnF trnG trnH trnI1 trnI2 trnK trnL1 trnL2 trnMf trnMe trnN trnP trnQ trnR1 trnR2 trnS1 trnS2 trnT trnV trnW trnY Mitochondrial Genomes in Invertebrate Animals, Fig. 3 Gene content in animal mtDNA. The five main animal lineages are shown. Dark blue, light blue, and white

M

rectangles signify genes present in all, some, or no representatives within each lineage. Red rectangle indicates an inferred change in tRNA amino acid (but not anticodon)

732

a relatively low rate of gene rearrangements, and the presence of a single large noncoding “control” region. The notion of the “typical animal mtDNA” was propagated widely throughout the animal mtDNA literature, yet the range of studied species was limited primarily to bilaterian animals. As expected from phylogenetic considerations and will be seen in the following sections, mtDNA in non-bilaterian animals is both substantially different from and more diverse than that of bilaterian animals.

Mitochondrial DNA of Sponges: Major Differences Among Main Lineages Sponges (phylum Porifera) are one of the most ancient groups of animals, often considered to be the sister group to the rest of Metazoa. Hence, they should have the same taxonomic rank as all other animals taken together. Not surprisingly, sponge mitochondrial genome evolution is not only different from that in other animals but also highly variable among major lineages within the phylum: traditional classes Demospongiae, Hexactinellida, and Calcarea, plus the newly recognized group Homoscleromorpha. So far, most mitochondrial genomes have been determined for the class Demospongiae, including at least one genome for each order in this group (Wang and Lavrov 2008). Demosponge mitochondrial genomes are circular-mapping molecules of 20–25 kb in size that are characterized by the retention of several ancestral (non-metazoan) features, including a minimally modified genetic code, presence of extra genes (most notable atp9), conserved structures of tRNA genes, and the existence of multiple noncoding regions (Figs. 2 and 3). In all but one case, genes have identical transcriptional orientation.

Mitochondrial Genomes in Invertebrate Animals

Another notable feature of demosponge mtDNA is proliferation of repetitive stem-loop elements – similar to those found in mtDNA of many non-metazoan eukaryotes – which occurred in several independent lineages. The rate of nucleotide substitutions in demosponges is low (0.5–1.6  109 per site per year in freshwater sponges), comparable to that in corals and plants, although a significant acceleration in evolutionary rates occurred in the Keratosa (G1) lineage (Wang and Lavrov 2008). The mitochondrial genomes of homoscleromorphs, a recently recognized major group of sponges, are similar to those of demosponges and have retained similar ancestral genomic features. However, two different mitochondrial organizations have been found within this group: all genomes from the family Oscarellidae are characterized by the presence of tatC, a gene for subunit C of the twin arginine translocase, otherwise not known in animal mtDNA, as well as 27 tRNA genes, while those from the family Plakinidae lack all but five tRNA genes present in Oscarellidae (Gazave et al. 2010) (Fig. 3). In addition, within the Plakinidae, one or two introns are present in cox1 of Plakinastrella sp., Plakina crypta, and Plakina trilopha but are absent elsewhere (Gazave et al. 2010). Interestingly, introns in the same positions within cox1 have been also found in the demosponge family Tetillidae. Class Hexactinellida (glass sponges) is currently represented by mitochondrial genomes of three species: Iphiteon panicea, Sympagella nux, and Aphrocallistes vastus. These are circularmapping molecules that evolved a genome organization superficially similar to that of bilaterian animals. Glass sponges and bilaterian animals share several mitochondrial features, most strikingly, the reassignment of the mitochondrial AGR codons from arginine to serine, and some of them have experienced similar changes in the

ä Mitochondrial Genomes in Invertebrate Animals, Fig. 3 (continued) identity. D, G, H, and C under Porifera signify Demospongiae, Hexactinellida (glass sponges), Homoscleromorpha, and Calcarea, respectively. A and

M under Cnidaria stand for Anthozoa and Medusozoa. Phylum Chaetognatha (Ch), which experienced a loss of 23 mitochondrial genes, is shown separately from the rest of bilaterians

Mitochondrial Genomes in Invertebrate Animals

secondary structure of the tRNASer UCU that translates these codons (Haen et al. 2007). These changes are rare and are complex events that, so far, are known to occur only in the Bilateria and Hexactinellida. However, glass sponges lack another change in the genetic code characteristic of bilaterian animals: the reassignment of the AUA codon from isoleucine to methionine. In addition, the mtDNA of glass sponges contains atp9, which to date has been found only in sponges (among animals), but lacks atp8. Finally, glass sponges are characterized by widespread usage of translational frameshifting in mitochondrial translation (Lavrov et al. 2013). Mitochondrial DNA in calcareous sponges remains poorly characterized. However, recently the complete mitochondrial genome sequence of the calcinean sponge Clathrina clathrus was determined (Lavrov et al. 2013). The genome of this sponge is highly unusual and consists of six linear chromosomes. It is also characterized by fragmented rRNA genes, tRNA editing, changes in the genetic code, and a high rate of sequence evolution. Studies of the intraspecific variation within Clathrina clathrus and Leucetta chagosensis suggest that mitochondrial genes will be useful for phylogeographic studies of Calcarea.

Mitochondrial DNA of Placozoans: Large Genomes with Several Unusual Features Mitochondrial genomes of four placozoans have been reported (Signorovitch et al. 2007). These genomes are large (32–43 kbp), circular-mapping molecules with genes encoded in both transcriptional orientations (Fig. 2). The percentage of intergenic nucleotides is between 15 and 22, similar to demosponges. Placozoan mitochondrial genomes lack atp8, the gene present in mtDNA of most animals, but encode 24 tRNA genes, sufficient to translate all codons using the minimally modified genetic code, inferred for placozoan mtDNA as well as several large ORFs (Fig. 3). Placozoan genes for the large subunit rRNA consist of two subgenic modules separated by ~20 kb. Each placozoan mtDNA contains 8–9

733

introns, including both type I and type II, otherwise rare in animals (Burger et al. 2009). Some of the introns contain ORFs and genes for reverse transcriptase. The structure of cox1 is the most unusual: not only does this gene contain the majority of introns (6–7) but it also consists of three segments with different transcriptional polarities. It has been shown that the transcripts of these fragments are joined together by group I intron trans-splicing, one of few reports of such a process in vivo (Burger et al. 2009). In addition, cox1 transcripts undergo U-to-C mRNA editing, the only known case of mRNA editing in animal mtDNA.

Mitochondrial DNA of Cnidarians: Evolution of Linear Organization in Medusozoa Mitochondrial DNA sequences in the phylum Cnidaria have been determined for more than 40 species representing most major lineages (Kayal et al. 2011; Medina et al. 2006). Within Cnidaria, two major mitochondrial genome organizations have been reported: all mitochondrial genomes in the class Anthozoa (corals, soft corals, sea anemones) are circular molecules, while those in the classes Scyphozoa, Hydrozoa, and Cubozoa (collectively known as Medusozoa) are linear molecules. Furthermore, in several species of Hydrozoa and all species of Cubozoa, mtDNA consists of several linear chromosomes (Fig. 2). One feature that is shared among all cnidarians is the loss of all but one or two tRNA genes from mtDNA. A recent study indicates that this is a genuine loss rather than nuclear translocation and that it co-occurred with the loss of all but two mitochondrial aminoacyl-tRNA synthetases encoded by the nuclear genome (Haen et al. 2010). Other unusual features of cnidarian mtDNA include the presence of two atypical genes: mutS encoding a putative mismatch repair protein in octocorals and polB specifying a DNA-dependent DNA polymerase in several medusozoan cnidarians (Kayal et al. 2011). In addition, group I introns, with or without the homing endonucleases of the LAGLI-DADG type, are found in hexacorals.

M

734

Mitochondrial DNA of Ctenophores: Reduced Genome with a High Rate of mtDNA Evolution Only ~150 living species of ctenophores are known and only one mitochondrial sequence from this group has been determined (Pett et al. 2011). The mitochondrial genome of Mnemiopsis leidyi is a small (10.3 kb), circular-mapping molecule atypical in its gene content and the extent of sequence evolution (Fig. 2). It has lost at least 25 genes, including atp6 and all tRNA genes (Fig. 3). It also displays an exceptionally high rate of sequence evolution, resulting in highly derived structures of encoded proteins and rRNA. The atp6 gene has been transferred to the nucleus and has acquired a targeting pre-sequence, and loss of tRNA genes from the mtDNA is accompanied by the loss of nucleus-encoded mitochondrial aminoacyl-tRNA synthetases. Rapid evolution of the mitochondrial rRNA is correlated with accelerated evolution of nucleus-encoded mitochondrial ribosomal proteins.

General Trends in Invertebrate mtDNA Evolution As emphasized in this essay, animal mtDNA is far more diverse than previously thought. Most diversity, as should be expected, is found within non-bilaterian animals, in particular sponges, which are characterized by at least three different genetic codes, linear and circular genome organizations, extra genes (atp9, tatC), large variation in the number of tRNA genes (2–27), presence/ absence of introns, tRNA editing, fragmented rRNA genes, translational frameshifting, variable rates of evolution, proliferation of stem-loop elements, and large variation in size. More research is needed to characterize the full extent of mitochondrial diversity in non-bilaterian animals, in particular calcareous sponges and ctenophores.

References Burger G, Yan Y, Javadi P, Lang BF (2009) Group I-intron trans-splicing and mRNA editing in the mitochondria of placozoan animals. Trends Genet 25:381–386

Mitochondrial Genomes in Land Plants Clary DO, Wolstenholme DR (1985) The mitochondrial DNA molecule of Drosophila yakuba: nucleotide sequence, gene organization, and genetic code. J Mol Evol 22:252–271 Gazave E, Lapebie P, Renard E, Vacelet J, Rocher C, Ereskovsky AV, Lavrov DV, Borchiellini C (2010) Molecular phylogeny restores the supra-generic subdivision of homoscleromorph sponges (Porifera, Homoscleromorpha). PLoS ONE 5:e14290 Haen KM, Lang BF, Pomponi SA, Lavrov DV (2007) Glass sponges and bilaterian animals share derived mitochondrial genomic features: a common ancestry or parallel evolution? Mol Biol Evol 24:1518–1527 Haen KM, Pett W, Lavrov DV (2010) Parallel loss of nuclear-encoded mitochondrial aminoacyl-tRNA synthetases and mtDNA-encoded tRNAs in Cnidaria. Mol Biol Evol 27:2216–2219 Kayal E, Bentlage B, Collins AG, Kayal M, Pirro S, Lavrov DV (2011) Evolution of linear mitochondrial genomes in medusozoan cnidarians. Genome Biol Evol 4:1–12 Lavrov D, Pett W, Voigt O, Worheide G, Forget L, Lang B, Kayal E (2013) Mitochondrial DNA of Clathrina clathrus (Calcarea, Calcinea): Six Linear Chromosomes, Fragmented rRNAs, tRNA Editing, and a Novel Genetic Code. Mol. Biol. Evol. 30:865-880 Medina M, Collins AG, Takaoka TL, Kuehl JV, Boore JL (2006) Naked corals: skeleton loss in Scleractinia. Proc Natl Acad Sci U S A 103:9096–9100 Pett W, Ryan JF, Pang K, Mullikin JC, Martindale MQ, Baxevanis AD, Lavrov DV (2011) Extreme mitochondrial evolution in the ctenophore Mnemiopsis leidyi: insight from mtDNA and the nuclear genome. Mitochondrial DNA 22:130–142 Signorovitch AY, Buss LW, Dellaporta SL (2007) Comparative genomics of large mitochondria in placozoans. PLoS Genet 3:e13 Wang X, Lavrov D (2008) Seventeen new complete mtDNA sequences reveal extensive mitochondrial genome evolution within the Demospongiae. PLoS ONE. 3:e2723 Zouros E, Freeman KR, Ball AO, Pogson GH (1992) Direct evidence for extensive paternal mitochondrial DNA inheritance in the marine mussel Mytilus. Nature 359:412–414

Mitochondrial Genomes in Land Plants Linda Bonen Department of Biology, University of Ottawa, Ottawa, Canada

Synopsis Plants have the distinction of possessing the largest known mitochondrial genomes among

Mitochondrial Genomes in Land Plants

eukaryotes although their size does not reflect a proportionate increase in gene content. Earlydiverging lineages of land plants, such as liverworts and mosses, have mitochondrial genomes of approximately 100–200 kb with a conservative set of about 70 genes, some of which retain bacterial-operon organization reflecting their ancestry. In contrast, the mitochondrial genomes of vascular plants range from about 200 kb to almost 3,000 kb. They are highly recombinogenic and exist in multiple physical forms with varying stoichiometries. In addition, they have a mosaic composition due to the incorporation of “foreign” chloroplast and nuclear sequences, as well as exogenous DNA through horizontal transfer. Such sequences are acquired in a sporadic, lineage-specific fashion and this includes functional chloroplast-origin tRNA genes that have replaced certain native mitochondrial counterparts. However, the origin of much of the intergenic DNA remains elusive, with extensive rearrangements obscuring its history. Even though vascular plant mitochondrial genomes are large, their tRNA gene sets are incomplete and nucleusencoded cytosol tRNAs are imported for translation. Another notable characteristic is C-to-U RNA editing of mitochondrial transcripts, as well as U-to-C editing in certain early-diverging plant lineages. Mitochondrial gene sets also differ somewhat among plants because the movement of functional genes to the nucleus is still ongoing, with ribosomal protein gene migration being the most successful. The rearranging nature of flowering plant mitochondrial genomes is also related to forms of cytoplasmic male sterility (CMS), where the inappropriate expression of novel chimeric reading frames is correlated with the inability to produce functional pollen.

Introduction For more than 30 years, it has been appreciated that the mitochondrial DNAs (mtDNAs) of vascular plants are exceptionally large and recombinogenic in nature. This conclusion was initially based on complex, nonstoichiometric restriction profiles and reassociation kinetic data,

735

but since then it has been corroborated by genomic sequence data from a wide variety of plant species. In 1997, the first complete mitochondrial genome from a vascular flowering plant, Arabidopsis thaliana, was reported. Its 367-kb genome contains about 60 genes encoding ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), and proteins required for respiratory function, and it has a complex physical structure because of the presence of two sets of recombinationally active repeat elements. This pattern contrasts with that of the mitochondrial genome of the nonvascular liverwort Marchantia polymorpha, published in 1992, which is about half that size (about 187 kb) and has a simple circular form as confirmed by electron microscopy. It however contains more genes (about 70 in total) than does Arabidopsis mtDNA. By late 2011, complete genome sequence data were available for about 30 different species of vascular and nonvascular plants, with representatives being shown in Fig. 1. It is clear that the mitochondrial genomes of land plants have followed a variety of evolutionary pathways and illustrative examples of their diverse and intriguing features are discussed below.

Physical Complexity of Plant Mitochondrial Genomes Plant mitochondrial genomes are not only very large, but they also exhibit a remarkable range in size and complexity among land plants (Fig. 1). The early-diverging lineages (liverworts and mosses) have mitochondrial genome sizes ranging from about 100–200 kb and a relatively conserved set of genes involved in respiratory function and gene expression. Gene order is for the most part conservative and certain genes (particularly ribosomal protein ones) retain a bacterial operon-type organization. Only a limited number of genomic rearrangements are seen among the mitochondrial genomes of liverworts and mosses. Such features are reminiscent of those seen in the chloroplast genomes of both nonvascular and vascular plants. In contrast, much greater complexity is seen for the mitochondrial genomes of vascular plants.

M

736

Mitochondrial Genomes in Land Plants ARABIDOPSIS BRASSICA VIGNA CUCUMIS CITRULLIS eudicots

CUCURBITA VITIS SILENE BETA NICOTIANA

monocots

ZEA ORYZA

gymnosperm

lycophytes

TRITICUM CYCAS ISOETES

*

SELAGINELLA hornworts

*

PHAEOCEROS

MEGACEROS mosses

ANOMODON PHYSCOMITRELLA

liverworts

TREUBIA PLEUROZIA MARCHANTIA 200

1000

2000 kb

Mitochondrial Genomes in Land Plants, Fig. 1 Diversity of mitochondrial genome size in land plants. Arabidopsis thaliana [366,924 bp, NC_001284], Brassica napus [221,853 bp, NC_008285], Vigna radiata [401,262 bp, NC_015121], Cucumis sativus [1,684,592 bp divided in 3 chromosomes 1,555,935 bp, 83,817 bp, and 44,840 bp, NC_016004-NC_016006], Citrullus lanatus [379,236 bp, NC_014043], Cucurbita pepo [982,833 bp, NC_014050], Vitis vinifera [773,279 bp, NC_012119], Silene latifolia [253,413 bp, NC_014487], Beta vulgaris [368,801 bp, NC_002511], Nicotiana tabacum [430,597 bp, NC_006581], Zea mays [569,630 bp, NC_007982], Oryza sativa [490,520 bp, NC_011033], Triticum aestivum [452,528 bp, NC_007579], Cycas taitungensis [414,903 bp, NC_010303], Isoetes

engelmannii [minimum genetic complexity of 57,571 bp based on contig data, FJ010859, FJ176330, FJ390841], Selaginella moellendorffii [261,212 bp, JF338143JF338147], Phaeoceros laevis [209,482 bp, NC_013765], Megaceros aenigmaticus [184,908, NC_012651], Anomodon rugelii [104,239 bp, NC_016121], Physcomitrella patens [105,340 bp, NC_007945], Treubia lacunosa [151,983 bp, NC_016122], Pleurozia purpurea [168,526 bp, NC_013444], Marchantia polymorpha [186,609 bp, NC_001660]. Sequences and journal references are available at the NCBI accession numbers. Asterisks indicate genome was reported as “network” rather than “master chromosome” because of high number of recombinational breakpoints. Cladogram to illustrate plant relationships is not to scale

Among the completely sequenced mitochondrial genomes from angiosperms and gymnosperms, the sizes range from more than 200 kb (in Brassica napus) to close to 1,700 kb (in cucumber), and for certain other members of the cucurbit family (such as muskmelon), estimates approach 3,000 kb, thus exceeding the genome sizes of some free-living bacteria. This expansion does not however reflect an increase in

gene content (see below). One contributing factor to the increased size is the integration of “foreign” intracellular DNA from the chloroplast and nucleus. In certain cases, bacterial and/or viral sequences have been identified to be present as well. Intron length can also inflate genome size, but to a lesser degree. The genomes also typically contain sets of repeat elements that are active in intra- and intermolecular recombination. The

Mitochondrial Genomes in Land Plants

737

Mitochondrial Genomes in Land Plants, Fig. 2 Schematic of a simplified plant mtDNA molecule that contains one set of recombinationally active repeat elements (arrows) in a direct (a) or indirect orientation (b). The

complexity of subgenomic physical forms generated by homologous recombination will depend on the copy number of a given repeat and the number of different repeats present in the “master chromosome”

outcome of intramolecular recombination depends on whether repeats are in direct or inverted orientation in the parental molecule (Fig. 2). The complexity of the resulting heterogeneous population of subgenomic physical forms will depend on the copy number of a given repeat as well as the number of different recombinational repeats present. For example, in Brassica napus, there is just one type of recombinational repeat and it is present in two copies, whereas the Silene latifolia mitochondrial genome has six copies of a different repeat element. Thus, even though angiosperm mitochondrial genomes are usually depicted as single circular “master chromosomes,” the actual situation in vivo is more complex, with mixtures of subgenomic circular, linear, and branched molecules. The mitochondrial genomes of lycophytes, which are a sister clade to angiosperms and gymnosperms, also have complex multipartite structures. In fact, the genomes for Isoetes engelmannii and Selaginella moellendorffii have been depicted as networks since the high number of recombination breakpoints has precluded the assembly of circular “master chromosome” maps.

Although recombination repeat sequences are typically dispersed in the genome and present in relatively few copies, there are also numerous other inactive repeats of shorter length. The Cycas taitungensis mitochondrial genome has a family of mobile elements called “Bpu sequences” that comprise about 5% of the genome, and in the huge cucumber mitochondrial genome, repetitive DNA of various lengths contributes about 36% of the genome. In angiosperms, some of the subgenomic forms (called “sublimons”) are typically present at very low levels but can undergo amplification under certain conditions, such as passage through tissue culture. This phenomenon has been termed substoichiometric shifting of sublimons. The comparison of mitochondrial genomes from the same species (e.g., among Arabidopsis accessions or maize cytotypes) is shedding light on the nature of very recent DNA rearrangements and points to the involvement of short repeats across which either reciprocal or nonreciprocal recombination occurs. Even taking into consideration all of the factors described above, much of the intergenic DNA in the large mitochondrial genomes of vascular plants still remains unidentified. It is believed

M

738

that amplification of native mtDNA followed by extensive rearrangement, which obscures its origin, may well be another major contributing factor and the frequent presence of short mitochondrial coding and noncoding fragments in spacer regions supports this view. In addition to the mitochondrial genome, selfreplicating linear or circular plasmid DNAs have also been identified within the mitochondria of certain plants. They sometimes contain intact or remnant “foreign” genes such as phage-type DNA polymerase or RNA polymerase as well as unidentified ORFs. Such sequences can become integrated into the “master chromosome,” as is the case of an RNA polymerase ORF in the Vitis vinifera mitochondrial genome. The mitochondrial plasmids are typically 10 kb or smaller and are not included in the genome size estimate because they lack mitochondrial genes. It should be noted that the cucumber mitochondrial genome is composed of three chromosomes, the two smaller of which (about 84 kb and 45 kb) lack any identified genes but they do contain characteristic plant mtDNA features, such as recombinationally active repeats and mitochondrial pseudogene fragments.

Variable Gene Content Among Plant Mitochondrial Genomes Plant mitochondrial genomes typically share a similar core set of rRNA, tRNA, and proteincoding genes important for respiratory function and translation (Table 1). Unlike animal and fungal mitochondrial genomes, those of plants contain a 5S rRNA gene in addition to the LSU and SSU rRNA genes. However, the mitochondrial genome of the lycophyte Selaginella moellendorffii lacks a 5S rRNA gene and the one in the angiosperm Silene latifolia is exceptionally divergent. Notably, one of the SSU rRNA gene copies in Silene latifolia mitochondria appears to have undergone gene conversion with a chloroplast-origin homologue, illustrating yet another form of genetic plasticity. The mitochondrial protein-coding genes typically include ones for NADH dehydrogenase

Mitochondrial Genomes in Land Plants

(nad) subunits, cytochrome oxidase (cox) subunits, cytochrome b, succinate dehydrogenase (sdh) subunits, ATP synthase (atp) subunits, cytochrome c biogenesis (ccm) components, a bacterial-type twin-arginine translocase (called tatC or mttB), and ribosomal proteins (rps and rpl). Seed plants have a subset of the ancestraltype set of 41 mitochondrial protein-coding genes found in liverworts and mosses (Table 1), and although the number exceeds that of animal or fungal mitochondria, it is lower than in certain protists such as Reclinomonas americana. Hornworts have a reduced set of mitochondrial genes compared to liverworts and mosses, and this reduction is primarily due to the almost complete loss of ribosomal protein genes (see “Plasticity of Plant Mitochondrial Genomes and Gene Transfer to the Nucleus” section below). They also lack the four ccm-type genes typically found in plant mitochondria, as does the liverwort Treubia lacunosa. Interestingly, hornworts have a high number of pseudogenes that appear to linger in their mitochondrial genomes. The two lycophytes for which mitochondrial genomic data are available also have a much reduced gene set, with Selaginella moellendorffii having no mitochondrion-encoded ribosomal protein genes. In angiosperms, most of the variation in protein-coding gene number is due to the loss of ribosomal protein (rps and rpl) and succinate dehydrogenase (sdh) genes, with Silene latifolia having the smallest set to date. Although the mitochondrial genomes in vascular plants are very large, they do not encode a complete set of tRNA genes. The deficit is compensated for by nucleus-encoded cytosoltype tRNAs that are imported into the mitochondrion. The most extreme case is for the lycophyte Selaginella moellendorffii where no mitochondrion-located tRNA genes have been identified. In vascular plants, a third class of tRNA (in addition to the native mitochondrial type and the imported cytosol type) is involved in translation, namely, chloroplast-origin tRNAs whose genes have been integrated into the mitochondrial genome by intracellular horizontal transfer. The number and identity within this category vary among plants (Table 1, parentheses),

Mitochondrial Genomes in Land Plants

739

Mitochondrial Genomes in Land Plants, Table 1 Variation in mitochondrial gene content among land plants. Accession numbers and sizes for the mitochondrial genomes are given in Fig. 1. Supplemental Fig.1 in Alverson et al. (2011) and Tables S1 and S3 in Liu et al. (2011). Numbers do not include duplicate copies of genes or intronic ORFs found within group I or II introns. Transfer RNA numbers in parentheses denote chloroplast-origin Plant Angiosperms Eudicots Arabidopsis thaliana (thale cress) Brassica napus (rapeseed) Vigna radiata (mung bean) Cucumis sativus (cucumber) Citrullus lanatus (watermelon) Cucurbita pepo (zucchini) Vitis vinifera (grape) Silene latifolia (bladder campion) Beta vulgaris (sugar beet) Nicotiana tabacum (tobacco) Monocots Zea mays (maize) Oryza sativa (rice) Triticum aestivum (wheat) Gymnosperm Cycas taitungensis Lycophytes Isoetes engelmannii (quillwort) Selaginella moellendorffii (spikemoss) Hornworts Phaeoceros laevis Megaceros aenigmaticus Mosses Anomodon rugelii Physcomitrella patens Liverworts Treubia lacunosa Pleurozia purpurea Marchantia polymorpha

genes, and protein-coding numbers in parentheses indicate ribosomal protein genes. It should be noted that the same number for two plants does not necessarily reflect an identical gene (or intron) set because there are lineage-specific differences in gene migration (mitochondrion-to-nucleus and chloroplast-to-mitochondrion). Asterisks indicate that the number includes trans-splicing introns

Ribosomal RBA genes

Transfer RNA genes

Protein-coding genes

Group I introns

Group II introns

3

17 (6)

30 (7)

0

23*

3 3 3 3

18 (7) 16 (5) 22 (13) 21 (8)

31 (8) 30 (8) 36 (13) 37 (12)

0 0 1 1

23* 22* 23* 24*

3 3 3

25 (15) 23 (11) 9 (3)

37 (12) 38 (15) 24 (1)

0 0 0

24* 25* 19*

3 3

21 (8) 21 (9)

29 (6) 36 (13)

0 0

20* 23*

3 3 3

17 (8) 18 (8) 15 (5)

31 (8) 34 (11) 32 (9)

0 0 0

22* 23* 22*

3

22 (4)

40 (15)

0

25*

3

13 (0)

22 (4)

3*

27

2

0

17 (0)

3*

33*

3 3

20 (0) 18 (0)

18 (2) 19 (3)

1 0

33 30

3 3

24 (0) 24 (0)

40 (15) 40 (15)

3 3

24 24

3 3 3

25 (0) 25 (0) 27 (0)

38 (17) 41 (17) 41 (17)

7 7 7

21 24 25

illustrating the dynamic and mosaic nature of the translation machinery in plant mitochondria. A subset of mitochondrial genes in vascular and nonvascular plants possesses introns

(Table 1), which have been assigned to the group I or group II families of ribozyme mobile genetic elements because of their distinctive features. Angiosperms have almost exclusively group II

M

740

type and there is a bias for their presence in nad genes. The sole exception is a fungal-origin group I intron known to favor a homing site in the cox1 gene; this intron has been sporadically acquired in disparate plant lineages. In nonvascular plants and lycophytes, both group I and II introns are found, with the cox1 gene typically containing a relatively large number of introns. Although the core set of introns in angiosperms and gymnosperms are at the same sites (consistent with presence in their ancestor), few are at shared sites between vascular and nonvascular plants. Among nonvascular plants, there is wide variation reflecting the mobile nature of these introns. In vascular plants, DNA rearrangements have occurred within the introns of several genes so that the exons are dispersed in the genome, yet independently transcribed precursor RNAs can reassemble into a structure competent for splicing in trans. The conversion of cis-splicing introns to trans-splicing forms has occurred sporadically during plant evolution as tracked by examining their status in early-diverging land plant lineages. In the highly rearranging mitochondrial genome of the lycophyte, Selaginella moellendorffii, trans-splicing of both group I and group II introns has been documented. Group I and group II introns that are fully competent mobile elements contain ORFs whose products are involved in mobility and splicing. A subset of the mitochondrial introns in nonvascular plants contain such ORFs (called matR or rtl for group II and HEG (homing endonuclease gene) for group I), but just one intact matR gene is present in the mitochondria of angiosperms and gymnosperms. Intronic ORFs have not been included in the proteincoding gene number in Table 1. Curiously, in Selaginella moellendorffii, the nad4L gene is located within one of the nad1 introns, again illustrating the plasticity and rearranging nature of these genomes. The genetic information carried in plant mitochondrial genomes typically must undergo correction at the RNA level to enable the proper protein sequences to be synthesized. The form of RNA editing seen in seed plants is the conversion of certain specific cytosines to uridines, and this occurs mostly at non-synonymous sites within

Mitochondrial Genomes in Land Plants

protein-coding sequences. In certain earlydiverging plant lineages, such as hornworts and Isoetes engelmannii (but not Selaginella moellendorffii), U-to-C editing is also required. Of all land plants examined to date, only marchantiid liverworts such as Marchantia polymorpha lack RNA editing, and that is believed to be due to secondary loss. That said, the degree of editing varies dramatically among plants. In angiosperms, there are typically about 350–500 sites, but in the lycophyte Selaginella moellendorffii, 2,139 editing sites have been identified, while in the moss Physcomitrella patens there are only 11 sites. This illustrates the dynamic interplay between acquisition of such “mutations” (apparently tolerated because machinery is available for correction at the RNA level) and their loss (through retroprocessing templated by edited mRNA copies) over time. The high degree of editing in Selaginella moellendorffii mitochondria is correlated with a very high genomic GC content of 68%, compared to the 41–49% GC seen among other land plants whose mitochondrial genomes have been sequenced. In addition to the conventional mitochondrial genes, a large number of ORFs (typically annotated as such when longer than 100 codons) have been identified in the large plant mitochondrial genomes. However, they are rarely conserved among plant species and appear to be nonfunctional reading frames fortuitously generated by DNA rearrangements. That said, there are instances when such novel chimeric ORFs, which contain segments of known mitochondrial coding and/or regulatory sequences, are expressed and have a detrimental effect. This is the case for many forms of cytoplasmic male sterility (CMS) where normal pollen is not produced. This trait is of agronomic interest to crop breeders for the production of hybrid seed.

Plasticity of Plant Mitochondrial Genomes and Gene Transfer to the Nucleus As discussed above, a hallmark of vascular plant mitochondrial genomes is the integration of

Mitochondrial Genomes in Land Plants

741

Mitochondrial Genomes in Land Plants, Fig. 3 DNA flux influencing mitochondrial genetic makeup in seed plants. Black arrows denote movement of functional genes from chloroplast-to-mitochondrion (e.g., tRNA genes) and from mitochondrion-to-nucleus (e.g., ribosomal protein ones). Gray arrows denote the transfer of

nonfunctional sequences: nucleus-to-mitochondrion; chloroplast-to-mitochondrion and from exogenous sources such as bacteria, viruses, and intron mobile genetic elements; and plant-to-plant horizontal transfer. For simplicity, arrows depicting chloroplast-to-nucleus and exogenous-to-nucleus DNA movement have been omitted

foreign DNA sequences from various sources. Another form of intracellular DNA flux is the movement of functional genes from the mitochondrion to the nucleus (Fig. 3). This transfer is distinct from the random incorporation of mitochondrial sequences into nuclear DNA (called numt sequences) that has been seen in virtually all eukaryotes. Rather this movement represents successful transfer of genes coupled with their loss from the mitochondrion. It typically involves an RNA intermediate to eliminate the need for editing or group I/II splicing of the translocated gene copy. Gene migration to the nucleus has been inferred to occur sporadically over time (in both vascular and nonvascular plants), although some lineages have experienced more extensive loss from the mitochondrion than others (Table 1). Among angiosperms, transfer appears particularly rampant in the Silene lineage, in that the Silene latifolia mitochondrial genome has the smallest gene set identified to date, plus a large number of remnant pseudogenes. Successful transfer requires not only integration into a compatible site for expression but also signals for import of the protein back into the mitochondrion. The most frequently transferred genes are ribosomal protein ones (Table 1, parentheses), and the hydrophilic nature of these gene products (in contrast to the

membrane-associated respiratory chain subunits) may contribute to their success. There are several documented cases of functional genes still being present in both compartments, reflecting “genes in transition.” There are also examples of gene fission where part of the ancestral gene is retained in the mitochondrion (and expressed) whereas the other part has been relocated to the nucleus. Occasionally, a mitochondrial gene is truly lost and its function taken over by another gene such as a nucleuslocated copy of a chloroplast homologue. Remarkably, there also are cases of horizontal plant-toplant transfer of mitochondrial sequences, even between monocots and dicots. This plasticity in gene makeup and organization is in sharp contrast to the typically slow rate of nucleotide substitution in plant mitochondria (notably slower than for chloroplast or nuclear genes). That said, there are some exceptional cases where genes rapidly accumulate nucleotide substitutions (such as atp9 in Silene latifolia) or plant lineages where overall rates have been elevated during their history, such as for Plantago or Pelargonium. In summary, the mitochondrial genomes in plants have followed very different evolutionary pathways than those of other eukaryotes during the 450–500 million years of land plant evolution. In the ancestor of seed plants, there appears to

M

742

have been a dramatic shift in mode of gene organization and genome structure. This switch from a rather conservative genome organization with residual bacterial-type gene order to a much more relaxed one has been accompanied by rampant DNA rearrangements and frequent gene flux both into and out of the mitochondrion. The underlying triggers and mechanisms leading to such pronounced diversity in genome size, structure, and gene content remain to be elucidated.

Mitochondrial Genomes in Unicellular Relatives of Animals

Mitochondrial Genomes in Unicellular Relatives of Animals Dennis V. Lavrov1 and B. Franz Lang2 Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA, USA 2 Centre Robert Cedergren, Département de Biochimie, Université de Montréal, Montréal, QC, Canada 1

Recommended Reading Adams KL, Palmer JD (2003) Evolution of mitochondrial gene content: gene loss and transfer to the nucleus. Mol Phylogenet Evol 29:380–395 Alverson AJ, Rice DW, Dickinson S, Barry K, Palmer JD (2011) Origins and recombination of the bacterial-sized multichromosomal mitochondrial genome of cucumber. Plant Cell 23:2499–2513 Chaw SM, Shih AC, Wang D, Wu YW, Liu SM, Chou TY (2008) The mitochondrial genome of the gymnosperm Cycas taitungensis contains a novel family of short interspersed elements, Bpu sequences, and abundant RNA editing sites. Mol Biol Evol 25:603–615 Davila JI, Arrieta-Montiel MP, Wamboldt Y, Cao J, Hagmann J, Shedge V, Xu YZ, Weigel D, Mackenzie SA (2011) Double-strand break repair processes drive evolution of the mitochondrial genome in Arabidopsis. BMC Biol 9:64 Hecht J, Grewe F, Knoop V (2011) Extreme RNA editing in coding islands and abundant microsatellites in repeat sequences of Selaginella moellendorffii mitochondria: the root of frequent plant mtDNA recombination in early tracheophytes. Genome Biol 3:344–358 Kitazaki K, Kubo T (2010) Cost of having the largest mitochondrial genome: evolutionary mechanism of plant mitochondrial genome. J Bot 2010. Article ID 620137, 12 p. https://doi.org/10.1155/2010/620137 Knoop V (2004) The mitochondrial DNA of land plants: peculiarities in phylogenetic perspective. Curr Genet 46:123–139 Liu Y, Xue JY, Wang B, Li L, Qiu YL (2011) The mitochondrial genomes of the early land plants Treubia lacunosa and Anomodon rugelii: dynamic and conservative evolution. PLoS One 6:e25836 Oda K, Yamato K, Ohta E, Nakamura Y, Takemura M, Nozato N, Akashi K, Kanegae T, Ogura Y, Kohchi T, Ohyama K (1992) Gene organization deduced from the complete sequence of liverwort Marchantia polymorpha mitochondrial DNA: a primitive form of plant mitochondrial genome. J Mol Biol 223:1–7 Unseld M, Marienfeld JR, Brandt P, Brennicke A (1997) The mitochondrial genome of Arabidopsis thaliana contains 57 genes in 366,924 nucleotides. Nat Genet 15:57–61

Definition Metazoa (multicellular animals) are part of the eukaryotic supergroup Opisthokonta, which also includes Fungi and several groups of protists (Baldauf and Palmer 1993). Although both Metazoa and Fungi are mainly multicellular assemblages, their members are closely related to unicellular eukaryotes within Opisthokonta (Fig. 1a). The group that unites animals and their unicellular relatives is known as Holozoa (Lang et al. 2002) and includes Choanoflagellata, Filasterea, and Ichthyosporea. There is a general consensus that Choanoflagellata are the closest relatives of Metazoa. Furthermore, recent multigene phylogenies place Filasterea as the sister group of Metazoa + Choanoflagellata (the Filozoa hypothesis), with Ichthyosporea forming the deepest divergence within Holozoa (ShalchianTabrizi et al. 2008; Torruella et al. 2011 and references therein; Fig. 1a). Mitochondrial genomes of two holozoan protists, the marine choanoflagellate Monosiga brevicollis and the freshwater ichthyosporean Amoebidium parasiticum, have been described previously (Burger et al. 2003). Mitochondrial DNA sequences of Capsaspora owczarzaki (GenBank accession #KC573038), a symbiont of a tropical freshwater snail, and of Ministeria vibrans (GenBank accession #KC573040), a free-living protist with slender radiating tentacles, are published as part of this review (Fig. 1b). Additional sequences are expected to become available through the Origins of Multicellularity Project at the Broad Institute (http://www.

Mitochondrial Genomes in Unicellular Relatives of Animals

743

a

Metazoa Monosiga brevicollis Salpingoeca rosetta Monosiga ovata Ministeria vibrans Capsaspora owczarzaki Sphaeroforma arctica Amoebidium parasiticum

Choanoflagellata Filasterea Ichthyosporea

Fungi Nucleariidae b

orf236 atp9 nad1 rps19 cox3 orf773 orf2038 D GW

rps13 rpl2 Y rpl16 rpl14 Q EH

orf915 atp6 nad2

L2

D

55 0

45

40

cox3

N

5

50

Ministeria vibrans mtDNA 55,918bp 20

35

M2

10

R1 R2 F M1

15

T K

atp8 rps14 orf155 orf222 orf883 orf129 rps12 cob

25 S C rnl 30 IP A orf1349 nad6 S nad4 V rpl6 nad4L nad3 cox2 cox1 rns rps4 nad5 1

2

Mitochondrial Genomes in Unicellular Relatives of Animals, Fig. 1 Mitochondrial genomes of unicellular relatives of animals. (a) Phylogenetic relationships among animals, fungi, and their unicellular relatives as inferred in several recent multigene studies (ShalchianTabrizi et al. 2008; Torruella et al. 2011 and references therein). Species names for which mtDNA sequences have been described previously are shown in blue. Two species for which mitochondrial genome data are newly reported in this entry are in red. (b) Genetic map comparison of M. vibrans and C. owczarzaki mtDNAs. The linear mitochondrial genome of M. vibrans is depicted as a circle, with linear ends marked by red, filled circles. The genome architecture of C. owczarzaki is unknown, potentially also linear. Standard genes (in black) are atp6 and atp8-9: subunits 6, 8, and 9 of Fo adenosine triphosphate (ATP) synthase; cox1-3: cytochrome c oxidase subunits 1–3; cob:

1

rps13 atp9 I

orf1525

L1

rnsrpl5rps14 rps3 L N rps19 MYC Q rnl T D

rps12

E P

S2

1

M3

R2

0 175

25

ccmF

K

Capsaspora owszarzeki mtDNA 50 196,908bp

S1

rpl14 rpl16

150

cox1

A

125

75 100

nad4 nad5 nad2 M W V L nad1 rps4 cob 2

3

F

rpl6 nad6 ccmC G nad4L atp6 cox2 nad3 R1 HL 2

rpl2

apocytochrome b; nad1-6 and nad4L: NADH dehydrogenase subunits 1–6 and 4 L; rns and rnl: small and large subunit rRNAs. tRNA genes are identified by the one-letter code for their corresponding amino acid; subscripts denote different genes for isoacceptor tRNAs. Genes that do not occur in the standard (fungal/animal) mtDNA (blue) are ccmC, ccmF: cytochrome c maturation protein CcmC and heme lyase; rps[N] and rpl[N]: small and large subunit ribosomal proteins; tatC: twin-arginine translocase component C. ORFs in M. vibrans (green). Genes shown on the outer and inner circumference are transcribed in clockwise and counterclockwise direction, respectively. The gray region within C. owczarzaki cox1 indicates a group IB intron that comprises a trnA gene (colored red). Because of the high number of ORFs in C. owczarzaki, these are not named; instead, respective arcs are colored green

M

744

broadinstitute.org/). This review begins with a brief description of the four mitochondrial genomes of holozoan protists, followed by a summary of similarities and differences in mtDNA organization in this group. Mitochondrial DNA of the choanoflagellate Monosiga brevicollis is a circular mapping molecule, 76,568 bp in size, with 86% A+T content. It contains a total of 53 identified genes that specify 26 proteins (including tatC, a gene for the subunit C of the twin-arginine translocase and 11 ribosomal proteins), two rRNAs, and 25 tRNAs. In addition, two large open reading frames (ORFs) are present in the genome. Intergenic regions of M. brevicollis mtDNA are unusually large, several hundred to several thousand bp long, and A+T-rich (93%). They encompass multiple arrays of sequence repeats that consist nearly exclusively of A + T and constitute more than half (53%) of the genome. Four group I introns are present in M. brevicollis mitochondrial genes: three in cox1 and one in nad5. The mitochondrial genome of Ministeria vibrans is a linear molecule, ~56 kb in size, having an A+T content of 75% and with large (~1 kb) inverted terminal repeats that encompass the gene for aspartate tRNA (trnD) (Fig. 1b). A total of 50 genes encode 24 proteins, including nine ribosomal proteins, two rRNAs, and 24 tRNAs. In addition, ten large ORFs are present in the genome, three of them exceeding a surprising 1,000 codons in length. The genes with opposite transcriptional polarities are arranged in two clusters, which are subdivided by the largest (1,351 bp) noncoding region in the genome. No introns are found in M. vibrans mitochondrial genes. The mitochondrial genome of Capsaspora owczarzaki is ~200 kb in size, nearly four times larger than that of its closest relative, M. vibrans, but it has a similar A+T composition (72%). It contains 49 identified genes coding for 21 proteins, two rRNAs, 26 tRNAs, plus 52 ORFs greater than 100 codons (ORFs that contain long arrays of short sequence repeats that are very common in this mtDNA are not counted). Among the protein genes that are unusual for

Mitochondrial Genomes in Unicellular Relatives of Animals

opisthokonts are ccmC and ccmF, involved in heme maturation and delivery, and six ribosomal protein genes. The otherwise common atp8 is lacking. Note, however, that this genome proved difficult to assemble due to a very high content of sequence repeats, and therefore, it is only tentatively complete and linear in architecture as depicted in Fig. 1b. Only one intron (inside cox1) is present in C. owczarzaki mitochondrial genes and it contains a gene for alanine tRNA (trnA). Accordingly, the large size of the genome is due to the large repeat-containing, noncoding regions. Finally, the mitochondrial genome of Amoebidium parasiticum consists of several hundred linear chromosomes and has a total estimated size of approximately 300 kb and A+T content of ~62% (Burger et al. 2003). Three different types of chromosomes have been described: an abundant class of small molecules without identified coding function, medium-sized molecules carrying a single gene, and a few large molecules harboring several genes separated by relatively short (80–200 bp) intergenic spacers. All chromosomes share a common terminal repeat structure. About half of A. parasiticum mtDNA has been sequenced and a total of 44 genes have been identified (including three for ribosomal proteins). In addition, 30 ORFs are present in the characterized part of the genome. Many tRNA genes in the A. parasiticum mt-genome are duplicated (with up to four copies) and numerous gene fragments are present. The genome also contains the largest number of introns in Holozoa (even though not all genes have been identified yet): at least 21 group I and two group II introns. Among individual genes, cox1 has eight group I and one group II, and rnl has five group I and one group II introns.

Discussion Comparison of the mitochondrial genomes of M. brevicollis, M. vibrans, C. owczarzaki, and A. parasiticum reveals a large diversity of mtDNA organization in unicellular relatives of

Mitochondrial Genomes in Unicellular Relatives of Animals

745

Oscarella

Amoebidium

rps3

tatC Monosiga rpl5 atp8 rps8 atp6 atp9 cob cox1 cox2 cox3 nad1 nad2 rpl2 nad3 nad4 nad4L rpl14 nad6 nad5 ccmC rpl16 rps19 ccmF Capsaspora rps4 rps14 rps12 rps13 rpl6

Ministeria Mitochondrial Genomes in Unicellular Relatives of Animals, Fig. 2 Gene complement of mitochondrial genomes in Holozoa shown as a Venn diagram. For comparison, the gene content of the homoscleromorph demosponge Oscarella carmela, the largest among

animals, is shown. Each oval corresponds to one organism; genes included within an oval are present in the mtDNA of the given organism. Only protein-coding genes are shown. Gene names are abbreviated as in Fig. 1

animals. The four studied genomes show three different genome structures: single-chromosome circular mapping, single-chromosome linear, and multiple-chromosome linear. In addition, each of these genomes has a unique gene content, which includes a distinct set of ribosomal protein genes, tatC in M. brevicollis, and ccmC and ccmF in C. owczarzaki (Fig. 2). Because most of the variation in organellar gene content is due to gene loss rather than gain, the observed differences suggest that the ancestral holozoan mitochondrial genome was substantially more gene-rich than those of modern representatives of this group. At the same time, characterized mitochondrial genomes of holozoan protists share several features in common. They encode a nearly identical set of proteins involved in oxidative phosphorylation (the only variation being the absence atp8 in C. owczarzaki mtDNA; Fig. 2), some ribosomal protein genes, and an apparently complete set of tRNAs required for mitochondrial protein synthesis. In addition, all these organisms use the same “minimally derived” mitochondrial genetic code, which differs from the standard genetic code in that UGA codons specify tryptophan rather than termination. Finally, positions of some cox1 and nad5 introns are conserved between

A. parasiticum, M. brevicollis, and several non-bilaterian animals. Data from additional mitochondrial genomes of unicellular relatives of animals are needed to clarify whether these shared features represent synapomorphies for Holozoa (traits shared with a common ancestor) and to better understand the evolution of mtDNA in this important group of eukaryotes.

References Baldauf SL, Palmer JD (1993) Animals and fungi are each other’s closest relatives: congruent evidence from multiple proteins. Proc Natl Acad Sci U S A 90:11558–11562 Burger G, Forget L, Zhu Y, Gray MW, Lang BF (2003) Unique mitochondrial genome architecture in unicellular relatives of animals. Proc Natl Acad Sci U S A 100:892–897 Lang BF, O’Kelly C, Nerad T, Gray MW, Burger G (2002) The closest unicellular relatives of animals. Curr Biol 12:1773–1778 Shalchian-Tabrizi K, Minge MA, Espelund M, Orr R, Ruden T, Jakobsen KS, Cavalier-Smith T (2008) Multigene phylogeny of Choanozoa and the origin of animals. PLoS One 3:e2098 Torruella G, Derelle R, Paps J, Lang BF, Roger AJ, Shalchian-Tabrizi K, Ruiz-Trillo I (2011) Phylogenetic relationships within the Opisthokonta based on phylogenomic analyses of conserved single copy protein domains. Mol Biol Evol (in press)

M

746

Mitochondrial Genomes in Vertebrate Animals Daniel Bogenhagen Department of Pharmacological Sciences, Stony Brook University, Stony Brook, New York, NY, USA

Synopsis While cytoplasmic genetic inheritance was initially discovered in plants and in yeast, the physical isolation of mtDNA was first accomplished in the mid-1960s from chicken embryos and mouse liver due to the small circular character of these vertebrate mtDNAs. The subsequent sequencing of human mtDNA was an early landmark of genomics, quickly followed by sequencing of bovine and murine mtDNA. This accomplishment, along with an extensive background of molecular biological investigations, led to an appreciation that vertebrate mtDNAs share a highly conserved organization encoding 13 polypeptide subunits of respiratory complexes compressed into a roughly 16.5 kb genome along with a minimal set of structural RNAs required to support translation within the organelle, namely, large and small subunit rRNAs and 22 tRNA genes. The gene content, gene order, and nonstandard genetic code of vertebrate mtDNA have all been conserved through roughly 500 million years of evolution featuring a remarkable morphological variation. This entry summarizes features of mtDNA organization and key advances in understanding the replication, transcription, RNA processing, and inheritance of vertebrate mtDNA genomes. Research on these topics has been stimulated dramatically by the growing realization of the importance of mtDNA for human health and disease.

Introduction Our current understanding of vertebrate mtDNAs has as a foundation the seminal accomplishments of sequencing the human, bovine, and murine

Mitochondrial Genomes in Vertebrate Animals

mitochondrial genomes in the early 1980s (Anderson et al. 1981; Bibb et al. 1981). This descriptive effort crystallized understanding of many critical features of mitochondrial biology, including the following points: • Vertebrate mitochondrial genomes are highly compact circular DNAs containing an essentially invariant set of 13 polypeptide subunits of respiratory complexes I, III, IV, and V. • The standard genetic code, previously considered to be “universal,” cannot permit decoding the mRNAs for the 13 mtDNA-encoded polypeptides. Instead, vertebrate mitochondria employ an alternate genetic code. • The small circular mtDNA encodes tandemly arrayed rRNAs of reduced size along with a minimal set of 22 tRNAs required for expression of the mitochondrial mRNAs. • In contrast to nuclear genes, which commonly contain introns, the genes encoded in vertebrate mtDNAs lack introns. • Mitochondrial mRNAs typically have no extensive 50 untranslated leader sequences to serve as ribosome binding sites and, in some cases, lack complete termination codons. In these instances, UAA termination codons are only generated by posttranscriptional addition of a short poly(A) tail. • Generation of all necessary mitochondrial transcripts requires symmetrical transcription of both strands of essentially the entire genome to generate polycistronic RNAs bearing rRNAs and mRNAs frequently flanked by tRNAs. The tRNA punctuation model holds that larger transcripts are often generated by default as by-products of endonucleolytic excision of tRNAs. • Vertebrate mitochondrial genomes contain a single noncoding region that contains some of the cis-acting sequence elements required for replication and transcription. • Vertebrate mtDNAs encode none of the key DNA polymerases, RNA polymerases, and factors required for maintenance of the organelle genome. All of these enzymes and factors are nuclear gene products that must be imported into mitochondria.

Mitochondrial Genomes in Vertebrate Animals

747

M Mitochondrial Genomes in Vertebrate Animals, Fig. 1 Human mtDNA organization.ND1-6 are subunits of NADH dehydrogenase; Cytb is apocytochrome b, a component of complex III; COI-III are the three largest

subunits of cytochrome oxidase (complex IV); and ATP6 and ATP8 are subunits of complex V, ATPase. tRNAs are identified using the single-letter code. Other features are discussed in the text (Adapted from Falkenberg et al. 2007)

These features will be discussed briefly in this review.

(Boore 1999). The 13 polypeptides retained in the vertebrate mitochondrial genome are all intrinsic membrane components of the protontranslocating respiratory complexes I, III, IV, and V (Wallace 2007). As shown in Table 1, these mtDNA-encoded subunits comprise a small fraction of the approximately 80 subunits of the electron transport chain. The mechanisms whereby the synthesis and assembly of mtDNA- and nuclear DNA-encoded subunits are coordinated are a subject of intense current investigation. Figure 1 illustrates how individual rRNAs and mRNAs are commonly separated by tRNA genes such that the larger RNAs can be generated by excision of tRNAs from long primary transcripts according to the tRNA punctuation model of

Vertebrate mtDNA Genomes Have a Conserved, Compact Organization The sequences of over 1,350 complete vertebrate mitochondrial genomes (including numerous congeneric examples) are currently available and have been curated in the MitoZoa database (D’Onorio de Meo et al. 2011). With few exceptions, the size, gene content, and gene order closely resemble that of human mtDNA shown in Fig. 1. Invertebrates often have altered gene content or gene order, as discussed by Boore

748

Mitochondrial Genomes in Vertebrate Animals

Mitochondrial Genomes in Vertebrate Animals, Table 1 Composition of mitochondrial respiratory complexesa I mtDNA nuDNA

7 39

II 0 4

III 1 10

IV 3 10

V 2 14

a

Numbers refer to proteins in each complex encoded by either the mitochondrial genome (mtDNA) or the nuclear genome (nuDNA)

Ojala et al. (1981). However, it is interesting to note that in some cases, the flanking tRNA sequences are not located on the same strand as the mRNAs, which may require that the antisense version of a tRNA helps signal certain RNA cleavage events. In most cases, there are few or no nucleotides “wasted” between stable transcripts. This overall organization suggests a remarkable parsimony if, indeed, RNAprocessing events result in efficient production of all possible mature RNAs from a single precursor. However, it has not been proven that all possible RNA products are preserved during these processing steps. In fact, there appears to be considerable variation in the steady-state abundance of various RNA species. The presence of only a single copy of most tRNA genes is a particularly unusual feature, since bacteria and the eukaryotic nucleus encode multiple iso-accepting tRNAs for each amino acid. Thus, the mitochondrial genome is unusually sensitive to point mutations in tRNAs, since mutation of a key residue in a single tRNA can compromise all mitochondrial protein synthesis, leading to serious human diseases. As shown in Fig. 1, there is a marked imbalance in the number of genes encoded on the two strands of vertebrate mtDNA. The rRNAs, 12 of 13 mRNAs, and the majority of tRNAs are transcribed from the same strand which, by convention, is referred to as the heavy, or H-, strand. Apocytochrome b and eight tRNAs are encoded on the complementary (L-) strand. The H- and L-strand nomenclature refers to the fact that the separated strands of vertebrate mtDNA have a differential base composition that results in a substantial difference in buoyant density in alkaline CsCl gradients.

The apparent evolutionary pressure to minimize the size of vertebrate mtDNA is also consistent with the observation that the large and small subunit rRNA genes (1,559 and 954 nucleotides in humans) are about 40% shorter than their counterparts in eubacteria (e.g., 2,904 and 1,542 nucleotides in E. coli). The corresponding large and small subunits of the mitochondrial ribosome have a larger number of protein subunits and a much higher protein-to-RNA ratio than bacterial ribosomes. The size of several mitochondrial tRNAs is reduced as well.

Transcription and Replication Control Elements Despite their high degree of compaction, vertebrate mitochondrial genomes contain one substantial region lacking a coding function, known as the control region. In many vertebrates, the control region contains a partially replicated segment with a displaced single strand known as the D-loop, as shown in Fig. 1. The size of this noncoding region varies from about 700 base pairs in most mammals to about 1,700 base pairs in the African frog, Xenopus. The control region segment near the tRNAPhe gene contains transcriptional promoters and three conserved sequence blocks (CSB1, 2 and 3) with poorly understood functions. In contrast, the promoter sequences have diverged extensively in vertebrates so that the transcription machinery in humans, for example, cannot accurately initiate transcription from mouse promoters and vice versa. Mitochondrial promoters have been studied in detail in very few vertebrates, principally in human, mouse, and Xenopus. Definitive evidence for transcription initiation generally depends on in vitro capping of primary transcripts, which have triphosphate termini in vivo. Vertebrate mitochondria share with the yeast S. cerevisiae the feature that promoters consist largely of a short core consensus sequence surrounding the start site, but the yeast mtDNA genome has evolved to contain multiple independent transcription units with discrete promoters. The 30 end of the D-loop is also highly divergent in both size and sequence, a feature that was

Mitochondrial Genomes in Vertebrate Animals

first recognized even before sequences were determined, since mtDNA strands from closely related animals, such as sheep and goats, fail to hybridize in this region. No clear mechanism has emerged to explain how constraints on sequence divergence are selectively relaxed in this portion of the mtDNA. Conserved sequence motifs have been identified at the extreme 30 end of the D-loop and have been designated termination-associated sequences, or TAS elements, on the assumption that they influence arrest of D-loop DNA synthesis. However, accurate termination of replication by the mitochondrial replicase, DNA polymerase g (DNA pol g), has not been reconstituted at these sites. A large body of experimental evidence supports a standard model for vertebrate mtDNA replication in which transcription from the light strand promoter (LSP) primes leading strand replication at a distance, with RNA-to-DNA conversions occurring near CSB2 and CSB1 and, at least in humans, at an additional downstream site in the 50 portion of the D-loop. CSB2 has received considerable attention since this sequence is distinguished by a long GC tract, related to the human sequence 50 AAGCGGGGGAGGGGGGGTTTGG, which has the potential to support persistent GC-rich RNA-DNA structures or guanine tetrads that may signal transcription termination or RNAprocessing events involved in formation of long RNA primers. However, the majority of 50 -termini of both D-loop DNAs and longer replication intermediates map to CSB1 or downstream sites. Together, these sequence elements are referred to as the leading strand replication origin, or OH. The standard strand-asynchronous model for mtDNA replication (Clayton 1982), first established through EM analysis of replication intermediates and pulse-chase labeling studies, holds that leading strand replication proceeds for about 10 kb with continued displacement of the parental H-strand to expose a lagging strand origin, OL. Strong support for this model is provided by the reconstitution of replication initiation at OL primed by mtRNA polymerase along with EM studies showing replication intermediates coated by the mitochondrial singlestrand binding protein (mtSSB). Importantly, extensively deleted mtDNA genomes observed in

749

patients and experimental animals virtually always retain both OH and OL. Nevertheless, this model has been questioned based on interpretation of two-dimensional gels of mtDNA species that have been variously interpreted to suggest strandsynchronous, bidirectional replication from an imprecise origin region and/or a variant of the strand asynchronous model in which the displaced parental H-strand is coated by RNA instead of the abundant mtSSB. Future research will no doubt provide additional definitive tests of these alternative replication models.

Trans-acting Factors Involved in mtDNA Maintenance Transcription. Only one RNA polymerase has been described in vertebrate mitochondria (Falkenberg et al. 2007), a single-subunit enzyme of about 140 kDa with homology to T7 RNA polymerase at both the primary sequence and tertiary structural levels. The mitochondrial RNA polymerase (mtRNAP) binds to core promoters along with an accessory factor, TFBM2, to initiate transcription that is markedly stimulated by the adjacent binding of an abundant HMG-box protein, TFAM. TFAM binds and bends mtDNA dramatically to position its C-terminal tail adjacent to the core machinery of TFB2M/mtRNAP. TFAM is a dual-function protein that also plays a significant architectural role in packaging the mtDNA in compact nucleoids. TFAM is an essential gene in the mouse, where tissue-specific knockouts of the gene have provided valuable models for mtDNA depletion. At least one mitochondrial termination factor has been described, TERF1, that uses a repetitive pumilio-like fold to bind tightly to a sequence element at the end of the tandem rRNA genes. TERF1 presents a more dramatic block to transcription elongation into the rRNA region, thus acting to reduce the incidence of collisions between two opposing transcription complexes. A related protein, TERF3, plays an essential function to modulate mitochondrial transcript levels. A large number of proteins better known for their roles in nuclear gene transcription have been reported to be capable of import into

M

750

mitochondria, but the precise mechanisms whereby these proteins might influence transcription are generally poorly understood. Replication and repair. Mitochondria employ a single DNA polymerase, DNA pol g, for both replication and repair. Vertebrate DNA pol g is a heterotrimer containing one copy of the catalytic subunit, a family A-DNA polymerase structurally related to E. coli DNA polymerase I and T7 DNA polymerase, along with a dimeric accessory factor related to prokaryotic tRNA synthetases. This accessory factor, PolgB, alters several properties of the catalytic subunit but is best known as a processivity factor. Mitochondrial SSB, a T7-like DNA helicase (known as C10orf 2 or Twinkle), RNase H, and a mitochondrial isoform of DNA ligase III all serve predictable roles as accessory replication factors. The holoenzyme replicates DNA with high fidelity due to the action of its proofreading exonuclease domain but is nevertheless thought to be responsible for a large proportion of mtDNA mutations due to the absence of a highly efficient mismatch repair system in vertebrates. Vertebrate mitochondria also lack nucleotide excision repair machinery that would assist in repair of bulky DNA adducts, such as pyrimidine dimers and covalently bound carcinogens and chemotherapeutic agents. DNA repair is limited to short- and long-patch base excision repair of abasic sites generated by spontaneous base loss or by the action of damagespecific DNA glycosylases. Due to the high rates of depurination and oxidative damage in mitochondria, these repair processes appear to be indispensable. RNA processing. As noted above, the compact organization of vertebrate mitochondrial genomes requires the synthesis of long primary transcripts containing the sequences specified by the 37 mitochondrial genes. According to the tRNA punctuation model, the initial processing of these transcripts at least in mammals involves incision at the 50 end of the tRNA by an RNase P followed by 30 incision, generally by RNase ZL (also known as ELAC2). Several RNA termini are not accurately accounted for by this mechanism, indicating that additional research into RNA-processing activities in mitochondria is in order. Interestingly,

Mitochondrial Genomes in Vertebrate Animals

the key initial factor in RNA processing, RNase P, is well characterized as a ribozyme in the eukaryotic nucleus and in eubacteria, and this is conserved in the yeast S. cerevisiae, which maintains a mtDNA-encoded RNAse P RNA. While there is some evidence that vertebrate mitochondria may be able to import an RNase P RNA, a form of mammalian RNase P has been characterized that requires no RNA component. It is possible that recognition of all 22 mitochondrial tRNAs requires versatile RNA recognition requiring more than one form of RNase P. Mitochondrial tRNA biogenesis requires posttranscriptional addition of a CCA tail and numerous internal sitespecific modifications including generation of pseudouridine and covalent addition of methyl groups and other side chains. In contrast, the rRNAs in vertebrate mitochondria are thought to feature only a limited number of modifications, far fewer than in bacteria or in cytoplasmic ribosomes. The rRNA modifications that have been studied to date appear to be extremely important for function as knockout of either the TFB1M 12S rRNA dimethyladenosine methyltransferase or of TERF4, the binding partner of the NSUN4 16S rRNA methyltransferase, gives an embryonic lethal phenotype.

Inheritance of Vertebrate mtDNA The last 20 years have witnessed an explosion of information bearing on the importance of mtDNA mutations in human disease (DiMauro and Schon 2003). This knowledge has, in turn, focused considerable attention on the generation and expansion of mutations within the cellular mtDNA population. Deep sequencing efforts have revealed that the mtDNA population in normal cells is extensively heteroplasmic, with multiple rare sequence variations. We now know that every cell in a vertebrate organism, including those in post-replicative tissues, continually replicates and turns over its mtDNA complement. The rate of turnover has only rarely been studied but has been shown to yield tissue-specific half-lives of 9–31 days in rodents. Some newly arising mtDNA mutations are lost through genome

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes)

turnover, while others expand through the population. It is well known that a low frequency of mtDNA mutations is typically inconsequential until a high proportion of the mitochondrial genomes in cells are affected. The threshold for such pathology is about 75% mutant mtDNA in tissues with high energy requirements. The high copy number of mtDNA offers some protection against the development of disease as the remaining normal mitochondrial genomes complement one another through the exchange of gene products via mitochondrial fission and fusion. Nevertheless, the continuous turnover of mtDNA provides ample opportunity for somatic mutations to expand through the mtDNA population in a stochastic manner with advancing age. This process is well documented in tumors, which frequently accumulate essentially homoplasmic mtDNA mutations during the many rounds of cell division required for a clonal tumor cell to expand to a detectable size. Whether such mtDNA mutations contribute to tumorigenesis is a subject of active investigation. These continuous processes of mtDNA mutation and turnover contribute to the high rate of evolutionary divergence of mtDNA sequences. It is reasonable to ask how any species maintains a specific dominant mtDNA sequence. Work from many laboratories has shown that there is an intergenerational bottleneck in the maternal inheritance of mtDNA, as documented first in cattle (Hauswirth and Laipis 1982). This bottleneck reflects a marked contraction in the mtDNA gene copy number in primordial germ cells during oocyte development. Recent studies tracking the inheritance of mtDNA mutations through multiple generations have revealed a significant purifying selection against deleterious mutations. The fact that pathological mtDNA mutations can be transmitted from a mother to her offspring indicates that this purifying selection is not entirely efficient. Still, this process provides some measure of control over inheritance of deleterious mtDNA mutations. Acknowledgments Research in the author’s laboratory is supported by grants from the NIH and from the Ellison Medical Foundation.

751

References Anderson S, Bankier AT, Barrell BG, deBruijn MHL, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe BA, Sanger F, Schreier PH, Smith AJH, Staden R, Young IG (1981) Sequence and organization of the human mitochondrial genome. Nature 290:457–465 Bibb MJ, Etten RAV, Wright CT, Walberg MW, Clayton DA (1981) Sequence and gene organization of mouse mitochondrial DNA. Cell 26:167–180 Boore JL (1999) Animal mitochondrial genomes. Nucleic Acids Res 27:1767–1780 Clayton DA (1982) Replication of animal mitochondrial DNA. Cell 28:693–705 D’Onorio de Meo P, D’Antonio M, Griggio F, Lupi R, Borsani M, Pavesi G, Castrignanò T, Pesole G, Gissi C (2011) MitoZoa 2.0: a database resource and search tools for comparative and evolutionary analyses of mitochondrial genomes in Metazoa. Nucleic Acids Res 40:D1168–D1172 DiMauro S, Schon E (2003) Mitochondrial respiratorychain diseases. N Engl J Med 348:2656–2668 Falkenberg M, Larsson N-G, Gustafsson CM (2007) DNA replication and transcription in mammalian mitochondria. Annu Rev Biochem 76:679–699 Hauswirth WW, Laipis PJ (1982) Mitochondrial DNA polymorphism in a maternal lineage of Holstein cows. Proc Natl Acad Sci U S A 79:4686–4690 Ojala D, Montoya J, Attardi G (1981) tRNA punctuation model of RNA processing in human mitochondria. Nature 290:470–474 Wallace DC (2007) Why do we still have a maternally inherited mitochondrial DNA? Insights from evolutionary medicine. Annu Rev Biochem 76: 781–821

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes) Megan A. O’Brien and Christopher E. Lane Department of Biological Sciences, University of Rhode Island, Kingston, RI, USA

Synopsis Twenty-seven mitochondrial genome sequences are available from the Kingdom Chromista, including 24 stramenopile, one haptophyte, and two cryptophyte. Mitochondrial gene order and

M

752

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes)

arrangement are remarkably varied across the chromists, with the mtDNA containing genes specifying 2–3 rRNAs, 27–52 proteins, and 16–27 tRNAs. The mitochondrial genomes of chromists are generally closely packed, with small intergenic spaces, few introns, and occasional repeats. Canonical chromist mitochondrial genomes are predominantly circular mapping, and the overall size ranges from 27.7 to 77.3 kb. Many features of chromist mitochondria are unique, not found outside this group of organisms.

Introduction Kingdom Chromista has been recognized since 1981 and contains mostly aquatic and photosynthetic species. Chromista is divided into three major groups: stramenopiles (including brown algae, diatoms, and chrysophytes), haptophytes (coccolithophores), and cryptophytes (marine and freshwater phytoplankton). Recent phylogenetic analyses have suggested that Chromista is actually paraphyletic (Fig. 1), with stramenopiles grouping with alveolates (Phylum Alveolata), while cryptophytes and haptophytes form a separate monophyletic group (Burki et al. 2007). Tripartite tubular flagellar hairs (called mastigonemes)

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes), Fig. 1 Major eukaryotic groups and their evolutionary relationships. Chromist taxa are signified in bold type, including the cryptophytes, haptophytes, and photosynthetic members of the stramenopiles. Relationships among the chromalveolate taxa (shown in blue) remain unresolved, as indicated by the dashed lines

and/or chloroplasts surrounded by a periplastid membrane within the rough endoplasmic reticulum lumen are features that are shared among chromists. Additionally, chromists contain chlorophyll c and auxiliary pigments such as fucoxanthin, peridinin, and other xanthophylls, which give them their characteristic gold/brown color. The photosynthetic members of Chromista include brown algae, diatoms, chrysophytes, raphidophytes, xanthophytes, coccolithophores, cryptophytes, and a variety of lesser-known phytoplankton. Several classes of chromists are non-photosynthetic and non-pigmented, but most contain a plastid of red algal endosymbiotic origin, with a fourmembrane plastid envelope. The most striking evidence of secondary endosymbiosis is the presence of a miniaturized red algal nucleus, termed a nucleomorph, in the cryptophytes (Archibald 2007). There are no known non-photosynthetic haptophytes, but the oomycetes, labyrinthulids, and bicosoecids are part of a large assemblage of non-photosynthetic stramenopiles. The genus Gonimonas and several species of Cryptomonas (currently all recognized as C. paramecium) comprise the non-photosynthetic cryptophytes. Energy in these organisms is not derived from metabolism of starch but instead from stored leucosin (protein) or laminarin (polysaccharide).

Excavata Alveolata Amoebozoa Stramenopiles Bacillariophyta Bicosoecida Chrysophyceae Rhizaria Dictyochophyceae Eustigmatales Hypochytriales Haptophyta Opisthokonta Labyrinthulomycetes Cryptophyta (animals, fungi) Opalinata Archaeplastida Pelagophyceae (red & green algae, land plants) Peronosporomycetes Phaeophyceae Pinguiochrysidales Raphidiophyceae Synurales Xanthophyceae

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes)

Chromista Mitochondria Of the nearly 500 eukaryotic mitochondrial genomes sequenced to date, only 10% are from protists, of which fewer than 30 are from Chromista. A typical chromist mitochondrial genome contains genes for 2–3 rRNAs, 27–52 proteins, and 16–27 tRNAs (Table 1) and is generally closely packed, with small intergenic spaces, few introns, and occasional repeats. The overall size of chromist mtDNAs ranges from 27.7 to 77.3 kb. All chromist mitochondrial genomes are circular mapping with the exception of those in the stramenopiles Ochromonas danica and Proteromonas lacertae, which are linear and bracketed by two large inverted repeats (Pérez-Brocal et al. 2010). The circularmapping mitochondrial genome of the oomycete Saprolegnia ferax also contains two large inverted repeats (Martin et al. 2007). Many unique features, not found outside this group of organisms, characterize chromist mitochondrial genomes. Intergenic regions greater than 5,000 bp have arisen independently only a few times across all eukaryotes, including in the cryptophytes Hemiselmis andersenii and Rhodomonas salina (Kim et al. 2008). Interspersed with palindromic sequences capable of forming similar stem-loop hairpins, the large repeat region likely arose prior to the diversification of the cryptophytes. Mitochondrial DNAs in the two red tide-forming raphidophycean algae, Heterosigma akashiwo and Chattonella marina, independently experienced partial genome duplication events after separation of the genera. Both genomes contain seven open reading frames (ORFs) of unidentified function, with individual ORFs in one genome having apparent homologs in the other genome (Masuda et al. 2011). Interestingly, mtDNA in the haptophyte Emiliania huxleyi encodes a DNA adenine methyltransferase (dam) gene (Sanchez-Puerta et al. 2004), which is not present in any other sequenced mitochondrial genome. The dam gene produces an enzyme that catalyzes the transfer of a methyl group from S-adenosyl-L-methionine to the N6 position of a specific adenine in its target DNA sequence. Blastocystis, an anaerobic parasite found in the gut of humans, has the smallest mitochondrial

753

genome and most reduced gene collection within chromists, containing only 60–65% of the typical set of chromist tRNA genes and missing the fundamental cox1, cox2, cox3, cob, atp6, atp8, and atp9 genes that form an essential part of a functional electron transport chain. This genome is contained within the mitochondrion-related organelle (MRO) of Blastocystis, which is considered to represent an intermediate stage between classical mitochondria and the highly reduced MROs, hydrogenosomes, and mitosomes of other anaerobic protists (see Hjort et al. 2010). Similarly, the mitochondrial genome of Proteromonas lacerate, another Opalinata species, also lacks cox, cob and atp genes (Pérez-Brocal et al. 2010). With only one representative mitochondrial genome sequence for haptophytes and two for cryptophytes, no general mtDNA characteristics can be inferred within these two groups. The single haptophyte mtDNA, that of E. huxleyii, lacks nad7, nad9, nad11, and atp8 as well as many ribosomal protein genes but interestingly retains atp4, which is rare in mitochondrial genomes (Sanchez-Puerta et al. 2004). Between the two cryptophyte mitochondrial genomes, there have been approximately 31 rearrangements, and the genomes have large blocks (4.7 and 20 kb) of unrelated direct repeat sequences (Kim et al. 2008). The genes rps1, atp4, sdh4, and tatA, found infrequently in mitochondrial genomes, are nevertheless present in R. salina mtDNA (Hauth et al. 2005). R. salina mtDNA also harbors two group II introns, but the H. andersenii mtDNA is devoid of introns, and all genes are encoded on the same strand (Kim et al. 2008). Sequenced stramenopile mitochondrial genomes include 3 Bacillariophyta, 1 Chrysophyceae, 1 Synurales, 9 Phaeophyceae, 4 Peronosporomycetes, 1 Bicosoecida, 2 Raphidophycea, and 3 Opalinata. Once thought to have a narrow range of variation (Chesnick et al. 2000), stramenopile mitochondrial genomes are now seen to exhibit considerable variation in size, gene content, and overall arrangement. Stramenopile mtDNAs encode 2–3 rRNAs, 27–52 proteins, and 16–27 tRNAs, with mitochondrial gene order highly varied across the lineage. All nine Phaeophyceae mitochondrial genomes include genes for three rRNAs (23S, 16S, and 5S),

M

754

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes)

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes), Table 1 Characteristics of chromist mitochondrial genomes. Four Saccharina species (S. japonica,

Haptophyte Emiliania huxleyi Cryptophytes Hemiselmis andersenii Rhodomonas salina Stramenopiles Phaeodactylum tricornutum Synedra acus Thalassiosira pseudonana Heterosigma akashiwo Chattonella marina Saccharina spp. Desmarestia viridis Dictyota dichotoma Fucus vesiculosus Laminaria digitata Pylaiella littoralis Cafeteria roenbergensis Ochromonas danica Chrysodidymus synuroideus Saprolegnia ferax Phytophthora ramorum Phytophthora sojae Phytophthora infestans Proteromonas lacertae Blastocystis spp.

S. longipedalis, S. angustata, and S. coriacea) are combined for Saccharina spp. Two strains of Blastocystis are reported as Blastocystis spp.

Genome size (bp)

Gene number (# tRNA genes)

5S rRNA

Circular

Group II Introns

Inverted repeats

29,013

48 (25)

No

28

Yes

No

No

60,553

66 (28)

No

29

Yes

No

No

19.7

48,063

69 (27)

No

29

Yes

2

Yes

4.7

77,356

60 (24)

No

35

Yes

4

No

35.4

46,657 43,827

59 (24) 61 (25)

No No

32 30

Yes Yes

3 1

No Yes

4.9 5.0

38,690

69 (25)

Yes

36

Yes

No

No

44,772

69 (25)

Yes

28

Yes

2

No

37,500–37,657 39,049

66 (25) 68 (26)

Yes Yes

35 37

Yes Yes

No No

No No

31,617

66 (25)

Yes

37

Yes

No

No

36,392

67 (26)

Yes

35

Yes

No

No

38,007

67 (25)

Yes

35

Yes

No

No

58,507

79 (23)

Yes

38

Yes

7

No

43,159

58 (22)

No

27

Yes

No

No

41,035

75 (29)

No

26

Linear

No

Yes

34,119

61 (23)

No

24

Yes

No

No

46,930

77 (22)

No

23

Yes

No

Yes

39,314

67 (25)

No

22

Yes

No

No

42,977

67 (25)

No

22

Yes

No

No

37,957

67 (25)

No

22

Yes

No

No

48,663

53 (23)

No

21

Linear

No

Yes

27,719–28,382

45 (16)

No

22

Yes

No

No

% GC

Repeat region (kb)

Mitochondrial Genomes of Chromists (Stramenopiles, Haptophytes and Cryptophytes)

25–26 tRNAs, and 35 identifiable proteins, as well as two ORFs unique to the mtDNAs of all brown algae (Oudot-Le Secq et al. 2001; Yotsukura et al. 2010). Within the chromists, retention of the mtDNA-encoded rpl31 gene is exclusive to Phaeophyceae, while 5S rRNA and rpl5 genes are found in both phaeophycean and raphidophycean mitochondrial genomes (Oudot-Le Secq et al. 2001; Masuda et al. 2011). The brown alga, Pylaiella littoralis, possesses the second-largest chromist mitochondrial genome (at 58.5 kb) sequenced to date, with an expanded set of 79 genes (Oudot-Le Secq et al. 2001). The large mtDNA size is partly attributable to several lateral transfers into the genome of phage-like RNA polymerase genes and group II introns (Oudot-Le Secq et al. 2001). Pylaiella littoralis mtDNA has also retained many primitive features, including a-proteobacteria-like promoters and clustered ribosomal protein genes. Within the oomycetes, mitochondrial genomes from three Phytophthora species and one Saprolegnia sp. encode two rRNAs, 25–26 tRNAs, and 25–35 proteins (Martin et al. 2007). Whereas 67 coding regions are shared and the gene composition is relatively constant, these oomycete genomes display considerable variation in gene order due to inversion and repeats. The nad11 gene employs a unique stop codon (TGA) compared to all other genes in the Phytophthora mitochondrial genomes, which use TAA, as does nad11 in Saprolegnia ferax mtDNA (Martin et al. 2007). Both Phytophthora ramorum and Saprolegnia ferax contain inverted repeats: 1,150 bp in P. ramorum (containing trnR(ugu) and the unique orf176) and 8,618 bp in S. ferax (containing rnl, rns, rps7, rps10, nad2, part of nad5 and 5 trn). Gene order is highly conserved within the genus Phytophthora, the only difference due to two inversions containing several genes in P. infestans. Within chromists, the atp1 gene is found only in the oomycete mitochondrial genomes, and there are six ORFs that are shared across all three Phytophthora genomes, with two having homologs in S. ferax. The three sequenced mitochondrial Bacillariophyta mitochondrial genomes represent the three major categories of diatoms: the centrics,

755

bilateral pennates, and raphid pennates. The Bacillariophyta mitochondrial genomes specify two rRNAs, 24–25 tRNAs and 35 proteins (Oudot-Le Secq and Green 2011). The Phaeodactylum tricornutum mitochondrial genome is the largest chromist mitochondrial genome characterized to date, almost double the size of other diatom mitochondrial genomes due to the presence of a 35.5-kb block of direct repeats. Thalassiosira pseudonana mtDNA contains a 5 kb region mainly composed of short tandem repeats and flanked by a single 183 bp inverted repeat, whereas mtDNA in Synedra acus also has a 5 kb repeat composed primarily of two 1.3 kb direct repeats and a few tandem repeats. The P. tricornutum cox1 gene has a +1 translational frameshift, a change from CCC to CCT where the ribosome skips a single nucleotide and continues in another reading frame, the first frameshift of this kind recognized in an algal species. Additionally, the P. tricornutum nad11 gene is split by 27 bp into two ORFs. In the much-debated Kingdom Chromista, comparative mitochondrial genomics can provide additional evidence to resolve the evolutionary history of the groups within this lineage. As one of the largest eukaryotic kingdoms, Chromista has a disproportionately low number of mitochondrial genomes sequenced. In view of the considerable amount of gene rearrangement and variation in gene composition, mitochondrial genome sequences from additional chromist species will have to be obtained for effective and evolutionarily informative comparative genomic studies. On a broader scale, the high variation and numerous unique features observed in chromist mitochondrial genomes provide a wide array of opportunities to study organellar genome evolution.

Cross-References ▶ Mitochondrial Genomes ▶ Mitochondrial Genomes in Alveolates ▶ Mitochondrial Genomes of Excavata ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

M

756

Mitochondrial Genomes of Excavata

References

Mitochondrial Genomes of Excavata Archibald JM (2007) Nucleomorph genomes: structure, function, origin and evolution. Bioessays 29: 392–402 Burki F, Shalchian-Tabrizi K, Minge M, Skjæveland Å, Nikolaev SI, Jakobsen KS, Pawlowski J (2007) Phylogenomics reshuffles the eukaryotic supergroups. PLoS One 2:e790 Chesnick JM, Goff M, Graham J, Ocampa C, Lang BF, Seif E, Burger G (2000) The mitochondrial genome of the stramenopile alga Chrysodidymus synuroideus. Complete sequence, gene content and genome organization. Nucleic Acids Res 28:2512–2518 Hauth AM, Maier UG, Lang BF, Burger G (2005) The Rhodomonas salina mitochondrial genome: bacteria-like operons, compact gene arrangement and complex repeat region. Nucleic Acids Res 33: 4433–4442 Hjort K, Goldberg AV, Tsaousis AD, Hirt RP, Embley TM (2010) Diversity and reductive evolution of mitochondria among microbial eukaryotes. Philos Trans R Soc Lond B Biol Sci 365:713–727 Kim E, Lane CE, Curtis BA, Kozera C, Bowman S, Archibald JM (2008) Complete sequence and analysis of the mitochondrial genome of Hemiselmis andersenii CCMP644 (Cryptophyceae). BMC Genomics 9:215 Martin FN, Bensasson D, Tyler BM, Boore JL (2007) Mitochondrial genome sequences and comparative genomics of Phytophthora ramorum and P. sojae. Curr Genet 51:285–296 Masuda S, Kamikawa R, Ueda M, Oyama K, Yoshimatsu S, Inagaki Y, Sako Y (2011) Mitochondrial genomes from two red tide forming raphidophycean algae Heterosigma akashiwo and Chattonella marina var. marina. Harmful Algae 10:130–137 Oudot-Le Secq M-P, Green BR (2011) Complex repeat structures and novel features in the mitochondrial genomes of the diatoms Phaeodactylum tricornutum and Thalassiosira pseudonana. Gene 476:20–26 Oudot-Le Secq M-P, Fontaine JM, Rousvoal S, Kloareg B, Loiseaux-de Goër S (2001) The complete sequence of a brown algal mitochondrial genome, the ectocarpale Pylaiella littoralis (L.) Kjellm. J Mol Evol 53:80–88 Pérez-Brocal V, Shahar-Golan R, Clark CG (2010) A linear molecule with two large inverted repeats: the mitochondrial genome of the stramenopile Proteromonas lacertae. Genome Biol Evol 2:257–266 Sanchez-Puerta MV, Bachvaroff TR, Delwiche CW (2004) The complete mitochondrial genome sequence of the haptophyte Emiliania huxleyi and its relation to heterokonts. DNA Res 11:1–10 Yotsukura N, Shimizu T, Katayama T, Druehl LD (2010) Mitochondrial DNA sequence variation of four Saccharina species (Laminariales, Phaeophyceae) growing in Japan. J Appl Phycol 22:243–251

Julius Lukeš Biology Centre, Institute of Parasitology, Czech Academy of Sciences, and Faculty of Science, University of South Bohemia, České Budĕjovice, Czech Republic

Synopsis Supergroup Excavata comprises a variety of biflagellated aerobic and anaerobic protists whose defining characteristic is a central grooved cytostome (cell mouth). Of all the eukaryotic supergroups, excavates exhibit the widest variety of mitochondrial genome forms and gene content. One group, the jakobids, possesses the most ancestral (least derived) mitochondrial genome yet characterized: a circular-mapping DNA containing the largest known mitochondrial gene set and displaying bacterial operon-like gene arrangements and expression signals. At the other extreme, certain excavates have lost mtDNA entirely. One excavate phylum, Euglenozoa, contains three well-delineated lineages: kinetoplastids, diplonemids, and euglenids. Some kinetoplastids, including many parasitic genera such as Trypanosoma (the causative agent of African sleeping sickness), have a complex kinetoplast DNA that consists of interlocked mass of maxicircles and minicircles. Maxicircles correspond to the mitochondrial genome in other organisms, but their protein-coding genes are “cryptic” in that their transcripts must undergo an intricate process of posttranscriptional uridine insertion/deletion editing in order to become translatable. Minicircles encode small guide RNAs that provide the information in trans for this RNA editing. In Diplonema papillatum (a diplonemid), the mtDNA consists of numerous small circular molecules encoding only portions of genes, whose transcripts must be correctly trans-spliced in order to be translated. In contrast, in Euglena gracilis (a euglenid), the mtDNA takes the form of small linear fragments, and the few identified genes are embedded in a sea of

Mitochondrial Genomes of Excavata

noncoding repeated sequence and interspersed with small fragments of authentic genes.

757

is to refer to the cristae-deficient vesicles of most excavates as “mitochondrion-related organelles” (MROs).

Introduction Excavata is a taxonomic unit usually considered to be at the rank of a superkingdom, comprising aerobic and anaerobic protists having a central cytostome in the form of a groove. Major phyla with representative genera include Euglenozoa (Euglena, Trypanosoma), Heterolobosea (Naegleria), Jakobida (Jakoba, Reclinomonas), Preaxostyla (Trimastix), Fornicata (Giardia), and Parabasalia (Trichomonas). In most cases, one flagellum is positioned to bring food into the groove, while the second, forward-directed flagellum propels the organism. However, numerous members of this group have lost some or all of these features and are placed in this superkingdom on the basis of molecular phylogenetic evidence only (Hampl et al. 2009). Excavata is of particular interest because of its possibly basal position in the eukaryotic phylogenetic tree. Because light and electron microscopic studies have failed to find typical cristaecontaining, double-membrane-bound vesicles in a number of excavates, these were at one time considered to be amitochondriate protists that had branched off the eukaryotic tree before the endosymbiotic event that led to mitochondria. However, with the advent of molecular methods, which have been instrumental in the identification of mitochondrial or mitochondrion-derived prokaryotic genes in all carefully examined excavates, this view has been abandoned. At present, organelles containing the products of “mitochondrial” genes are classified into three morphologically and functionally distinct categories: mitochondrion, hydrogenosome, and mitosome. It is worth noting, however, that the distinctions among these three categories are not as clearly defined as was earlier thought. Recent studies indicate the existence of a continuum of organelles, in which a conventional mitochondrion represents one extreme and a highly reduced mitosome the other. Therefore, given the current paucity of information, the conservative approach

Mitochondrial Genome Diversity in Excavates In Excavata, only the ultrastructurally typical mitochondria, usually containing respiratory complexes, are known to contain mtDNA. Interestingly, in terms of amount of DNA, number of (protein-coding) genes, and overall organization, the mitochondrial genome is extremely variable in these protists (Gray et al. 2004). In fact one can argue that excavates exhibit higher mitochondrial genome diversity than the rest of eukaryotes combined: one group, the jakobids, harbors extremely gene-rich mitochondrial genomes, whereas the “petite” mutants of trypanosomes lack any organellar DNA whatsoever. Apparently, the progressive transfer of genes from the endosymbiont-derived mitochondrion to the cell nucleus occurred at dramatically different rates in the various lineages of Excavata. Sequencing of the entire circular-mapping mitochondrial genome of one excavate, the jakobid Reclinomonas americana, was a true breakthrough. The total number of 98 mitochondrial genes, of which 67 are protein coding, greatly surpassed the highest number of mitochondrion-encoded genes known up to that time (Lang et al. 1997) and still represents the largest known gene repertoire for any mitochondrial genome. While most of the genes in R. americana mtDNA code for subunits of respiratory complexes, 18 of them including 4 subunits of a eubacterial-type RNA polymerase were found to be associated with a mitochondrial genome for the first time (Lang et al. 1997). Since all other eukaryotes encode a singlesubunit, phage T3/T7-like mitochondrial RNA polymerase in their nucleus, the most plausible explanation of these findings is that the nucleusencoded, phage-type RNA polymerase replaced the original mitochondrion-encoded, eubacterial enzyme, retained only in R. americana, in all other eukaryotes.

M

758

Subsequently, mitochondrial genomes of a number of other jakobids have been sequenced, revealing that indeed all of them are extremely gene- as well as A+T-rich, with only relatively small differences among the encoded gene sets (Gray et al. 2004). The mitochondrial genome of Jakoba libera (100 kb in size) is notable for its linearity and the telomere-like repeats that cap both ends. Finally, RNA editing has been noted in two tRNAs encoded by the mitochondrial genome of another jakobid, Seculamonas ecuadoriensis. While the MRO (hydrogenosome) of the anaerobic ciliate Nyctotherus possesses a genome (Akhmatova et al. 1998), no DNA is present in the homologous organelle of the excavate Trichomonas and related protists. The well-studied MROs of the latter organism were originally characterized as molecular-hydrogen-producing vesicles but were recently shown also to be site of iron-sulfur cluster synthesis and perhaps other processes (Tachezy 2008). MROs (mitosomes) are found in another excavate, the anaerobic intestinal parasite Giardia spp., but similar organelles evolved independently in other eukaryotic lineages as well. Although early studies claimed that the mitosomes of the non-excavate Entamoeba histolytica contain DNA, the present view is that no DNA has been retained in any of the mitosomes studied so far. MROs have been observed in ultrastructural sections of a number of free-living excavates (e.g., Trimastix, Carpediemonas, Retortamonas, Dysnectes, and Chilomastix), as well as their parasitic or commensalic relatives (e.g., diplomonads, retortamonads, oxymonads, and parabasalids). Biochemical properties of these organelles remain largely unknown, but in all cases where extensive expressed sequence tag projects were launched, homologues of genes coding for mitochondrial proteins have been found, testifying to an origin of these genome-lacking organelles from an ancestral, genome-containing mitochondrion (Tachezy 2008).

Mitochondrial DNA in Kinetoplastids Kinetoplastid flagellates (phylum Euglenozoa) harbor a large and complex mass of DNA, termed

Mitochondrial Genomes of Excavata

kinetoplast (k) DNA, composed of so-called maxicircles and minicircles. Maxicircles are the counterpart of a typical mitochondrial genome (Jensen and Englund 2012), as they encode mitoribosomal RNAs as well as protein subunits of respiratory complexes and mitochondrial ribosomes; however, the function of minicircles was established only after the discovery of RNA editing (Stuart et al. 2005) (see below). Kinetoplast DNA exists principally in two different arrangements – as free DNA circles in the suborder Bodonidae and as a single, threedimensional kDNA network in the suborder Trypanosomatidae (Lukeš et al. 2002). Bodonidae is a paraphyletic group ancestral to Trypanosomatidae and includes parasitic, commensal, and free-living protists with two flagella. The kDNA in these groups comes in several forms: (1) pro-kDNA, found in Bodo spp., is characterized by non-interlocked relaxed DNA circles, concentrated in a single bundle located near the basal body of the flagella; (2) pan-kDNA, described in members of the genus Cryptobia, is composed of tens of thousands of minicircles that are either free or form small catenanes, are invariably supercoiled, and are evenly distributed throughout the mitochondrial lumen; (3) polykDNA arrangement is characteristic of Dimastigella spp. and Cruzella marina, the minicircles of which are non-interlocked and relaxed and distributed throughout the organellar lumen at multiple foci; (4) finally, in Trypanoplasma borreli, kDNA exists in a unique arrangement, termed mega-kDNA, in which minicircles are tandemly linked into large circular molecules, more or less evenly spread throughout the mitochondrial lumen. In most bodonids, the kDNA is truly gigantic, in size approaching that of the respective nuclear genomes (Lukeš et al. 2002). It is likely that yet another type of kDNA organization will be described in Perkinsella (or Ichthyobodo-like organism), an earlybranching aflagellar kinetoplastid that became an endosymbiont of another protist, Neoparamoeba. With regard to kDNA structure, the situation is simpler in the other group within Kinetoplastida, Trypanosomatidae, which encompasses solely parasitic forms. Some, such as Trypanosoma

Mitochondrial Genomes of Excavata

spp., Leishmania spp., and Phytomonas spp., are serious pathogens of humans, other vertebrates, and economically important plants, as well as monoxenous parasites of insects. In Trypanosoma brucei, the most intensively studied kinetoplastid and causative agent of African sleeping sickness, and in the model flagellate, Crithidia fasciculata, replication and maintenance of the kDNA network has been investigated in great detail. The maxicircles and minicircles are mutually interlocked into a single huge network within the mitochondrion, densely packed into a disk located close to the basal body of the flagellum (Jensen and Englund 2012). The diameter and the thickness of the disk are correlated with the number and size of minicircles, respectively. A unique feature of the kDNA circles is that they are not supercoiled but relaxed. Their replication occurs via highly organized enzymatic machinery (in conjunction with which individual circles are released from the network), replicated in a specialized zone and reattached to the network at two antipodal sites. So far, six DNA polymerases and six DNA helicases have been identified as part of the replication machinery, which is estimated to comprise more than 100 different proteins, including topoisomerases, primases, ligases, and single- and double-strand-binding proteins. Somewhat surprisingly, only a single RNA polymerase is responsible for the transcription of both maxicircles and minicircles (Jensen and Englund 2012). It has been known for decades that some Trypanosoma species contain less kDNA than others. Some have lost parts of or the entire maxicircle component but have retained minicircles (=dyskinetoplastic trypanosomes), whereas others have lost all kDNA (=akinetoplastic trypanosomes). Only recently was it shown that these forms are to various extents “petite” mutants of T. brucei and that the gradual loss of kDNA can also be induced under laboratory conditions (Lai et al. 2008). As mentioned above, most maxicircle-encoded genes exist in an encrypted form, which means that their protein-coding transcripts have to be extensively edited via a complex process of uridine (U) insertion and deletion in order to become

759

translatable. RNA editing occurs in all studied kinetoplastids; however, its extent varies depending on species and gene. Although RNA editing has been extensively studied only in T. brucei and Leishmania tarentolae, available data indicate that the underlying mechanism is highly conserved within Kinetoplastida. Information for the posttranscriptional, site-specific U insertions into and U deletions from pre-edited mRNAs is provided by small (50–70-bp-long) RNA molecules termed guide RNAs, which are encoded predominantly in the kDNA minicircles (Stuart et al. 2005). Moreover, several large protein complexes assist in this intricate task. In T. brucei, it is estimated that up to 1,000 different guide RNAs and over 70 proteins are involved in the editing and processing of 18 maxicircleencoded transcripts. The best-studied complex is the editosome, which encompasses the main enzymatic activities for editing events, in particular (1) cleavage of pre-edited mRNA at the editing site, specified by guide RNA; (2) U insertion or deletion by a terminal uridylyl transferase or exonuclease, respectively; and (3) sealing by RNA ligase (Aphasizhev and Aphasizheva 2011). The MRP1/2 complex is thought to facilitate the formation of duplexes between guide RNAs and pre-edited mRNAs (Stuart et al. 2005). A mitochondrial poly-A polymerase (kPAP) complex modulates, in collaboration with additional proteins, the synthesis of 30 poly-A/U tails in pre-edited, edited, and never-edited transcripts (Aphasizhev and Aphasizheva 2011). Finally, a recently described mitochondrial RNA binding complex (MRP1) is composed of dozens of proteins that usually lack any domains and are conserved only within the kinetoplastid flagellates. The composition of the MRP1 complex is uncertain, as different purification protocols lead to different sets of protein subunits. Moreover, knockdowns for individual subunits result in dramatically different phenotypes, ranging from general destabilization of guide RNAs through impact on all mRNAs to a limited effect on partially or extensively edited transcripts (Aphasizhev and Aphasizheva 2011). Evidently additional proteins awaiting discovery are involved in the byzantine and extremely complex

M

760

editing and processing of mitochondrial transcripts in trypanosomes. Although the evolutionary advantage of this U insertion/deletion type of editing is unclear, the edited transcripts are translated in the organelle and the resulting, functional proteins are incorporated into mitochondrial respiratory complexes and ribosomes, as in conventional mitochondria in which no such editing process exists.

Mitochondrial DNA in Euglenids and Diplonemids In addition to Kinetoplastida, phylum Euglenozoa contains two other well-defined monophyletic groups – Euglenida and Diplonemida. Euglenids are free-living, ecologically significant protists, carrying a secondarily acquired green plastid, which was lost in some lineages. Diplonemids are a small, poorly studied group of parasites and commensals that branches off between kinetoplastids and euglenids and has recently been considered to be more closely related to the former. Despite the fact that the plastid genome of Euglena gracilis was the first completely sequenced plastid DNA, the mitochondrial genome of this protist, as well as of related species, has proven very refractory to study, and our knowledge about it remains quite limited. Since the description of the mtDNA-encoded gene specifying cytochrome c oxidase subunit 1 (cox1) in the mid-1990s (Refs. 294 and 320 in Gray et al. 2004), only one study has been published further characterizing E. gracilis mtDNA (Spencer and Gray 2011). The mitochondrial genome of E. gracilis is represented by a collection of heterodisperse linear DNA fragments, most of which are about 4 kb long, although small fragments centered around 7.5 kb were also seen. This observation is in agreement with early studies of the physical structure of E. gracilis mtDNA, in which only small DNA pieces were detected. In E. gracilis mtDNA, genes encoding all three subunits of the cytochrome c oxidase complex (cox1, cox2, and cox3) have been found, in most cases flanked by

Mitochondrial Genomes of Excavata

repeat sequences of varying size. However, fullsize open reading frames were encountered rather rarely, whereas DNA fragments containing small pieces of the cox genes (as well as rRNA genes), in various arrangements, were abundant. Transcripts of protein-coding genes appear not to require RNA editing or RNA splicing. The small and large mitoribosomal rRNAs are present in the form of two fragments each, yet their transcripts do not appear to be spliced together. Each half of the small subunit rRNA is encoded by a separately transcribed subgenomic module. Although in all other aspects of cellular and molecular biology, diplonemids are the less wellknown organisms when compared with euglenids, a substantial amount of information is available about diplonemid mtDNA. Described by (Vlček et al. 2011) as “[a]rguably the most bizarre mitochondrial DNA,” the mitochondrial genome of Diplonema papillatum is located in a single large reticulated organelle, for which a low number of exceptionally large and flat mitochondrial cristae is characteristic (Marande et al. 2005). The organellar DNA is not concentrated in a single or a few locations, as in the related kinetoplastids, but is distributed in a seemingly random fashion throughout the lumen of the mitochondrion. Staining with DNA-specific dyes strongly indicates that the amount of mtDNA is extremely high and may even approach that of the nuclear genome. Electron microscopy and agarose gel electrophoresis of the mtDNA, separated from nuclear DNA, revealed that it comprises two categories of circular chromosomes, 6 and 7 kb in size (designated A and B, respectively), that exist as relaxed and supercoiled circles, each present in multiple copies in free form (Marande et al. 2005). While the sequence data obtained in conjunction with this initial study already revealed that at least the cox1 gene is split into several fragments, each residing on a different circular chromosome (Marande et al. 2005), it was not obvious at that point whether these modules were simply nonfunctional gene fragments present in addition to a full-size functional copy of this gene or whether no intact version of cox1 existed. A follow-up study demonstrated that the cox1 fragments are indeed transcribed individually from different

Mitochondrial Genomes of Excavata

circular chromosomes, with the contiguous, translatable cox1 transcript being generated by splicing together, in trans and in a perfectly orderly fashion, the separate transcripts of nine cox1 subgenomic modules located in the mtDNA (Marande and Burger 2007). Moreover, this work revealed a single case of insertion of a stretch of six uridines between two modules (Marande and Burger 2007), reminiscent of the uridine insertion/deletion type of RNA editing characteristic of kinetoplastids (see previous chapter). While nothing is yet known about the machinery responsible for this insertion, it has been shown that in terms of size and location, exactly the same insertion of six non-encoded uridines occurs in the related diplonemids D. ambulator, Diplonema sp. 2, and Rhynchopus euleeides. Extensive sequencing of the mitochondrial genome of D. papillatum revealed that it contains a typical set of protein-coding genes, which are, however, all split into fragments in the manner originally described for cox1 (Vlček et al. 2011). So far, a complete set of modules for cytochrome b, cox1, cox2, cox3, and nad7 (encoding NADH dehydrogenase subunit 7 of respiratory complex I), as well as dozens of modules varying in length from 60 to 350 bp and encoding other subunits of the respiratory chain, have been mapped. Some genes, often present and conserved in mitochondrial genomes, such as rps12 (ribosomal protein subunit 12) and nad9 (NADH dehydrogenase subunit 9), have so far not been encountered in the D. papillatum sequences, while the absence of tRNA genes in diplonemids (Vlček et al. 2011) is a character shared with kinetoplastid mitochondrial genomes. Furthermore, as in the case of cox1 transcripts, the gene modules encoding other mitochondrial proteins are transcribed and the products trans-spliced into mature transcripts. The position of the gene modules in the A and B circles is highly conserved, with only one module invariably present per circle. Using comparative sequence analysis, a constant region in each circle has been identified, in addition to the gene-encoding portion flanked by conserved motifs (Vlček et al. 2011).

761

The existence of small mtDNA fragments in diplonemids and euglenids indicates that the common ancestor of Euglenozoa likely already had a fragmented mitochondrial genome, which might have had important consequences for extant mitochondrial genomes within this lineage (Flegontov et al. 2011). As proposed recently by (Spencer and Gray 2011), antisense transcripts of gene fragments might exemplify the ancestral form of guide RNAs, genes for which are abundantly present in the kinetoplast DNA.

References Akhmatova A, Voncken F, Van Alen T, Van Hoek A, Boxma B, Vogels G, Veenhuis M, Hackstein JHP (1998) A hydrogenosome with a genome. Nature 396:527–528 Aphasizhev R, Aphasizheva I (2011) Mitochondrial RNA processing in trypanosomes. Res Microbiol 162: 655–663 Flegontov P, Gray MW, Burger G, Lukeš J (2011) Gene fragmentation: a key to mitochondrial genome evolution in Euglenozoa? Curr Genet 57:225–232 Gray MW, Lang BF, Burger G (2004) Mitochondria in protists. Annu Rev Genet 38:477–524 Hampl V, Hug L, Leigh JW, Dacks JB, Lang BF, Simpson AGB, Roger AG (2009) Phylogenetic analyses support the monophyly of Excavata and resolve relationships among eukaryotic “supergroups”. Proc Natl Acad Sci U S A 106:3859–3864 Jensen RE, Englund PT (2012) Network News: The replication of kinetoplast DNA. Ann Rev Microbiol 66:473–491 Lai D-H, Hashimi H, Lun Z-R, Ayala FJ, Lukeš J (2008) Adaptation of Trypanosoma brucei to gradual loss of kinetoplast DNA: T. equiperdum and T. evansi are petite mutants of T. brucei. Proc Natl Acad Sci U S A 105:1999–2004 Lang BF, Burger G, O’Kelly CJ, Cedergren R, Golding GB, Lemieux C, Sankoff D, Turmel M, Gray MW (1997) An ancestral mitochondrial DNA resembling a eubacterial genome in miniature. Nature 387: 493–497 Lukeš J, Guilbride DL, Votýpka J, Zíková A, Benne R, Englund PT (2002) The kinetoplast DNA network: evolution of an improbable structure. Eukaryot Cell 1:495–502 Marande W, Burger G (2007) Mitochondrial DNA as a genomic jigsaw puzzle. Science 318:415 Marande W, Lukeš J, Burger G (2005) Unique mitochondrial genome structure in diplonemids, the sister group of kinetoplastids. Eukaryot Cell 4:1137–1146 Spencer DF, Gray MW (2011) Ribosomal RNA genes in Euglena gracilis mitochondrial DNA: fragmented

M

762

Mitochondrial Genomes of Green, Red and Glaucophyte Algae

genes in a seemingly fragmented genome. Mol Genet Genomics 285:19–31 Stuart K, Schnaufer A, Ernst NL, Panigrahi AK (2005) Complex management: RNA editing in trypanosomes. Trends Biochem Sci 30:97–105 Tachezy J (ed) (2008) Hydrogenosomes and mitosomes: mitochondria of anaerobic eukaryotes. Springer-Verlag Berlin Heidelberg Vlček C, Marande W, Teijeiro S, Lukeš J, Burger G (2011) Systematically fragmented genes in a multipartite mitochondrial genome. Nucleic Acids Res 39:979–988

Mitochondrial Genomes of Green, Red and Glaucophyte Algae Robert W. Lee and Jimeng Hua Department of Biology, Dalhousie University, Halifax, NS, Canada

Abbreviations CW DO mtDNA

Clockwise flagellar basal bodies Directly opposed flagellar basal bodies Mitochondrial DNA

Synonyms Archaeplastida; Archaeplastidians; Charophytes; Charophytes plus chlorophytes; Chloroplastida; Chloroplastidians; Clade; Embryophytes; Glaucophyta; Glaucophytes; Green algae; Green plants; Land plants; Monophyletic group; Plantae; Red algae; Rhodophyta; Rhodophytes; Streptophyte algae; Viridiplantae

Synopsis The archaeplastidian algae (green, red, and glaucophyte algae) – a group united by being the first known photosynthetic eukaryotes derived from the symbiosis of a non-photosynthetic eukaryotic host and a free-living cyanobacterium – show an impressive diversity in mitochondrial genome architectural features in more than 30 complete sequences now available. Two

green algal species have mitochondrial DNAs (mtDNAs) with an excess of guanine and cytosine, which contrasts with all the known mtDNAs of other archaeplastidian algae and all but a few species outside this group. Sizes of known mitochondrial genomes of archaeplastidian algae vary from 13 to 201.8 kb. This variation is in part due to differences in gene complement, which in one subgroup is reduced in number to about one-fifth of that seen in the more gene-rich mtDNAs of other archaeplastidian algal groups; as well, the fraction of intergenic and intronic DNA varies widely among archaeplastidian algal mtDNAs. Among archaeplastidian algae, a few species have been identified with linear genome-sized mtDNAs harboring distinct telomeres, and at least one species appears to have circular unitgenome-sized mtDNA. However, the great majority of mtDNAs in the archaeplastidian algae are characterized as “circular mapping,” which is consistent with a circular unit-genome structure but also a branched structure composed of linear head-to-tail concatemeric molecules as seen in some plant and protist mtDNAs. Finally, most mtDNAs among archaeplastidian algae are intact molecules; however, fragmented forms composed of two sub-chromosomal elements have been identified and proposed to result from recombinational events between intact mtDNA precursors.

Introduction The green, red, and glaucophyte algae represent an attractive group in which to study the evolution of mitochondrial genome structure because of the surprising array of mitochondrial DNA (mtDNA) architectural features observed in these algae and the increased and ongoing understanding of phylogenetic relationships within the group. This entry will deal first with the phylogeny and taxonomy of this group followed by a discussion of the diversity of mitochondrial genome features including nucleotide composition, size, shape, structural continuity, gene complement, and compactness. For many of these features, comments will be made about possible evolutionary forces and/or mechanisms responsible for their variation.

Mitochondrial Genomes of Green, Red and Glaucophyte Algae

More detailed information on green, red, and glaucophyte mtDNAs can be found in Gray et al. (2004), Burger and Nedelcu (2012), and Leliaert et al. (2012).

Phylogeny and Taxonomy The green (Chloroplastida minus the land plants), red (Rhodophyta), and glaucophyte (Glaucophyta) algae (Fig. 1) together form a large assemblage of usually autotrophic organisms ranging from unicellular to multicellular forms living in marine, freshwater, or terrestrial habitats. This group together with land plants forms Archaeplastida – a clade widely accepted to be descended from a common ancestor that gained its plastid and the ability for photosynthesis directly from a cyanobacterial endosymbiont. The monophyly of the archaeplastidian group is well supported (Keeling 2010) but nevertheless is still contentious. Moreover, although there is consensus for monophyly of each of the three main branches within Archaeplastida – the chloroplastidian, rhodophyte, and glaucophyte Mitochondrial Genomes of Green, Red and Glaucophyte Algae, Fig. 1 Evolutionary relationships among lineages in Archaeplastida. The three main branches include Chloroplastida (comprising the sister phyla Streptophyta and Chlorophyta), Rhodophyta (red algae), and Glaucophyta (glaucophytes). The green algae consist of Chlorophyta and the non-land plant members of Streptophyta (charophytes). Dotted lines represent uncertain topologies

763

lineages – the exact branching order of these lineages remains uncertain. This entry will use the term “archaeplastidian algae” to represent the green, red, and glaucophyte algal group. The green algae are not a monophyletic group as they are represented by all members of Chlorophyta and the non-land-plant members (charophytes) of Streptophyta (Fig. 1). Charophytes (streptophyte algae) appear to be a paraphyletic group and the phylogeny of this assemblage is still emerging. Green algae encompassing Chlorophyta – the archaeplastidian algal group containing by far the greatest diversity – consist of three monophyletic classes that include Ulvophyceae, Trebouxiophyceae, and Chlorophyceae (the UTC group) and a paraphyletic assemblage, the prasinophytes, which contains several clades whose phylogenetic relationships are not well resolved. Moreover, the precise branching pattern of the three major lineages in the UTC group is also uncertain, likely in part because of insufficient taxon sampling and potential relatively short time spans between the nodes of the three branches (Smith et al. 2011). Chlorophyceae includes two distinct

M

764

Mitochondrial Genomes of Green, Red and Glaucophyte Algae

clades, initially characterized as the CW (“clockwise”) and DO (“directly opposed”) groups on the basis of their flagellar basal bodies; these distinct branches have been designated Chlamydomonadales and Sphaeropleales, respectively. Within the CW group, which is relatively well sampled for mitochondrial genome sequences, several sub-clades have been identified but their phylogenetic relationships are not fully resolved. The genus Chlamydomonas is a polyphyletic grouping because its members are interspersed with species of other genera among the different sub-clades of the CW group. Although the ancestor of Chloroplastida is proposed to have been marine and a scaly flagellated prasinophyte-like unicell, one descendent sub-lineage, the streptophyte algae, is thought to have been the first chloroplastidian algal group to gain a solid foothold in freshwater habitats and thus became preadapted to terrestrial existence and, as a consequence, transformed the planet with its evolution into land plants (Becker and Marin 2009). Phylogenetic relationships of lineages within Rhodophyta and Glaucophyta are not well established and the taxonomies of these divisions are under development (Fig. 1). Rhodophyta, the group with the second greatest diversity among archaeplastidian algae, inhabits primarily marine habitats; its members are predominately multicellular, but some unicellular species exist. Important features that distinguish red algae from other archaeplastidian algae include the absence of both centrioles and flagella from all life stages. Although they share with glaucophytes the presence of phycobilin photosynthetic pigments, the red algae uniquely contain phycoerythrin, a specific phycobilin, which is responsible for their red color. Several distinct red algal lineages have been identified. Glaucophyte algae are a small group of unicellular or colonial fresh water algae. A chief distinguishing feature of this group is the presence of plastids termed cyanelles, which contain a peptidoglycan layer between the two membranes of the cyanelle, making them more similar to cyanobacteria than are the plastids of the remaining archaeplastidian algae.

Mitochondrial DNA Features Nucleotide composition. The great majority of sequenced mitochondrial genomes in archaeplastidian algae, as for mtDNAs generally, have an excess of adenine and thymine (AT) over guanine and cytosine (GC) (Table 1), possibly because of AT mutation pressure coupled with inefficient mtDNA repair processes (Smith et al. 2011). Two mtDNAs in the study group, namely, those of the obligate heterotroph Polytomella capuana (Chlorophyceae) and Coccomyxa sp. C-169 (Trebouxiophyceae), have GC contents of 57.2% and 53.2%, respectively, which stand out compared to the AT-rich mtDNAs in their close chlorophyte relatives, other archaeplastidian algae, and all but a few sporadic examples among other eukaryotic species (Smith et al. 2011). In the P. capuana and Coccomyxa examples, the GC bias is most evident in the least conserved regions of the mtDNA, i.e., codon third positions of protein-coding genes and intergenic regions, therefore supporting a nonadaptive basis for the GC bias – possibly as a result of GC-biased gene conversion – according to the negative selection principle of the neutral theory of molecular evolution. Sequence data from the mtDNA-encoded cox1 and cob genes from several taxa of the Oogamochlamys clade (CW group, Chlorophyceae) also reveal elevated levels of GC at the codon third positions in these genes. Genome size. Among the completely sequenced mitochondrial genomes of archaeplastidian algae (Table 1), 75% are between 25 and 50 kb, with a 15-fold size range between the smallest (13 kb in P. capuana; CW group, Chlorophyceae) and the largest (201.8 kb in the streptophyte alga Chlorokybus atmophyticus (charophyte)). Genome size is influenced by coding DNA content, which tends to increase with the number and size of the genes in the mtDNA, and noncoding DNA content (intronic, both types I and II, and intergenic DNA). The large size of the C. atmophyticus mitochondrial genome compared to those of other archaeplastidian algae is the result of its un-compact or bloated nature, which is mainly the result of an expanded noncoding DNA content, but also because of its rich

Mitochondrial Genomes of Green, Red and Glaucophyte Algae

765

Mitochondrial Genomes of Green, Red and Glaucophyte Algae, Table 1 General characteristics of green algal, red algal, and glaucophyte mitochondrial genomes

Taxon Streptophyta Chlorokybus atmophyticus Mesostigma viride Chara vulgaris Chaetosphaeridium globosum Chlorophyta Chlorophyceae Chlamydomonas reinhardtii (CW) Polytomella capuana (CW) Polytomella parva (CW) Polytomella piriformis (CW) Volvox carteri f. nagariensisf (CW) Chlamydomonas moewusiig (CW) Dunaliella salina (CW) Chlorogonium elongatum (CW) Scenedesmus obliquush (DO) Trebouxiophyceae Pedinomonas minor Prototheca wickerhamii Coccomyxa sp. C-169 Helicosporidium sp. ATCC 50920 Ulvophyceae Pseudendoclonium akinetum Oltmannsiellopsis viridis Prasinophytes Ostreococcus tauri Micromonas sp. Nephroselmis olivacea Pycnococcus provasolii Rhodophyta Gracilariophila oryzoides Gracilariopsis andersonii Chondrus crispus

Gene number proteinb/rrna/ trna/otherc

Shape

GC (%)

Codinga (%)

Intron (%)

Intergenic (%)

201.8

C

39.8

22.7

21.8

55.5

0

40/3/28/1

6/14

42.4 67.7 56.6

C C C

32.2 40.9 34.4

75.4 52.3 60.4

11.1 38.5 15.9

13.5 9.2 23.7

0 0 0

36/3/26/0 37/3/26/3 36/3/28/0

4/3 14/13 9/1

15.8

L

45.2

83.1

0

9.9

7

7/2/3/1

0/0e

13.0

L

57.2

82.0

0

4.0

14

7/2/1/0

0/0

FL

41.0

65.5

0

1.5

33

7/2/1/0

0/0

FL

41.9

65.8

0

1.2

33

7/2/1/0

0/0

C

35.7

33.4

22.5

44.1

0

7/2/3/0

2/1

22.9

C

34.6

53.5

31.0

15.5

0

7/2/4/0

9/0

28.3

C

34.4

41.6

28.1

30.3

0

7/2/3/0

22/0

22.7

C

37.8

53.4

35.6

11.0

0

7/2/3/0

6/0

42.9

C

36.3

52.8

7.8

39.4

0

13/2/27/0

2/2

25.1 55.3

C C

22.2 25.8

57.2 62.1

3.3 8.5

39.5 29.4

0 0

11/2/8/0 32/3/26/0

0/1 5/0

65.5

C

53.2

48.3

10.8

40.9

0

29/3/26/1

1/4

49.3

C

25.6

67.1

8.6

24.3

0

32/3/25/0

3/0

95.9

C

39.3

50.6

11.4

38.0

0

30/2/25/0

7/0

56.8

C

33.4

57.7

8.6

33.7

0

28/3/24/0

2/1

44.2 47.4 45.2

C C C

38.2 34.6 32.8

92.5 82.5 72.2

0 0 8.4

7.5 17.5 19.4

0 0 0

35/3/27/1 33/3/28/1 35/3/26/1

0/0 0/0 4/0

24.3

C

37.8

87.7

0

12.3

0

18/2/16/0

0/0

25.2

C

28.1

89.4

0

10.6

0

23/3/20/2

0/1

27.0

C

28.0

87.0

0

13.0

0

23/3/20/2

0/1

25.8

C

27.9

94.1

1.8

4.1

0

23/3/23/2

Size (kb)

13.1 3.0 13.0 3.1 ~ 35

Telomeric (%)

Intronsd I/II

0/1

(continued)

M

766

Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Mitochondrial Genomes of Green, Red and Glaucophyte Algae, Table 1 (continued)

Taxon Plocamiocolax pulvinata Cyanidioschyzon merolae Porphyra purpurea Glaucophyta Glaucocystis nostochinearumi Cyanophora paradoxai

Size (kb)

Codinga (%)

Intron (%)

Intergenic (%)

Telomeric (%)

Gene number proteinb/rrna/ trna/otherc

Intronsd I/II

Shape

GC (%)

25.9

C

23.9

91.3

0

8.7

0

23/3/23/2

0/2

32.2

C

27.1

87.4

7.6

5.0

0

29/3/25/5

0/0

36.8

C

33.5

77.4

13.5

9.1

0

22/2/24/4

0/2

34.1

C

25.7

92.0

0

8.0

0

38/3/25/0

0/0

51.6

C

26.0

84.7

0

15.3

0

38/3/27/0

0/0

C circular mapping, L linear, FL fragmented linear, CW clockwise flagellar basal body group, DO directly opposed flagellar basal body group a Coding sequences include intergenic ORFs and exons of structural RNA genes and standard and nonstandard proteins b Standard protein-coding genes c rnpB, tatA, tatC, ccmA, ccmB, ccmC, ccmF, dpo, rtl d Numbers of group I and group II introns e No introns have been identified in the mtDNA from the standard laboratory strain of C. reinhardtii; however, in total, five unique optional introns (three group I and two unclassified) have been found in other geographical isolates of this species f Irresolvable gaps within the intergenic and intronic regions due to repetitive elements may affect some parameters g Listed as Chlamydomonas eugametos in GenBank (NC001872) h Strain UTEX 78 (GenBank AF204057). There is an S. obliquus (strain KS3-2) mtDNA sequence (GenBank NC002254) with similar features i Unpublished sequences described by Burger and Nedelcu (2012)

gene content by archaeplastidian algal standards. The smallest mitochondrial genomes in archaeplastidian algae are found in the CW group of Chlorophyceae, which have low gene contents and compact architectures. Interestingly, the largest identified mitochondrial genome among the CW group, that of Volvox carteri f. nagariensis, has a genome size that rivals many more gene-rich mtDNAs found in other archaeplastidian algal groups, due to its high fraction of noncoding DNA. Shape. Little is known about the in vivo structure of archaeplastidian algal mitochondrial genomes. Based mainly on DNA sequencing, the great majority of well-studied mtDNAs from this group have been characterized as “circular mapping” (Table 1). Although this feature is consistent with a unit-genome-sized circular structure and is the case for most animal mtDNAs and that of the green alga Chlamydomonas moewusii and probably other related CW-group chlorophyceans, it does not eliminate the possibility of linear head-to-tail concatemeric sequences that generate complex branched

structures by recombination-driven replication, resembling the mechanism employed by the circularly permuted linear phage T4 DNA. A similar structure and replication mechanism are likely for the mtDNAs of liverwort, some yeasts (Schizosaccharomyces pombe and wild-type Saccharomyces cerevisiae), as well as the red bread mold (Neurospora crassa) (Bendich 2010). But not all archaeplastidian algal mitochondrial genomes are circular-mapping: unit-genome-sized linear mtDNAs with distinct telomere structures have been identified in the Reinhardtinia clade of Chlorophyceae (Chlorophyta) and linear mtDNA forms may exist among taxa in the closely related clade Oogamochlamydinia. The telomeric sequences and structures identified in the Reinhardtinia-clade taxa include inverted repeats, 30 overhangs, and closed single-stranded loops. Because the mitochondria lack telomerase, it is presumed that the diverse termini of these linear mtDNAs help the genome overcome the end replication problem. Interestingly, the mtDNA of Volvox carteri f. nagariensis has been characterized as a circular-mapping molecule,

Mitochondrial Genomes of Green, Red and Glaucophyte Algae

which so far is unique among Reinhardtinia-clade members. Such an observation may indicate some mtDNA transitions within this clade between genome-sized linear molecules and genomesized circular ones or complex concatemeric structures, as described above. Structural continuity. Generally speaking, mitochondrial genomes are composed of a single kind of DNA molecule. Fragmented mtDNAs, however, have been observed in diverse eukaryotic lineages, including archaeplastidian algae in which known examples of this trait are so far limited to the mitochondrial genomes of certain Reinhardtinia-clade species (Table 1) and possibly species in the closely related Oogamochlamydinia clade. In the Reinhardtinia clade, examples of fragmented mtDNA are so far limited to two Polytomella species, which contain 13 and 3 kb fragments of the mitochondrial genome. Comparison of the sequence structure of an intact mtDNA form from an earlier diverging lineage represented by P. capuana has led to a suggested fragmentation scheme of a P. capuana-like mtDNA involving GC-rich inverted-repeat sequences. Gene complement. Early studies (Turmel et al. 1999) described two distinct mitochondrial genome types of among the five then sequenced archaeplastidian algal mitochondrial genomes (all were from Chlorophyta with no streptophyte, red, or glaucophyte algal representatives). One mtDNA type, dubbed “reduced-derived” or Chlamydomonas-like, was characterized by small genome size (16–25 kb), limited gene content (no ribosomal protein or 5S rRNA genes and only a few respiratory protein and tRNA genes), and the presence of fragmented and scrambled rRNA-coding regions. The other mtDNA form, termed “ancestral” or Prototheca-like, featured a larger size (45–55 kb), a more complex set of protein-coding genes (including ones for ribosomal proteins), a complete or almost complete set of tRNA genes, and 5S rRNA as well as conventional continuous rRNA genes. With the expanded data base of complete mtDNA sequences now including those of streptophyte, red, and glaucophyte algae as well as an ulvophycean representative of Chlorophyta

767

(Table 1), it appears that the original two main mitochondrial genome types might still apply to the larger archaeplastidian algal group; however, a slight blurring of this distinction can be seen. For example, red, glaucophyte, and streptophyte algal mtDNAs have two or three succinate dehydrogenase genes, which are missing in the known mtDNA sequences of chlorophyte algae. In addition, fragmented rRNA genes have been identified in several prasinophyte mtDNAs (Burger and Nedelcu 2012) and a few extra protein-coding genes (cox2 and cox3) and a much larger collection of tRNA genes compared to the Chlamydomonas-type mtDNA have been identified in the mtDNA from Scenedesmus obliquus (DO group of Chlorophyceae). Moreover, the mtDNA of the prasinophyte Pycnococcus provasolii and the known red algal mtDNAs are missing some respiratory proteins and a number of ribosomal proteins relative to other ancestral-like mtDNAs. With only one possible exception, the reduced-derived type of mtDNA is found so far only in the chlorophycean lineage, especially in the CW group. The mtDNA of Pedinomonas minor, which is currently placed in Trebouxiophyceae, has a fragmented large subunit rRNA gene and a gene content resembling the reduce-derived type of archaeplastidian algal mtDNA, otherwise seen only in the chlorophycean lineage. However, the phylogenetic placement of Pedinomonas is not fully resolved; its placement in Chlorophyceae, where it seems to belong based on mtDNA features, has not been ruled out and may even be likely considering recent phylogenetic analyses with complete mtDNA sequences (Smith et al. 2011). In the case of archaeplastidian algal mtDNAs that lack genes present in the mitochondrial genomes of other archaeplastidian algae, it is presumed that functional copies of the missing genes have been transferred to the nucleus, from where the gene is expressed and its product targeted back to the mitochondrion. It is not known why some archaeplastidian algal species (i.e., the chlorophyceans) have had more of their mitochondrial genes transferred to the nucleus compared to other archaeplastidian algae, although both natural selection and nonadaptive

M

768

Mitochondrial Genomes of Green, Red and Glaucophyte Algae

mechanisms can be envisioned. However, one nonadaptive (neutral) explanation – that the functional net movement of genes from mitochondria to the nucleus is favored by higher mutation rates in the mitochondrial genome compared to the nuclear one – is not supported by empirical studies with the chlorophyte alga C. reinhardtii and the streptophyte alga Mesostigma viride (Hua et al. 2012); both appear to have similar relative mutation rates in the mitochondrial and nuclear genetic compartments yet differ by a factor of five in the number of genes they contain in their mtDNAs (Table 1). Compactness. Known archaeplastidian mtDNAs show considerable variation in their compactness (Table 1), defined here as the fraction of the genome-encoding proteins and structural RNAs. The most compact mtDNAs, i.e., those having more than 80% and sometimes more than 90% coding DNA, tend to be found in glaucophyte, rhodophyte, and certain prasinophyte species. Some red algae, notably, Chondrus and Cyanidioschyzon, have carried coding density to the point where some genes now overlap one another. Were it not for the telomeric sequence in the linear mtDNAs of some chlorophyceans such as C. reinhardtii and Polytomella spp. – sequences classified here as noncoding even though they are likely to be functionally important – these mitochondrial genomes would rival the compactness of the most coding-dense glaucophyte, rhodophyte, and prasinophyte counterparts. At the other end of the spectrum, some mitochondrial genomes are rather bloated, by archaeplastidian algal standards. The most extreme known case is the mtDNA of the charophyte alga Chlorokybus atmophyticus, which is only 23% coding sequence, as the result of a moderately high proportion of intronic sequence (21.8%) and a record-breaking (for archaeplastidian algae) fraction of intergenic DNA (55.5%). At 33% coding, the mtDNA of the chlorophyte and multicellular Volvox carteri f. nagariensis is the second-most bloated of characterized archaeplastidian algal mtDNAs because of a moderately high fraction of intronic DNA (22.5%) and high fraction of intergenic DNA (44.1%). The ultimate cause of compactness variation in mtDNA among diverse eukaryotes,

generally, and archaeplastidian algae, in particular, is not well explained. On the one hand, it has been proposed that compact mtDNAs could be the result of natural selection, e.g., for increased speed of replication, while on the other hand, nonadaptive forces could be at play (Lynch 2007). In the latter category, one possible explanation is the mutational-hazard hypothesis, which suggests that the principal forces governing organelle genome size and structure are mutation and random genetic drift (Lynch 2007). The hypothesis argues that adornments in genomic sequence, such as introns and intergenic DNA, are a liability because they represent targets for potentially deleterious mutations, where the higher the mutation rate, the greater the burden of the adornment. The hypothesis further posits that species with large effective genetic population sizes (i.e., large Ne) are more efficient at perceiving and eliminating this burdensome DNA than those with small effective population sizes. Insights into the combined effects of mutations per nucleotide site per generation and the effective population size (Nem) on a genome can be estimated by its nucleotide diversity (among members of its species) at silent sites (psilent), which are defined as synonymous sites such as codon third positions of proteincoding genes and noncoding (intronic and intergenic) sites. Available psilent measurements from V. carteri are on average very low (around 0.0004, 0.0006, and 0.005 for the mitochondrial, plastid, and nuclear DNA, respectively), relative to C. reinhardtii and other protists (Leliaert et al. 2012). This suggests that all three V. carteri genetic compartments have a small Nem and therefore a reduced ability to detect and eradicate excess DNA, which could help explain why the organelle and nuclear genomes of this organism are among the most bloated among archaeplastidian algae. The mutational burden hypothesis predicts a low psilent for the mtDNA of the streptophyte algae Chlorokybus atmophyticus, which is even more bloated than that of V. carteri and a high psilent for the mtDNAs of the prasinophyte Ostreococcus tauri, the glaucophyte Glaucocystis nostochinearum, and the rhodophyte Chondrus crispus, which are among the most compact of archaeplastidian algae.

Mobile DNA: Mechanisms, Utility, and Consequences

References Becker B, Marin B (2009) Streptophyte algae and the origin of embryophytes. Ann Bot 103:999–1004 Bendich AJ (2010) The end of the circle for yeast mitochondrial DNA. Mol Cell 39:831–832 Burger G, Nedelcu AM (2012) Mitochondrial genomes of algae. In: Bock R, Knoop V (eds) Genomics of chloroplasts and mitochondria. Series on advances in photosynthesis and respiration. Springer, Dordrecht, pp 127–157 Gray MW, Lang BF, Burger G (2004) Mitochondria of protists. Annu Rev Genet 38:477–524 Hua J, Smith DR, Borza T, Lee RW (2012) Similar relative mutation rates in the three genetic compartments of Mesostigma and Chlamydomonas. Protist 163:105–115 Keeling PJ (2010) The endosymbiotic origin, diversification and fate of plastids. Philos Trans R Soc Lond B Biol Sci 365:729–748 Leliaert F, Smith DR, Moreau H, Herron M, Verbruggen H, Delwiche CF, De Clerck O (2012) Phylogeny and molecular evolution of the green algae. Crit Rev Plant Sci 31:1–46 Lynch M (2007) The origins of genome architecture. Sinauer Associates, Sunderland Smith DR, Burki F, Yamada T, Grimwood J, Grigoriev IV, Van Etten JL, Keeling PJ (2011) The GC-rich mitochondrial and plastid genomes of the green alga Coccomyxa give insight into the evolution of organelle DNA nucleotide landscape. PLoS ONE 6:e23624 Turmel M, Lemieux C, Burger G, Lang BF, Otis C, Plante I, Gray MW (1999) The complete mitochondrial DNA sequence of Nephroselmis olivacea and Pedinomonas minor. Two radically different evolutionary patterns within green algae. Plant Cell 11:1717–1729

Mobile DNA: Mechanisms, Utility, and Consequences Adam R. Parks1 and Joseph E. Peters2 1 Molecular Control and Genetics Section, Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD, USA 2 Department of Microbiology, Cornell University, Ithaca, NY, USA

Synopsis Mobile DNA elements contain sequences that enable them to physically move within or between different DNA molecules in a cell. These elements

769

are ubiquitous in nature and are found throughout each of the three domains of life and can be found in many ectopic DNA molecules, such as viral genomes and plasmids. They can perform a variety of functions for a host organism and mobilize genetic information from one host to another. There are three major strategies that are adopted by these elements to mobilize, which include transposition, conservative site-specific recombination, and target-primed reverse transcription. Mobilization of genetic elements is typically a highly regulated process and can have some important consequences or perform vital functions for organisms.

Introduction All organisms depend on faithful reproduction of their genetic material for their continued survival. However, in this section we will consider mobile genetic elements, segments of DNA that challenge the view that genetic order is strictly maintained. Mobile DNA elements contain sequences that enable them to physically move within or between different DNA molecules in a cell. As we will describe, mobile elements have been highly successful at ensuring their own propagation and have contributed greatly to the evolution of organisms and genetic systems in general. Mobile genetic elements can be grouped loosely by their mechanism of movement (Fig. 1): • Transposition • Conservative site-specific recombination • Target-primed reverse transcription The unifying characteristic of all of these elements is the programmed ability to move or copy genetic information from one site in a DNA molecule to another, changing the order of sequence within the DNA (Curcio and Derbyshire 2003; Siguier et al. 2014). DNA sequences provide instructions regarding which portion of the DNA is to be moved or copied, and they encode DNA-binding proteins that bind to the mobile element sequences and carry out the necessary

M

770

Mobile DNA: Mechanisms, Utility, and Consequences

a

b Transposition

Site-specific Recombination

c

Target-primed reverse transcription Mobile Element

OH Mobile Element

HO Mobile Element

Mobile Element

Mobile Element

nt

me

bile

Ele

Mo

HO Mobile Element

OH

Mobile Element

Mobile Element

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 1 A comparison of three general mechanisms of genetic element mobility. (a) In transposition, a transposase protein (tan circles) binds to the ends of the genetic element by recognizing inverted repeat DNA sequences. The transposase carries out the breaking and joining reactions that liberate the element from the donor DNA molecule (shown as black lines) and inserting it into a recipient DNA molecule (shown as red lines). Transposition is characterized by the ability of an element to move from one site within the DNA to another, where there is no requirement for homologous DNA sequence, and typically results in small target-site duplications. (b) Conservative site-specific recombination involves the recognition of specific DNA sequences (blue boxes) on both the donor

DNA (black lines) and in the recipient DNA (red lines). Strand-exchange reactions are carried out by either tyrosine recombinases or serine recombinases (blue circles). In the site-specific recombination mechanism, there is no net gain or loss of genetic information (i.e., there is no targetsite duplication). (c) In target-primed reverse transcription, the element is copied out of the donor DNA (black lines) by host RNA polymerases (green oval), resulting in an RNA copy of the element (purple squiggled line). A reverse transcriptase enzyme (purple circle) nicks the recipient DNA molecule (red lines) and uses the free 30 -end generated by the nick to template the formation of a DNA copy of the RNA element, causing the new DNA copy of the element to be colinear with the donor DNA

catalytic reactions for moving or copying the DNA segment. The DNA-binding proteins are often assisted by other proteins that are involved in the normal maintenance of DNA within cells, such as DNA polymerases, ligase, and topoisomerases, to complete the transaction. Since mobile genetic elements catalyze their own movement, they are often considered entities separated from the host organisms they occupy, parasites within an established and successful system, much like viruses. Once in place, they depend on the replication machinery of a host organism for their own propagation. In light of information presented here and elsewhere, this type of recombination may seem risky for the

integrity of the host genome. Regardless of the risk to the host organism, mobile genetic elements are present in nearly all living organisms and many ectopic DNA molecules, such as viral genomes and plasmids. Indeed, genes that are associated with the activities displayed by mobile genetic are the most abundant and ubiquitous of all genes observed to date (Aziz et al. 2010). Thirty to fifty percent of the human genome is thought to have originated from various mobile genetic elements, although they are typically less abundant in many single-celled organisms where selective pressures impose a heavy cost on unnecessary genetic content (Kazazian 2004). As autonomous genetic elements, their specific

Mobile DNA: Mechanisms, Utility, and Consequences

a.

Insertion into genes

b.

Alteration of gene regulation

c.

Genome rearrangement

d.

Host Gene

host

IS

IS

gene

IS

Host Gene

Host Gene

Wild-type

771

IS

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 2 Mobile genetic elements can disrupt host genomes in a variety of ways. A hypothetical host gene is presented as a blue arrow with a black arrow representing a promoter. An insertion sequence is presented as a green box. (a) The wild-type condition in which there is no disruption of genetic function. (b) Insertion into the middle of a gene, truncating the coding

sequence and disrupting its function. (c) A promoter sequence within the insertion sequence may substitute for the native promoter, causing altered regulation of the host gene. (d) In a less direct effect that mobile elements may have on host genes, homologous recombination between two insertion sequences may cause the inversion or deletion of host genes

activities are constrained by selective pressures within a host cell, and as members of the host’s genome, they also participate in the evolution of the host organism itself. There are a variety of problems that mobile elements present to host, including (Fig. 2):

regulatory mechanisms that prevent frequent movement. Frequent movement carries with it the risk of death to the host organism, and infrequent movement may result in neutralization of the genetic element through random inactivating mutations (Nagy and Chandler 2004). Increasing the frequency of mobility in many mobile elements can be achieved through mutation of the recombinases responsible for mobility, suggesting that there are selective pressures exerted upon the elements themselves, preventing them from being so active that they damage the host organism. Given the autonomy with which mobile elements move, their ubiquity, and the dangers that they present to the host organism, mobile genetic elements have variously been referred to as “junk DNA” or “selfish DNA,” implying that they serve no beneficial function within an organism (Kazazian 2004). However, it should also be noted that some elements serve beneficial roles. Moreover, the interplay between mobile element activity and host genetic systems has resulted in some profound evolutionary changes. In addition to the mobility of genetic elements, they leave behind remnants of the genetic material that has been transferred even after reductive evolution has disabled and dismantled the mobilization

• Deactivation of host genes – Insertion into a gene may disrupt its normal coding sequence, thereby inactivating it. • Introduction of harmful genes – In addition to overtly harmful genetic content, such as toxins and nucleases, mobile genetic elements are a burden on host organism resources. • Altered regulation of host genes – Misregulation of genes may lead to altered expression of genes, preventing expression when they are needed or causing overexpression when not needed. • Rearrangement of DNA higher-order structure such as the relocation of centromeres or the contribution of chromatin organization sequences. Given the dangers involved in frequent movement of mobile genetic elements, organisms have mechanisms that restrict the movement of these elements, and the elements themselves often have

M

772

components of an element, making it important to distinguish between transferred and transferable genetic elements. Mobile genetic elements can be seen as mediators of the evolutionary process through the following activities: • Brokers of horizontally transferred genes. Mobile genetic elements enable the movement of genes from one organism to another, sometimes improving the selective advantage of the new recipient under certain conditions. In some systems, elements can mediate the accretion of selectable genes into large transferable bundles known as genomic islands, or fitness islands, leading to the transfer of many genes at once. • Gene duplication and diversification. Movement of genetic elements can lead to the duplication of genes, which promotes the diversification of gene function. One gene may maintain its original function, while the new copy may mutate without detriment to the host. Increased gene number may also lead to gene dosage effects that can change host physiology. • Altered gene regulation. Changes in gene regulation are thought to have contributed to altered developmental processes. For example, mobile elements are thought to be largely responsible for genetic changes that drove the evolution of humans from primate ancestors. • Programmed genetic rearrangement. In some cases, frequent, controlled, generation of genetic diversity is beneficial, such as in the V(D)J recombination system of the human immune system (Zhou et al. 2004) and the phase variation system of bacterial pathogens. Many recombination systems employed by a wide variety of organisms evolved from mobile genetic elements. There are two basic steps in the mobilization process. These steps include: • Liberating the mobile element from the donor DNA • Joining of the mobile element to the target DNA

Mobile DNA: Mechanisms, Utility, and Consequences

These steps are accomplished in a variety of different ways, and there are several regulatory mechanisms that can occur at each of these steps. Liberation of the genetic information within the original donor DNA can occur in one of two ways (Curcio and Derbyshire 2003). First, the element may be “cut out” with the DNA physically excised from the donor molecule. Second, the element may be “copied out” (Fig. 3), either by replication using the host DNA replication machinery or by transcription of the DNA segment by a host-encoded RNA polymerase. If the element is copied out by transcription, the element must also undergo reverse transcription, converting the element to DNA from the RNA copy. Most elements that require reverse transcription also encode their own reverse transcriptase (RT) enzyme. Once an appropriate target, or recipient, DNA has been located, the mobile element must break the target DNA backbone and join mobile DNA with target DNA molecule (insertion). This process is called strand transfer in transposons and site-specific recombinases. In elements that use target-primed reverse transcription mechanisms, insertion occurs through the reverse transcription process. Restoration of DNA structure following DNA strand transfer is an important step for both element and host. DNA repair is often necessary to complete the mobilization cycle for transposons and target-primed reverse transcription by-products. DNA intermediates that must be repaired can include single-strand nicks, gapped DNA, 50 -DNA flaps, or Holliday junctions. While many of these are not under the regulatory control of the mobile element, there are examples where the host machinery is actively recruited, such as seen in the recruitment of PriA by the transposable bacteriophage element Mu. Transposons Transposons use a molecular process that allows them to move from one site in a DNA molecule to a new site that contains no sequence similarity (McClintock 1950). This activity is typically accomplished using a DDE-type transposase (or integrase) protein; however, there are some

Mobile DNA: Mechanisms, Utility, and Consequences

c Paste-In

Cut-Out

leme nt

a

773

Mob ile E

Mobile Element

Mobile Element Mobile Element

b

d Copy-In

Copy-Out RNAP Mobile Element

nt

me

Mobile Element

bile

Ele

Mo

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 3 Examples of mobile elements that illustrate the cut-out, copy-out, paste-in, and copy-in terms are shown. (a) Elements such as Tn10 and Tn5 use a cut-out strategy to remove the element from donor DNA. The transposase (beige circles) “cuts out” the element from the donor DNA by joining the top and bottom strands at either end of the element, leaving behind a double-strand DNA break. (b) Retro-elements such as Ty1 and L1 are “copied out” of donor DNA as RNA transcripts. (c) Mobile elements that are “pasted in,” such as Tn10 and Tn5,

undergo a strand-exchange reaction in which DNA that encodes the element is joined with the target DNA, as opposed to a polymerase enzyme making a copy of the element at the new site (see d copy-in). (d) Elements that are “copied in” require the activity of a DNA polymerase or a reverse transcriptase (as shown in the figure) to make a new copy of the element at the target site, using the targetsite DNA as a primer for synthesis and using the DNA or RNA element as a template. Examples of elements that use this strategy include L1 and R2

elements that are called transposons that use other mechanisms instead (Dyda et al. 2012). DDE-type recombinases contain conserved aspartate (D), aspartate (D), and glutamate (E) amino acid residues that coordinate two divalent magnesium ions required to catalyze the chemical reactions involved in breaking and joining DNA. These magnesium ions create a break in the donor DNA by using an activated water molecule as the nucleophile in a hydrolytic cleavage of the donor DNA backbone. The recombinase then uses the free 30 -OH at the end of the DNA element in nucleophilic attack of target DNA sugarphosphate backbone of the target molecule. This step is referred to as a transesterification reaction, resulting in the simultaneous breakage of the recipient DNA strand and joining of the donor strand (Fig. 4). Retrotransposons do not actually cleave the donor DNA, but are copied out by host RNA polymerase (Curcio and Derbyshire 2003). RNA polymerase transcribes the information encoded

within the DNA element, making an RNA copy of DNA while leaving the original DNA molecule intact. The RNA information is later converted to DNA by a reverse transcriptase protein encoded by the mobile element. Retrotransposons in yeast embody the diffuse barrier between different types of mobile elements; they are packaged into virions along with the RNA copy of the genome and reverse transcriptase, much like a retrovirus. However, after leaving the nucleus they return without ever leaving the cell as an infectious particle. The DDE transposases must coordinate activities at both ends of the transposon and with the target DNA molecule. The ends of the transposon are delineated by DNA sequences that are specifically recognized by the transposase proteins and contain bases that promote the strand-transfer reaction (Fig. 5). These sequences appear as inverted DNA repeat sequences. They are inverted so that the recombinase is situated in the same orientation on both ends of the element

M

774

Mobile DNA: Mechanisms, Utility, and Consequences

H

BASE

OH + O P O

BASE

O

3’ O

5’

HO OH

H – O

P O

Mg

O

3’ O

5’

O P HO H

O

2+

Mg

O O – O

2+



HO OH O P HO

E

O



D

H O

OH

D Transposase

O PO O

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 4 DDE-type transposase proteins contain a highly conserved aspartate (D), aspartate (D), and glutamate (E) motif that allows them to coordinate two divalent magnesium cations. The magnesium ions stabilize the

+

H

H O

transition state of the DNA hydrolysis and transesterification reactions involved in DNA strand exchange. The sequence of reactions is shown in the inset, where black lines represent donor DNA and thick green lines represent the target DNA

a Mobile Element

Transposition TAGC ATCG

b TAGC

c

TAGC ATCG

Mobile Element

Mobile Element

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 5 Target-site duplication is the exact duplication of short sequences flanking a transposable element following insertion into a new DNA molecule (in red). The duplication event is a result of transposase activity and host DNA repair activity. (a) The transposase proteins (beige circles) bind different base pair junctions in the top and bottom strands of the target DNA. (b) Consequently, small

ATCG

TAGC ATCG

Gapped DNA intermediate

Gap Repair (DNA replication)

gaps are produced at either end of the newly inserted element, where one gap is derived from the top strand of the target DNA and the opposite end gap is derived from the bottom strand of DNA. Host-encoded DNA polymerases fill in the gaps using the 30 -ends provided by the newly inserted DNA element to prime DNA synthesis (blue letters)

Mobile DNA: Mechanisms, Utility, and Consequences

allowing the breaking and joining events to occur at the very termini of the element. Target-Site Duplication

In its simplest form (such as in the bacterial element IS3 or the maize element Ds), a single transposase can catalyze the excision of the element from the donor DNA molecule and mediate its insertion into the target; however, most transposons coordinate the activities of at least two transposase monomers, one at either end of the transposon, acting on opposite strands of the DNA (Turlan and Chandler 2000). These transposases also bind to target DNA, but the binding of each transposase monomer is offset across the DNA backbone. Therefore, when strand transfer occurs, the ends of the transposon are inserted at different base junctions, resulting in short gaps at the ends of the transposon following resolution (Fig. 5). The gaps are ultimately filled in by host DNA polymerases, resulting in short direct duplications of DNA sequence at either end of the transposon. These duplications are referred as target-site duplications and are a distinctive hallmark of all transposition reactions. They may range in length from 2 to nearly 250 base pairs (Plikaytis et al. 1998), depending on the type of transposon, although 3–10 base pair duplications are more common. Replicative Transposition

Without cleavage of the second, non-transferred DNA strand, the element remains tethered between the donor and target DNA (Derbyshire and Grindley 1986). Elements that do not make a break in the non-transferred strand are called replicative transposons because replication of the elements plays a central role in their lifecycle (Fig. 6). The resolution of the transposition intermediate involves the duplication of the transposon and results in the cointegration of the donor DNA with the target DNA. These elements use the free 30 -OH that is made available in the host DNA to initiate replication using host-encoded DNA polymerases, mediating DNA replication that extends across the entire transposon both in the old and new locations. For most replicative transposons, resolution of the new target-mobile element junction is

775

necessary after strand transfer has been completed in transposition. Some elements, such as Tn3 and Tn917, encode a separate site-specific recombination system that will remove the donor DNA, including a copy of the transposon, from the target DNA (see section “Conservative Site-Specific Recombination” below). In the bacteriophage Mu we see yet another example of how the lifestyles of intracellular parasites can interweave. The Mu element is packaged as a virus, but undergoes DNA replication using the process of replicative transposition. In this system, cointegration of target and donor molecule is left unresolved because the host genome will be abandoned after packaging of the virus. Second Strand Cleavage

Cut-and-paste transposons cut the non-transferred (second) strand to liberate the entire element during transposition (Turlan and Chandler 2000). Some elements (such as Tn5 and Tn10) achieve second strand cleavage by using the 30 -OH from the transferred strand to attack the non-transferred strand, forming a hairpin structure. This reaction causes the element to be completely excised from the donor DNA at the time of strand transfer and results in a double-strand DNA break(s) in the donor DNA where the transposon used to be. This double-strand DNA break(s) is often repaired by homologous recombination and DNA replication, using the sister chromosome as a template so that the transposon is maintained in the original donor DNA site. The liberated transposon remains bound by recombinase proteins at either end in a DNA-protein complex, called a nucleoprotein complex. The nucleoprotein complex then finds a suitable target, re-breaks the ends of the transposon just as before using an activated water molecule as the nucleophile, and mediates the strand-transfer reaction into the target DNA molecule. Some transposons, such as Tn7, encode a separate endonuclease-like protein (TnsA) that is dedicated to cleavage of the non-transferred strand in a reaction that simply nicks (or causes a single-strand DNA break) at the 50 -end of the elements, completing the release

M

776

Mobile DNA: Mechanisms, Utility, and Consequences

a Mobile Element

transposase 3’

3’ transposase Donor DNA Target DNA

c Donor DNA

b

Target DNA

Replication of Mobile element 3’

5’

5’

3’

3’ Mobile element

5’

5’ 3’

d Recipient DNA

Copied Mobile element

Donor DNA

Mobile element

Mobile element

Recipient DNA

Copied Mobile element

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 6 In replicative transposition, extensive DNA replication copies in the DNA element. (a) Transposase proteins bind to the ends of the DNA element (red lines) and mediate a strand-exchange reaction between the donor DNA (black lines) and the recipient DNA (blue lines). (b) The 30 -DNA ends generated in the recipient DNA by the strand-exchange reaction are used to prime DNA synthesis. (c) Since the DNA element is now

continuous with the template strand of the recipient DNA, the element is copied twice, once in each direction, since both the top and bottom strands had 30 -ends. (d) Following replication of the element, the donor DNA and the mobile element are both connected to the target DNA, in an intermediate called a cointegrate. Many elements also encode site-specific recombinases that will resolve the cointegrate intermediate

of the element from the donor DNA (Hickman et al. 2000). Tn7 can be converted to a replicative transposon through mutation of tnsA that renders the non-transferred strand cleavage activity of this element inactive (May and Craig 1996; Peters and Craig 2001). Another transposon IS903 is capable of switching between replicative and cut-and-paste transposition depending on DNA sequences flanking the element (Curcio and Derbyshire 2003). In another example of the subtle boundary between cut-and-paste and copy-and-paste mechanisms, bacteriophage Mu integrates into the bacterial genome as a cut-and-paste transposon, but multiplies through a replicative transposition mechanism once within the cell.

Composite Transposons of Bacteria

The simplest transposons in bacteria, called insertion sequences (IS), are characterized by a single gene encoding the transposase, flanked by inverted repeat sequences that are recognized by the transposase (Fig. 7). Inverted repeat sequences delineate the ends of the transposon; all DNAs between these ends are transferred. More complex transposons contain a variety of other genes. In addition to movement of all genetic information between a single pair of ends, composite transposons (such as Tn10) can be formed (Craig 2002; Tropp 2012). Composite transposons are the result of two insertion sequences that have inserted close together, and subsequent transposition events that employ the outermost ends of both

Mobile DNA: Mechanisms, Utility, and Consequences

777

Inverted repeats Disabled repeat Transposasegene

a MITES b Insertion Sequences

Antibiotic resistance or other fitness related gene Resolution site Site specific recombinase

c Compound Transposon

d Complex Transposon

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 7 Mobile genetic elements exist at an array of different levels of complexity. (a) The simplest DNA elements are composed of inverted repeats that may be recognized by a transposase encoded elsewhere in the genome. (b) Insertion sequences are the simplest autonomous mobile elements, encoding the transposase that is necessary to carry out strand-exchange reactions, and they possess the required inverted repeats that are recognized by the transposase proteins. (c) Compound transposons are elements that are comprised of two insertion sequences flanking genes that were not previously associated with

the insertion sequences. In these elements (such as Tn10) at least one of the inside inverted repeats is often disabled, so only the outermost inverted repeats can be used for successful transposition events. (d) There are many other transposons (such as Tn7, Tn3, and Tn21) that have more complicated arrangements that may also include other recombination systems that are either required for additional processing of the transposon after it has been mobilized (e.g., resolution systems for replicative transposon) or are simply mobilized along with the rest of the transposon (e.g., integrin cassette systems)

elements mobilize all genetic information between the ends. This arrangement enables the capture of genes from DNA molecules that house the insertion sequences, and is typically accompanied by the disabling of internal ends of the former insertion sequences.

transposable elements (or MITEs) (Feschotte et al. 2002). They are nonautonomous transposons, commonly found in plants, that rely on a transposase encoded elsewhere in the genome. The transposase genes can be found in the autonomous counterparts of these elements, which do encode their own transposase genes and can invoke their own mobilization. In the case of MITEs, the transposase protein can act on any accessible transposon ends, regardless of what lies between them, mobilizing these much smaller elements that typically lack open reading frames. MITEs are simply a pair of inverted repeats that have evolved from the full-length transposons whose transposase they now rely on for mobilization. They are typically quite small (100 within an individual integron cassette system reported in some Vibrio cholera strains)

excisionase (Xis), for excision from the genome. Int recognizes a conserved attachment site within the E. coli chromosome, called attB, and within the l genome, called attP. Recombination between the two sites results in two hybrid att sites, attL and attR, flanking the now integrated l genome. For many bacteriophages that use a similar integration system, the site within the bacterial host genome where recombination takes place, often called the attachment site, is a tRNA gene, which recombines with a similar tRNA gene within the bacteriophage genome. Since the sitespecific recombination process does not disrupt the host-encoded gene, the bacteriophage is able to use this highly conserved sequence to gain access to the chromosome of diverse bacteria without disabling any essential functions in the process. Integron cassette systems also rely on the activity of a tyrosine recombinase to incorporate selectively advantageous genes (such as antibiotic resistance genes) flanked by site-specific recombinase recognition sequences (Cambray

et al. 2010; Fig. 11). Integron systems facilitate mobilization of gene cassettes and optimization of gene expression for factors that are not essential, but are occasionally necessary for the survival of the host organism. Since the site-specific recombination sites are maintained with each additional cassette, more cassettes can be added, leading to the accretion of horizontally transferred, selectable genes within integron cassette region. Resolvases and Invertases: Examples of Serine Recombinases Members of the resolvase-invertase family utilize a conserved serine residue to accomplish the same conservative recombination outcome described above. Resolvases are involved in sorting out cointegration intermediates observed in replicative transposition events (e.g., Tn3) by excising the DNA sequence between two recognition sequences, ensuring the maintenance of chromosomal or plasmid stability (Cambray et al. 2010). Invertases mediate the inversion of DNA sequences flanked by their cognate

Mobile DNA: Mechanisms, Utility, and Consequences

783

a

1 2 3

1 2 3 1 2 3

1 2 3

1 2 3

1 2 3

b

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 12 Orientation of site-specific recombinase binding sites dictates inversion versus deletion. (a) Recombinase binding sites oriented in the same

direction mediate the excision of the intervening DNA sequence. (b) Recombinase binding sites that are oriented in opposing directions mediate the inversion of the intervening DNA sequence

DNA-binding sites, thereby altering gene expression in the vicinity of the element (e.g., the Hin/hix phase variation system of Salmonella). Both resolvases and invertases contain a catalytic serine in their active site, but differ in the recognition sequence arrangement. Resolvases require direct repeated res sites to excise intervening DNA sequence, whereas invertases require inverted target sites to flip the intervening DNA sequence (Fig. 12). Both forms of recombination have strict DNA topological requirements, requiring the exact supercoiling of the substrate DNA. In addition to the supercoiling requirement, invertases also require DNA bending proteins, such as FIS in bacteria. Site-specific recombination reactions mediated by serine recombinases do not involve a Holliday junction intermediate, but proceed by concerted breaking and joining reactions that exchange both strands at once. Serine recombinases also interact with the sugar backbone of the DNA in a slightly different way compared to tyrosine recombinases. Serine recombinases leave 30 -OH groups after protein-DNA bond formation as opposed to the 50 -OH left by tyrosine recombinases. There are many examples of thematic overlap between transposon and site-specific recombination systems. Y-transposons and S-transposons are elements that utilize a tyrosine or serine recombinase, respectively, yet do not require significant homology between target and donor DNA molecules. Recently, a class of transposon have been described, whose members include the

IS200/605 DNA transposons, which use similar chemistry to mediate the transfer of a single strand of DNA and insert into sites that lack homology. Target-Primed Reverse Transcription In systems that utilize a target-primed reverse transcriptase mechanism, no true strand transfer is needed. The free 30 -OH of a broken DNA molecule is used to prime the initiation of the reverse transcriptase enzyme that makes a DNA copy of the element (Curcio and Derbyshire 2003; Tropp 2012; Fig. 13). Since the target DNA was used as the priming end, the information encoded within the element is now covalently attached to the target DNA molecule. The process of target-primed reverse transcription is more variable with respect to the kinds of DNA repair that are required following insertion. The proteins encoded by these elements commonly possess endonucleolytic activity, but this activity is not commonly controlled by a specific sequence. This means that the endonuclease can cleave upstream or downstream of the insertion site, leading to either a gap or a flap of extraneous DNA. Host-associated enzymes repair the DNA damage, leaving behind target-site duplications in the event of a gap or target-site deletions in event of a flap. Some elements are capable of making the break directly opposite to the original break resulting in a blunt insertion. Once the initial DNA copy of the element is established, another break in the target DNA is made, and either the element encoded reverse transcriptase or a host

M

784

Mobile DNA: Mechanisms, Utility, and Consequences

b a

RNA Polymerase

Reverse Transcriptase

Mobile Element

c

Poly-A tail Mobile Element

Mobile Element Mobile Element

e Host DNA polymerase?

d

f Mobile Element

As

Mobile Element

As

Mobile Element

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 13 Target-primed reverse transcription occurs through an RNA intermediate. (a) The mobile element, consisting of 50 - and 30 -UTR regions and an open reading frame encoding a reverse transcriptase gene, is transcribed by a host RNA polymerase (green oval), producing an RNA transcript of element (red box) including a poly(A) tail (commonly added to mRNA in eukaryotes, purple line). (b) The reverse transcriptase gene is translated, and the resulting RT protein binds to the RNA

transcript. (c) The RT enzyme nicks the target DNA, generating a free 30 -OH. (d) The free 30 -OH generated is used to prime the synthesis of a new DNA copy of the element at the site of the nick. The opposite DNA strand of the target DNA is also nicked by the RT enzyme. (e) The gap in the target DNA is repaired, by either a host DNA polymerase or the RT enzyme and ligase. (f) The new copy of the mobile element differs from the original copy by the addition of more (A) nucleotides to the 30 -end

DNA polymerase can be used to make the complementary strand of the element. In some cases homologous recombination plays an even greater role in resolving recombinant molecules.

elements may actually use the 30 -OH from host DNA replication process as a primer, or random nicks in the target DNA may be used to initiate reverse transcription. The newly synthesized single strand of DNA is colinear with a single strand of the target DNA. To complete the process, the opposite strand is nicked and used to prime second strand synthesis, using the new DNA copy of the element as template. Some reverse transcriptases include an RNase H domain for the removal of the original RNA template during second strand synthesis. LINEs are highly abundant in the human genome, comprising ~21% of the total genome; however, almost all of these elements are inactivated remnants (Kazazian 2004). There are ~100 activate LINE elements in the human genome. SINEs (or short interspersed elements) are nonautonomous derivatives of LINEs that are much shorter in length and very abundant in many genomes, including in humans. The human genome contains 1.1 million copies of the Alu element (Kazazian 2004). SINEs are

LINEs and SINEs

Long interspersed elements (LINEs) move via an RNA intermediate and use a reverse transcriptase enzyme, but do not use DDE-type recombinases to accomplish transfer to a new site. LINEs were first recognized as repeat sequences in the human genome before they were known to be mobile elements and are sometimes called non-LTR elements because they lack the long terminal repeats found in retrotransposons mentioned above. These elements make a break in a single strand of the target DNA and use the newly generated free 30 -OH to prime reverse transcription of the RNA element, essentially amounting to a molecular bait-and-switch. These elements do not require specific DNA sequences at their ends, but there does appear to be preference for DNA that is naturally bent (Tropp 2012). In some cases

Mobile DNA: Mechanisms, Utility, and Consequences

b

ci

5’ 2’

Reverse Transcriptase

3’

c ii RNA Polymerase

2’

a

785

5’

3’ 3’

Mobile Element

Mobile Element

c iii 2’

3’ 5’

d

e 2’

3’

f

3’

Mobile DNA: Mechanisms, Utility, and Consequences, Fig. 14 Group II mobile introns. (a) The element is copied out of DNA by a host RNA polymerase enzyme (green oval). (b) Translation of the open reading frame within the element produces a reverse transcriptase protein (RT, purple oval). (c) Self-splicing of the intron from the mRNA proceeds by the following steps: (i) nucleophilic attack of the splice donor site by a 20 -OH group within the intron, leading to the formation of a lariat structure. (ii) The splice donor serves as a nucleophile, attaching the splice acceptor site at the 30 -end of the intron. (iii) The

intron is then released as a lariat-shaped RNA molecule with the mRNA splice donor and acceptor sites fused. (d) A new target site is selected through base pairing between the intron and the target DNA, and a reverse splicing reaction joins the intron with DNA. (e) Reverse transcriptase nicks the target DNA generating a 30 -OH that is used in priming reverse transcription. (f) RT uses an RNAse H domain to remove the RNA copy of the intron (or it is displaced) during second strand synthesis, generating a double-strand DNA copy of the intron at the new site

typically less than 1,000 base pairs in length and can be mobilized by closely related LINEs. The relationship between LINEs and SINEs is analogous to the previously mentioned transposable MITEs. The reverse transcriptase of LINEs can sometimes act on cellular mRNAs, resulting in processed pseudogenes, which can become new hybrid genes. Estimates of processed pseudogene formation suggest that ~0.5% of the coding capacity of the human genome can be attributed to the activity of L1 elements.

encode at least one protein with two main functions, a reverse transcription function and an RNA chaperone activity (Lambowitz and Zimmerly 2004). These proteins are typically referred to as intron-encoded proteins (IEP). The RNA of most Group II introns possesses a ribozyme activity that enables them to perform both splicing and reverse splicing activities, releasing them from RNA. The splicing reaction begins with the selfcleavage of the intron at the intron-exon junction by nucleophilic attack of 50 -phosphate by the 20 -OH of a conserved adenine ribonucleotide. This process is typically aided by RNA chaperone function of the IEP although for some elements this activity is dispensable. The splicing reaction produces an RNA loop, known as a lariat structure. The reverse transcriptase function of the IEPs enables the production of a DNA copy of the element once it has been reverse-spliced into

Mobile Group II Introns Mobile Group II introns also use a target-primed reverse transcription mechanism (Fig. 14). These elements are common in prokaryotes and eukaryotic organelle genomes and are thought to be the progenitors of introns found within eukaryotic coding sequences. Mobile Group II introns

M

786

a section of DNA. It is unclear if the reverse transcriptase or if host enzymes replace the remaining RNA copy of the element with DNA. Mobile Element Involvement in the Evolution of Host Organisms Transposons participate in the evolution of organisms actively by movement of the element within the host genome and passively by providing regions of homology that can participate in homologous recombination. Active transposition can interrupt genes or change their regulation. The original discovery of transposons by Barbara McClintock involved the movement of the Ds element, a nonautonomous derivative of the Ac element, abrogating or restoring a gene that makes pigment in corn kernels. Some components of mobile elements have been domesticated by their host organisms and incorporated into basic organismal functions. An example of the domestication of mobile elements can be found in the V(D)J recombination system that generates antibody diversity in mammals by a mechanism that is similar to mobilization of the Hermes transposon (Zhou et al. 2004). In both the V(D)J and Hermes recombination systems, the transesterification reaction generates hairpins in the donor DNA that flanks the mobile element. In V(D)J recombination, the donor DNA molecules rejoin to coding sequences of the antibody, generating new variants of antibodies. In the case of the Hermes transposon, the free 30 -ends of the element are used for subsequent insertion of the transposon. In a process that harkens back to a transposon ancestry, “mistakes” in V(D)J recombination occasionally occur in which DNAs recombine elsewhere within the genome, leading to B-cell malignancies (Curcio and Derbyshire 2003). Frequent mobilization events are typically not beneficial for host organisms. For this reason, host genetic systems are capable of deactivating transposons. It is not to the benefit of the essentially parasitic mobile elements to destroy the host in which they reside either. In eukaryotes, transposons are typically silenced by RNAi mechanisms, inhibiting transposition and preventing damage to the host genome. Mobilization of genetic elements is an especially large problem in the germ

Mobile DNA: Mechanisms, Utility, and Consequences

line of plants and animals. Some organisms go through a stage in which transposons are reactivated within the vegetative cells that accompany gametes (nurse cells and in pollen) (Slotkin et al. 2009). siRNA that neutralizes the mobilized elements is produced and transferred to the germ cells. This process ensures that potentially active elements will be silenced without ever causing damage within the germ cells, where changes to the genome could be particularly harmful and heritable. Some mobile elements, L1 in particular, have the ability to move adjacent DNA to new locations along with their own movement. This property is known as 30 -transduction. The 30 -transduction process leads to the production of new exons or altered gene expression through promoter and enhancer shuffling. It has been estimated that ~1% of the haploid genome of humans is a result of this process (Callinan and Batzer 2006). Alu elements are involved in a process known as exonization or “Aluternative splicing.” In this process, the presence of an Alu element in an intron of a gene prevents splicing of that intron, leaving the sequence to encode protein. This phenomenon is thought to be responsible for ~5% of all alternative splicing events in humans. As in all host-parasite relationships, it is beneficial for both the host and the mobile element to control propagation. Both frequency of mobilization and site of insertion must be controlled. From the perspective of the mobile element, moving too frequently will risk harming the host so severely that the host organism dies, and the element will no longer be propagated along with the host (Craig 2002; Nagy and Chandler 2004; Tropp 2012). Infrequent movement leaves the element vulnerable to deactivation through mutation and recombination. Similarly, indiscriminate target-site selection may result in disruption of essential genes and, therefore, neutralization of both host and element. Mobile elements often maintain mechanisms that limit transposition and in some cases direct movement to sites that will not endanger the host. In a particularly elegant example, transposon Tn7 does not activate transposition until an acceptable target has been identified. Transposition regulation and targeting

Mobile DNA: Mechanisms, Utility, and Consequences

proteins enable Tn7 to monitor suitable insertion sites, metabolic status of the host, replication status of the host genome, and the arrival of plasmids and other genetic entities. Tn7 also displays a phenomenon known as target-site immunity. Once one copy of the element has been inserted into the genome, insertion of another copy of the same element is discouraged. The targetimmunity mechanism prevents damage to the host genome and prevents the transposon from self-destruction by inserting into itself. The region of the chromosome that is “immune” to Tn7 insertion can be extensive, spanning >190 kilobases of DNA sequence. This process is probably very important in conjugal plasmids, which are a preferred target for Tn7, via its TnsE-mediated pathway of transposition; multiple insertions in a conjugal plasmid would likely destabilize the plasmid. Many mobile elements, particularly in bacteria, contain genes that benefit the host under certain selective conditions. These genes may encode factors that confer antibiotic or heavy metal resistance, systems that establish resistance to acquiring other mobile genetic elements, and metabolic genes that enable the utilization of additional nutrients. Beneficial genes provide incentives for the maintenance of the genetic element by the host organism and prevent loss of the element under selective conditions. Some genetic elements encode “host addiction” systems that include toxin and antitoxin genes, which ensure the maintenance of the element. In these systems stable toxins are produced along with unstable antitoxin proteins. If the element carrying the antitoxin gene is ever lost, the toxin will kill the host or at least prevent its replication until the antitoxin-carrying element can be reacquired. Archaea and some bacteria possess clustered regularly interspaced short palindromic repeats (CRISPR) systems that serve as an immune system, protecting the genome of the organism from invasion by viruses, plasmids, and transposons (Barrangou 2013). The CRISPR systems store sequence information of elements that they have developed resistance to and show evidence of protection against a broad range of horizontally transferred mobile genetic elements. Surprisingly,

787

CRISPR systems have also been found within transposable elements and bacteriophage (Parks and Peters 2009). Mobile Genetic Elements in Biotechnology Mobile genetic elements have long been used for the manipulation of genes and genomes for industrial uses. They have many benefits and can be very versatile. Among other uses, mobile elements are used for the following: • Random and targeted mutagenesis of eukaryotic and prokaryotic genomes • Aid in sequencing of long contiguous DNA molecules • Capture or transfer of genetic information in cloning and subcloning strategies • Transcriptional or translational fusion of reporter genes to promoters or open reading frames Transposons have been extensively used as tools of random mutagenesis to knock out genes within a genome (Craig 2002). By performing exhaustive mutagenesis of a genome, libraries of organisms containing individual transposon insertions can be created and tested for growth or display of a certain phenotype under a given set of conditions. After determining which transposon insertion results in the observed phenotype, the function of a gene can be surmised. Similarly, transposons that contain inducible or repressible promoters near their ends can be used to conditionally activate or shut off genes, providing an additional element of control to functional genomics studies. Transposons are frequently used in Drosophila for introducing transgenes, making deletions, making gene fusions, and producing RNAi substrates for the silencing of genes. The P element is typically used in Drosophila, while other elements are gaining popularity in other model insect systems such as hobo, Minos, and piggyBac. Transposons have also been used to fuse reporter proteins to native proteins within cells. This technique allowed investigators to determine the cellular localization of some genes. A clever adaptation of this technique has also been used to

M

788

fuse fluorescent proteins within random positions of a given protein, allowing easy identification of fusion proteins that remain functional despite the addition of a large fluorescent protein (Gregory et al. 2010). Transposons may be used to aid in sequencing projects involving long contiguous stretches or DNA. One of the challenges of DNA sequencing technologies is short read length. By randomly inserting a transposon with known sequence into larger DNA molecules, investigators establish sites within that DNA molecule that may be used to prime sequencing reactions. Some transposons have been engineered to contain a conditional origin of replication, allowing nearby genes to be easily cloned by endonuclease restriction and ligation (Tropp 2012). Site-specific recombinases are also commonly used as tools in molecular biology laboratories. A popular application of the lambda integration system is the Gateway system (Invitrogen, Carlsbad, CA). This system employs the Int and IHF proteins to clone DNA sequences from PCR products containing the Int recognition sequences from phage lambda. A counter-selectable marker allows efficient identification of plasmids that contain the insert of interest. Once cloned, subcloning into plasmids for other purposes is accomplished by using the Int, Xis, and IHF proteins to move the gene of interest into other plasmids that contain lambda attachment sites. A major benefit of the Gateway approach is that recombination steps can be done entirely in vitro. The Cre/lox system from bacteriophage P1 has been extensively used in plants and animals. Conditional genomic deletions and rearrangements can be achieved by engineering lox sites within a DNA target and expressing the Cre recombinase from an inducible promoter. The FLP/FRT system from yeast has been used for similar purposes. Group II mobile introns may also be used to modify genomes. A benefit to the use of these elements is that they can be targeted to virtually any desired sequence. Homing introns use DNA sequence homology between the element and recipient DNA to target insertion. By modifying

Mobile DNA: Mechanisms, Utility, and Consequences

the mobile intron to contain the desired target sequence, the element can be potentially directed to any given site (Tropp 2012). Due to the ubiquity of some mobile elements, an important consideration is the possibility of mobilizing elements at inappropriate times after an experiment has been carried out or inadvertent activation of endogenous elements. Some wellstudied mobile elements can be used to perform genetic manipulations in vitro, preventing activation of endogenous mobile elements within living cells. Medical Implications and Therapeutic Uses of Mobile Elements Given that mobile elements are so abundant in the human genome, it is no surprise that mobilization of these elements may cause disease. It is estimated that 0.1–0.27% of all genetic disorders results from a mobilization event someplace in the chromosome (Callinan and Batzer 2006). It is thought that 1 in 50 people experiences the mobilization of L1 elements either early in development or in the parental germ line. Mobile elements can affect human health in a variety of ways and have led to disease states such as certain types of hemophilia, muscular dystrophy, and CoffinLowry syndrome. In some cases mobile genetic elements can mitigate the effects of genetic disorders that afflict the host by disabling or reducing the expression of genes that cause disorders. Mobile elements can be useful in gene therapy strategies. Gene therapy is an approach whereby aberrant genetic information that leads to a disease state may be corrected by introduction of DNA containing a functional gene. In gene therapy an abnormal gene may be replaced by the insertion of a functional copy of the gene, or in some cases, the regulation of a gene can be altered. These strategies require the permanent addition of functional genetic information, for which mobile genetic elements are well suited. Retroviral vectors have been used with some success; however, the insertion of the viral vector has caused problems of its own. Viral vectors have the benefit of potential administration directly to the patient, while other

Mobile DNA: Mechanisms, Utility, and Consequences

approaches require ex vivo manipulation of cells and the reintroduction of the cell line into the patient. Efficient transposition in mammalian cells by the Sleeping beauty and piggyBac elements may help in the development of gene therapy strategies (VandenDriessche et al. 2009). Both of these systems have demonstrated sufficient functionality within human cells, allowing the transfer of engineered genetic constructs into the genome of somatic cells. However, they both have the drawback of inserting nonspecifically, potentially influencing the function of other systems within cells. Despite this characteristic, they display some benefits over retroviral vectors that have been used for gene therapy approaches in the past. In early gene therapy clinical trials, integration resulted in the development of leukemia in four individuals treated for severe combined immunodeficiency X1. Site-specific recombinases are also being developed to remove genetic information as well. Investigators used a directed evolution strategy to evolve the Cre recombinase to recognize the LTR of HIV-1 proviral DNA to remove it from the genome. Therapeutic application of this technique is still very far off, but in conjunction with other new technologies, the promise of this technique is enticing. Using combinations of transposon systems and site-specific recombination systems, researchers can produce induced pluripotent stem (iPS) cells used in regenerative medicine (VandenDriessche et al. 2009). In this strategy c-Myc, Klf4, Oct4, and Sox2 were transposed into fibroblast cells using the piggyBac transposase and were later removed using the same transposase once the fibroblast cells had been converted into pluripotent stem cells. New mobile elements continue to be discovered and characterized, and they are often quickly adapted for use in medicine and biotechnology. As natural agents of recombination and diversification, it is clear that mobile elements will continue to be a mainstay of the biotechnology toolbox.

789

Cross-References ▶ DNA Recombination, Mechanisms of ▶ DNA Repair Polymerases ▶ DNA Replication ▶ Double-Strand Break Repair ▶ Homologous Recombination in Lesion Bypass ▶ V(D)J Recombination

References Aziz RK, Breitbart M, Edwards RA (2010) Transposases are the most abundant, most ubiquitous genes in nature. Nucleic Acids Res 38:4207–4217 Barrangou R (2013) CRISPR-Cas systems and RNA-guided interference. Wiley Interdiscip Rev RNA 4:267–278 Callinan PA, Batzer MA (2006) Retrotransposable elements and human disease. Genome Dyn 1:104–115 Cambray G, Guerout AM, Mazel D (2010) Integrons. Annu Rev Genet 44:141–166 Cordaux R, Batzer MA (2009) The impact of retrotransposons on human genome evolution. Nat Rev Genet 10:691–703 Craig NL (1997) Target site selection in transposition. Annu Rev Biochem 66:437–474 Craig NL (2002) Mobile DNA II. ASM Press, Washington, DC Curcio MJ, Derbyshire KM (2003) The outs and ins of transposition: from mu to kangaroo. Nat Rev Mol Cell Biol 4:865–877 Derbyshire KM, Grindley ND (1986) Replicative and conservative transposition in bacteria. Cell 47:325–327 Dyda F, Chandler M, Hickman AB (2012) The emerging diversity of transpososome architectures. Q Rev Biophys 45:493–521 Feschotte C, Jiang N, Wessler SR (2002) Plant transposable elements: where genetics meets genomics. Nat Rev Genet 3:329–341 Gregory JA, Becker EC, Jung J, Tuwatananurak I, Pogliano K (2010) Transposon assisted gene insertion technology (TAGIT): a tool for generating fluorescent fusion proteins. PLoS One 5:e8731 Grindley ND, Whiteson KL, Rice PA (2006) Mechanisms of site-specific recombination. Annu Rev Biochem 75:567–605 Hickman AB, Li Y, Mathew SV, May EW, Craig NL, Dyda F (2000) Unexpected structural diversity in DNA recombination: the restriction endonuclease connection. Mol Cell 5:1025–1034 Kazazian HH Jr (2004) Mobile elements: drivers of genome evolution. Science 303:1626–1632

M

790 Lambowitz AM, Zimmerly S (2004) Mobile group II introns. Annu Rev Genet 38:1–35 May EW, Craig NL (1996) Switching from cut-and-paste to replicative Tn7 transposition. Science 272:401–404 McClintock B (1950) The origin and behavior of mutable loci in maize. Proc Natl Acad Sci USA 36:344–355 Nagy Z, Chandler M (2004) Regulation of transposition in bacteria. Res Microbiol 155:387–398 Parks AR, Peters JE (2009) Tn7 elements: engendering diversity from chromosomes to episomes. Plasmid 61:1–14 Peters JE, Craig NL (2001) Tn7: smarter than we thought. Nat Rev Mol Cell Biol 2:806–814 Plikaytis BB, Crawford JT, Shinnick TM (1998) IS1549 from Mycobacterium smegmatis forms long direct repeats upon insertion. J Bacteriol 180:1037–1043 Siguier P, Gourbeyre E, Chandler M (2014) Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev https://doi.org/10.1111/ 1574-6976.12067. [Epub ahead of print] Slotkin RK, Vaughn M, Borges F, Tanurdzic M, Becker JD, Feijo JA, Martienssen RA (2009) Epigenetic reprogramming and small RNA silencing of transposable elements in pollen. Cell 136:461–472 Tropp BE (2012) Molecular biology: genes to proteins, 4th edn. Jones & Bartlett Learning, Sudbury Turlan C, Chandler M (2000) Playing second fiddle: second-strand processing and liberation of transposable elements from donor DNA. Trends Microbiol 8:268–274 VandenDriessche T, Ivics Z, Izsvak Z, Chuah MK (2009) Emerging potential of transposons for gene therapy and generation of induced pluripotent stem cells. Blood 114:1461–1468 Wu X, Burgess SM (2004) Integration target site selection for retroviruses and transposable elements. Cell Mol Life Sci 61:2588–2596 Zhou L, Mitra R, Atkinson PW, Hickman AB, Dyda F, Craig NL (2004) Transposition of hAT elements links transposable elements and V(D)J recombination. Nature 432:995–1001

Modification of DNA

Monophyletic Group ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

mRNA Decay ▶ Cytoplasmic mRNA, Regulation of

mRNA Degradation ▶ Cytoplasmic mRNA, Regulation of

mRNA Localization and Localized Translation Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms Localized translation; RNA localization

Definition

Modification of DNA ▶ Electrophiles, Types of

Molecular Cloning ▶ Enzymes and Cloning Vectors Used to Create Recombinant DNA, Characteristics of

Some mRNAs are translated in specific locations of the cell. Localized translation is an efficient way to localize a particular protein. Often localized translation depends on the mRNA itself being localized to a particular place. mRNAs to be localized contain cis-elements that recruit “zip code” proteins that help localize the mRNA. During localization, the mRNA also recruits proteins that repress its translation, so that it cannot be translated until it reaches its destination.

mRNA Localization and Localized Translation

Discussion There are a growing number of examples of mRNAs that are localized to a particular place within a cell and only translated once localized. mRNA localization (and subsequent localized translation) has several possible advantages (as reviewed in Martin and Ephrussi 2009). First, mRNA localization is an efficient method to localize protein. Rather than localizing each protein, only a few mRNAs are localized, each of which can generate many proteins. Second, localizing translation can protect the cell against proteins that have a deleterious effect elsewhere. Third, it might allow faster regulation of translation, as the protein can be made both at the right place and at the right time. The discoveries of localized mRNA are increasing, suggesting that this is an important aspect of gene expression regulation. In Drosophila embryos, 71% of mRNAs tested (about a quarter of the transcriptome) showed distinct localization in a number of exquisite patterns (Lecuyer et al. 2007). The localization of the mRNA causes localized translation of their proteins, some of which are critical for setting up the body plan of the fly. Fibroblasts, highly motile cells in our connective tissue, have a complex network of actin in their lamellipodia that allows these cells to move. Actin mRNA is localized to the lamellipodia to produce actin protein at the edge of the cell, where it can drive cellular movement (reviewed in Martin and Ephrussi 2009). mRNA localization is important for several functions within our neurons, including synapse plasticity. A neuron has protrusions called axons, which transmit signals, and dendrites, which receive signals. An axon from one neuron connects to a target cell (such as a muscle cell) or a dendrite from another neuron to create a synapse. Chemical signals are passed through synapses to send messages through the nervous system. Synaptic plasticity is the ability for individual synapses to change in response to signals induced by experiences; this is thought to be the biochemical underpinnings of learning and memory. Synapses can be quite far from the nucleus of a neuron (up to 1 m in humans), so hundreds of translationally repressed mRNAs are localized to the dendrite at each synapse. Emerging evidence

791

hints that specific stimuli can promote the translation of some, but not all, mRNAs. These early results suggest that synapses record our experiences through the translation of localized mRNAs, which strengthens the synapse. Firm evidence linking synaptic plasticity with mRNA localization and localized translation is challenging to achieve with current technology. An elegant experiment in mice shows the importance of the localized translation of a Ca2+/ calmodulin-dependent protein kinase II (CaMIIKa). This mutant mouse expresses CaMIIKa mRNA that lacks a 30 UTR, so CaMIIKa mRNA is not localized to dendrites. The protein is still produced throughout the neuron, not by localized translation at the dendrites. These mice showed long-term memory defects, suggesting that specifically localized translation is important. FMRP (fragile X mental retardation protein) is a trans-factor that localizes and represses the translation of many critical neuronal mRNAs (Martin and Ephrussi 2009; Martin and Zukin 2006). Mutations in FMRP are associated with fragile X syndrome, which is the most common cause of mental retardation in boys. Studies of mRNA localization in many systems have revealed the following themes. The mRNA to be localized contains cis-elements that interact with trans-factors that direct the mRNA to the right part of the cell. The cis-element is often referred to as a zip code, and, if the mRNA is a postcard, then the trans-factor it binds is the postman. An mRNA’s localization signals are often in the 30 UTR and are usually repeated to increase the efficiency of localization. Just as a postcard arrives with other mail, the mRNA is often delivered with other RNAs and their associated proteins in an RNA transport granule. This RNA transport granule hops a ride on a motor protein and travels to its appropriate destination by the cell’s highway system, the cytoskeleton. Once in position, the mRNA may be maintained at its destination by localized proteins or other factors. Often, if an mRNA is mislocalized, it is subject to degradation. Once the mRNA is localized, derepression of translation occurs, often because the repression proteins are inactivated or competed away. While there are several examples

M

792

mRNA Localization and Localized Translation

mRNA Localization and Localized Translation, Fig. 1 Model of Ash1 mRNA localization to the bud tip of yeast. 1 Some of the localization factors assemble on Ash1 mRNA while it is still in the nucleus. 2 After nuclear export, the Ash1 localization mRNP includes Khd1 and Puf6, which block different steps in translation to keep Ash1 mRNA repressed. 3 She2 links Ash1 mRNA to myosin motor (green) so it can transport Ash1 along the

actin cytoskeleton (red) to the bud tip. 4 Kinases in the membrane of the bud (pink lines) phosphorylate (orange circles) Puf6 and Khd1, forcing them to fall off the mRNA. 5 Their removal means that translation can occur (Adapted by permission from Macmillan Publishers Ltd: [Nature Reviews Molecular Cell Biology] (Besse and Ephrussi 2008), copyright (2014))

of localized mRNAs that fit these themes, it is important to note that there are several mechanisms of localization and some mRNAs that deviate from this general description. For example, nanos mRNA in Drosophila is localized by entrapment, not by movement along the cytoskeleton. That is, there are proteins in the posterior of the embryo that bind nanos mRNA and keep it located posteriorly. Additionally, nanos mRNA that is not properly localized is destroyed. One of the best-studied examples of mRNA localization is the Ash1 mRNA in the bakers’ yeast Saccharomyces cerevisiae, which is localized to the bud that will eventually become an independent daughter cell. Ash1 protein helps the daughter cell maintain its mating type, while the mother cell switches mating type, allowing both mating types to be present in a population.

Ash1 mRNA contains several cis-elements that act as a zip code and binds several trans-factors, including Puf6, Khd1, and She3 protein (Fig. 1). Puf6 and Khd1 bind within the coding sequence of Ash1 mRNA to repress translation while the mRNA is in transit. She3 binds the 30 UTR and functions as an adaptor, connecting the Ash1 mRNA to a motor protein called myosin. Myosin can transport the Ash1 mRNA along the actin cytoskeleton into the bud tip. Ash1 mRNA is observed accumulating in cytoplasmic foci; it is unclear if these foci represent RNA transport granules (Bertrand et al. 1998; Lavut and Raveh 2012). Once Ash1 mRNA is localized, kinases embedded in the membrane at the bud tip phosphorylate Puf6 and Khd1, inactivating these repressors so that Ash1 translation can begin (Besse and Ephrussi 2008).

Mutation Assays

793

References

mRNA Storage Bertrand E, Chartrand P, Schaefer M, Shenoy SM, Singer RH, Long RM (1998) Localization of ASH1 mRNA particles in living yeast. Mol Cell 2:437–445 Besse F, Ephrussi A (2008) Translational control of localized mRNAs: restricting protein synthesis in space and time. Nat Rev Mol Cell Biol 9:971–980 Lavut A, Raveh D (2012) Sequestration of highly expressed mRNAs in cytoplasmic granules, P-bodies, and stress granules enhances cell viability. PLoS Genet 8:e1002527 Lecuyer E, Yoshida H, Parthasarathy N, Alm C, Babak T, Cerovina T, Hughes TR, Tomancak P, Krause HM (2007) Global analysis of mRNA localization reveals a prominent role in organizing cellular architecture and function. Cell 131:174–187 Martin KC, Ephrussi A (2009) mRNA localization: gene expression in the spatial dimension. Cell 136:719–730 Martin KC, Zukin RS (2006) RNA trafficking and local protein synthesis in dendrites: an overview. J Neurosci 26:7131–7134

▶ Cytoplasmic mRNA, Regulation of

Multidimensional NMR ▶ NMR Basis (Theory)

for

Biomolecular

Structure

Mutation and Cancer ▶ DNA Damage, Relevance to Cancer

mRNA Regulation

Mutation Assays

▶ Cytoplasmic mRNA, Regulation of

▶ DNA Damage, Practical Screening for

M

N

NMR Approaches to Determine Protein Structure Qi Hu Department of Biochemistry and Molecular Biology, Mayo Clinic College of Medicine, Rochester, MN, USA

This short entry aims to briefly introduce the general process of solving a protein structure using NMR spectroscopy. Instead of emphasizing on the theory, the effort was put into an introduction of practical aspects of solving the solution structure by NMR spectroscopy step by step.

Introduction Synonyms NMR, nuclear magnetic resonance; NOE, nuclear Overhauser effect; PDB, Protein Data Bank; RDC, residual dipolar coupling; RMSD, rootmean-square deviation; TROSY, transverse relaxation-optimized spectroscopy

Synopsis NMR spectroscopy developed very rapidly during the past decades. Modern NMR spectroscopy is widely used in the structural biology. As one of the major methods for macromolecule structure determination at the atomic level, NMR is unique in its ability of determining the macromolecule structure in aqueous solution close to the physiological condition. The ability of detecting molecular motions of macromolecules on a large range of time scales makes NMR spectroscopy an ideal tool suitable not only for structure determination but also for molecular dynamics studies. # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

In the past decades, NMR spectroscopy has gained rapid development, being applied widely in the fields of biology, chemistry, materials science, medical science, etc. In the field of structural biology, NMR is one of the most important methods used for solving macromolecule structures at atomic resolution. While the NMR method has a limitation in terms of the size of macromolecules that can be studied at present, the solution structures obtained by the NMR method are usually believed to be close to the physiological condition. Although X-ray crystallography makes the main contribution to the macromolecule structures deposited in the protein data bank (PDB), NMR exhibits obvious advantages in handling proteins with different degrees of flexibility. A multidomain protein with a flexible linker or unstructured domain can be difficult to crystallize, and NMR can be used for characterizing the structure. Furthermore, the advantages in probing the protein dynamics makes NMR a suitable method for characterizing intrinsically disordered

796

proteins, which is increasingly found implicated in many biological processes.

Sample Preparation The first step to start structure determination using NMR spectroscopy is to prepare a good sample with proper isotopic labeling in appropriate concentration and purity. A good sample brings good NMR spectra and saves much time both in collecting and analyzing the data. 15

N- or 15N, 13C-Labeled Protein For a well-folded protein of small size (25 kD), the doublelabeled samples may not be able to give good spectra due to the fast relaxation of 1H for large proteins. For these cases, deuteration is a good choice to overcome the obstacle by replacing most of the protons with deuterons (Fiaux et al. 2004). Deuteration results in dramatically decreased line width of the signals and enables NMR spectra with improved quality to be obtained for even larger proteins. The percentage level of deuteration of the samples may vary corresponding to the richness of deuteron in the material that is used for cell culture and protein expression. For partial deuteration in a triple

NMR Approaches to Determine Protein Structure

labeling (15N, 13C, and 2H) using 15NH4Cl and 13 C-glucose and using D2O instead of H2O, 70–80% deuteration of all the unchangeable aliphatic and aromatic protons is obtained. If perdeuteration is required, deuterated 13C-glucose needs to be used in place of the non-deuterated one. In order to record the normal 15NH-based experiments, it is better to purify the protein in normal aqueous solution so that the 15N-D will exchange back to 15N-H. Usually, the deuteration in combination with transverse relaxationoptimized spectroscopy (TROSY) and other NMR techniques works very efficiently to extend the molecular weight limit of proteins that NMR spectroscopy can study. Special Labeling Strategies For some specific cases or very large proteins, certain special labeling strategies can be adopted to obtain satisfying spectra. For large proteins produced by triple labeling with perdeuteration, usually only the NH groups are visible in NMR spectra. Thus, insufficient NOEs can be gained for high-quality structure calculation, especially those NOEs between aliphatic protons in the hydrophobic core which is important in stabilizing the overall folding of the entire structure. Tugarinov and Kay developed a special labeling method which aims at specifically labeling the methyl groups of isoleucine, valine, and leucine residues in a protein (Tugarinov and Kay 2003). The labeling scheme produces protein which is uniformly 15N, 13C, 2 H-labeled except that the side chains of isoleucines, leucines, and valines in the protein are labeled as follows: one of the two methyl groups in the three residues is labeled as 13C1H3, the other as 12C2H3. This labeling method provides important restraints for the buried core of the structure. There are also other labeling strategies. The segmental labeling produces proteins that are only labeled in one section of the amino acid sequence. The stereo-array-isotope labeling (SAIL) using cell-free technique provides a means of stereo-selectively labeling the methyl groups and methylene groups in a protein. These methods are very useful for the study of very large or multidomain proteins whose signal overlaps severely when regular labeling strategies are used.

NMR Approaches to Determine Protein Structure

Chemical Shift Assignments There are many different software that can be used for NMR data process and for the chemical shift assignments, such as NMRPipe (Delaglio et al. 1995), NMRView, and Sparky. Backbone Assignment The aim of the analysis of NMR spectra is to gain as much information as possible about the interatomic distances and torsion angles. In order to get the information, chemical shifts of atoms in each residue must be assigned first. Generally speaking, the assignment can be divided into two parts: the sequential backbone assignment of the amino acids and the assignment of the amino acid side chains. Different assignment strategies may be applied for samples with different labeling status and different spectra. For proteins less than 10 kDa, homonuclear spectra could be used for assignment with unlabeled sample. Homonuclear experiments such as 2D-COSY, 2D-TOCSY, and 2D-NOESY are most frequently used for assignment. The assignment typically begins from the identification of certain amino acids with a characteristic pattern of cross signals. Glycine has two Ha with chemical shift near 4 ppm and therefore could readily be identified, while valine, leucine, and isoleucine can be recognized by their two methyl groups giving a characteristic row of cross peaks between 1 and 2 ppm. Homonuclear experiments can provide enough information for the assignment of small proteins or peptides. However, with protein size increasing, the overlap of peaks from different atoms in the spectra gets worse, and it is hard to obtain sufficient chemical shift information for the subsequent structure calculation. Therefore, tripleresonance experiments with 15N- and 13C-labeled samples are necessary for the assignment of larger proteins. For a triple-resonance experiment, the sequential correlations between amino acids are established via J coupling or indirect dipoledipole coupling. Usually, a series of tripleresonance experiments are recorded in order to complete the assignment. Magnetization transfer

797

pathways of some commonly used tripleresonance experiments are shown in Fig. 1. The general strategy of sequential assignment with triple-resonance experiments can be explained using the CBCANH and CBCA(CO)NH spectra as an example. Both CBCANH and CBCA(CO) NH spectra have three frequency axes: 1H, 15N, and 13C. CBCANH spectrum correlates an amide proton with Calpha and Cbeta from the same residue as well as those from the preceding residue, while CBCA(CO)NH spectrum correlates amide proton only with Calpha and Cbeta from the preceding residue. Thus, the long chain of a protein can be linked one by one based on the correspondence of the peaks in CBCANH and CBCA(CO)NH spectra as illustrated in the Fig. 2. However, proline residues interrupt the linkage due to their missing amide proton. CBCANH and CBCA(CO)NH spectra provide plenty of information for sequentially assigning the backbone chain; however, other spectra are also usually needed for solving the ambiguity with the former two spectra. For bigger proteins, HNCA and HN(CO)CA spectra are sometimes more useful than CBCANH, since they usually have better signal-to-noise ratios than does CBCANH and thus are a good supplement to CBCANH. HNCA and HN(CO)CA provide the same information for Calpha chemical shift as the CBCANH and CBCA(CO)NH but no Cbeta chemical shift. However, the disadvantage of HNCA and HN(CO)CA spectra are that Calpha chemical shift provides less information about the amino acid type than the Cbeta chemical shift and less disperse. Another pair of triple-resonance experiments which are an independent alternative for checking the sequential connectivities are HNCO and HN(CA)CO. Side-Chain Assignment Since hydrogen atoms on the side chain are usually involved in the long-distance interaction, its assignment will provide important information for structure calculation. A straightforward method is to begin with a set of HBHA(CBCACO)NH, HCC(CO)NHTOCSY, and CC(CO)NH-TOCSY spectra. The combination of these three spectra will enable the assignments of most of hydrogen and

N

798

NMR Approaches to Determine Protein Structure

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

HNCO

HN(CA)CO

HCACO

HCA(CO)N

CBCA(CO)NH

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

H

C

H

H

C

H

H

C

H

N

C

C

N

C

C

H

H

O

H

H

O

CBCANH

HBHA(CBCA)NH

HBHA(CBCACO)NH

C(CO)NH-TOCSY

H(C)(CO)NH-TOCSY

NMR Approaches to Determine Protein Structure, Fig. 1 Magnetization transfer of some triple-resonance experiments for backbone and side-chain chemical shift assignment

side-chain carbon chemical shifts. For some residues with long side chain containing two methyl groups, such as valines, leucines, and isoleucines, it may be difficult to distinguish the two methyl groups and adjacent methylenes or methynes. And the connectivity of which hydrogen to which carbon is not provided. At this stage, HCCH-TOCSY and HCCH-COSY will be the most useful spectra for obtaining

such information. Three frequencies in HCCHTOCSY and HCCH-COSY are 1H-13C-1H. For each carbon plane in HCCH-TOCSY spectrum, of the two hydrogen dimensions, one is from the hydrogen that is attached to the carbon and the other dimension is from other hydrogens belonging to that side chain. The HCCHCOSY spectrum is similar to HCCH-TOCSY, except that instead of detecting all hydrogens

NMR Approaches to Determine Protein Structure

799

NMR Approaches to Determine Protein Structure, Fig. 2 An example of sequential assignments using CBCA (CO)NH and CBCANH spectra

N

belonging to the same side chain in the third dimension, only those hydrogens which are attached to the neighboring carbon atoms are detected.

Besides experiments for assigning the aliphatic hydrogens, there are also some 2D experiments for assigning aromatic hydrogens, such as CB(CC)HDCOSY, HBCBCGCDHD, and HBCBCGCDCEHE.

800

Structure Calculation Constraints for Structure Calculation When most of the chemical shifts are assigned, the next step will be transforming the information into distances and torsion angles for structure calculation. The most important spectra used for this purpose are 2D-NOESY, 3D-15N-NOESYHSQC, and 3D-13C-NOESY-HSQC. Generally, each cross peak on the spectra indicates distance information of two certain hydrogen atoms and can be transformed into a piece of distance constraint. The dependence of signal intensity (NOE) to the real distance (r) between the two hydrogens which give rise to the peak is described as NOE = 1/r6. Usually, the NOEs can be classified into three classes based on the connectivity of the two protons in the primary structure: intra-residue NOEs, medium-range NOEs, and long-range NOEs. Medium-range NOEs are mainly indicative of the protein backbone conformation and are used to determine secondary structure. The longrange NOEs reflect the global folding of a protein and are much more useful in tertiary structure determination. Typically, the distance of two protons that generate a cross peak on the NOSEY spectra is within 5 Å. For some very weak peaks, the distance could be extended to 6 Å. In addition to the distance restraints, torsion angle or dihedral angle information is also required for a good structure calculation. This information can be obtained either from specific experiments, i.e., 3D HNHA for F, 3D (H)NNHTOCSY for c, etc., or from an empirical prediction of protein F and c backbone torsion angles using certain software, i.e., TALOS, which is installed as part of the NMRPipe system. Determination of Secondary Structure To determine the secondary structure, torsion angles of the backbone are a necessary information. Chemical shifts of protein backbone atoms are delicately sensitive to the local conformation, and homologous proteins show similar patterns in the chemical shifts of secondary structure. The relation was inversely used to develop the TALOS software which aims at searching

NMR Approaches to Determine Protein Structure

a database for triplets of adjacent residues with secondary chemical shifts and sequence similarity which provide the best match to the query triplet of interest (Cornilescu et al. 1999). It uses a set of chemical shifts of five atoms from the backbone: 1 HA, 13CA, 13CB, 13CO, and 15N for the prediction. The predicted dihedral angles F and c are directly used for structure calculation. Besides the prediction with TALOS or other software, the secondary structure is also reflected on the pattern of correlation peaks on the 15 N-NOESY and 13C-NOESY spectra. This is much obvious for the helices. For an amide proton located within a helix, characteristic cross peaks can be found close to the diagonal peak of itself on the 15N-NOESY spectrum. These interproton NOEs come from the interaction between the amide proton of detection and the adjacent amide protons within a helix. Thus, if such characteristic consecutively correlated peaks are present on the 15N-NOESY spectrum for a sequentially connected fragment of a protein, it most probably is a helix. Tertiary Structure Calculation The general principle of calculation of the tertiary structure is that based on the knowledge of empirical input data such as bond lengths of all covalently attached atoms and bond angles, using certain calculation software to convert the distance and torsion angle data into a visible structure. A randomly folded structure is calculated based on the empirical data and the amino acid sequence of the protein. Then the structure calculation software tries to fold the starting structure with all the experimentally derived interproton distance and torsion angle constraints satisfied. Commonly used software for such purpose includes CYANA, Xplor-NIH, CNS, etc. Here, we take CNS (Brunger 2007) as an example to show the process of NMR structure calculation. First of all, a file containing the molecular topology information (.mtf file) must be generated for the structure using the protein sequence as an input file. This information about molecular connectivity will be used in the next step to generate starting (extended conformation) coordinates. For

NMR Approaches to Determine Protein Structure

some specific cases, such as proteins containing disulfide bonds or cis-prolines, it is important to include the information at this step. Then, use the molecular topology information to generate an extended conformation as a starting model for the following structure calculation. The next step is to do simulated annealing using experimentally measured interproton distance and dihedral angle constraints. The starting structure is heated to a high temperature during a simulation and followed by many discrete cooling steps to evolve towards the energetically favorable final structure under the restraint of a force field derived from the experimental constraints. The calculation generates a family of structures instead of an exactly defined structure. And the quality of the calculated structures can be defined by the mean deviation (RMSD) of each structure in the family from an average structure. A smaller RMSD means a better constrained conformational space of the protein structures. To improve the accuracy and quality of the calculated structure, refinement is a necessary step. To this end, additional constraints such as residual dipolar coupling data or paramagnetic relaxation enhancement data can be included. There are multiple parameters that can be used to evaluate and validate the quality of an NMR structure: RMSD of the backbone and all the atoms, violations from NOEs, molecular energy, etc. PROCHECK (Laskowski et al. 1993) is one of the software used to evaluate the quality of an NMR structure. It performs a number of checks of structural quality. A Ramachandran plot is generated by PROCHECK, illustrating the distribution of F and c angles in the structures.

TROSY-Based Experiments for Large Proteins For biological macromolecules with molecular weights above 30 kDa, the line widths of the signals in the NMR spectra turn much wider; therefore, the signal-to-noise ratio is much reduced when recorded with conventional multidimensional NMR experiments. This problem can

801

be solved either by the use of deuteration to eliminate proton-mediated relaxation pathways or by the use of transverse relaxation-optimized spectroscopy (TROSY). In many cases, these two methods are combined together to get the most optimized spectra (Fernandez and Wider 2003). TROSY is based on the fact that cross-correlation relaxation rates associated with the interferences of chemical shift anisotropy (CSA) and dipoledipole interactions can be dramatically reduced in a high field (Pervushin et al. 1997; Zhu et al. 2004). For TROSY using single-transitionto-single-transition polarization transfer, only 50% of the spin population is selected to have a TROSY effect. Thus, the advantage of TROSYtype experiments can only be effectively realized for a protein whose transverse relaxation time, T2, is sufficiently short. Under these circumstances, the effect of TROSY will compensate for the intrinsic loss in sensitivity of TROSY. Generally, the larger the protein is, the more pronounced the effect of TROSY will be. Higher field strength also enhances the effect of TROSY.

Residual Dipolar Coupling Residual dipolar coupling (RDC) measurement provides complement orientation information to the conventional use of NOEs in the structure calculation. The conventional NOEs are localdistance restraints, while RDC provides orientation information between the domains of the entire protein. Since diamagnetic molecules at moderate field strength have little preference in orientation due to the tumbling, it is necessary to measure the RDC using particular media in which the protein molecules are aligned along a certain orientation. There are many different alignment media commercially available, such as lipid bicelles, liquid crystalline bicelles, and bacteriophage Pf1. In the experiments to measure RDCs, RDCs appear as an additional contribution to the scalar J-coupling splitting. For two dipole-coupled nuclei, A and B, the dipolar coupling in solution,

N

802 10.4

10.2

10.0

9.8

9.6

9.4 V104N

C39N

118 ω1 - 15 N (ppm)

NMR Approaches to Determine Protein Structure, Fig. 3 An example of 2D IP-AP [15N, 1 H]-HSQC spectra for measuring RDCs. Left. Spectrum of sample without Pf1 bacteriophage to measure the scalar coupling, J. Right. Spectrum of sample with Pf1 bacteriophage to measure the scalar coupling and RDCs, J+D

NMR Approaches to Determine Protein Structure

S164N

94.32Hz

92.75Hz

T141N

118

93.78Hz

96.34Hz

120

120

Q133N Q126N 93.60Hz 95.36Hz

122

L47N

122 T

91.17Hz

124

124 10.4

10.2

10.0

9.8

9.6

9.4

ω2 - 1H (ppm) 10.4

10.2

10.0

ω1 - 15 N (ppm)

118

9.8

9.6

V104 2pf 97.56Hz 118

C391 2pf 94.37Hz

T141N2pf

S16 12pf 99.14Hz

120

9.4

92.65Hz

120 Q133 2pf 70.50Hz Q126 2pf

122

97.41Hz

122 V 96.50Hz

L47N2pf

124

124 10.4

10.2

9.8

10.0 ω2 -

1H

Cross-References

    DAB ðy, FÞ ¼ Da AB 3 cos2 y  1 þ 3R sin2 ycos2F =2

References

The two parameters Da and R in Eq. 1 can be estimated using the RDC data and a histogram method (Clore et al. 1998). These two parameters are necessary for the structure calculation.

9.4

(ppm)

DAB, in the following equation, can be determined by taking the difference of the splitting under anisotropic conditions (J+D) and under isotropic conditions (J) (Lipsitz and Tjandra 2004) (Fig. 3):

(1)

9.6

▶ NMR Approaches to Determine Protein Structure

Brunger AT (2007) Version 1.2 of the crystallography and NMR system. Nat Protoc 2:2728–2733 Clore GM, Gronenborn AM, Bax A (1998) A robust method for determining the magnitude of the fully asymmetric alignment tensor of oriented macromolecules in the absence of structural information. J Magn Reson 133:216–221

NMR Basis for Biomolecular Structure (Theory) Cornilescu G, Delaglio F, Bax A (1999) Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J Biomol NMR 13:289–302 Delaglio F, Grzesiek S, Vuister GW, Zhu G, Pfeifer J, Bax A (1995) NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J Biomol NMR 6:277–293 Fernandez C, Wider G (2003) TROSY in NMR studies of the structure and function of large biological macromolecules. Curr Opin Struct Biol 13:570–580 Fiaux J, Bertelsen EB, Horwich AL, Wuthrich K (2004) Uniform and residue-specific 15N-labeling of proteins on a highly deuterated background. J Biomol NMR 29:289–297 Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr 26:283–291 Lipsitz RS, Tjandra N (2004) Residual dipolar couplings in NMR structure analysis. Annu Rev Biophys Biomol Struct 33:387–413 Pervushin K, Riek R, Wider G, Wuthrich K (1997) Attenuated T2 relaxation by mutual cancellation of dipoledipole coupling and chemical shift anisotropy indicates an avenue to NMR structures of very large biological macromolecules in solution. Proc Natl Acad Sci U S A 94:12366–12371 Tugarinov V, Kay LE (2003) Ile, Leu, and Val methyl assignments of the 723-residue malate synthase G using a new labeling strategy and novel NMR methods. J Am Chem Soc 125:13868–13878 Zhu G, Xia Y, Lin D, Gao X (2004) TROSY-based correlation and NOE spectroscopy for NMR structural studies of large proteins. Methods Mol Biol 278:57–78

803

Synopsis Nuclear magnetic resonance (NMR) is referred to as a nucleus with spin under a magnetic field that will be excited by electromagnetic radiation with specific frequency (Larmor frequency). The resonance frequency variation depends on magnetic field strength, atom type, and the chemical environments the atom exists (chemical shift phenomenon). Chemical shift analysis can provide plenty of structure information of a molecule, but the overcrowd resonance signals in conventional (one-dimensional) spectra restricted its use on biomolecules. The superiority of modern NMR is to disperse the overcrowd resonance signals to multidimension. Density matrix formalism is the core of modern NMR, which mathematically describes the evolution of a nuclear spin system precisely. Product operator formalism, the simplified version of density matrix formalism, is more convenient and practical for NMR pulse program design. There are three types of pathway for coherence transfer: through scalar coupling, through dipolar coupling, and through physical or chemical exchange. Multidimensional NMR pulse program is composed of four different types of period: preparation period, evolution period, mixing period, and acquisition period.

Introduction

NMR Basis for Biomolecular Structure Nuclear magnetic resonance (NMR) has been (Theory) a powerful tool for chemical analysis since the Gaofeng Cui Department of Biochemistry and Molecular Biology, Mayo Clinic College of Medicine, Rochester, MN, USA

Synonyms Coherence transfer; Density matrix formalism; Multidimensional NMR; Nuclear magnetic resonance; Product operator formalism

1970s and become one of the two methods to obtain biomolecular structures at atomic level. There are more than 10,000 biomolecular structures, corresponding to ~10% of the total structures in protein data bank, contributed by NMR, and the number is growing with more than 500 solution structures being released each year. Nevertheless, the advantage of NMR is to detect biomolecular interactions with incredible sensitivity. It has become a powerful tool in life science. The purpose of this entry is to introduce NMR theory briefly for non-NMR scientists. NMR history, classical NMR theory, and modern NMR

N

804

theory were summarized in this entry. The most basic NMR concepts, such as spin, Larmor precession, relaxation, chemical shift, scalar coupling, and dipolar coupling, are mentioned in section “Classical NMR Theory Introduction,” while density matrix formalism, production operator formalism, the principle of Flourier transform NMR, coherence transfer, and multidimensional NMR are presented in section “Modern NMR Theory Introduction.”

NMR History The enthusiasm in nuclear magnetic resonance (NMR) spectroscopy aroused after Bloch and Purcell (1952 Nobel Prize in Physics winners) independently observed NMR signals in bulk mass in 1945 (Bloch et al. 1946; Purcell et al. 1946). The passion in NMR would have vanished if the nature of chemical shift was not revealed by Proctor and Dickinson in 1949 (Dickinson and Wimett 1949; Proctor 1949). Nuclear magnetic resonance frequency, which depends on the nucleus’ chemical environment, made its widespread use for chemical analysis. In the early days, very low sensitivity was a fundamental difficulty in NMR. Sensitivity was improved with the application of Fourier transformation pioneered by Ernst (1991 Nobel Prize in Chemistry winner) in 1966 (Ernst 1966). With the development of a new generation of NMR spectrometers based on superconducting materials, further improvement in the quality of spectra, in both resolution and sensitivity, was seen in the 1970s. However, overcrowded spectra remained a problem for complex systems. A milestone came in 1975 when Ernst developed two-dimensional NMR experiments (Ernst 1975). Ernst’s tremendous contributions to the field of NMR made possible the use of NMR to study biomolecules. The pioneer of NMR application to biomolecules was Wuthrich (2002 Nobel Prize in Chemistry winner) who developed a conceptual framework for protein structure determination and solved the first solution structure of proteinase inhibitor IIA in 1985 (Williamson et al. 1985). At present, NMR spectroscopy and X-ray crystallography are the two

NMR Basis for Biomolecular Structure (Theory) NMR Basis for Biomolecular Structure (Theory), Table 1 Nuclear property and spin quantum number Mass number Odd Even Even

Atomic number Odd or even Odd Even

Nuclear spin Half-integer Integer 0

most common methods for determining biomolecular structures at the atomic level.

Classical NMR Theory Introduction Spin Spin is a fundamental property of elementary particles in quantum mechanics and particle physics. The spin angular momentum (I) is characterized by the nuclear spin quantum number I. The spin quantum number depends on the atomic number and atomic mass number (Table 1). For NMR spectroscopy of biomolecules, the most important nuclei with I = 1/2 are 1H, 13C, 15N, 19F, and 31P; the most important nucleus with I = 1 is the deuteron (2H). By convention, the value of the z-component of I is specified by the following equation: Iz ¼

h m 2p

in which h is Planck’s constant and the magnetic quantum number m = (I, I + 1, . . . I  1, I). The nuclear magnetic moment m is defined by m = gI, and thus, mz ¼ gI z ¼ g

h m 2p

in which the magnetogyric ratio, g, is a characteristic constant for a given nucleus. Under an external magnetic field, the nuclear magnetic moment will interact with the external field. The resonance signal will split due to energy difference of spin states. This is called the Zeeman effect. The potential energy of a spin state is E ¼ m • B ¼ m

h gB0 2p

in which B0 is the external magnetic field strength.

NMR Basis for Biomolecular Structure (Theory)

Boltzmann Distribution In the absence of external magnetic field, the magnetic dipoles are randomly oriented and the net magnetization is zero. The magnetic dipole will interact with a magnetic field when an external magnetic field exists. The magnetic dipoles orient themselves parallel or antiparallel with the external magnetic field. At equilibrium, the relative population of a spin state is governed by the Boltzmann distribution:  , X   I Nm Em Em exp ¼ exp N kB T kB T m¼I   1 mhgB0 1þ  2I þ 1 2pkB T in which Nm is the number of nuclei in the mth state, N is the total number of spins, T is the absolute temperature, and kB is the Boltzmann constant (1.3805  1023 J/K). For I = 1/2 nuclei, the population difference between spin states is Na  DE  hu ¼ e kB T ¼ e kB T Nb u¼

gB0 2p

in which Na and Nb represent the number of nuclei in the two spin states, DE is the energy difference between the two spin states, and u is the frequency of irradiation needed for the transition between the two spin states. The population difference between energy levels for 1H is on the order of 104 in an 18.7-T magnetic field (800 MHz NMR spectrometer). From this point, NMR technique is quite insensitive compared to other spectroscopic techniques. Larmor Precession Nuclei are positively charged particles. Nuclei spin induces a magnetic field along their spin axis. The orientation of their spin axis is parallel or antiparallel with the external magnetic field, and their spin axis rotates about an external magnetic field axis with an angular frequency of

805 NMR Basis for Biomolecular Structure (Theory), Fig. 1 Larmor precession. The spin of a proton creates a local magnetic field (m) along its spin axis. The spin axis rotates about an external magnetic field (B0) axis with an angular frequency o

o = gB0 (Fig. 1). This phenomenon is called Larmor precession. The magnitude of precession frequency (Larmor frequency) is identical to the frequency of electromagnetic radiation required to excite transitions between Zeeman levels. Relaxation and Bloch Equation The nuclei at excited state must return to the ground state (Boltzmann equilibrium), followed with the decay of NMR signals it generated. This process is called relaxation. Two relaxation processes are responsible for NMR signals decay. One is known as longitudinal, or spin–lattice relaxation, and accounts for the reestablishment of the Boltzmann equilibrium distribution. The spin–lattice relaxation is expressed as dMz ðtÞ ¼ R1 ½M0  MZ ðtÞ dt in which R1 is the spin–lattice relaxation rate constant. The spin–lattice relaxation time constant T1 is the reciprocal of R1: T1 ¼

1 R1

Generally, T1 depends mainly on the NMR resonance frequency and the external magnetic field strength. The other relaxation process is called transverse or spin–spin relaxation. It accounts for the decay of the transverse magnetization in the x–y plane following a radio

N

806

NMR Basis for Biomolecular Structure (Theory)

frequency pulse (rf pulse). The spin–spin relaxation is expressed as dMx ðtÞ ¼ R2 Mx ðtÞ dt dMy ðtÞ ¼ R2 My ðtÞ dt in which R2 is the spin–spin relaxation rate constant. The spin–spin relaxation time constant T2 is the reciprocal of R2 1 T2 ¼ R2 where T2 is less sensitive than T1 to the external magnetic field strength. Chemical Shift When an atom is placed in the magnetic field, the magnetic field will induce a secondary magnetic field which is opposite to the magnetic field. This effect is nuclear shielding. As a result, the effective magnetic field at the nucleus is less than the external magnetic field: Beff ¼ B0 ð1  sÞ in which Beff is the effective magnetic field and s is shielding constant. The shielding constant varies according to the type of nucleus and its electronic environment. Varying effective field at each nucleus leads to varying resonance frequency. The resonance frequency exhibits little difference (shift) from the theoretical predicted value. This is the chemical shift phenomenon. The effective field is proportional to the static field, B0. Consequently, the absolute chemical shift difference between two resonance signals varies linearly with B0. To remove the static field influence, chemical shifts are expressed in relative values to the static field and referenced to the resonance signal of a standard compound: d¼

O  Oref  106 O0

in which O and Oref are the resonance frequencies of the sample and the standard compound, respectively. d (in parts per million, ppm) is the chemical shift. Tetramethylsilane (TMS; d = 0.00) is the most common chemical shift reference for organic solution system. For biomolecules in aqueous solution, 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS; d = 0.00) is the most used chemical shift reference because TMS is insoluble in aqueous solution. Scalar Coupling and Dipolar Coupling Two adjacent nuclei spins will interact with each other. This is called coupling. There are two types of coupling between two nuclei spins. One is scalar coupling. The other is dipolar coupling. Scalar coupling, also named J-coupling, indirect dipole–dipole coupling, or spin–spin coupling, refers to two nuclei spins interacting through chemical bond. Scalar coupling leads to signal splitting. The distance between signals in a multiplet is the coupling constant, denoted as n Jab where n is the bond number between atoms a and b. J-couples are mostly restricted to 2–3 bonds. Dipolar coupling refers to two nuclei spins interacting directly through space, mostly within 5–6 Å distance and proportional to the inverse cube of distance between two spin nuclei. Dipolar coupling causes spin relaxation, resulting in magnetization transfer through space-adjacent atoms. Dipolar coupling provides distance information between two space-adjacent atoms, making it very useful in determining biomolecular structures.

Modern NMR Theory Introduction FT-NMR Fourier transform NMR (FT-NMR) changed the traditional nuclei excitation method. In FT-NMR, a short radio frequency pulse (rf pulse) radiates all the frequencies within a specific frequency range simultaneously. All the nuclei with Larmor frequency within this range will be excited. After the rf pulse, the excited nuclei will relax to the thermal

NMR Basis for Biomolecular Structure (Theory)

807

equilibrium state, inducing an observable decay signal named free induction decay (FID). The time domain FID signals are recorded by an instrument and converted to frequency domain spectrum by Fourier transformation:

The operator Η is the Hamiltonian system, which is observable in physics. The solution of the Schrödinger equation C(t) is called wavefunction. In quantum mechanics, every physically observable quantity (A) is associated with a Hermitian operator A and has

þ1 ð

Af ðtÞ ¼ lf ðtÞ

f ðtÞexpðiotÞdt

Fð o Þ ¼ 1

in which f (t) is the eigenfunction of A and l is the eigenvalue of eigenfunction f (t). All the eigenfunctions of A form a complete orthonormal set and satisfies

þ1 ð

f ðtÞ ¼

FðoÞexpðiotÞdo 1

in which F(o) is a frequency domain data set and f(t) is a time domain data set. exp(iot) = cos(ot) + i sin(ot) F(o) and f (t) are Fourier pair. F(o) is a complex function with real and imaginary parts: þ1 ð f ðtÞ cosðotÞdt Real part: ReðFðoÞÞ ¼ 1

Imaginary

f i ðtÞfj dt ¼ di, j

If i = j, di,j = 1, else di,j = 0. If cn represents a complete set of orthonormal eigenfunctions, then the solution of the Schrödinger equation C(t) can be written in a linear combination of cn

part:

þ1 ð

ImðFðoÞÞ ¼

ð

f ðtÞ sin ðotÞdt

CðtÞ ¼

Density Matrix Formalism Density matrix formalism is introduced by John von Neumann in 1927. Density matrix formalism is extremely useful for mixed state quantum calculation (Cavanagh et al. 1996). Any microparticle can be described by the Schrödinger equation: @CðtÞ i ¼  ΗCðtÞ @t ℏ

cn c n

n¼1

1

The rf pulse and its excitation profile are a Fourier pair too. If the pulse length is t, then the frequencies it excited with a bandwidth of 1/t are centered at frequency F. A specific excitation frequency range can be obtained by changing the pulse length. A shorter pulse will excite a larger frequency range with no selection. This kind of pulse is hard pulse. A longer pulse will selectively excite a narrow range of frequency. This kind of pulse is soft pulse, also called selective pulse.

N X

in which cn are complex numbers that maybe dependent on time. The probability density (P) of the system state represented by C(t) at time t is PðtÞ ¼ C ðtÞCðtÞ ð P ¼ C ðtÞCðtÞdt in which C* is the complex conjugate of C. The Dirac notation was used to simplify the symbolic manipulation for scalar product. In the Dirac notation, C is represented by |Ci, and C* is represented by hC|. The probability density P can be noted as ð

P ¼ CC dt ¼ jCihCj Biomolecules in solution can be treated as a mixed state. For a spin I = 1/2 system, it has a

N

808

NMR Basis for Biomolecular Structure (Theory)

state and b state. The expectation value A can be expressed as ð



hAi ¼ C ðtÞACðtÞdt ¼ hCjAjCi X ¼ cm cn hmjAjni nm

in which hm| and |ni represent the complete sets of orthonormal eigenfunctions of a and b states, respectively, and c*m and cn are coefficients of hm| and |ni, respectively. c*mcn forms a matrix, which is referred to as the density matrix. Density matrix is the matrix representation of the density operator snm. snm ¼ cm cn : The density operator snm evolves in time and can be denoted as dsðtÞ ¼ i½A, sðtÞ dt

The spin operator satisfies the commutation relations: I x , I y ¼ iI z , I y , I z ¼ iIx , I z, I x ¼ iI y : Coherence For a one-spin system, the density operator can be described by Cartesian product operators Ix, Iy, and Iz. For a two-spin system, 16 Cartesian operators are required   to describe the density operator. These are 12 E , Ix, Iy, Iz, Sx, Sy, Sz, 2IxSz, 2IySz, 2IzSz, 2IzSx, 2IzSy, 2IxSx, 2IySy, 2IxSy, and 2IySx. For convenience, shift operators are defined as I þ ¼ I x þ iI y , I  ¼ I x  iI y , I0 ¼

The solution for s(t) is

pffiffiffi 2I z

The shift operators can be transformed to Cartesian operators:

sðtÞ ¼ expðiAtÞsð0ÞexpðiAtÞ Pauli spin matrices are defined as a set of spin operators for a single-spin   1/2 system:          0 1 , I y ¼ 1  0 i , I z¼ 1  1 0 , I x ¼ 12      2 2 1 0 i 0 0 1  and

1 j ai ¼ , 0 h aj ¼ ½ 1

0 ,



0 j bi ¼ 1 hbj ¼ ½ 0

1

in which Ix, Iy, and Iz are Cartesian product operators. |ai and |bi are complete sets of orthonormal eigenfunctions of a and b states, respectively. ha| and hb| are conjugate matrices of |ai and |bi, respectively. The density operator s(t) is the sum of spin operators Ix, Iy, and Iz. At thermal equilibrium, seq = Iz, as only z-magnetization exists.

1 I x ¼  ðI þ þ I  Þ, 2 Iy ¼

1 þ ðI  I  Þ, 2i

1 I z ¼ pffiffiffi I 0 2 Correspondingly, the 16 operators in shift basis are: Single quantum coherence: I +, I , S +, S , + 2I S0, 2I0S +, 2I0S , and 2I S0 + +   Double quantum coherence: 2I  S and 2I S 1 Zero quantum coherence: 2 E, I 0 , S0, 2I0S0, +  2I S , and 2I S + Each Cartesian operator or shift operator corresponds to a 4  4 matrix in a two spin 1/2 system. Both Cartesian operators and shift operators are very useful to describe density operator

NMR Basis for Biomolecular Structure (Theory)

evolution. Cartesian operator is more convenient for describing rf pulse effect on density operator. Shift operator is more convenient for describing coherence evolution in NMR experiments. Coherence is the correlation between two spins. It is associated with an NMR transition. The off-diagonal element of density matrix indicates the coherence between two spin-state transitions. 2aa 0 1 6 0 þ I ¼ pffiffiffi 6 2 40 0

ab 0 0 0 0

ba 1 0 0 0

bb3 0 17 7 05 0

aa ab ba bb

The coherence operator I + indicates transition from |bai to |aai and from |bbi to |abi. For both transitions, I spin changes state from b(m = 1/2) to a(m = 1/2). The spin angular momentum number changed Dm = + 1. I + denotes a singlequantum transition. Therefore, it is a singlequantum coherence operator: 2 aa 0 60 þ þ 6 I S ¼4 0 0

ab 0 0 0 0

ba 0 0 0 0

bb3 1 07 7 05 0

aa ab ba bb

The coherence operator I +S + represents a transition from |bbi to |aai. In this case, both I and S spins change their states from b(m = 1/2) to a(m = 1/2). The total spin angular momentum quantum number changes Dm = +2. I +S + denotes a double-quantum transition. It is a double-quantum coherence operator: 2 aa 0 6 0 I þ S ¼ 6 40 0

ab 0 0 0 0

ba 0 1 0 0

bb3 0 07 7 05 0

aa ab ba bb

The coherence operator I +S  stands for a transition from |bai to |abi. Here, I spin changes state from b to a (Dm = +1), while S spin changes state from a to b (Dm = 1). The total spin angular momentum quantum number changes

809

Dm = 0. I +S  is a zero quantum coherence operator. Product Operator Formalism Density matrix provides a precise mathematical description of the evolution for a nuclear spin system, but the matrix calculation is cumbrous for a complicated spin system. The product operator formalism (Kessler et al. 1988) is a simplified version of the density matrix formalism. Similar to the Bloch vector model, it treats the effect of rf pulse and delays as geometrical rotations. It only cares about the outcome after applying an operator and omits the cumbersome matrix calculation process. In product operator formalism, if three operators satisfy the commutation relationship [A, B] = iC, then expðiyCÞA expðiyCÞ ¼ A cosðyÞ þ B sinðyÞ 1. Product operator acting on free precession (Cavanagh et al. 1996) During free precession period (delay), the evolution is governed by chemical shift and scalar coupling. For spin I, the chemical shift Hamiltonian is H = OIIZ. OI is the offset of spin I. During a delay t, the evolution of spin I OI I Z t

I x ! I x cosðOI tÞ þ I y sinðOI tÞ OI I Z t

I y ! I y cosðOI tÞ  I x sinðOI tÞ OI I Z t

Iz ! I z For a weak coupled two spin system I and S, the scalar coupling Hamiltonian is H = 2pJISIZSZ. During a delay t, the evolution of a single-spin operator is as follows: 2p J IS I Z SZ t

I x ! I x cosð2pJ IS tÞ þ 2I y Sz sinð2pJ IS tÞ 2p J IS I Z SZ t

I y ! I y cosð2pJ IS tÞ  2I x Sz sinð2pJ IS tÞ 2p J IS I Z SZ t

Iz ! I z

N

810

NMR Basis for Biomolecular Structure (Theory)

The evolution of a double-spin operator is as follows: 2p J IS I Z SZ t

2I x Sz ! 2I x Sz cosð2pJ IS tÞ þ I y sinð2pJ IS tÞ 2p J IS I Z SZ t

2I y Sz ! 2I y Sz cosð2pJ IS tÞ  I x sinð2pJ IS tÞ 2p J IS I Z SZ t

Coherence Transfer Coherence can be transferred from one spin to another under the inference of an rf pulse. Coherence transfer can happen through bonds (scalar coupling) like COSY and TOCSY. The transfer can also occur through space (Nuclear Overhauser Effect, NOE, residual dipolar coupling) and through physical or chemical exchange.

2I z Sz ! 2I z Sz Through Bonds

2. Product operator acting on rf pulse (Cavanagh et al. 1996) The Hamiltonian for an rf pulse can be written as H = aIx or H ¼ aI y

p 2I y

p 2 Sy

2I x Sz ! 2I z Sz ! 2I z Sx

in which a = otp is the flip angle of an rf pulse and tp is rf pulse length. The evolution of spin angular operators: aI x

I x ! I x aIx

I y ! I y cos a  I z sin a aI x

I z ! I z cos a  I y sin a aI y

I x ! I x cos a  I y sin a aI y

I y ! I y aI y

I z ! I z cos a  I x sin a NMR pulse is a sequence of rf pulses and delays. Product operator can be applied on each event sequentially. Suppose a pulse is applied on I and S spins simultaneously, it can be noted as aI y þaSy

s1 ! s3 and be treated as a cascade of single events: aI y

COSY (correlation spectroscopy)-type coherence transfer requires the density operator to contain anti-phase terms and the pulse-interrupted evolution be restricted to weak scalar couplings. For example,

aSy

s1 ! s2 ! s3

The original anti-phase coherence on the I spin is passed to the S spin with a total transfer time of J1IS . TOCSY (total correlation spectroscopy)-type coherence transfer is an in-phase coherence transfer and requires strong coupling conditions. It needs 2pJ1 IS time to transfer coherence from I spin to S spin. For a three-spin IRS system, coherence can be transferred from R to S even if JRS = 0 by a two-step transfer: Rz ! Iz ! Sz. But this kind of coherence transfer has to satisfy the Hartmann–Hahn condition which is the irradiation energy for all the spins is the same. In practice, an isotropic mixing pulse sequence is used to generate an averaged Hamiltonian. Coherence will be transferred periodically among all the spins in a strong scalar-coupled network under the influence of the isotropic mixing pulse. Through Dipolar Coupling

In a dipolar coupling spin system (two spin nuclei within 5–6 Å), when one spin is saturated by an rf pulse, the disturbance in the population of the other spins is induced via cross-relaxation. The coherence is transferred from the saturated spin to the induced spin during a period of tm, the mixing time. This is Nuclear Overhauser Effect (NOE).

NMR Basis for Biomolecular Structure (Theory)

811

Through Physical or Chemical Exchange

Magnetization can be transferred by chemical exchange. Suppose there is a spin at two conditions A and B. Because of physical exchange or chemical exchange, the Larmor frequencies for the spin at A (oA) and B (oB) conditions will be different. kex represents the exchange rate between the two conditions. If kex >> |oB  oA|, an averaged spin magnetization of the two conditions evolves, leading to a single signal. This is the fast exchange case. If kex 20 mL of protein solution (5–15 mg/ml) is mixed with an equal volume of reservoir solution containing buffer, salt, and precipitant. This mixture is placed in the sitting position of experimental chamber and sealed with plastic clear seal. For the hanging drop method, a mixture of protein and reservoir drop is placed on a coverslip, inverted, and sealed with grease over reservoir chamber (500–1,000 mL). Drops set up by sandwiching between coverslip and sitting drop position is called sandwich drop method. For larger volumes of protein solutions (>20 mL) or buffer solutions with lower surface tensions (e.g., containing alcohols), sitting drop method or sandwich drop method can be used, whereas hanging drop method can be used for drops with 2,800 predicted genes were regarded as rice or monocot specific in a comparison between rice and A. thaliana annotated proteomes (IRGSP 2005). One intriguing explanation for the presence of these apparently species-specific sequences is that they contribute to aspects of morphological, developmental, and/or physiological differences between plant species. It is possible that these species-specific genes have arisen de novo or are sequences that evolve too fast between plant species. Aside from the possibility that these species-specific sequences are genes, it is also possible that some of them may represent false-positive gene predictions (see ▶ “Plant Genome Annotation, Methods for”). Recent studies have revealed that there are abundant transcriptional activities in parts of

plant genomes that are not known to be genes. An explanation is that these regions are novel genes that have evaded annotation (see ▶ “Plant Genome Annotation, Methods for”). Consistent with this view, some of these transcriptional activities are shown to be derived from regulatory RNAs. Through comparative studies across plant species, a number of these regulatory RNAs are also shown to be more conserved than neutrally evolving sequences. This relatively higher degree of conservation is regarded as evidence for natural selection. Most transcriptional activities in these “intergenic” regions do not have clear evidence of conservation. It is possible that some transcribed intergenic regions are real genes but our current approaches for detecting selection do not work well enough. However, given the absence of conservation evidence, one cannot rule out the possibility that some of the intergenic transcriptional activities represent transcriptional error. The same type of comparative studies that led to the identification of conserved regulatory RNAs has also revealed the presence of conserved sequences without evidence of transcription in plant genomes. While some of these conserved sequences coincide with regulatory sequences such as transcriptional factor binding sites, the

P

904

a

Plant Genomes, Evolution of

d

Chromosome

Tandem duplication

Replicative transposition

Tandem duplication Inverted repeat

b

e Intron

RNA DNA

Exon

Segmental duplication

Reverse transcription

Retrogene formation

Transcription

c

+

Whole genome duplication

Plant Genomes, Evolution of, Fig. 3 Mechanisms of gene duplication. (a) Tandem duplication: genes are duplicated in close proximity. (b) Segmental duplication: segments of chromosomes are duplicated on the same or different chromosomes. This can also be due to largerscale duplication (such as at the whole-genome level) followed by chromosome rearrangements. (c) Whole-

genome duplication: the whole set of chromosomes is doubled. (d) Replicative transposition: transposons are not only involved in moving genes around but sometimes lead to gene duplication. (e) Retrogene formation: a transcript is reverse transcribed into DNA which is inserted back to the genome. The hallmark of retrogenes is the loss of intron(s)

functions of most of the conserved intergenic sequences remain to be established.

highest proportion of genes that were apparently derived from gene duplications. For example, in A. thaliana 65% of the genes are duplicates, substantially higher than that in human (38%). Why are there so many duplicate genes? Why are there so many more duplicates in plants than in other species? To answer these questions, a discussion is necessary on how the duplicates can be generated and what the potential fates are for duplicate genes. Before plant genome sequences become available, it is apparent that some related plant genes are located close to each other on the chromosome (in tandem) that are generated through mispairing of the chromosomes (Fig. 3a). In a typical plant genome, thousands of genes can have at least one relative that is close by. In A. thaliana, for

Gene Duplication and Consequences Mutations in a genome may take the form of a single-base substitution, insertion, deletion, translocation, or duplication. Over millions of years of evolution, these mutations led to substantial differences in genome content and, as a result, gene content between plant species. Through comparative genomic studies, the most apparent change in gene content is due to prevalent gene duplications where 17–65% of genes in prokaryotes and eukaryotes are found to be products of gene duplication (Zhang 2003). Despite the fact that genes in all species have duplicated to a great extent, plants stand out in that they have by far the

Plant Genomes: From Sequence to Function Across Evolutionary Time

example, ~17% of the genes are likely derived from tandem duplication. Gene duplicates can also be generated through duplication of a single, a few, or the entire set of chromosomes. In contrast to tandem duplication that affects only a few genes, chromosome-level duplications result in doubling of hundreds to tens of thousands of genes in one event (Fig. 3b, c). The most dramatic chromosome-level duplication event involves duplication of the whole set of chromosomes (polyploidization, Fig. 3c). It has been estimated that 20–70% of extant plant species are polyploids (Otto and Whitton 2000). In the A. thaliana lineage, after its divergence from monocots ~150 million years ago, there are genomic signatures of three rounds of large-scale, presumably whole-genome duplication events. The number of polyploidy species and signatures of chromosome-level duplications is significantly higher in plants than any other species. This may explain why there are so many more duplicate genes in plants. The prevalence of polyploidy in plants may be due to the advantage conferred by doubling the genome where many genes may have adopted new functions. However, it is also possible that plants are simply more tolerant to polyploidization. Aside from tandem and chromosome-level events, there exist other duplication mechanisms and some involve transposable elements (Fig. 3d, e). After a gene is duplicated, provided that such duplication does not have an immediate, negative effect on survival, both copies will persist initially. If the presence of both copies is more advantageous than a single copy, both copies will be retained as long as the selective advantage exists. If the additional copy does not confer significant advantage, mutations will then accumulate in both duplicates. In most situations, mutations accumulate to a point that the original gene function is disrupted and results in a pseudogene. In rare situations, the mutations that have accumulated may lead to new functions. If such new functions confer selective advantage and the original function is still required, both copies will be retained. However, duplicate genes can also be retained in the absence of new function. For example, mutations may lead to partitioning of the original

905

functions. In this situation, both copies will also be retained because of the need to carry out the original functions. Regardless of the explanations of why duplicate genes are retained, consideration of both gene function and duplication has led to the findings that plant genes involved in phosphorylation, protein degradation, transport, and response to environmental stimuli tend to have high rate of retention (Hanada et al. 2008).

References Gregory TR (2005) The evolution of the genome. Academic, London Hanada K, Zou C et al (2008) Importance of lineagespecific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol 148(2):993–1003 IRGSP (2005) The map-based sequence of the rice genome. Nature 436(7052):793–800 Otto SP, Whitton J (2000) Polyploid incidence and evolution. Annu Rev Genet 34:401–437 Zhang J (2003) Evolution by gene duplication: an update. Trends Ecol Evol 18(6):292–298

Plant Genomes: From Sequence to Function Across Evolutionary Time Kevin L. Childs and C. Robin Buell Department of Plant Biology, Michigan State University, East Lansing, MI, USA

Synopsis It has only been a little over a decade since the first plant genome sequence was made available. Since then, major advancements in the understanding of not only plant genomes but also plant biology have been made. From genome sequence data, the overall distribution of genes, transposable elements, and other chromosome landmarks in multiple species that span the taxa within angiosperms and lower land plants is known. At the genome level, an area of active investigation is the structure and function of protein-coding genes, and their evolution as plant genomes undergoes

P

906

Plant Genomes: From Sequence to Function Across Evolutionary Time

rampant gene and genome duplication events. Indeed, access to genome sequence and annotation has revealed that all plant genomes contain a high number of paralogous gene families, which provide a template for diversification that can lead to phenotypic diversity. The application of the next-generation sequencing technology to plant genomes has enabled the sequencing of hundreds of genomes within a species. At a population level, plant genomes are diverse with copious polymorphisms ranging from single-nucleotide polymorphisms to presence/absence variations that contribute to the phenotypic variation within a species. As technologies improve, recalcitrant genomes will become available from species from a wider range of taxa including gymnosperms and heterozygous, polypoloid angiosperms with large genomes. Collectively, the genome sequence and accompanying annotation will provide resources for furthering our understanding of biological processes, agricultural traits, and key evolutionary events in the radiation and adaptation of plants in the ecosystem.

size (98%), indicating a recent origin for these transfers.

Use of Model Species for Plant Genome Research Model species are selected species with features that permit facile and rapid progress on understanding biological phenomena. Typically, such features include small stature and amenability to genetic analyses. In the angiosperms, there are two major classifications of plants: dicotyledonous species (dicots) that have two seed leaves (cotyledons), net venation, and floral parts in 4s or 5s and monocotyledonous species (monocots) that have a single seed leaf (cotyledon), parallel venation, and floral parts in sets of threes. In plant research, there are model species for both the dicots (A. thaliana) and monocot (Oryza sativa; rice), and these were the first two plant species with sequenced genomes. Arabidopsis thaliana. One genome had to be first. When the genomics era began, a single Sanger-based sequencing read was close to $10 per reaction with a read length of less than 500 bases. Thus, perhaps the driving factor in selection of the first plant genome to sequence was that of size, the smaller the better. The estimated genome size of A. thaliana, is 157 Mb which is substantially less than that of other plant species, especially crop species (Bennett et al. 2003). Other features of A. thaliana that drove the decision to sequence this genome in the 1990s centered on its status as a premiere species for basic research in plant biology. A. thaliana is

913

a small-statured, fast-growing plant with high fecundity (Fig. 4). These attributes are necessary for a genetic model species and were essential to the decision by key plant scientists in the 1980s to promote A. thaliana as a model for plant biology (Somerville and Koornneef 2002). Indeed, even before the start of the Arabidopsis Genome Initiative in 1996 in which a group of scientists had self-organized with an objective of sequencing the complete genome of A. thaliana, a wide range of sequenced or sequence-ready genomics resources were available for A. thaliana. These included a collection of expressed sequence tags that were the initial foray into genomics in the early 1990s (Adams et al. 1993) as a rapid gene discovery method. Large insert clones such as cosmid, yeast artificial chromosome (YAC), P1 artificial chromosome, and bacterial artificial chromosome (BAC) clones were made to facilitate positional cloning efforts in A. thaliana and served as the initial clone resources for sequencing the genome. The A. thaliana genome was completed and published in 2000 (Arabidopsis Genome Initiative 2000) and is unparalleled in the quality and completeness of the genome, even a decade later. In addition to the complete repertoire of genes, major discoveries in the genome include the high degree of tandem gene duplication and segmental duplication. As a consequence, there is a high degree of paralogous gene families in A. thaliana, permitting diversification of function of individual genes over the course of evolution (▶ Plant Genomes, Evolution of). Parallel with the development of resources for genome sequencing and annotation, the overall Arabidopsis community began to develop and make publicly available resources to facilitate functional genomics. Demonstration of the ability to make and screen ethyl methanesulfonate-, X-ray-, and fast neutron-generated mutants, coupled with the ability to generate transgenic plants through a simple floral dip procedure (Clough and Bent 1998), contributed to the growth and availability of functional genomics resources on a community scale and the ability for researchers to rapidly perform forward and reverse genetic screens. To date, the genetic resources for A. thaliana are bountiful. Not one

P

914

Plant Genomes: From Sequence to Function Across Evolutionary Time

Plant Genomes: From Sequence to Function Across Evolutionary Time, Fig. 4 Model species for plant genome research. Arabidopsis thaliana, a dicotyledonous angiosperm, has 5 chromosomes with a total genome size of 157 Mb (Bennett et al. 2003). The sequenced genome from the Columbia (Col-0) accession is 119 Mb; sequences absent from the Col-0 reference A. thaliana genome include a few limited gaps in the euchromatin and

centromeric sequences in all 5 chromosomes. Oryza sativa, a monocotyledonous angiosperm, has 12 chromosomes with a total genome size of 389 Mb (International Rice Genome Sequencing Project 2005). The sequence rice genome from the Nipponbare cultivar is 374 Mb; sequences absent from the Nipponbare reference sequence include limited gaps in the euchromatin arms, some of the telomeres, and 10 of the 12 centromeres

(Arabidopsis Genome Initiative 2000), but 1,001 A. thaliana genomes (Weigel and Mott 2009) will be available in the next few years (▶ Genomic Sequence and Structural Diversity in Plants). Large collections of indexed mutants (Alonso et al. 2003) and full-length cDNA clones are available to facilitate rapid genetic analyses of genes, phenotypes, and pathways. Thus, while it has been over a decade since the publication of the first A. thaliana genome (Arabidopsis Genome Initiative 2000), the exponential growth of resources and knowledge for A. thaliana, enabled in a large part by the genome sequence, contribute to the domination of A. thaliana as the premiere model species in plant biology research. Oryza sativa. Rice (O. sativa) was the second plant species to have its genome sequenced. The reasons for this have to do with the importance of rice as a food crop. No other plant provides more

calories for human consumption than rice. Additionally, rice is a model species for plant scientists. Rice has been well characterized agronomically, is easily genetically transformed, and has a relatively small genome size of 389 Mb (Fig. 4). Rice is a member of the Poaceae (the grass family) which is a major taxonomic family within the monocotyledonous angiosperms that contains the cereals and grasses. This family includes key agronomic species such as Festuca species (fescue), Hordeum vulgare (barley), O. sativa (rice), Saccharum species (sugarcane), Sorghum bicolor (sorghum), Triticum aestivum (wheat), and Zea mays (maize, corn). The agronomic importance of rice and the fact that many researchers study rice physiology and development make rice a scientifically important member of the Poaceae, and all of these factors contributed to the decision to sequence the rice genome.

Plant Genomes: From Sequence to Function Across Evolutionary Time

915

Plant Genomes: From Sequence to Function Across Evolutionary Time, Fig. 5 Synteny among Poaceae species. A five-gene region (A–E) is shown in Brachypodium distachyon in which the genes are not interrupted by

transposable elements (TEs). Examples of rearrangements, deletions, and insertions of novel genes (X, Y, or Z) are shown in three related Poaceae species

When the rice genome was being sequenced, there were in fact four separate sequencing projects. The most widely referenced of these genome sequences was produced by the International Rice Genome Sequencing Project (IRGSP) (International Rice Genome Sequencing Project 2005). The IRGSP sequenced the genome of the rice cultivar Nipponbare, which is a member of the japonica subspecies of rice. The IRGSP used a BAC-by-BAC approach to sequence the Nipponbare genome, and the final sequence consisted of 12 pseudomolecules with a combined length of 370 Mb, which was estimated to represent 95% of the complete genome. The chromosomes of rice range in size from 43.3 to 22.7 Mb. The final sequence contained 62 gaps that occurred in regions that were difficult to clone or assemble such as centromeres, telomeres, and other highly repetitive regions. Examination of the genome allowed researchers to identify 37,544 protein-coding genes which were not related to transposable elements. At least 35% of the Nipponbare genome, the reference accession that was sequenced, is comprised of transposable element-related sequences, a major difference from that of A. thaliana and reflective of many other plant genomes. At the time that the rice genome sequence became available, there were many questions in the minds of plant researchers as to the similarity between plant genomes. Like A. thaliana, gene duplication, particularly tandem duplication, is abundant in rice. In A. thaliana, 17% of all genes are found in tandem duplications, and in

rice, 14% of genes were found to exist in duplicated tandem arrays. However, these percentages are for immediately adjacent duplication events. A less stringent analysis that identified duplications in 5 Mb windows found that nearly one third of all rice genes have been involved in duplication events. Surprisingly, the genes of rice and A. thaliana are very similar with almost 90% of A. thaliana genes having a homolog with rice genes, but only 71% of rice genes have a homolog in the A. thaliana genome. This difference maybe partly due to the larger number of genes in rice. The rice genes that do not have homologs in A. thaliana have a variety of putative functions: seed storage proteins, seed allergens, proteinase inhibitors, chitinases, and defenserelated genes (International Rice Genome Sequencing Project 2005). However, the majority of rice-specific genes currently lack a known function. After the rice genome was sequenced, rice has served as a model species for the Poaceae. The conservation of gene order between two species is referred to as synteny, and the degree of synteny between members of the Poaceae is very high. There have been numerous rearrangements and duplications of chromosomal regions, but in general, genes that are found near each other in one Poaceae species are likely to be found near each other in another Poaceae species (Fig. 5). The conservation of gene order also extends to conservation of gene sequence and function, and thus, knowledge gained from the rice genome sequencing effort has greatly facilitated the understanding

P

916

Plant Genomes: From Sequence to Function Across Evolutionary Time

of gene function throughout the Poaceae family. Today, the genomes of other grass species have been sequenced, but rice, like A. thaliana, was sequenced using a strategy that resulted in a very high-quality genome sequence, and thus, for the foreseeable future, the rice genome will be the “gold standard” to which all other Poaceae genomes are compared. Not Just Two Genomes: How Multiple Plant Genome Sequences Have Changed Our Understanding of Plant Biology The A. thaliana and O. sativa genomes are the only “gold standard” genomes available to date. However, improvements in throughput and computational hardware and software, coupled with decreased costs, enabled the sequencing of many other plant genomes that span a range of taxa across the Angiosperms as well as the lower land plants (see below). Prior to 2008, these genomes were all sequenced using the Sanger sequencing platform (▶ Plant Genome Sequencing Methods). However, the emergence of the next-generation sequencing platforms has contributed to the availability of genome sequences for a wide range of taxa. This has enabled major insights into plant genome evolution (▶ Plant Genomes, Evolution of). What is apparent is that plant genomes are dynamic and have undergone expansion at many levels. This includes tandem duplication events, segmental duplication events, and even whole genome duplication. These duplications lead to gene families and, as a consequence, potential evolution of gene function among family members. Also abundant in some plant genomes are TEs that can contribute significantly to genome size, as in the case of maize where 85% of the genome is composed of TEs. In select examples in maize, TE duplication has contributed to gene evolution. Access to the genomes of more than one Oryza species has provided insight into the evolution of species within a single genus and will be instrumental in understanding key events in domestication of Oryza sativa. Furthermore, understanding the genotypic diversity within a single species is a research topic for many scientists. Multiple cultivars, ecotypes, and accessions are currently being

sequenced from several important plant species (▶ Genomic Sequence and Structural Diversity in Plants). What is surprising from these studies is that there is substantial diversity at the genome level within a species. This diversity can be simple such as single-nucleotide polymorphisms (SNPs) or small insertions/deletions (indels). However, these can be more substantial and include large indels (>100 bp), gene copy number variations, and gene presence/absence variations. While this may be anticipated in heterozygous or outcrossing species such as maize where allelic diversity is high, the A. thaliana genome also exhibits substantial diversity at the population level. Phylogenetic Sampling of Plant Genomes While A. thaliana and O. sativa are representatives of two major groups of angiosperms, the dicots and monocots, they do not represent the full diversity of plant species or even just the angiosperms. Genome sequences are available for Chlorophyta (Chlamydomonas reinhardtii, green alga), Bryophyta (Physcomitrella patens, moss), Lycopodiophyta (Selaginella moellendorffii, spikemoss), numerous angiosperm families that represent the diversity of the monocots and dicots, and a few gymnosperms. The majority of initial genome projects focused on agricultural crops of obvious importance. These include members of the Poaceae family, the grasses, which includes major cereals such as sorghum, maize, barley, and wheat. Before 2012, five additional Poaceae genome sequences were available including the feed crops sorghum and maize. A key Poaceae species sequenced was that of Brachypodium distachyon (hereafter referred to as Brachypodium). Brachypodium is a small-statured, fast-cycling annual grass that is related to all Poaceae species but more closely related to wheat and barley. What is remarkable about Brachypodium is that it has an extremely small genome relative to other Poaceae species, 272 Mb (The International Brachypodium Initiative 2010). The small size is due to the lack of polyploidization and the lack of expansion of TEs compared to other Poaceae species such as maize and wheat. The provision of genome sequences from multiple species within

Plant Genomes: From Sequence to Function Across Evolutionary Time

917

Chr3 31220k 31221k 31222k MSU Osa1 Rice Loci

31223k

31224k

31225k

31226k

31227k

LOC_Os03g54920

LOC_Os03g54930

expressed protein

VIP4, putative, expressed

31228k

31229k

31230k

31231k

31232k

31233k

FGenesh Predictions MSU Osa1 Rice Gene Models LOC_Os03g54920.1

LOC_Os03g54930.1

Rice FL-cDNA gi|32984608|dbj|AK099399.1| Per ID: 99.93% Per Sim: 99.93% Per Cov: 99.93%

Pistil OSN_AE

Oryza Repeats Best Arabidopsis Hit AT3G05010

AT5G61150

Best Maize Hit GRMZM2G123732

GRMZM2G048494

CSRDB smRNAs

Plant Genomes: From Sequence to Function Across Evolutionary Time, Fig. 6 Graphical representation of a two locus region in the rice genome. Shown are two genes, LOC_Os03g54920 and LOC_Os03g54930, which encode an expressed protein and a VIP4 protein. Shown below the locus track is the output of the FGENESH prediction, an ab initio gene finder used for structural annotation. Gene models with exons (boxes) and introns (lines) are shown. Full-length cDNA support is available for LOC_Os03g54920 but not LOC_Os03g54930.

Expression support is also provided for both gene models by RNA-seq data from pistil tissue as shown by the dark blue boxes under Pistil OSN_AE. There are repetitive sequences in this region (burgundy boxes). Both loci have homologous sequences in A. thaliana and maize and glyphs are shown for these loci which are hyperlinked to the A. thaliana and maize databases, respectively, for more details. Small RNAs are also annotated in this region, shown in the orange boxes and corresponding to the location of the repetitive sequences

a taxonomic clade has enabled robust assessments of synteny, in which gene order is shared between species suggesting an evolutionary relationship (Fig. 5). Indeed, researchers have been able to identify large syntenic blocks among the Poaceae and have been able to reconstruct the ancestral chromosome (The International Brachypodium Initiative 2010).

Information (NCBI) and through large public Internet-based resources. The data is stored in relational databases that permit querying of the sequence and annotation by biologists. These sequence- or text-based searches provide information at the single-gene level to enable biologybased inquiries. More complex queries are possible in select resources. Bioinformaticists can also develop custom relational databases that permit robust analyses of not only a single genome but multiple genomes. Perhaps the most widely used tool to view and analyze plant genomes is through a graphical viewer in which the genome, annotated genes, and other features are displayed (Fig. 6) with a limited number of search and analysis tools.

Data Storage, Access, and Data Mining Plant genomes are large, in the range of millions to billions of base pairs. The annotations of these genomes are equally large with annotated proteincoding genes nearing 50,000 in some species. As the genomes from additional species and from multiple individuals from single species are sequenced, the total amount of sequence and annotation will grow at an extraordinary pace. This amount of data is prohibitive to data mining using simple documents or files. Genome sequence and annotation are stored and made available for data mining through public archives such as the National Center for Biotechnology

Conclusions As new sequencing technologies continue to evolve, it will become cheaper and easier to sequence plant genomes. Furthermore, the quality

P

918

Plant Genomes: From Sequence to Function Across Evolutionary Time

of those sequences is expected to improve. Therefore, in the near future, the genomes of thousands of important crop and non-crop species will be sequenced. As sequencing costs drop and quality increases, it will likely become routine for researchers to sequence the genome of every cultivar, ecotype or, accession that is important in their research. Annotation and comparative analyses will accompany these new genome sequences, and further insights into the genes, their regulation, and their evolution into the diversity of extant plant species will be made.

Cross-References ▶ Epigenetics ▶ Genomic Sequence and Structural Diversity in Plants ▶ Mitochondrial Genomes in Land Plants ▶ Plant Genome Annotation, Methods for ▶ Plant Genomes, Evolution of ▶ Plant Genomes: From Sequence to Function Across Evolutionary Time ▶ Plant Transposable Elements: Beyond Insertions and Interruptions

References Adams MD, Soares MB, Kerlavage AR, Fields C, Venter JC (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 4:373–380 Alonso JM, Stepanova AN, Leisse TJ, Kim CJ, Chen H, Shinn P, Stevenson DK, Zimmerman J, Barajas P, Cheuk R, Gadrinab C, Heller C, Jeske A, Koesema E, Meyers CC, Parker H, Prednis L, Ansari Y, Choy N, Deen H, Geralt M, Hazari N, Hom E, Karnes M, Mulholland C, Ndubaku R, Schmidt I, Guzman P, Aguilar-Henonin L, Schmid M, Weigel D, Carter DE, Marchand T, Risseeuw E, Brogden D, Zeko A, Crosby WL, Berry CC, Ecker JR (2003) Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301:653–657 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815 Bennett MD, Leitch IJ, Price HJ, Johnston JS (2003) Comparisons with Caenorhabditis (approximately 100 Mb) and Drosophila (approximately 175 Mb) using flow cytometry show genome size in Arabidopsis to be

approximately 157 Mb and thus approximately 25% larger than the Arabidopsis genome initiative estimate of approximately 125 Mb. Ann Bot 91:547–557 Brown JW, Waugh R (1989) Maize U2 snRNAs: gene sequence and expression. Nucleic Acids Res 17: 8991–9001 Campbell MA, Haas BJ, Hamilton JP, Mount SM, Buell CR (2006) Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics 7:327 Clough SJ, Bent AF (1998) Floral dip: a simplified method for Agrobacterium-mediated transformation of Arabidopsis thaliana. Plant J 16:735–743 Gill BS, Appels R, Botha-Oberholster AM, Buell CR, Bennetzen JL, Chalhoub B, Chumley F, Dvorak J, Iwanaga M, Keller B, Li W, McCombie WR, Ogihara Y, Quetier F, Sasaki T (2004) A workshop report on wheat genome sequencing: International Genome Research on Wheat Consortium. Genetics 168:1087–1096 Green BR (2011) Chloroplast genomes of photosynthetic eukaryotes. Plant J 66:34–44 Haas BJ, Wortman JR, Ronning CM, Hannick LI, Smith RK Jr, Maiti R, Chan AP, Yu C, Farzad M, Wu D, White O, Town CD (2005) Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol 3:7 He Y (2009) Control of the transition to flowering by chromatin modifications. Mol Plant 2:554–564 International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436:793–800 Kubo T, Newton KJ (2008) Angiosperm mitochondrial genomes and mutations. Mitochondrion 8:5–14 Lauria M, Rossi V (2011) Epigenetic control of gene regulation in plants. Biochim Biophys Acta 1809: 369–378 Meyer IM (2007) A practical guide to the art of RNA gene prediction. Brief Bioinform 8:396–414 Nagaki K, Cheng Z, Ouyang S, Talbert PB, Kim M, Jones KM, Henikoff S, Buell CR, Jiang J (2004) Sequencing of a rice centromere uncovers active genes. Nat Genet 36:138–145 Somerville C, Koornneef M (2002) A fortunate choice: the history of Arabidopsis as a model plant. Nat Rev Genet 3:883–889 Stupar RM, Lilly JW, Town CD, Cheng Z, Kaul S, Buell CR, Jiang J (2001) Complex mtDNA constitutes an approximate 620-kb insertion on Arabidopsis thaliana chromosome 2: implication of potential sequencing errors caused by large-unit repeats. Proc Natl Acad Sci U S A 98:5099–5103 The International Brachypodium Initiative (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463:763–768 Weigel D, Mott R (2009) The 1001 genomes project for Arabidopsis thaliana. Genome Biol 10:107 Zhang X (2008) The epigenetic landscape of plants. Science 320:489–492

Plant Transposable Elements: Beyond Insertions and Interruptions

Plant Transposable Elements: Beyond Insertions and Interruptions Ning Jiang Department of Horticulture, Michigan State University, East Lansing, MI, USA

Synopsis Transposable elements (TEs) are fragments of DNA that can move, or transpose, from one location to another in the genome. Due to their mobility, they are also called “jumping genes.” Plant TEs are extremely powerful mutagens, resulting in insertions, deletions, duplications, and chromosomal inversions. In addition to their well-described destructive roles as mutagens and molecular parasites, recent studies indicate that TE activity can have constructive roles in genomes. These include duplication of gene sequences, modification of gene expression patterns, and functionality as a chromosomal domain. This essay introduces the fundamental aspects of plant TEs with a focus on the recent discoveries of their constructive role in plant genomes and their interaction with the host genome and environment.

Introduction TEs were first discovered by Barbara McClintock in the 1940s using maize as a model organism. Maize, a member of the grass (Poaceae) family, is as an excellent model species for genetic studies due to its large flower as well as separate male (the tassels) and female (the ears) reproductive organs, which enables easy genetic crosses. Furthermore, the large sizes of maize chromosomes make them readily visible under a light microscope. For a normal maize plant, chromosome breakage is a rare event. However, McClintock noticed that in one maize line, breakage on chromosome 9 occurred very often and always at one particular locus (McClintock 1948). Subsequently, she discovered that two factors were essential for the

919

breakage. One was located at the site of breakage and was called Ds (Dissociation). The other, which was distinct from Ds, was required to “activate” the breakage and was therefore called Ac (Activator). Since the location of Ac and Ds appears to be variable from time to time in the genome, McClintock proposed that they were actually genetic elements capable of transposition (McClintock 1948). In 1983, more than 30 years after the initial identification of TEs, McClintock won a Nobel Prize for her important discovery. This was when the Ac and Ds elements were cloned and sequenced by multiple research groups. It turned out that Ac encodes a transposase (Tpase) protein that is responsible for the transposition of the genetic element as well as Ds. Meanwhile, more TEs were identified from maize and other organisms including other plant species, animals, fungi, algae, and bacteria. These TEs form distinct superfamilies and families – Ac/Ds represents only one of numerous TE families.

Classification of TEs: Two Classes of Elements Based on their transposition mechanism, TEs fall into two classes. Class 1, the retrotransposons, uses a “copy and paste” mechanism and utilizes an RNA intermediate for transposition (Fig. 1a, b). Class 1 elements can be further divided into several groups, including the long terminal repeat (LTR) elements, long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs). During their transposition, the element mRNAs are converted into cDNA through the action of reverse transcriptase which is encoded within the element, and the TE cDNAs are then inserted into a target site in the genome. Due to their replicative transposition mechanism, class 1 elements can amplify very rapidly and contribute the largest portion of most plant genomes in terms of DNA content. Class 2, the DNA transposons, is often associated with terminal inverted repeats (TIRs) and transposes via a DNA intermediate (Fig. 1c, d). In plants, there are several superfamilies of DNA transposons

P

920

Plant Transposable Elements: Beyond Insertions and Interruptions

Plant Transposable Elements: Beyond Insertions and Interruptions, Fig. 1 Structure and transposition of two classes of TEs. Coding regions are depicted as colored boxes. LTRs and TIRs are shown as black triangles. TSDs are shown as small horizontal arrows flanking LTRs or TIRs. RNAs are shown as waved lines. Other DNA sequences are shown as black lines. (a) Structure of an autonomous and a nonautonomous LTR element. (b) Transposition of an LTR element using the “copy and paste” mechanism. (c) Structure of an autonomous DNA element, nonautonomous DNA element, and a MITE. (d) Transposition of a DNA element through “cut and paste” mechanism

including Ac/Ds (hAT), Spm/dSpm (CACTA), Mutator/Mutator-like element (MULE), Tc1/ Mariner/Stowaway, PIF/Harbinger/Tourist, and Helitron. In general, DNA transposons excise from one site and reinsert elsewhere in the genome, resembling a “cut and paste” mechanism (Fig. 1d). Based on their coding capacity, both classes of TEs can be divided into autonomous and nonautonomous elements. Autonomous elements,

such as Ac, encode the protein products (Tpase or reverse transcriptase) required for their transposition. Nonautonomous elements, such as Ds, do not encode the relevant products and rely on their cognate autonomous elements (Ac) for transposition (Fig. 1). When an element inserts into a genomic site, it often duplicates a small piece of its flanking sequence, which is called target site duplication (TSD) (Fig. 1). Each superfamily creates a unique length of TSD that can be used to

Plant Transposable Elements: Beyond Insertions and Interruptions

921

identify the element. Helitrons are an exception to this classification because they do not generate a TSD.

grasses. TEs in maize significantly outnumber that of normal genes (1,300,000 vs. 39,000). In other words, one gene is accompanied by 33 TEs!

TEs as Major Components of the Genome: Genes Are Buried in an Ocean of Transposable Elements

Regulation of TE Activity: Turned On by Stress

All of the TEs that were discovered in early research endeavors, such as Ac/Ds and Spm/ dSpm, are associated with rather low copy numbers in the genome (up to a dozen). As a result, TEs were considered a rare curiosity existing in only a few special organisms such as maize. This paradigm has changed as a consequence of the development of genome-scale sequencing technologies. To date, TEs have been detected in almost all the sequenced prokaryotic and eukaryotic genomes. It is clear that TEs comprise a major portion of most plant genomes except Arabidopsis thaliana, which has a streamlined genome (157 Mb) with only 14% of its genome derived from recognizable TEs (Arabidopsis Genome Initiative 2000). In contrast, maize harbors at least 1.3 million copies of TEs that account for 84% of its genome (Schnable et al. 2009). This partially explains why TEs were first discovered in maize. Although plant genomes contain both class 1 and class 2 elements, class 1 elements often contribute more to the expansion of genome size due to their replicative transposition mechanism (see above). This is well demonstrated in the grasses, a critical clade as most cereal crops are included in this family. Grasses are derived from a common ancestor about 70 million years ago, so their genomes are similar in terms of gene content and order. However, there is an 18-fold variation of genome sizes among the diploid species (272 Mb for Brachypodium distachyon and 5 Gb for barley) in grasses. Genes in the B. distachyon genome are closely linked to each other, while the same genes in maize or barley are separated by large blocks of LTR retrotransposons. Therefore, the amplification of LTR elements is largely responsible for the genome size expansion in

If a TE inserts inside a gene or its regulatory region, the insertion may have serious implications on the function of the relevant gene and the fitness of the host organism. The insertion would either abolish the transcription of the gene or change the coding capacity of the gene. Given the abundance of TEs that could be highly mutagenic, it is intriguing how plants maintain their genome integrity and survive. Apparently, TEs and the genome have established a balance through long-term evolution. First, most TEs have a specific type of targeting specificity, which refers to the preference to insert in certain locations or specific sequences within the genome. LTR retrotransposons, which make up the largest fraction of plant genomes, are mostly located in heterochromatic regions where there are fewer genes, the so-called safe havens in the genome. This minimizes the possibility of them inserting into genes. On the other hand, the elements that are most frequently located in the genic regions are the small DNA transposons or miniature inverted transposable elements (MITEs, such as Tourist and Stowaway elements, Fig. 1c). Due to their small size, most insertions by MITEs result in subtle changes in the genes they insert into (Naito et al. 2009). More importantly, most existing TEs are products from past transposition events, and few of them are still actively transposing in the genome. In maize, there are over 1,000 families of TEs, yet only a few TE families are still active in a subset of maize lines (Schnable et al. 2009). A variety of factors contribute to the inactivation of TEs. Some TEs lose their activity due to the accumulation of mutations in the region encoding Tpases. Other TEs are silenced by the host silencing mechanisms such as RNA interference. Due to their

P

922

Plant Transposable Elements: Beyond Insertions and Interruptions

repetitive features, fortuitous TE transcripts with sense and antisense orientations often coexist and form double-strand RNAs, which are further processed by RNA-induced silencing complex and form small interfering RNAs (sRNAs). The sRNAs may bind to the message RNAs (mRNA) with sequence homology and result in the degradation of the relevant mRNA. Alternatively, the sRNAs may mediate de novo methylation of the relevant loci through RNA-DNA interaction, which suppresses transcription. Some active TEs are quiescent under normal growth conditions and are only activated when the plants are subjected to biotic or abiotic stresses such as pathogen, wounding, and tissue culture (Feschotte et al. 2002). This is because many TEs contain cis-regulatory elements that are responsive to stress signals. The rice MITE mPing, for example, contains a cis-element that is responsive to cold (Naito et al. 2009). mPing can be also activated by tissue culture and radiation, and its copy number varies from 1 to over 1,000 in different rice cultivars (Naito et al. 2009). Collectively, the chance for TEs to generate disruptive effects in a normal plant is rather low. However, when the “balance” is broken under stress, TEs may undergo massive amplification and reshape the genome.

Structural Component of the Genome: TEs in Plant Centromeres Centromeres are the chromosomal domains essential for the assembly of the chromosomal structures that mediate faithful segregation at mitosis and meiosis. Centromeres interact with the spindle apparatus to enable chromosome disjunction. The function of centromeres is conserved across different species. Like other chromosomal domains, the centromere/kinetochore complex is composed of DNA and structural proteins. One of the centromere-specific proteins is called CENH3, which replaces the regular histone H3 in centromeric chromatin and is critical in the establishment of kinetochores in various organisms. In plants, satellite DNA and LTR retrotransposons are the main DNA components in centromeres (Jiang et al. 2003). Multiple families of LTR

elements are present in centromeric and pericentromeric regions. While some of them are also found in other chromosomal domains, certain elements are centromere specific, such as the “centromeric retrotransposon (CR)” family in grasses. Unlike other retrotransposons that diverge rapidly, the CR elements (CRR in rice and CRM in maize) are conserved and specifically located in centromeres in almost all the grass species (Jiang et al. 2003). The CRM retrotransposon directly interacts with CENH3, and the interaction is sufficient to induce centromere identity and function. It was speculated that the centromere-associated transcripts, including those from CR elements, might be essential in the recruitment of CENH3 (Jiang et al. 2003). Therefore, the CR elements and their transcripts may facilitate chromosome segregation through their interaction with CENH3. Interestingly, recently active CRM elements are found in the center of current centromeres, and relatively ancient CRM elements are located in pericentromeric regions. This suggests that the centromere is a dynamic complex which continuously recruits newly transposed elements in the center of the centromere and “pushes” older elements to the flanking regions (Wolfgruber et al. 2009). The rapid turnover of centromeric TEs suggests that TEs in centromeres may play a key role in reproductive isolation and the emergence of new species.

Acquisition and Duplication of Genes: Making Genes Jump One of the recent findings in plant genomes is the presence of large amounts of “atypical” TEs, which carry genomic sequences including intact genes or gene fragments. For example, about 3,000 Pack-MULEs (Mutator-like elements carrying genes or gene fragments) were identified in rice, and many of them carry gene fragments from different loci (Jiang et al. 2004) (Fig. 2). In maize, there are 1,930 intact Helitron elements, and most of them captured gene fragments (Schnable et al. 2009). As a result, duplication, amplification, and mobilization of genes and gene

Plant Transposable Elements: Beyond Insertions and Interruptions

Plant Transposable Elements: Beyond Insertions and Interruptions, Fig. 2 A generic Pack-MULE element carrying sequences from two genes. Exons are shown as colored boxes and introns are depicted as “v” shape lines connecting exons. Other sequences are diagrammed as that in Fig. 1. Homologous sequences are connected by dashed lines. Small RNAs (waved lines) generated by the PackMULE would be effective on both the Pack-MULE and the parental genes

fragments by TEs is one of the major mechanisms for shuffling of coding and regulatory sequences in plants. The acquisition of genomic sequences by retrotransposons likely occurs during the transposition process. One component of the retrotransposition machinery of an LTR element is the virus particle in which the element mRNAs are included and subsequently converted into cDNAs. During this process, mRNAs of normal genes can be fortuitously packed into the particle and thereafter retrotransposed. Since these retrogenes are generated through reverse transcription and transposition, they are distinguished from their parental forms by the presence of a poly (A) tract, the lack of introns, and the presence of a TSD. Some of the retrogenes are associated with the relevant elements, i.e., gene fragments may be found between the two LTRs of a single LTR element. Other retrogenes may be present independently, making it difficult to deduce the element responsible for their formation. The mechanism for sequence acquisition by DNA elements is less clear, and it is not known whether the acquisition is linked to transposition. Since introns are observed in the acquired portion of the DNA elements, it is clear that the acquisition involves DNA sequences, not RNA or cDNA sequences (Jiang et al. 2004). Given the abundance of gene-carrying elements in plants, a critical question is whether some of these evolve into bona fide genes. Emerging evidence suggests this is the case. In tomato,

923

the Rider LTR element is responsible for the creation of the sun locus, which is responsible for the oval shape of tomato fruit (Xiao et al. 2008). The Rider element duplicated a 25 kb DNA fragment from chromosome 10 and inserted the duplication into the sun locus which is located on chromosome 7. In tomato cultivars with round tomato fruit, there is only one SUN gene on chromosome 10, with no detectable expression. For the cultivar with oval fruit, an additional SUN gene is located in the duplicated region on chromosome 7. The additional SUN gene duplicated by Rider is surrounded by novel regulatory elements, which leads to its expression in flower and young fruits, conferring the oval fruit phenotype (Xiao et al. 2008). Thus, it is very likely that a subset of genes duplicated by TEs serve as functional genes.

Regulatory Roles of TEs: Modifying the Expression of Existing Genes When TEs insert into the regulatory regions of genes, they often alter the expression of the adjacent genes, and the outcome of such alteration varies with individual insertions. It is common that the insertion of the element abolishes the expression of the relevant gene by interruption of the structure and function of the promoter. In other cases, the TE insertion may upregulate the expression of their adjacent genes. For instance, a somatic mutation termed reiterated reproductive meristem (RRM) is responsible for the overbranching of fruit meristems in grape and is caused by the insertion of a hAT family DNA transposon in the promoter of the VvTFL1A regulatory gene (Fernandez et al. 2010). The hAT element acts as an enhancer which leads to the elevated expression of the VvTFL1A gene and the reiteration of branching in the floral meristem. In rice, most of the recently amplified mPing elements either upregulate the expression of their adjacent genes or have no detectable effects (Naito et al. 2009). In addition, the presence of cold-responsive regulatory sequences in mPing (see above) enables some of the nearby genes inducible by stress.

P

924

In addition to the influence on the expression of their adjacent genes, TEs, such as Pack-MULEs, may also regulate the expression of their parental genes due to the sequence homology between the TEs and the genes. As mentioned above, repetitive TEs are often associated with the generation of sRNAs. Since some of the sRNAs are generated from the acquired gene fragments, those sRNAs are also effective in suppressing the expression of the parental genes. In this way, TEs become the “remote control” of their parental genes. As a consequence, TEs are capable of regulating gene expression both in “cis” and in “trans.”

Cross-References ▶ Genomic Sequence and Structural Diversity in Plants ▶ Plant Genome Annotation, Methods for ▶ Plant Genomes, Evolution of ▶ Plant Genomes: From Sequence to Function Across Evolutionary Time ▶ Target-site Selection ▶ Transposable Elements and Plasmid Genomes

References Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–815 Fernandez L, Torregrosa L, Segura V, Bouquet A, Martinez-Zapater JM (2010) Transposon-induced gene activation as a mechanism generating cluster shape somatic variation in grapevine. Plant J 61:545–557 Feschotte C, Jiang N, Wessler SR (2002) Plant transposable elements: where genetics meets genomics. Nat Rev Genet 3:329–341 Jiang J, Birchler JA, Parrott WA, Dawe RK (2003) A molecular view of plant centromeres. Trends Plant Sci 8:570–575 Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431:569–573 McClintock B (1948) Mutable loci in maize. Carnegie Inst Wash Yearb 47:155–169 Naito K, Zhang F, Tsukiyama T, Saito H, Hancock CN, Richardson AO, Okumoto Y, Tanisaka T, Wessler SR (2009) Unexpected consequences of a sudden and

Plantae massive transposon amplification on rice gene expression. Nature 461:1130–1134 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, Chen W, Yan L, Higginbotham J, Cardenas M, Waligorski J, Applebaum E, Phelps L, Falcone J, Kanchi K, Thane T, Scimone A, Thane N, Henke J, Wang T, Ruppert J, Shah N, Rotter K, Hodges J, Ingenthron E, Cordes M, Kohlberg S, Sgro J, Delgado B, Mead K, Chinwalla A, Leonard S, Crouse K, Collura K, Kudrna D, Currie J, He R, Angelova A, Rajasekar S, Mueller T, Lomeli R, Scara G, Ko A, Delaney K, Wissotski M, Lopez G, Campos D, Braidotti M, Ashley E, Golser W, Kim H, Lee S, Lin J, Dujmic Z, Kim W, Talag J, Zuccolo A, Fan C, Sebastian A, Kramer M, Spiegel L, Nascimento L, Zutavern T, Miller B, Ambroise C, Muller S, Spooner W, Narechania A, Ren L, Wei S, Kumari S, Faga B, Levy MJ, McMahan L, van Buren P, Vaughn MW, Ying K, Yeh CT, Emrich SJ, Jia Y, Kalyanaraman A, Hsia AP, Barbazuk WB, Baucom RS, Brutnell TP, Carpita NC, Chaparro C, Chia JM, Deragon JM, Estill JC, Fu Y, Jeddeloh JA, Han Y, Lee H, Li P, Lisch DR, Liu S, Liu Z, Nagel DH, McCann MC, SanMiguel P, Myers AM, Nettleton D, Nguyen J, Penning BW, Ponnala L, Schneider KL, Schwartz DC, Sharma A, Soderlund C, Springer NM, Sun Q, Wang H, Waterman M, Westerman R, Wolfgruber TK, Yang L, Yu Y, Zhang L, Zhou S, Zhu Q, Bennetzen JL, Dawe RK, Jiang J, Jiang N, Presting GG, Wessler SR, Aluru S, Martienssen RA, Clifton SW, McCombie WR, Wing RA, Wilson RK (2009) The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112–1115 Wolfgruber TK, Sharma A, Schneider KL, Albert PS, Koo DH, Shi J, Gao Z, Han F, Lee H, Xu R, Allison J, Birchler JA, Jiang J, Dawe RK, Presting GG (2009) Maize centromere structure and evolution: sequence analysis of centromeres 2 and 5 reveals dynamic Loci shaped primarily by retrotransposons. PLoS Genet 5:e1000743 Xiao H, Jiang N, Schaffner E, Stockinger EJ, van der Knaap E (2008) A retrotransposon-mediated gene duplication underlies morphological variation of tomato fruit. Science 319:1527–1530

Plantae ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Plasmid Cloning Vectors

925 Plasmid Cloning Vectors, Table 1 Bacterial plasmids

Plasmid Classification ▶ Plasmid Incompatibility

Replicon

ColE1

ColE1

6.9

10–20

pSC101

pSC101

9

4–6

99

1–2

F factor

Plasmid Cloning Vectors Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Synopsis Plasmids are cloning vectors that are maintained in cells as autonomously replicating circular double-stranded DNA molecules. A great many cloning vectors that are in use today were derived from naturally occurring plasmids. DNA sequences that may allow for selection of host cells containing the plasmid, convenient addition of DNA inserts, partitioning of replicated plasmids to daughter cells during cell division, highlevel expression of genes cloned into the plasmid, and other functions are found in or have been added to these natural molecules. Plasmid cloning vectors exist for use in bacteria, yeast, and higher eukaryotic cells. This short entry describes the history and features of some of the important plasmid cloning vectors.

Size (kb)

Copy number

Plasmid

4.4

20

pUC18

pMB1 (ColE1) pMB1

2.7

75

pACYC184

p15A

4.2

10–12

pBR322

References Kornberg and Baker (1991) Kornberg and Baker (1991) Kornberg and Baker (1991) Lin-Chao et al. (1992) Lin-Chao et al. (1992) Sambrook et al. (1989)

another (conjugation). Plasmids may also encode proteins that confer functions beneficial to the host cell, such as resistance to antibiotics or to heavy metals. Cloning vectors used in bacteria typically have been constructed using DNA from several different sources to provide the most convenience to the experimenter. Cloning vectors used in yeast cells are either derived from natural plasmids or constructed from DNA elements taken from the yeast chromosomes, while many plasmids used in mammalian cells are derived from viruses. The replication origin and associated control elements in a plasmid are referred to as a replicon (Sambrook et al. 1989). Many different vectors may carry the same replicon and thus have the same or similar DNA replication mechanism. Several bacterial plasmids and cloning vectors, along with their properties, are listed in Table 1.

Introduction F Plasmid Many important cloning vectors are derived from naturally occurring plasmids. Plasmids are circular DNA molecules that are maintained as an episome, or extrachromosomal DNA molecule, inside a cell (Sherratt 1974). The plasmid must contain a DNA sequence that serves as an origin of replication (Ori) so that the plasmid DNA is propagated as the cell undergoes the cell division cycle. Some plasmids contain genes that encode proteins involved in plasmid DNA replication, plasmid partitioning to daughter cells during cell division, and self-transmissibility from one cell to

The term “plasmid” was coined by Lederberg in 1952 to signify an “extrachromosomal hereditary determinant” (Lederberg 1952, 1998). He applied the term to a wide range of then poorly understood genetic phenomena, including bacteriophages, genetic inheritance of chloroplasts and mitochondria, and insect endosymbionts, and, later, to the genetic element responsible for genetic exchange in bacteria via conjugation, the fertility factor (Cavalli et al. 1953). This latter element turned out to be a plasmid by the subsequent usage of the

P

926

Plasmid Cloning Vectors

Plasmid Cloning Vectors, Fig. 1 Map of the F plasmid. The locations of the three replicons (RepF1A, RepF1B, and RepF1C) and of the genes that encode functions for self-transmission to a recipient cell (“transfer region”) are indicated. Map is based on F plasmid sequence and annotation in GenBank (Acc. No. AP001918) and Kornberg and Baker (1991), Firth et al. (1996)

term. It is now known as the F plasmid and is considered the first plasmid to be discovered Firth et al. (1996). The F plasmid is about 100 kb in size and carries about 60 genes Kornberg and Baker (1991), Firth et al. (1996), including genes and DNA sequences that control its own DNA replication, partitioning to daughter cells at host cell division, and that allow for conjugative transfer of the plasmid from a donor cell to a recipient cell (Fig. 1). F plasmid replication and partitioning are regulated so that the plasmid is maintained at only one to two copies per cell. The plasmid contains three replicons, referred to as RepF1A, RepF1B, and RepF1C. RepF1A alone is sufficient for plasmid replication and stability, RepF1B can also support plasmid replication, and RepF1C is inactive Firth et al. (1996). The RepF1A replicon includes an origin that directs bidirectional replication (oriV) and one that directs unidirectional replication (oriS) Firth et al. (1996). F plasmid

DNA replication requires at minimum the oriS sequence and the plasmid-encoded RepE protein, along with several host-encoded proteins Kornberg and Baker (1991). The F plasmid is too large and complex to be a useful cloning vector in its natural form. However, useful vectors can be derived from the F plasmid because most of the plasmid is dispensable for plasmid replication and stability in the cell. A 9 kb mini-F plasmid, consisting of bp #43,276–52,852 from the F plasmid, includes the essential oriS replication origin and genes that encode proteins needed for plasmid replication and maintenance, including the repE and sop (par) genes. Other features from F that are present in mini-F are shown in Fig. 2. The ccdA and ccdB genes encode a poison/antidote pair that ensures stable maintenance of the plasmid in a cell (Van Melderen 2002). The 11.7 kDa CcdB protein is toxic to the cell because it binds to and traps an intermediate in the reaction catalyzed by DNA

Plasmid Cloning Vectors

927

Plasmid Cloning Vectors, Fig. 2 Map of the mini-F plasmid. Mini-F consists of base pairs 43,276–52,852 from the F plasmid, including the RepFIA replicon (see Fig. 1). Genetic elements found in mini-F (also present in F) are indicated. See text for functions of these elements. Arrows indicate the direction of transcription (arrowhead is 3-prime end). Based on sequence and annotation in GenBank (Acc. No. M12987.1)

gyrase, in which the cell’s DNA is cleaved. Accumulation of DNA double-strand breaks leads to cell death. The 8.7 kDa CcdA protein counteracts the toxic effect of CcdB. This protects the cell, as long as it contains the plasmid so that the unstable CcdA protein is produced continuously. Mini-F also contains three sets of DNA repeats, called IncB, IncC, and IncD (sopC), that regulate plasmid replication and copy number and are responsible for incompatibility of two different recombinant F plasmids within one cell (Novick 1987). Importantly, the transfer genes from F are not present in mini-F. This means that mini-F cannot be transferred from cell to cell during culture. The mini-F plasmid is present at one to two copies per cell. Very large DNA fragments (300 kb) can be inserted into the mini-F plasmid and maintained as a bacterial artificial chromosome (BAC) in a bacterial host cell (Shizuya et al. 1992). Fosmids are a second cloning vector derived from the F plasmid (Kim et al. 1992). These are cosmids (9.5 kb) that have the oriS from the F plasmid, the repE and sop genes, an antibiotic resistance gene, and the cos site from phage l. Large DNA fragments (40 kb) can be inserted, and the recombinant DNA can be packaged into

phage particles in vitro. The recombinant phages are used to infect host cells, in which the DNA is re-circularized and maintained as a single-copy circular DNA episome. Fosmids and BACs are beneficial because large and complex DNA inserts are often unstable in bacterial cells, undergoing deletion of parts of the DNA insert. The low copy number of a fosmid or BAC alleviates the instability problem.

ColE1, pBR322, and pUC Plasmids The first important cloning vectors were derived from the plasmid ColE1 (Fig. 3). ColE1 is a 6.6 kb plasmid that is not self-transmissible. It encodes the colicin E1 toxin which forms ion channels in the bacterial inner cell membrane and dissipates the transmembrane electrochemical gradient, resulting in death of the cell (Cramer et al. 1990). The imm gene encodes an inner membrane protein that makes a cell containing the ColE1 plasmid resistant to the colicin E1 toxin. ColE1 DNA replication depends on an origin sequence, but the plasmid encodes no proteins that are essential for plasmid replication nor for partitioning during cell division. The ColE1

P

928

replication process is controlled by two plasmidencoded RNA molecules, called RNA I and RNA II, and the plasmid-encoded rop protein. ColE1 is limited to 10–20 copies per cell because of this regulation by RNA I/II and rop. RNA II is transcribed from template sequence that is 550 bp upstream of the Ori. The RNA II remains annealed to the DNA template as an RNA-DNA hybrid. The RNA in the hybrid is cleaved by the cellular RNase H, and the resulting 30 -terminal hydroxyl groups in the cleaved RNA serve as primer termini for plasmid DNA replication. Cleavage of RNA II is controlled by RNA I, whose gene overlaps that for RNA II but is transcribed in the opposite direction. RNA I is a negative regulator that binds to RNA II and prevents RNA II from hybridizing to the DNA. The RNA I-RNA II interaction is stabilized by the rop protein, a 63 amino acid residue protein encoded 400 bp downstream of the Ori. One of the most important early cloning vectors was pBR322 (Bolivar et al. 1977a, b). pBR322 was made starting with pMB8, one of several plasmids related to ColE1 that were found in clinical isolates of E. coli (Bolivar et al. 1977a; Betlach et al. 1976). A gene encoding resistance to tetracycline (TetR) from the plasmid pSC101 was joined to pMB8 by ligation. A gene encoding resistance to ampicillin (AmpR), carried on the TnA transposon, was joined to the Tet-resistant plasmid by transposition in vivo. One of the resulting AmpR-TetR plasmids, called pBR312, was then treated with EcoRI under star-cleavage conditions, the digest mixture was re-circularized using DNA ligase, and plasmids that conferred resistance to both Amp and Tet were selected again. This treatment removed DNA that was necessary for transposition by TnA, which stabilized the AmpR gene in the plasmid. Additional DNA was removed from one of these plasmids, called pBR313, by restriction enzyme digestion and ligation, to produce pBR322 (Bolivar et al. 1977b). pBR322 (4,361 bp) is thus a composite of DNA from three precursors (pMB8, pSC101, and TnA). It has the ColE1 origin of replication and the RNA I, RNA II, and rop genes, so it is limited to the same moderate copy number as is ColE1. A useful feature of pBR322 is that foreign DNA

Plasmid Cloning Vectors

can be inserted into unique restriction endonuclease sites located in either the AmpR gene (e.g., PstI, PvuI, or ScaI sites; see Fig. 3) or the TetR gene (e.g., EcoRV, BamHI, SphI, or SalI sites). Cells transformed with the resulting recombinant plasmid would then be resistant to one antibiotic but sensitive to the other. This provides a convenient way to isolate recombinant pBR322 vectors that contain insert DNA from amongst empty vectors. The pUC family of plasmids was constructed from pBR322 by Messing and colleagues. First, a DNA fragment that carried the tetracycline resistance gene and the rop gene was deleted from pBR322, to give pUR1 (Ruther 1980). A DNA fragment from M13mp7, carrying the lac operator and promoter, the multi-cloning site, and the lacZa gene fragment, was ligated into pUR1. Several restriction sites were eliminated from the resulting plasmid, ultimately producing pUC8 and pUC9 (Vieira and Messing 1982). The multi-cloning site was enlarged by adding additional restriction sites, producing pUC18 (Fig. 3) and pUC19 (Norrander et al. 1983). These plasmids are more useful than pBR322 because of the unique restriction sites in the multi-cloning site (Fig. 4), which make it very convenient to insert foreign DNA by ligation. The lac region allows for selection of recombinant plasmids by bluewhite screening on agar plates containing X-gal. The copy number control of these plasmids is compromised due to the deletion of the rop gene and because of a G-to-A mutation in the DNA that encodes RNA II (Lin-Chao et al. 1992). The pUC plasmids are present at 50 copies per cell at 37  C and 175 per cell at 42  C (Lin-Chao et al. 1992). Plasmids have undergone tremendous further elaboration to make them suitable for a wide variety of applications. Several plasmid vectors are specifically designed for easy cloning of PCR products. Some DNA polymerases used in PCR have a terminal transferase activity that adds an untemplated deoxyadenosine residue (dA) to the 30 -end of the PCR product (Clark 1988). TA-cloning vectors are supplied as linear DNA molecules with a single thymidine (dT) added to the 30 -ends. The dT on the vector end can form

Plasmid Cloning Vectors

929

Plasmid Cloning Vectors, Fig. 3 Maps of ColE1, pBR322, and pUC18. See text for description of functions encoded in transcribed regions (arrows; arrowheads indicate 3-prime ends). Ori shows location of transition from primer RNA (RNA II) to DNA during plasmid DNA replication. Locations of some restriction endonuclease recognition sites in pBR322 are shown. MCS: multi-cloning site. Maps are based on sequences and annotations in GenBank: Col E1, Acc. No. J01566; pBR322, Acc. No. J01749; pUC18, Acc. No. L08752. pUC18 is shown in the opposite orientation from the sequence given in GenBank

P a base pair to the dA on the PCR product end, and the two molecules can be joined by ligation (Mead et al. 1991). Other vectors are supplied in linear form with blunt ends that can be ligated to PCR products made using DNA polymerases that produce blunt-ended products.

Expression Vectors A major application of cloning is for the highlevel overexpression of the protein encoded by a gene of interest. The overexpressed protein can then be purified and studied in vitro, used for structural studies, used as an immunogen, etc. Some bacterial expression vectors contain only a promoter that drives transcription of an

insert gene, in which case the insert DNA must include a Shine-Dalgarno ribosome binding site and a start codon. Alternatively, the ribosome binding site and start codon may be part of the vector, so that only the open reading frame (ORF) of the gene of interest (start codon to stop codon) need be inserted. The start codon (ATG) in the vector is often part of a unique recognition site for either NdeI (CA/TATG) or NcoI (C/CATGG). The insert DNA must simply be altered so that the start codon of the ORF is also part of an NdeI or NcoI site. Ligation of the insert into the vector cut with one of these enzymes places the ORF at the correct position relative to the ribosome binding site for efficient translation of the mRNA. Several different strong promoters have been used for protein expression in E. coli, including

930

Plasmid Cloning Vectors, Fig. 4 Sequence of the lac promoter, lac operator, multi-cloning site (MCS), and a part of the lacZa coding sequence in pUC18. DNA bases in uppercase are in the MCS. Amino acids in lowercase are those encoded by the MCS sequence, while those in uppercase are the natural lacZ sequence. Bases in the

Plasmid Cloning Vectors

Lac repressor binding site (i.e., lacO) are underlined. The 10 and 35 sequences of the lac promoter are in uppercase. The sequence and base numbering correspond to the orientation shown in Fig. 3. The upper DNA strand is the template strand for RNA synthesis, and transcription proceeds from the lower right to upper left

Plasmid Cloning Vectors, Fig. 5 Map of the pET-15b expression vector. “pBROri” indicates the RNA I and RNAII genes and Ori site from pBR322 (see Fig. 3). lacI: gene encoding E. coli Lac repressor protein. PT7: promoter sequence for transcription by bacteriophage T7 RNA polymerase. The gene encoding the protein to be expressed is inserted between the NdeI and either the XhoI or the BamHI site (red). The map is based on sequence obtained from http://www.emdmillipore. com

the lac promoter and the lacUV5 mutant promoter, the synthetic tac and trc promoters, and the pL promoter from bacteriophage lambda. Perhaps the most widely used are promoters recognized by the

RNA polymerase from bacteriophage T7 contained in pET vectors such as pET-15b (Figs. 5 and 6; Studier and Moffatt 1986; Studier et al. 1990). The pET vectors have the strong

Plasmid Cloning Vectors

931

Plasmid Cloning Vectors, Fig. 6 Sequence of the cloning site in pET-15b. DNA bases constituting the T7 promoter (PT7), Lac repressor binding site (lac operator), and T7 transcription terminator are underlined. The amino

sequence encoded by the region, including the hexahistidine tag, is shown. The site for proteolytic cleavage of the fusion protein catalyzed by thrombin is indicated by the arrow

promoter and ribosome binding sites from the j10 gene of T7 phage, just downstream of which are NdeI and/or NcoI sites into which the foreign ORF can be inserted. The plasmid-borne gene insert is expressed at a high level if T7 RNA polymerase is present in the host cell. The RNA polymerase can be provided by infecting the cells with phage that expresses T7 RNA polymerase (usually done by infecting with phage CE6, a lambda phage that contains the gene for T7 RNA polymerase (Studier et al. 1990). Expression is done more commonly by introducing the recombinant vector into an E. coli strain that is lysogenic for the DE3 phage (Studier and Moffatt 1986). The DE3 prophage is a bacteriophage lambda integrated into the chromosome. The prophage contains the gene encoding T7 RNA polymerase downstream of the E. coli lacUV5 promoter. The gene is repressed by the Lac repressor protein. The T7 RNA polymerase is synthesized when IPTG is added to the growth medium, or when cells are grown in the presence of lactose (Studier 2005). There are no sequences in the E. coli chromosome that are recognized as promoters by T7 RNA polymerase, so the enzyme transcribes only the plasmid-borne gene. Transcription by T7 RNA polymerase is about fivefold faster than that by E. coli RNA

polymerase so that a very large amount of insert gene mRNA is produced, leading in most cases to synthesis of a large amount of the target protein. pET vectors are also available to express proteins fused to a variety of additional peptide sequences, such as a secretion signal sequence, hexahistidine tag (His6-tag), glutathione S-transferase (GST) tag, thioredoxin, etc., that facilitate folding, secretion, or purification of the overexpressed protein. Mutually compatible expression vectors have been developed for co-expression of several different genes in one host cell, which is useful for studies of multi-subunit proteins and enzymes (see ▶ “Plasmid Incompatibility”). The T7 expression system is particularly useful for expressing proteins that are toxic in E. coli. The T7 j10 promoter sequence is unrelated to an E. coli promoter sequence, so a gene controlled by this promoter is not expressed in a cell that does not contain the T7 RNA polymerase gene. The gene encoding a protein of interest that happens to be toxic in E. coli can be cloned into the pET vector using an E. coli strain that does not produce the T7 polymerase. Expression of the protein is very low, and often not a problem, in the expression strain before induction of T7 RNA polymerase synthesis by IPTG or lactose, so that the strain

P

932

will often grow in spite of the toxic gene insert. However, a small amount of the T7 RNA polymerase is usually present in the cells due to leaky expression of the lacUV5 promoter under non-inducing conditions, and this may produce sufficient toxic protein to inhibit the growth of the cells. The uninduced expression can be reduced even further by co-transforming with a plasmid (pLysE or pLysS) that expresses the T7 lysozyme protein, a natural inhibitor of T7 RNA polymerase.

Yeast Plasmids There are two main types of plasmid vectors that can be used in yeast cells – those based on the natural 2-mm plasmid and ARS/CEN plasmids. The 2-mm plasmid (Fig. 7) is a 6,318 bp circular dsDNA element that is found naturally in most strains of Saccharomyces. It is present in multiple copies in the nucleus where it replicates once during the cell cycle (Futcher 1988). Its replication and partitioning require an autonomously replicating sequence (ARS), the plasmid-encoded REP1 and REP2 proteins, and the cis-acting STB (REP3) DNA sequence element. The copy number is kept constant by a partitioning system that uses the REP1 and REP2 proteins encoded by the plasmid (Ghosh et al. 2006). These proteins are present in most yeast strains due to an endogenous 2-mm plasmid, and so they need not be encoded on a plasmid used as a cloning vector. The D protein also contributes to plasmid stability (Futcher 1988). The plasmid also encodes the FLP sitespecific recombinase, which binds to two FLP recombination target (FRT) sequences and catalyzes an intramolecular inversion of the intervening DNA. The plasmid can therefore exist in either of two forms, called A and B (Fig. 7). YEp plasmids (yeast episomal plasmids, such as YEp24, Fig. 8) contain a portion of the 2 mm plasmid that includes the replication origin (ARS sequence) to enable the plasmid to be propagated in the yeast cell. The 2 mm segment includes only parts of other genes, such as FLP, that encode partial, inactive, proteins. The vectors also include sequences from pBR322 that enable the plasmid

Plasmid Cloning Vectors

to replicate and be maintained in E. coli. This is convenient for an investigator, since initial cloning can be done using the bacterial host. The YEp plasmids also include selectable markers for both bacteria (e.g., Amp) and yeast (e.g., URA3). YCp plasmids (yeast chromosomal plasmids; Fig. 8) were constructed from DNA elements that control chromosomal DNA replication and partitioning in yeast. Autonomously replicating sequences (ARS, 125 bp) are the origins of replication for yeast chromosomes. Plasmids containing an ARS are replicated during the cell cycle. Insertion of a centromere sequence (CEN, 100 bp), enables the plasmid DNA to be partitioned equally between mother and daughter cells during mitosis and meiosis, ensuring plasmid stability. These plasmids are maintained at one to two copies per cell. YEp and YCp plasmids also contain a bacterial plasmid origin of replication and antibiotic resistance gene, for cloning and maintenance in bacterial cells, and a selectable marker gene for use in yeast. These are generally genes that encode biosynthetic enzymes (e.g., LEU2, HIS3, URA3, TRP1). The plasmid is introduced into a yeast host cell with a chromosomal mutation in one of these genes, making the cell unable to grow in medium lacking, e.g., leucine, histidine, uracil, or tryptophan. The plasmid-borne gene complements the chromosomal mutation and enables the plasmid-containing cell to grow in this medium.

Plasmids for Mammalian Cells DNA elements that are maintained as episomes in mammalian cells have been derived from viruses such as the BK virus, papilloma viruses, and Epstein-Barr virus. Like plasmids, these viruses have dsDNA genomes, and they exist in multiple copy number in the nucleus of an infected cell. Human BK virus is a polyomavirus with a 5,153 bp genome. Its replication requires a viral Ori sequence and the viral large T antigen protein. It undergoes a lytic infection in human cells and can be grown as a plasmid in other mammalian cells. BKV vectors include the replication origin and T antigen gene, but other viral

Plasmid Cloning Vectors

933

P

Plasmid Cloning Vectors, Fig. 7 Map of the 2 mm plasmid from yeast. The 599 bp inverted repeat sequences are in gray. FLP protein catalyzes site-specific recombination at the two FLP recombination target (FRT, blue) sequences, which interconverts the A and B forms. The

origin of replication (ARS) is in red. Other genes are discussed in the text. Arrows indicate the direction of transcription (arrowhead is 3-prime end). Map is based on sequence and annotation in GenBank (Acc. No. J01347.1)

genes are removed. These cloning vectors contain a selectable marker for human cells (e.g., neomycin) and a bacterial plasmid Ori and antibiotic resistance gene. They are maintained in multiple copies (20–40 or 75–120) depending on the cell type (Van Craenenbroeck et al. 2000), and they do not integrate into the chromosome of the host.

They lack genes required to produce infectious virus, so they are simply maintained as a plasmid once introduced into a eukaryotic cell. Papilloma viruses (e.g., bovine papilloma virus (BPV)) require an Ori sequence and the E1 and E2 proteins for their replication. Epstein-Barr virus (EBV) has a linear, 172 kb dsDNA genome. It has

934

Plasmid Cloning Vectors

Plasmid Cloning Vectors, Fig. 8 Plasmid cloning vectors used in yeast. (a) YEp24. Thick circle indicates DNA from the 2 mm plasmid (bp # 1–2,247; see Fig. 7). Base pairs 2,244–2,278 and 3,438–7,769 are from pBR322 (“pBR-Ori” indicates the RNA I and RNAII genes and Ori site from pBR322 (see Fig. 3)). The URA3 gene is from yeast (S. cerevisiae). Map is based on sequence and annotation in GenBank (Acc. No. L09156). (b) YCp50. Thick circle is DNA from yeast (including URA3 gene); thin circle is DNA from pBR322. Map is based on sequence and annotation in GenBank (Acc. No. X70276)

two origins of replication – OriP, for replication as a latent virus in the infected cell, and Ori Lyt, for replication during lytic growth. EBV vectors contain OriP, the gene encoding the viral EBNA1 protein, selectable markers for both eukaryotic and bacterial cells, and a bacterial plasmid Ori (Yu et al. 2009). They are present at 5–100 copies in a eukaryotic cell.

Cross-References ▶ Artificial Chromosomes ▶ Blue/White Selection ▶ Plasmid Incompatibility ▶ Selection with Antibiotics

References Betlach M, Hershfield V, Chow L, Brown W, Goodman H, Boyer HW (1976) A restriction endonuclease analysis of the bacterial plasmid controlling the ecoRI restriction and modification of DNA. Fed Proc 35:2037–2043 Bolivar F, Rodriguez RL, Betlach MC, Boyer HW (1977a) Construction and characterization of new cloning vehicles. I. Ampicillin-resistant derivatives of the plasmid pMB9. Gene 2:75–93 Bolivar F, Rodriguez RL, Greene PJ, Betlach MC, Heyneker HL, Boyer HW, Crosa JH, Falkow S (1977b) Construction and characterization of new cloning vehicles. II. A multipurpose cloning system. Gene 2:95–113 Cavalli LL, Lederberg J, Lederberg EM (1953) An infective factor controlling sex compatibility in Bacterium coli. J Gen Microbiol 8:89–103

Plasmid Genomes, Introduction to Clark JM (1988) Novel non-templated nucleotide addition reactions catalyzed by procaryotic and eucaryotic DNA polymerases. Nucleic Acids Res 16: 9677–9686 Cramer WA, Cohen FS, Merrill AR, Song HY (1990) Structure and dynamics of the colicin E1 channel. Mol Microbiol 4:519–526 Firth N, Ippen-Ihler K, Skurray RA (1996) Structure and function of the F factor and mechanism of conjugation. In: N FC (ed) Escherichia coli and Salmonella. ASM Press, Washington, DC, pp 2377–2401 Futcher AB (1988) The 2 micron circle plasmid of Saccharomyces cerevisiae. Yeast 4:27–40 Ghosh SK, Hajra S, Paek A, Jayaram M (2006) Mechanisms for chromosome and plasmid segregation. Annu Rev Biochem 75:211–241 Kim UJ, Shizuya H, de Jong PJ, Birren B, Simon MI (1992) Stable propagation of cosmid sized human DNA inserts in an F factor based vector. Nucleic Acids Res 20:1083–1085 Kornberg A, Baker TA (1991) DNA replication, 2nd edn. W. H. Freeman, New York Lederberg J (1952) Cell genetics and hereditary symbiosis. Physiol Rev 32:403–430 Lederberg J (1998) Plasmid (1952–1997). Plasmid 39:1–9 Lin-Chao S, Chen WT, Wong TT (1992) High copy number of the pUC plasmid results from a Rom/Ropsuppressible point mutation in RNA II. Mol Microbiol 6:3385–3393 Mead DA, Pey NK, Herrnstadt C, Marcil RA, Smith LM (1991) A universal method for the direct cloning of PCR amplified nucleic acid. Biotechnology (NY) 9:657–663 Norrander J, Kempe T, Messing J (1983) Construction of improved M13 vectors using oligodeoxynucleotidedirected mutagenesis. Gene 26:101–106 Novick RP (1987) Plasmid incompatibility. Microbiol Rev 51:381–395 Ruther U (1980) Construction and properties of a new cloning vehicle, allowing direct screening for recombinant plasmids. Mol Gen Genet 178:475–477 Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning: a laboratory manual, 2nd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor Sherratt DJ (1974) Bacterial plasmids. Cell 3:189–195 Shizuya H, Birren B, Kim UJ, Mancino V, Slepak T, Tachiiri Y, Simon M (1992) Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc Natl Acad Sci U S A 89:8794–8797 Studier FW (2005) Protein production by auto-induction in high density shaking cultures. Protein Expr Purif 41:207–234 Studier FW, Moffatt BA (1986) Use of bacteriophage T7 RNA polymerase to direct selective high-level expression of cloned genes. J Mol Biol 189:113–130 Studier FW, Rosenberg AH, Dunn JJ, Dubendorff JW (1990) Use of T7 RNA polymerase to direct expression of cloned genes. Methods Enzymol 185:60–89

935 Van Craenenbroeck K, Vanhoenacker P, Haegeman G (2000) Episomal vectors for gene expression in mammalian cells. Eur J Biochem 267:5665–5678 Van Melderen L (2002) Molecular interactions of the CcdB poison with its bacterial target, the DNA gyrase. Int J Med Microbiol 291:537–544 Vieira J, Messing J (1982) The pUC plasmids, an M13mp7-derived system for insertion mutagenesis and sequencing with synthetic universal primers. Gene 19:259–268 Yu J, Hu K, Smuga-Otto K, Tian S, Stewart R, Slukvin II, Thomson JA (2009) Human induced pluripotent stem cells free of vector and transgene sequences. Science 324:797–801

Plasmid Family ▶ Plasmid Incompatibility

Plasmid Genomes, Introduction to Christopher M. Thomas1 and Laura S. Frost2 Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, UK 2 Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada 1

P Synopsis Plasmids are defined as extrachromosomal DNA that can replicate autonomously, a feature that has been used in plasmid vector construction and in plasmid classification schemes. Replication functions can also be used to identify and classify plasmid sequences from genome sequencing projects using modern bioinformatic analysis algorithms. The replication genes form the replicon, which are usually found clustered together on the plasmid along with auxiliary stable inheritance functions. These include partitioning genes in low copy number plasmids or multimer resolution functions in high copy number plasmids. Addiction systems that prevent plasmid loss complete this suite of core replication and maintenance

936

functions, known as the “backbone” genes. A secondary set of core plasmid functions is the mobilization and mating pair formation components that promote conjugative plasmid and chromosomal transfer between cells. The core functions are controlled by regulatory elements that coordinate expression of some or all of these disparate plasmid features. The remainder of the plasmid genome may contain accessory or “cargo” genes that often define key phenotypes of their hosts such as, most famously, antibiotic resistance. These genes are often introduced onto the plasmid by transposable elements. The evolutionary history of the plasmid can be discerned by the sequence and pattern of this DNA acquisition. A knowledge and understanding of plasmid genomes is important both for building new plasmids using synthetic biology and for dealing with plasmid-mediated horizontal gene transfer.

Introduction Since whole books have been written on basic plasmid biology (Funnell and Phillips 2004; Thomas 2000), this discussion of plasmids will avoid a standard approach. Instead the focus will be on plasmids as autonomously replicating genomes, such that each aspect of the plasmid is considered from the genomic perspective. Why should plasmid genomes be of interest? The simple answer is that they confer properties on their hosts that have a profound effect on human lives, and they provide useful tools that are vital for modern biotechnology. By definition, plasmids, unlike chromosomes, are not essential for the survival of their host; “curing” the plasmid should not affect its host under benign growth conditions. However, plasmids often carry genes that allow survival under challenging circumstances such as in the presence of antibiotics or environmental pollutants. There are an increasing number of elements that look remarkably like plasmids but carry essential genes and, therefore, are not formally plasmids (see ▶ “Plasmids as Secondary Chromosomes”). Plasmids usually double in number during the growth cycle of their host and seem to replicate in step with the chromosome

Plasmid Genomes, Introduction to

even though they may be present at a copy number higher than that of the chromosome. Thus, they are chromosome-like but have the useful properties of being free to be lost or to change through mutation, thereby providing a genetic locus where genes may evolve faster than in the chromosome; they have been described as nature’s scratchpad for evolution (B. M. Wilkins, personal communication (Lawley et al. 2004)). Plasmids are also able to spread from host to host by conjugation, transformation, or transduction, thereby promoting horizontal gene transfer. Genes that become linked to plasmids are thus more likely to spread, or be “mobilized,” and the collective set of genes carried in this way is referred to as the “mobilome” as opposed to the chromosomal genes that are less mobile and have been referred to as the “stay-at-‘ome” (see ▶ “Metamobilomics – The Plasmid Metagenome of Natural Environments”). Thus, plasmids (the “plasmid genome” collectively) are a special part of the genome that responds to selective pressure in remarkable ways as illustrated by the dramatic spread of antibiotic resistance genes over the last 60 years or the evolution of catabolic pathways in bacteria isolated from polluted water and soil. These characteristics have made plasmids an important tool basis for genetic manipulation (GM) that has revolutionized molecular biology. Plasmids, generally, do not associate with one specific host but are found within separate species or genera depending on their host range, which can be broad or narrow. This poses a problem for submission to sequence databases, which are interested in clear-cut classification schemes (see ▶ “Plasmids, Naming and Annotation of”). In this sense, each plasmid forms a separate and complete entity, even if it cannot exist on its own, and thus is worthy of study in its own right. Many researchers fail to appreciate plasmids in their entirety, but those who do think about plasmids marvel at the beauty of their compact yet versatile natures as this collection of articles will illustrate. Nor is this knowledge without application – not only does it underpin GM technology but it will also contribute to novel ways to combat plasmidmediated antibiotic resistance and virulence mechanisms. This introduction will explain the

Plasmid Genomes, Introduction to

basics and lead the reader through modern methodologies (▶ “Synthetic Plasmid Biology”) and end up with applications of manipulating plasmid genomes that will justify the effort that plasmid biologists bring to their subject. Since plasmids are, by definition, extrachromosomal genetic elements that replicate autonomously in step with their host and are thus inherited vertically from generation to generation, all other properties or functions encoded by plasmids are additional, although they may contribute to the success of the plasmid in critical ways. The core traits or “plasmid backbone” can be divided into two types: those genes that ensure vertical inheritance (replication, copy number, partitioning, stability) and are essential and those genes that are important but inessential and code for horizontal spread. The genes that are obviously outside the backbone region and that may contribute a phenotypic advantage to their host are referred to as the “genetic cargo.”

Development of the Field The appreciation of plasmids as a distinct part of the bacterial genome was first established through genetic studies in the 1950s, and the idea that there are distinct plasmid types that can be grouped by incompatibility arose from genetic approaches during the 1960s. Their physical structure was demonstrated when methods such as equilibrium density and velocity gradient centrifugation showed the presence of discreet DNA species that correlated with the presence of these elements in the late 1960s and early 1970s. Electron microscopy using the Kleinschmidt technique demonstrated the circular nature and size of these elements, and heteroduplex analysis, among other techniques, made it possible to compare the differences and similarities between related plasmid genomes. The production of purified restriction enzymes and the development of gel electrophoresis combined with recombinant DNA technology in the 1970s made it possible to build physical maps of these genomes and dissect them genetically to establish the function of particular genes and operons. DNA sequencing

937

in the late 1970s and 1980s and associated genetic analysis allowed the nature of the components of these genomes to be established. At the same time the complete sequences of a number of small plasmid genomes were established, but it was not until the 1990s that the first complete genome of a conjugative plasmid was assembled. The 1980s and 1990s saw the progressive understanding of the molecular biology of each plasmid core function – replication, transfer, multimer resolution, addiction, and finally partitioning. In parallel, the discovery of the multiple layers that can coordinate these functions led to a greater appreciation of plasmid genomes as finely tuned systems. The development of fluorescence microscopy in the 1990s to track subcellular organization of the bacterial genome allowed the plasmid genomes and their movement to be visualized. The 2000s saw the increasingly rapid expansion in the numbers of fully sequenced plasmid genomes which has allowed analysis of the evolution of their core function as well as a better appreciation of their role in the dynamic nature of the bacterial genome.

The Nature of Plasmid Genomes Plasmid genomes vary in size from as little as 1 kb, just big enough to encode one protein, to over 1,000 kb. Examples of plasmids across this size range are illustrated in Fig. 1. The largest are thus bigger than the smallest bacterial genomes (about 400 kb). Like chromosomes, they can be linear but are more frequently circular. Since plasmid isolation techniques often depend on the circularity of the plasmid molecules, it may be that we have missed a significant number of linear plasmid genomes in the past. Fortunately, sequencing whole bacterial genomes is now comparative in cost to laboriously isolating and sequencing plasmid DNA and should reduce this bias. For small plasmids, the copy number tends to be high due to a simple, stable inheritance strategy (many copies, randomly segregated), whereas for large plasmids, the copy number tends to be low. Thus, the percentage of the DNA in the bacterial cell that is derived from

P

938

Plasmid Genomes, Introduction to

Plasmid Genomes, Introduction to, Fig. 1 Examples of modular organization in plasmid genomes. Core to every genome are replication and maintenance functions (cyan) consisting of replication (rep, essential), partitioning (par), multimer resolution (mrs), and addiction functions (psk post-segregational killing). Many but not all plasmids have transfer and mobilization functions (yellow) – smaller plasmids tend to have mobilization, mob, functions that allow them to be mobilized by selftransmissible plasmids that have both transfer origin, oriT, DNA processing, and mating pair formation functions, tra/ trb. Plasmids also carry a variety of cargo functions that may be associated with transposition and recombination functions (blue): insertion sequences, IS; transposons, Tn; and the more specialized PRE function (plasmid

recombination) that promotes co-integration with other plasmids. Plasmid F has three replicons, hence the numbers IA, IB, and II. Phenotypic determinants can include almost any type of function but commonly include resistance determinants (red; aph aminoglycoside phosphotransferase, bla beta lactamase, str streptomycin resistance, sul sulfonamide resistance, tet tetracycline resistance) and bacteriocin production, cea, and immunity, imm (lime green). Other genetic loci not yet defined: cer ColE1 Resolution, eex entry exclusion (blocks entry of specific plasmid types by conjugative transfer by preventing the formation of productive mating pairs), rom repressor of maintenance. Note that yellow arrows indicate position and direction of transfer from oriT. These maps are diagrammatic representations and are not to scale

plasmids tends to be about 1–5%. There may be a barrier in many bacteria that prevents this percentage from increasing because of the undue metabolic burden on the host. The smallest plasmid genomes tend to be composed almost exclusively of replication and essential maintenance functions, and, providing they do not have a significant effect on the host properties (e.g., reduction in growth rate), they can be regarded as archetypal selfish DNA elements that exist for their own propagation rather than the benefit of their host. These plasmids are called “cryptic” which simply refers to the absence of a major phenotypic effect on the host. Larger plasmids may have more complex backbones of core functions including replication and maintenance functions and, possibly, mobilization or transfer functions as well. The maximum size of this plasmid core varies among bacteria. For example, in Streptomyces species, transfer between one strain

and another seems to be much simpler than the mating apparatus needed for conjugative transfer between Enterobacteriaceae, with core plasmid functions, including transfer functions, as little as 10 kb in size. By contrast, among the known IncP-1 plasmids, there are examples containing backbone functions of 50 kb although we do not understand how the complexity of these backbones contributes to the success of the plasmid. Another gray area is defining the plasmid backbone of cointegrates where two or more plasmids fuse together resulting in plasmids with multiple copies of core functions as in the case of the large family of F-like plasmids. Plasmid genomes that contain large cargos of additional genes generally tend to cluster their core functions together in one or more defined regions. In the case of IncP-1 plasmids, although there is more than one cluster of core functions, it is very clear that there are only a few distinct

Plasmid Genomes, Introduction to

939

Plasmid Genomes, Introduction to, Fig. 2 Evolution of a plasmid genome. A plasmid comes into existence as a self-replicating DNA element but should rapidly acquire or evolve a mechanism for controlling initiation to limit the metabolic load that it imposes on its host. A mechanism is necessary to counter the effect of recombination between identical copies of the plasmid that can result in the accumulation of multimers (dimer catastrophe). Copy number cannot fall below a level sufficient to allow random partitioning to ensure stable inheritance unless the plasmid acquires an active partitioning system. This allows the plasmid copy number to drop potentially to one copy per chromosome, and under these circumstances the plasmid

can grow to 100 s or thousands of kb with minimized consequences. Thus, there is an inverse correlation between plasmid size and plasmid copy number, and only low copy number plasmids tend to be self-transmissible due to the large number of genes required for a functional transfer region. Exceptions are illustrated by the IncX plasmids like R6K that have a copy number of >15 copies/chromosome. Because known plasmids are the result of many recombination events, plasmids in which all the basic functions are clustered tend to be selected because clustering decreases the separation/loss of any of these core-acquired functions

places where additional genes can be inserted. Whether cargo genes were relatively recently acquired or not, the clustering of the core genome is likely to be due to repeated shuffling of the genome functions due to acquisition and loss of DNA segments. Certain rules appear to apply to the acquisition of additional cargo genes. Firstly, insertions must not disrupt the function of the core; otherwise, the resulting plasmid would be lost rapidly. Secondly, insertions that do not interfere with the core but are a burden to the host will also be lost since the host is less competitive and is eventually diluted out. Therefore the first insertion must occur at a safe location (from the plasmid’s perspective) that does not have deleterious effects. Once this has occurred, then subsequent

insertions are likely to occur in the same location, partly because transposition often occurs preferentially into other transposable elements. This expanded cargo region then becomes a bigger target for future insertion events (Fig. 2). While the cores of some plasmids tend to be remarkably resistant to genetic drift as deduced by sequence alignments of isolates from diverse locations and dates, the rest of the genome can be remarkably malleable. The classic integration of F into the E. coli chromosome by recombination between IS elements to form an Hfr strain and the subsequent excision by recombination between different pairs of IS elements to create an F plasmid carrying a segment of the chromosome illustrates this flexibility. Interestingly, F plasmids

P

940

can maintain large segments of the chromosome in a stable manner with little rearrangement of its own or the host’s DNA (at least over measurable periods of time). In the Rhizobiaceae, there appears to be a constant rearrangement of the plasmid via the acquisition and loss of host sequences as has been demonstrated by the polymerase chain reaction (PCR) using primers for the unique sequences flanking all copies of a particular IS element in the genome. This demonstrated that essentially all possible integration and excision events that could happen by homologous recombination between IS elements did happen, providing a large enough sample size was used (Lozano et al. 2010). Thus, the plasmid genome can be envisaged as a core that undergoes gain and loss of blocks of genes from the cellular environments it finds itself in, while selective pressure preserves those products that are advantageous to the existing bacterial community and to the plasmid itself.

Plasmid Replication Systems Since the fundamental property that defines a plasmid is its ability to replicate autonomously, it is not surprising that the most commonly used system for classifying plasmids is based on their replication system(s) and how they are related to each other. This encyclopedia will start from the premise that understanding the replication system is key to understanding plasmid genomes and that all contributions will be built on this foundation. This not only allows for plasmid classification but also provides an understanding of how the different replication systems impose constraints on plasmid evolution. For example, rolling circle replication is generally confined to smallish, high copy number plasmids, whereas large, low copy number plasmids invariably use particular types of theta or linear replication systems. An important concept in thinking about replication and plasmid genomes is the replicon. As Jacob, Brenner, and Cuzin originally defined it, the replicon is the unit of DNA that is replicated (Jacob et al. 1963). Thus, a bacterial chromosome, which generally contains a single, bidirectional

Plasmid Genomes, Introduction to

replication origin from which replication proceeds in both directions to a terminus, normally constitutes a single replicon. But “replicon” has also come to have a second meaning – the genetic information required for a DNA molecule to replicate autonomously within a bacterial cell. Since there are many genes encoding replication functions dotted around the chromosome, “replicon” has increasingly become associated with the genetic region that encodes the origin of replication (oriV) and the protein that specifically activates it. This is certainly the way that it is used with respect to plasmids, and this is important because although there are many plasmids with a single replicon in this sense, there are also many plasmids that contain multiple replicons. At this point it is worth considering the issue of copy number. While prokaryotic genomes must duplicate prior to cell division, the idea of a baby cell with a single copy of the chromosome is likely to be an oversimplification for many bacteria. Indeed, for some bacteria such as Deinococcus radiodurans, the existence of multiple copies per cell is essential and allows the chromosome to be rebuilt by recombination between the random fragments created by ionizing radiation. In rapidly growing bacteria, there is a gradient of gene dosage moving from oriC with the highest dosage to the terminus with the lowest because the genes nearest the replication origin will have been replicated at least once if not twice before the previous round has been completed. Against this backdrop, determining the copy number of a plasmid, which is normally defined as copies per chromosome equivalent (rather than number of copies of the plasmid per bacterium), is not straightforward. Originally copy number was determined by labeling bacterial DNA by growth in the presence of radioactive precursors and then separating plasmid DNA from chromosomal DNA by buoyant density (normally CsCl) centrifugation. It can also be estimated by Southern blotting of total DNA with plasmid and chromosome markers but is now probably most accurately assessed by comparing read depth after high-throughput sequencing. Whatever method is used to determine copy number, it is clear that each plasmid (and groups of closely

Plasmid Genomes, Introduction to

related plasmids) has a characteristic copy number that is controlled by a variety of types of circuitry (del Solar and Espinosa 2000). Replication is stimulated when the copy number per cell falls below this level and is shut off when copy number rises above this level. Thus, the plasmid genome is more than just a single copy of the plasmid but is comprised of the complement of plasmid molecules that normally exist together in a bacterial cell. Many of the properties of a plasmid, for example, the range of auxiliary stable inheritance functions it carries or the level of antibiotic resistance it confers on its host, are characteristic of its copy number per cell rather than just the genes encoded on a single plasmid. Plasmids with multiple replicons are most likely to have arisen by the joining of two plasmids (co-integration – the integration of two plasmids into one another) due to recombination or transposition processes that work constantly in the cell. This is followed by a selective event that allows the co-integrant to survive at the expense of its neighbors. One such event could be the cotransfer of the phenotypic determinants carried by both plasmids, which could benefit from cointegrate formation. When the replicons involved have similar properties (particularly in relation to copy number), as in the case of certain plasmids of the IncF or IncH families, these appear to coexist stably in the same DNA molecule. In cases where there are both high and low copy number replicons, instability results. When the two plasmids within the cointegrate are relatively small, the higher copy number replicon dominates – its replication is not suppressed until the copy number has risen to a level higher than that of the lower copy number plasmid that has already stopped replicating. When the low copy number replicon is part of a large genome, it may be impossible for the high copy number replicon to take over completely, and this may create problems that favor bacteria in which the two plasmids have split apart again. There may also be instability when one of the replicons generates single-stranded replication intermediates as occurs, for example, in rolling circle replication (RCR) replicons or sequential, bidirectional theta replicons as found in the IncQ plasmids (see

941

below). The single-stranded DNA may stimulate recombinational breakdown of the cointegrate. Stable multi-replicon plasmid genomes may therefore require that one of the replicons is inactivated. In the case of certain large conjugative plasmids from Staphylococcus aureus, capture of the resistance determinants of small high copy number plasmids occurs through transpositional inactivation of the rep gene of the smaller, high copy number plasmid (Berg et al. 1998; Jensen et al. 2010). Thus, the interplay between different replicons is a key feature that underpins our understanding of the structure and composition of plasmid genomes. It is important to think about the nature of the plasmid replicons that exist and the sorts of plasmid genomes with which they are associated (Fig. 3). As in the case of phages like lambda, we can divide plasmid replicons into those with intermediates that resemble the Greek letter theta (▶ Theta-Replicating Plasmids, Large), involving separation of the DNA strands at the origin followed by synthesis of a short RNA molecule to prime leading-strand synthesis, and those with intermediates that resemble the Greek letter sigma that involve nicking of one DNA strand at the origin to create a free 30 end that primes leadingstrand synthesis. The latter plasmids are also called rolling circle replication (RCR) plasmids (del Solar and Espinosa, Rolling circle replicating plasmids). Theta replicons often use protein-based replication initiation whereby the Rep protein binds to recognition sequences at oriV. These sequences are often multiple short repeats in either direct or inverted orientations (or both). These repeats or “iterons” confer many important properties on plasmids including incompatibility (Thomas, Plasmid incompatibility) and copy number. Theta replicons can be further divided into those that use a helicase to unwind the origin followed by a specific primase to create an RNA primer that initiates leading-strand synthesis and those that employ RNA polymerase (RNAP) to create a primer from which DNA polymerase I (DNA P I) can start leading-strand synthesis which switches to DNA polymerase III (DNAP III) as the replication bubble increases in size.

P

942

Plasmid Genomes, Introduction to

Plasmid Genomes, Introduction to, Fig. 3 Replication strategies and associated genomic functions. (a) Theta replication initiated by a transcript. ColE1 exemplifies replicons that do not require a plasmid-encoded protein: the primer for leading-strand initiation is produced by processing of a 500 nt (approx) transcript (RNA II) created by RNA polymerase. RNA II is folded into a productive form, and this is repressed by a counter-transcript (RNAI) whose concentration increases in proportion to plasmid copy number and eventually shuts off replication at a copy number that is characteristic of the plasmid. The small, dimeric protein Rom, encoded by the rom gene, binds to the complex between RNA I and RNA II and potentiates repression. The primer is extended by DNAPI until the single-stranded DNA bubble is large enough to allow DNAP III and the primosome to enter. (b) Rolling circle replication (RCR) plasmid replication involves a replication protein that binds to a recognition sequence (generally an inverted repeat that can be extruded to form a hairpin). It then nicks the adjacent sequence, becoming covalently attached to the 50 end and leaving a 30 end that can prime replication, thereby displacing the existing strand until the replication fork has progressed all the

way around the plasmid genome. Rep once again recognizes and nicks the origin, reforming the double-stranded circle and releasing a single-stranded circle. One Rep subunit is left attached to a short oligonucleotide that is a product of this process so that it is inactivated and is not able to participate in initiation again. Control is generally exerted by a counter-transcript to the rep gene. The replication origin (nick site) may be within the rep gene or adjacent to it. (c) Theta replication driven by a Rep protein. The gene encoding the replication protein Rep can be autogenously, negatively controlled by Rep binding to its promoter region, by counter-transcript RNA regulating transcription or translation of the rep gene, or by an additional repressor that can be either closely linked or encoded independently. The Rep protein acts positively at the origin to attract either additional plasmid proteins (e.g., for IncQ plasmids like RSF1010) or host proteins that unwind the origin and then prime leading-strand synthesis. Replication of such plasmids can be regulated by adjusting the supply of Rep or by “handcuffing” two related replicons by Rep dimers, monomeric Rep being the form that normally activates oriV

Primase-based replicons can be further subdivided into the majority that recruit the host initiation machinery for both leading- and lagging-strand syntheses and thus function like minichromosomes and those that are relatively independent, encoding their own leading-strand initiation system but providing no system for lagging-strand synthesis. Plasmids that have the latter sort of replicons, termed sequential bidirectional replicons, as well as those using RCR, generate substantial single-stranded DNA intermediates. This is because either the lagging strand is synthesized as a continuous process from the

complementary synthesis that originated at the origin or the lagging-strand origin is activated by the formation of an in cis double-stranded DNA hairpin which is recognized by RNAP and used to create a primer. Either way, neither of these processes is suited to large genomes since long segments of single-stranded DNA induce the SOS response, which increases recombinational activity and often results in deletions or rearrangements, particularly if the DNA contains repeated or inverted DNA segments. It is worth explicitly mentioning replication of linear plasmid genomes, which are most

Plasmid Genomes, Introduction to

frequently encountered in Actinomycetes and Borrelia species, since they face specific problems with replication of their ends. Although some phage can initiate replication from the ends of their genomes, linear plasmids generally initiate replication from one or more internal sites (Zhang et al. 2010) using mechanisms similar to those for theta replicons and then deal with replication of their ends in specific ways. In Streptomyces plasmids, the 50 ends of the linear genome are protected by covalent attachment of a terminal protein, whereas the 30 ends form an extension of 250–300 bp that adopt a protective, complex, secondary structure based on multiple inverted repeat sequences. This allows the 30 end to prime replication back on itself using a hairpin structure. A second plasmid-encoded protein, Tap, encoded upstream of tpg (the terminal protein gene) is thought to bind to specific recognition elements in the 30 sequence and recruit the terminal protein which then cleaves the double-stranded DNA, remaining attached to 50 end and releasing the 30 end (Huang et al. 2007). The bacteriophage N15, whose prophage state is a plasmid, is a doublestranded DNA molecule with covalently closed and looped back ends that becomes a singlestranded circle when denatured (Ravin 2011). The linear plasmids of Borrelia species have similar, covalently looped back ends with a telomere resolvase produced to deal with the dimeric daughter molecules (Chaconas and Kobryn 2010).

Auxiliary Stable Inheritance Functions Following the logic described above, specific types of replication systems will be affected by associated stable inheritance functions. For instance, large plasmids cannot be maintained at high copy number because this would impose an unacceptable metabolic load on their hosts. Only plasmids that can control their copy number will be able to exceed a certain size limit. This has been estimated by analyses of the sizes of sequenced genomes in conjunction with experimentally determined copy number estimates and functional classification of plasmid replicons.

943

Identifying additional stable inheritance functions such as active partitioning, multimer resolution, and post-segregational killing can provide the basis for predicting a plasmid’s properties or behavior and understanding its evolution. Active partitioning. Active partitioning systems, by definition, function to distribute the copies of a plasmid in a better-than-random way to either side of the plane of cell division, either throughout the cell cycle or just prior to cell division. The presence of genes in a plasmid genome that are known to, or are predicted to, encode active partitioning functions indicates that the plasmid is or was of a low enough copy number to require this assistance. One way of thinking about plasmid evolution is that a plasmid could initially seek to maximize its copy number thereby out-replicating its genomic context. Thus, the copy number would rise initially to confer maximum stability, but possibly this would put an unnecessary energetic burden on its host. As selective pressure favors the acquisition of additional phenotypic determinants, this burden would increase to unacceptable levels leading to plasmid instability. Acquisition of a partition system can therefore allow the plasmid to escape from this situation, reducing the copy number by changing the strength of the negative feedback loops that control replication, without sacrificing segregational stability (Fig. 2). A number of families of partitioning systems have been identified and studied and have been reviewed extensively (Schumacher 2012). Suffice to say that they normally consist of a cis-acting site, termed the centromere-like region, because of the obvious formal similarity to the region of eukaryotic chromosomes where the spindles attach, and one or more proteins that bind to this region that are able to direct the movement of the plasmid DNA molecule(s) in the cell after (or possibly even before) replication has taken place. These families are based on similarity of the partitioning motor proteins with actin, with Walker ATPases, and with FtsZ, the cell division protein. Multimer resolution. A second type of auxiliary function is represented by the multimer resolution systems (mrs) that have evolved to ensure

P

944

that each copy of the plasmid is physically separate so that at cell division random segregation can play a role in ensuring that both daughter bacteria have at least one copy of the plasmid (Hallett et al. 2004). Such mrs functions are found on both low copy number (e.g., in a plasmid genome rsfF-D functions of plasmid F) and high copy number plasmids (e.g., the cer-xer system of ColE1). In the former case multimer resolution can be viewed as preventing the negative effect of all copies of the plasmid becoming part of the same DNA molecule (catenated) and being unable to be split between the two daughter cells, known as the dimer catastrophe (Field and Summers 2011). In the case of high copy number plasmids, it provides a positive function by ensuring that there are enough physically separate segregating units to allow efficient segregation. One point that has not yet been resolved satisfactorily is why the observed number of plasmid foci for low copy number plasmids is lower than expected on the basis of estimated copy number. This raises the possibility that low copy number plasmids form physically linked clusters, containing two or more plasmid genomes that are actively partitioned. This could be a disadvantage in the absence of an active partitioning mechanism. Plasmid addiction systems. One of the most unexpected plasmid maintenance strategies is the carriage of addiction systems by plasmids which actively reduce host viability if the plasmid is lost (Engelberg-Kulka and Glaser 1999; Mochizuki et al. 2006). These consist essentially of two elements: one is lethal to the host on its own (toxin), whereas the other in some way prevents this lethality either by blocking production of the lethal element or by neutralizing its effect (antidote or antitoxin). The classic examples of these two types are the hok-sok system of plasmid R1 and the CcdA-CcdB system of F. In the case of hok-sok, translation of the stable and highly structured mRNA encoding the lethal polypeptide Hok (host killing) is regulated first by folding of the mRNA that sequesters the ribosome binding site and then by the unstable sok antisense RNA (suppressor of killing). The result is a timedelayed production of Hok, which depolarizes membranes and causes loss of energetic

Plasmid Genomes, Introduction to

function – but only if the plasmid has been lost or transcription has been prevented (Gerdes et al. 1997). The alternative model illustrated by the Ccd system involves a stable toxin, in this case the gyrase poison CcdB, which is normally complexed with the unstable antidote, in this case, CcdA. Again, plasmid loss prevents production of more CcdA, resulting in the rapid depletion of the antidote pool and activation of the toxin (De Jonge et al. 2012). The antidote and toxin do not always need to interact directly as illustrated by restriction-modification (RM) systems, which were recognized by Kobayashi as constituting an addiction system when encoded by plasmids (Kobayashi 2004). If an RM plasmid is lost, then the ability to modify all of the restriction targets is eventually lost as the enzymes are diluted out, whereas the partner restriction enzyme, providing it has a longer half-life, needs only cut at a few sites in order to have a lethal effect. Examples of currently known addiction systems are listed in Table 1. Such addiction systems are also encoded as part of many bacterial chromosomes indicating that their function cannot simply be to aid plasmid stability. The key discovery is that these functions are generally not bacteriocidal when switched on but rather bacteriostatic, inducing a dormant state in which the bacteria are described as “persisters” and can survive a variety of stress conditions (Maisonneuve et al. 2011). The possibility that plasmids that carry addiction systems may increase the number of persisters in a population and thus aid the survival of their host is consistent with data recently obtained with bacteria carrying the F plasmid (Tripathi et al. 2012). Being able to recognize the presence of such addiction systems in plasmid genomes is important when attempting to displace or “cure” the plasmid carrying them; this was the key factor in developing an efficient plasmid curing strategy for complex plasmids like those of the F-group. Attempts to displace F-like plasmids can be problematic because they carry multiple replicons and multiple addiction systems. Even if the function of all the replicons is blocked, addiction systems can either kill the plasmid-free segregants or induce stress responses that elevate the mutation rate.

Plasmid Genomes, Introduction to

945

Plasmid Genomes, Introduction to, Table 1 Examples of plasmid addiction systems Type I antisense RNA systems Hok-sok

Plasmid R1 (IncFII)

Host Enterobacteriaceae

Toxin Hok

Enterobacteriaceae

PndA

par

R483 (IncI alpha) pAD1

Fst

srnB

F

Enterococcus faecalis E. coli

SrnB’

SrnC RNA

Membrane

F (IncFI) R1/R100 (IncFII) P1

E. coli Enterobacteriaceae

CcdB Kid/ PemK Doc

CcdA Kis/PemI

Gyrase ss 50 -UA(C/A)-30 in RNA

Phd

RK2 (IncPalpha) pSM1903 (Inc18) pMYS6000

Gram-negative bacteria S. pyogenes

ParE

ParD

Translation inhibition via 30S subunit Gyrase

Zeta

Epsilon

Shigella flexneri

MvpT

MvpA

pnd

Type II proteinbased systems Ccd ParD/Pem Phd-doc ParDE SegB Mvp

E. coli

Antitoxin Sok RNA PndB RNA RNAII

Target Membrane Membrane Membrane

Inhibits ATP/GTP binding to Zeta, a phosphotransferase

See (Kobayashi 2004)

This leads to the formation of plasmid-free segregants that are not isogenic with the plasmidpositive parent. Based on a knowledge and understanding of the plasmid genome, the solution was the creation of a plasmid, pCURE2, that could interfere with each replicon either by production of an appropriate repressor or titration of an appropriate positive function and that could neutralize each addiction system present. pCURE2 is 100% efficient in a stress-free process and is sufficiently unstable itself that pCURE2-free bacteria can easily be isolated by counterselection against the sacB gene that it carries. This strategy can be used to displace other plasmids but depends on the ability to recognize these addiction systems using bioinformatic analysis (Hale et al. 2010).

Conjugative Transfer The ability to replicate autonomously allows a plasmid to establish itself in a new host without the need for homologous recombination (insertion into an existing genome) providing it can make use of the generic host factors (polymerases,

gyrases, helicases, etc.) needed for replication. When a replicon is capable of functioning in many different bacteria, the plasmid is capable of spreading horizontally into diverse genetic backgrounds and is called a broad host range plasmid. Plasmids that have more stringent requirements for replication in a new host have a more limited, or narrow, host range (Top, Genomic signature analysis to predict plasmid host range). It is not surprising that many plasmids contain complex mechanisms that promote the transfer of DNA between bacterial cells. These conjugative transfer systems create a pore through which single- or double-stranded DNA is passed. Because the most complex conjugative transfer systems involve many genes and require a considerable amount of DNA to encode the whole apparatus, they tend to be found only on larger, lower copy number plasmids. Conjugative transfer requires two different processes: first, the construction of a conjugative pore between the bacteria (mating pair formation or Mpf) and, second, the interaction of the plasmid with the pore and subsequent DNA transfer (Dtr). Pore formation can be a generic process, whereas the

P

946

interaction that drives transfer needs to be specific for each plasmid. Many plasmids “borrow” the conjugative pore of large plasmids to transfer themselves and only provide the Dtr genes needed to link the plasmid to the pore and initiate transfer, a process termed mobilization (Mob). Plasmids that are mobilized by borrowed conjugative pores can be smaller and have a higher copy number than the plasmids that code for the whole conjugative apparatus. Thus, the transfer apparatus can be used as a second subsidiary system for plasmid classification that helps make sense of their biology and the nature of their genomes (Garcillán-Barcia et al. 2009; Garcillán-Barcia et al. 2011; ▶ “Conjugative Transfer Systems and Classifying Plasmid Genomes”). There are three main types of known transfer systems that have quite different consequences for the genetic space they occupy on plasmid genomes. The first type is self-transmissible and uses a type IV secretion system (T4SS) to build a pilus (Gram-negative) or other structure (Grampositive) that spans the cell envelope and is exposed on the bacterial surface. This allows contact with a recipient cell to create a conjugative pore. The plasmid is then attached to the pore via an ATPase called the coupling protein that brings the DNA-protein complex (relaxosome) to the pore, resulting in replicative transfer (DNA transfer or mobilization). Such transfer systems are found in both Gram-negative and Gram-positive bacteria as well as the archaea, the best studied systems being plasmids within the Enterobacteriaceae and also the cocci (Enterococci and Staphylococci) (Grohmann et al. 2003). The genes for such systems usually constitute the largest segment of the plasmid backbone. In many cases the genes for Mpf and Dtr are separated into two or more different transcriptional units (Fig. 4). The presence of a T4SS in a plasmid genome can indicate self-transmissibility but could also indicate the presence of a protein export system not associated with plasmid transfer. This ambiguity can be found using simple bioinformatic analyses that can lead to incorrect prediction of function and inappropriate naming of the genes

Plasmid Genomes, Introduction to

(Frost and Thomas, Naming and annotation of plasmids). There are several important hints to the presence of a true conjugative system. The most important is the presence of the coupling protein, now known as the VirD4 family. Other clues are the presence of a relaxase and orthologues of an ATPase involved in protein transport (VirB4). A subclass of T4SS-based conjugative systems contains homologues belonging to the VirB11 family of ATPases (such as IncP plasmids), whereas another subclass, represented by F-type systems, has a complex mating pair stabilization and pilus retraction system that contains cysteine-rich proteins, such as members of the TraN family (Lawley et al. 2003; Zechner et al. 2010). The second type of transfer system is found in plasmids of the Actinomycetes and the archaea. In Streptomyces, for instance, the cells grow by hyphal extension, and plasmid transfer involves two separate stages. The first stage of transfer takes advantage of hyphal fusion and does not involve a complex transfer apparatus but, instead, uses a single trans-envelope protein to form the conjugative pore that actively transports doublestranded DNA across the membrane. The second stage involves transfer or “spread” across the septa that occur at intervals along the hyphae during spore formation. The genomic signature of such systems is quite distinct from those of the first type and resembles the chromosomal partitioning protein, FtsK, and SpoEIII. The former ensures that chromosomal DNA does not get trapped by the closing septum during cell division, whereas the latter ensures efficient delivery of a copy of the chromosome into the forespore during sporulation in Bacillus (Grohmann et al. 2003). Thus, only a relatively small number of plasmid genes are needed for these functions (Fig. 4). The third type of transfer systems is those that work by promoting fusion of the donor and recipient cells by a pheromone signaling system. This has been worked out in detail for Enterococcus faecalis with occasional reports that it might also be used in other Gram-positive cocci (Clewell 2011). In the Enterococcus system, the recipient cell emits a small peptide pheromone

Plasmid Genomes, Introduction to

947

Plasmid Genomes, Introduction to, Fig. 4 Plasmid gene clusters associated with different conjugative transfer strategies. ColE1 represents the simplest conjugative or mobilization organization – the oriT is within the mob genes for DNA metabolism; exc represents an exclusion protein. pWW0 is an IncP9 plasmid that has gene clusters for mob, containing DNA transfer genes, and the oriT separate from the mpf (mating pair formation) genes. MpfR is a regulator of the mpf operon. pAD1 is a Grampositive plasmid from Enterococcus faecalis that has a pheromone-responsive conjugative system. The pheromone response is controlled by the tra genes, including the regulatory RNA mD, that in turn controls expression of the tra genes including the exclusion locus sea1 and the gene for aggregation substance (AS), asa1. pAD1 has two

oriT (Clewell 2011). RK2 is an IncP1 alpha plasmid that has two transfer regions, one for DNA transfer and pilin processing (tra) and primase (pri) and one for the mating pair formation genes (trb) that are controlled by the trbA gene product. Note that the oriT is within the tra region. F is unusual in that its DNA transfer genes (tra) are separated by the mating pair formation genes (tra/trb) with the oriT lying outside the transfer operons. The main activator is encoded by traJ, which is regulated by the antisense RNA finP and finO that has been interrupted by an IS3 element. Transfer regions are gold, regulatory genes are red, DNA transfer genes are green with oriT shown as arrows or boxes in darker green, and IS3 is dark blue. Regulatory RNAs are shown as small arrows

that is detected at extremely low levels and is taken up by the donor cell. This sets in motion a complex pattern of gene expression resulting in the production of aggregation substance (AS) that is displayed on the surface of the donor cell. AS binds to the lipoteichoic acid of the recipient cell and brings the two cells together. The transfer of DNA through the conjugative pore is similar to that for other Gram-positive bacteria that use a modified T4SS suited to bacteria that lack a periplasmic space. After transfer has occurred, expression of the pheromone is repressed allowing differentiation between the new donor and other recipient cells. These systems also involve many gene products and therefore occupy significant segments of plasmid genomes (Fig. 4). The fourth system is worth mentioning at this point that illustrates the blurring of the line between phage and plasmids. “Plasmids,” such as R391, were difficult to isolate although they

exhibited incompatibility (R391 was in the IncJ group) with other “plasmids” and were capable of conjugative transfer. These “plasmids” turned out to be derived from lysogenic phages that can insert and excise from the chromosome using an integrase/excisionase mechanism. They do not appear to replicate autonomously but do excise prior to conjugative transfer by an F-type conjugative mechanism. These “plasmids” are now called integrating conjugative elements (ICEs) and are increasingly being identified through genome sequencing projects. One ICE that was identified many years ago was conjugative transposon, CTn916, which has a conjugative system related to many found in Gram-positive plasmids (Burrus and Waldor 2004; Clewell 2011). It is also important to mention P1, the Jekyll and Hyde in the world of plasmids and phages. P1 can be either a benign plasmid with a replicon almost identical to that of

P

948

Plasmid Genomes, Introduction to

Plasmid Genomes, Introduction to, Fig. 5 Coordinate regulation of expression of plasmid backbone functions. For an increasing number of plasmid genomes, it is clear that there are mechanisms for co-regulation of gene expression. IncP-1 plasmid RK2 illustrates this principle (Bingle

and Thomas 2001). Here the central control operon, which encodes active partitioning and transcriptional repressor functions, autogenously regulates its own expression as well as repressing all operons of the plasmid backbone or core

plasmid F or a phage that can be induced to form particles and lyse the cell. Thus, it moves from cell to cell as an infectious phage particle rather than by conjugation (Łobocka et al. 2004).

initiation including the F-like plasmid R1 and pIP501 in Gram-positive bacteria. The transfer systems of many F-like plasmids are regulated by an antisense RNA, FinP, which reduces conjugation potential in the absence of recipient cells, thereby reducing the metabolic burden on the host. FinP RNA provides plasmid specificity by interacting with the leading region of the mRNA for the transfer operon activator protein TraJ. FinP exhibits very little repression on its own and requires the FinO protein to mediate FinP-traJ mRNA duplexing (Frost and Koraimann 2010). Although one could fill many pages with examples of coordinated circuitry, it has become apparent that many plasmids have acquired ways to coordinate the expression of many of their basic functions (Bingle and Thomas 2001). Thus, the genes on the plasmid backbone – those genes that are needed to allow the plasmid to function as a genome – have become more integrated through the co-regulation of their expression. Some of these systems of coordination depend on genes being physically close to each other – as, for example, the inclusion of both rep and par functions in the repABC plasmids or the divergent transcription of the rep and tra functions of Ti plasmids such that the activities of the rep and tra functions are inversely related. Wider coordination of expression can be identified within IncP plasmids (Fig. 5). The replication and transfer functions are co-regulated at overlapping, divergent promoters in conjunction with repeated operator sequences spread across the genome that are

Regulation and Coordination of Plasmid Gene Expression One fascinating aspect to emerge from the detailed study of plasmid genomes is the extent to which they have developed regulatory circuits to ensure that their genes are able to be highly active when needed but can otherwise be highly repressed and minimize the burden they represent to their bacterial host (Bingle and Thomas 2001). Early examples of such circuits include systems that evolved to control replication such as that in ColE1. It encodes an antisense RNA (RNAI) that regulates the replication of the high copy number ColE1-type plasmids. This provides a way of titrating the long, complex countertranscript, RNAII, required for initiating leadingstrand synthesis. The action of RNA I is aided by the small basic protein Rom that mediates RNAIRNAII interaction and helps lower the copy number. These circuits have been analyzed by mathematical modeling which helps to determine how the importance of different parameters determines the overall behavior of the system (▶ Stekel, Modelling Plasmid Regulatory Systems). Regulatory RNAs were also found in other plasmids that control the level of replication

Plasmid Genomes, Introduction to

controlled by KorAB regulators. In other systems, such as the kis-kid post-segregational killing system of the F-like plasmid R1, coordination is through the targeting of RNAs by a ribonuclease whose level rises when the plasmid fails to segregate correctly (López-Villarejo et al. 2012). One consequence of this coordination of expression is that the plasmid genome is no longer composed of a set of independent modules which can be reshuffled or reused with complete flexibility. Attempts to use one module in the absence of the master regulator may be unsuccessful because of inappropriate levels of gene expression which may lead to either loss of function or deleterious effects on the host or other components of the system.

Transposable Elements and Integrons While transposable elements and integrons are found in all parts of a genome (chromosome, plasmids, and prophage), they appear to make up a higher proportion of plasmid genomes than of any other type of genome (▶ Hobman, Transposable elements and plasmid genomes). This can create difficulty in generating an accurate sequence of a plasmid genome when there are multiple identical or near identical copies of a single IS element. There are a variety of reasons why they accumulate particularly in plasmids. Since viruses are constrained by the need to be packaged into a virion capsid, the size of phage genome and the maximum capacity of the capsid appear to have coevolved together such that they cannot acquire much more than an extra 10% of mobile DNA. Since chromosomes encode a large number of essential genes, the accumulation of a large amount of mobile DNA with the associated increase in metabolic and phenotypic loading may also be disfavored. A striking feature of mobile DNA in plasmids is the clustering of such elements. This can be rationalized by imagining that in a relatively small and compact plasmid genome, there may be only a few places to insert mobile DNA without disrupting a function important for plasmid replication, maintenance, or transfer. This is certainly

949

true for the broad host range plasmid RSF1010 (aka R1162 and R300B) that is a poor candidate for use as a cloning vector. If the first insertion of mobile DNA does not have a dramatic negative effect on plasmid fitness, it will help define a site where further insertions will also not be harmful. Expanding the length of nonessential mobile DNA at this site may also positively encourage insertions at that particular region. Therefore, segments of plasmid genomes composed of multiple insertion events, one on top of the other, can be very complicated. Reconstructing the history of such regions is difficult but does give some idea of the evolutionary development of the segment (Fig. 6). Initially, this history can be traced using known sequences at the borders of mobile DNA elements (IS, integrons, transposons) and their genomic signatures to define the ends of elements in the region. For example, Tn1 has 23 bp inverted repeats at its ends and generates 5 bp direct repeats at its borders on insertion. If a large insertion disrupts a copy of Tn1, it should be possible to identify the original ends, if they exist, by identifying matches to the 23 bp sequences that define the ends of the element as well as the 5 bp direct repeats that border the initial Tn1 element. Often, the initial DNA segment, such as Tn1, occurs elsewhere in a different genome (plasmid, chromosome) in an uninterrupted form, ensuring that its cargo of genes is preserved at one location at least. Thus, it should also be possible to distinguish between a simple insertion of an element and two adjacent insertions in the same orientation that result from recombination. Particularly well-described examples are the plasmids of Staphylococcus aureus, which use these strategies to build a formidable arsenal of resistance determinants (Jensen et al. 2010).

Phenotypic Load Following on from the previous section, one can think about the nature of the noncore part of the plasmid genome as a cargo carried by the plasmid which can be analyzed in terms of what sorts of genes are carried, how they are assembled, and in what combinations they are usually found. Thus,

P

950

Plasmid Genomes, Introduction to

Plasmid Genomes, Introduction to, Fig. 6 Genome insertions. Plasmids tend, on average, to carry a much larger number of insertions (per unit length) acquired by transposition than chromosomes and phages. The history of insertions can be uncovered by identifying the ends of insertion elements and transposons, using the direct repeats

associated with them to pinpoint particular insertion events. The results are illustrated by the analysis of pWW0, the best studied TOL plasmid associated with biodegradation of aromatic organic compounds and the focus of much research into biosensors and bioremediation (Greated et al. 2012)

one of the intriguing questions is why plasmids, which often carry antibiotic resistance or catabolic functions separately, are so common, whereas plasmids that carry both antibiotic resistance and catabolic functions simultaneously are rare or, perhaps, nonexistent? Also, why are some very useful cargo genes usually found on plasmids and not chromosomes, and how do they differ from “chromosomal” genes? There are many ideas about this – from those that focus on the possibility that plasmid genes are ones that have recently spread through or are spreading through a population to the idea that these genes provide relatively simple autonomous functions. This debate has been informed by the analysis of genetic information using the latest bioinformatic tools and computer modeling (Rankin et al. 2011; ▶ “Mathematical Modelling of Plasmid Dynamics”).

In addition, it is interesting to consider how carriage of transposable elements on a plasmid may have a long-term beneficial effect which could explain why some plasmids seem to have accumulated a surprisingly large number of them. Since it is observed that transposable elements often preferentially insert into or near other elements, and since transposable elements can have a fitness cost if they jump into functionally necessary genes, having a plasmid that attracts insertions away from the chromosome may be an advantage. Second, some elements control their frequency of transposition by negative feedback circuits, so if a plasmid carries multiple copies, it may reduce the rate of movement of the element and thus protect the host from further insertional inactivation. Although one can think of the core backbone genes as not expressing a phenotype whereas the

Plasmid Genomes, Introduction to

cargo DNA must confer one under the right conditions or circumstances (e.g., the presence of an antibiotic), the reality is rarely as clear-cut as this. Thus, a “cryptic” plasmid may confer a phenotype that can be as simple as a change in growth rate, either up or down. Conjugative plasmids can have profound effects on their hosts although they may be very subtle in the absence of an appropriate assay. For instance, by virtue of expressing the conjugative apparatus itself, both positive (chromosomal evolution, biofilm formation) and negative (sensitivity to phage like M13 or PRD1, decrease in fitness) traits can be conferred on the host. A consequence of this may be the fact that in F-like plasmids the expression of conjugative transfer genes is tightly regulated so that only a few bacteria in a steady-state population express the genes for the full transfer apparatus and these genes only remain on for a relatively short time. On the other hand, immediately after the initial transfer, the new transconjugants help spread the plasmid rapidly throughout a population using a process called epidemic spread. Certainly, the good must outweigh the bad since conjugative plasmids are usually very stably maintained.

Future Outlook: Exploitation of Plasmid Genomes The advent of recombinant DNA was in part driven by the discovery of plasmid genomes that provided the basis for creating small selfreplicating genetic systems that were tailored to the needs of the researcher or the particular application. Thus, natural plasmid genomes have been extensively remodeled to provide the basis of the tools for the GM revolution. Indeed it is instructive to think of vector development as a largescale exercise to dissect and reengineer plasmid genomes. The first stage was identifying minireplicons from a variety of plasmids to provide the core of a range of vectors and separating these from mobilization functions to satisfy safety requirements that foresaw plasmids with unnatural genes spreading through bacterial populations. The second stage was joining these

951

to suitable selectable markers as well as inclusion of convenient restriction sites and ways of screening for inserts to create cloning vectors. In parallel, controlled promoters were added to make expression vectors to allow overproduction of the products of cloned genes. The discovery that some plasmid replication systems can be split into cis- and trans-acting functions allowed the construction of plasmids that could replicate in a host that carried the essential trans-acting replication function but not in a new host lacking it. This resulted in the creation of suicide vector plasmids that could transfer into, but not replicate, in a new host, forcing them to integrate into the chromosome via shared transposable elements or other homologous sequences, in order to survive. The other side of plasmids as tools is the use of plasmids as models for bacterial genomes. This was one of the original drivers behind study of plasmids because they were small enough to manipulate and, being nonessential, could be subject to a wider range of mutational analyses. However it was also based on the premise that genomic segments of large, low copy number plasmids should, in general, be no different from other parts of the collective bacterial genome. A good example of this is genome sampling in order to identify proteins that bind to a particular segment of DNA. Butala et al. (2009) constructed a plasmid containing DNA that was flanked by sites for the yeast homing endonuclease SceI. This enzyme cuts at an 18 bp sequence that does not normally exist in the bacterial genome. By having the gene for this nuclease under the tight control of the ara promoter, it is possible to cut this fragment out by the simple addition of arabinose. By including the lac operator between these sites and supplying a FLAG-tagged LacI protein, it is possible to purify the fragment by magnetic bead affinity pull-down technology. This makes it feasible to clone an uncharacterized DNA fragment into this vector in the context of the host of interest and then purify the DNA segment with its native complement of bound proteins. Thus, the cloned segment becomes part of the plasmid genome, proteins bind to it as normal, and its release from the genome allows those proteins to be analyzed.

P

952

This has particular application in identifying components particularly the nucleoid-associated proteins that are involved in a range of DNA-protein complexes that are important for normal chromosome or plasmid function. The growing understanding of the role of plasmid-encoded nucleoid-associated proteins in minimizing the burden of plasmids on their hosts and modulating the activity of the host genome is a great example of this. A second revolution is taking place in the exploitation of plasmids, driven by the simplicity and low cost of DNA synthesis. Synthetic biology is making the de novo construction of plasmid genomes a reality (▶ Hansen, Synthetic plasmid biology). As it becomes easier to do this, the limitation may be the knowledge that underpins the design process. At one level it should be simple to build plasmid genomes from scratch using the various essential components that we have described covering replication, partitioning, multimer resolution, addiction, and transfer. Already there are many “biobricks” or gene cassettes available that represent such plasmid components. The synthetic process allows the synthesis of “natural” gene sets that differ from existing gene sequences primarily with respect to G+C content, codon usage, and the presence of restriction sites. Problems may arise if such gene sets are regulated by other factors within their normal context. In addition, it is not known how well different sets of functions will work together, especially if this involves very different copy numbers from their normal context. Use of mathematical models to analyze the behavior of individual gene sets and their control circuits may provide a robust framework for predicting which components will work well together, but until this is tested empirically, the success of these predictions remains unknown. Such modeling may also allow us to design and build synthetic systems and associated regulatory circuits from even more basic constituent parts. This could generate systems that do not interact with their host and also lack any sort of specificity – for example, they could be outside all natural incompatibility groupings with obvious advantages for coexisting in natural systems.

Plasmid Genomes, Introduction to

Foundational Concepts The fundamental features of plasmid genomes are the cis-acting sites responsible for the various processes that define the plasmid properties of the genome. Replicon: (1) A unit of DNA that is replicated from a single origin uni- or bidirectionally, normally the whole plasmid or chromosome. (2) The set of genetic information required for autonomous replication – generally the replication origin (oriV) and whatever genetic information is required to activate it. Relaxosome: The complex formed at the transfer origin oriT by proteins that bind to it and are required to activate it that normally involves a nicking reaction. Segrasome: The complex formed by the centromere-like sequence and proteins that associate with it to create the apparatus that drives betterthan-random partitioning of the plasmid population. Mobilome: The collective genetic information carried on plasmids and other mobile elements in the relevant bacterial population. Incompatibility: The inability of two plasmid genomes carrying related replicons to be stably inherited in the same host over multiple generations due to use of, and competition for, one or more stages of replication or segregation.

Cross-References ▶ Conjugative Transfer Systems and Classifying Plasmid Genomes ▶ Genomic Signature Analysis to Predict Plasmid Host Range ▶ Mathematical Modeling of Plasmid Dynamics ▶ Metamobilomics – The plasmid metagenome of natural environments ▶ Plasmid Incompatibility ▶ Plasmid Regulatory Systems, Modeling ▶ Plasmids as Secondary Chromosomes ▶ Plasmids, Naming and Annotation of ▶ Rolling Circle Replicating Plasmids ▶ Synthetic Plasmid Biology ▶ Theta-Replicating Plasmids, Large ▶ Transposable Elements and Plasmid Genomes

Plasmid Genomes, Introduction to

References Berg T, Neville F, Apisiridej S, Hettiaratchi A, Leelaporn A, Skurray RA (1998) Complete nucleotide sequence of pSK41: evolution of staphylococcal conjugative multiresistance plasmids. J Bacteriol 180:4350–4359 Bingle LH, Thomas CM (2001) Regulatory circuits for plasmid survival. Curr Opin Microbiol 4:194–200 Burrus V, Waldor MK (2004) Shaping bacterial genomes with integrative and conjugative elements. Res Microbiol 155:376–386 Butala M, Busby SJW, Lee DJ (2009) DNA sampling: a method for probing protein binding at specific loci on bacterial chromosomes. Nucleic Acids Research 37:e37 Chaconas G, Kobryn K (2010) Structure, function, and evolution of linear replicons in Borrelia. Ann Rev Microbiol 64:185–202 Clewell DB (2011) Tales of conjugation and sex pheromones: a plasmid and enterococcal odyssey. Mob Genet Elem 1:38–54 De Jonge N, Simic M, Buts L, Haesaerts S, Roelants K, Garcia-Pino A, Sterckx Y, De Greve H, Lah J, Loris R (2012) Alternative interactions define gyrase specificity in the CcdB family. Mol Microbiol 84:965–978 del Solar G, Espinosa M (2000) Plasmid copy number control: an ever-growing story. Mol Microbiol 37:492–500 Engelberg-Kulka H, Glaser G (1999) Addiction modules and programmed cell death and anti death in bacterial cultures. Annu Rev Microbiol 53:43–70 Field CM, Summers DK (2011) Multicopy plasmid stability: revisiting the dimer catastrophe. J Theor Biol 291:119–127 Frost LS, Koraimann G (2010) Regulation of bacterial conjugation: balancing opportunity with adversity. Future Microbiol 5:1057–1071 Funnell B, Phillips GJ (2004) In: Funnell B, Phillips GJ (eds) Plasmid biology. ASM Press, Washington, DC Garcillán-Barcia MP, Francia MV, de la Cruz F (2009) The diversity of conjugative relaxases and its application in plasmid classification. FEMS Microbiol Rev 33:657–687 Garcillán-Barcia MP, Alvarado A, de la Cruz F (2011) Identification of bacterial plasmids based on mobility and plasmid population biology. FEMS Microbiol Rev 35:936–956 Gerdes K, Gultyaev AP, Franch T, Pedersen K, Mikkelsen ND (1997) Antisense RNA-regulated programmed cell death. Ann Rev Genet 19:49–61 Greated A, Lambertson L, Williams PA, Thomas CM (2002) Complete sequence of the IncP-9 TOL plasmid pWW0. Env Microbiol 4:856–871 Grohmann E, Muth G, Espinosa M (2003) Conjugative plasmid transfer in gram-positive bacteria. Microbiol Mol Biol Rev 67:277–301 Hale L, Lazos O, Haines A, Thomas C (2010) An efficient stress-free strategy to displace stable bacterial plasmids. Biotechniques 48:223–228

953 Hallett B, Vanhooff V, Cornet F (2004) DNA site-specific resolution systems. In: Funnell B, Phillips GJ (eds) Plasmid biology. ASM Press, Washington, DC, pp 145–180 Huang CH, Tsai HH, Tsay YG, Chien YN, Wang SL, Cheng MY, Ke CH, Chen CW (2007) The telomere system of the Streptomyces linear plasmid SCP1 represents a novel class. Mol Microbiol 63:1710–1718 Jacob F, Brenner S, Cuzin F (1963) On the regulation of DNA replication in bacteria. Cold Spring Harb Symp Quant Biol 28:329–438 Jensen SO, Apisiridej S, Kwong SM, Yang YH, Skurray RA, Firth N (2010) Analysis of the prototypical Staphylococcus aureus multiresistance plasmid pSK1. Plasmid 64:135–142 Kobayashi I (2004) Genetic addiction: a principle of gene symbiosis in a genome. In: Funnell B, Phillips GJ (eds) Plasmid biology. ASM Press, Washington, DC, pp 105–144 Lawley TD, Klimke WA, Gubbins MJ, Frost LS (2003) F factor conjugation is a true type IV secretion system. FEMS Lett Microbiol 269:1–15 Lawley T, Wilkins BM, Frost LS (2004) Conjugation in gram-negative bacteria. In: Funnell B, Phillips GJ (eds) Plasmid. ASM Press, Washington, DC, pp 203–226, chapter 9 Łobocka MB, Rose DJ, Plunkett G 3rd, Rusin M, Samojedny A, Lehnherr H, Yarmolinsky MB, Blattner FR (2004) Genome of bacteriophage P1. J Bacteriol 186:7032–7068 López-Villarejo J, Diago-Navarro E, Hernández-Arriaga AM, Díaz-Orejas R (2012) Kis antitoxin couples plasmid R1 replication and parD (kis, kid) maintenance modules. Plasmid 67:118–127 Lozano L, Hernández-González I, Bustos P, Santamaría RI, Souza V, Young JP, Dávila G, González V (2010) Evolutionary dynamics of insertion sequences in relation to the evolutionary histories of the chromosome and symbiotic plasmid genes of Rhizobium etli populations. Appl Environ Microbiol 76:6504–6513 Maisonneuve E, Shakespeare LJ, Jørgensen MG, Kenn GK (2011) Bacterial persistence by RNA endonucleases. Proc Natl Acad Sci U S A 108:13206–13211 Mochizuki A, Yahara K, Kobayashi I, Iwasa Y (2006) Genetic addiction: selfish gene’s strategy for symbiosis in the genome. Genetics 172:1309–1323 Rankin DJ, Rocha EPC, Brown SP (2011) What traits are carried on mobile genetic elements and why? Heredity 106:1–10 Ravin NV (2011) N15: the linear phage-plasmid. Plasmid 65:102–109 Schumacher MA (2012) Bacterial plasmid partition machinery: a minimalist approach to survival. Curr Opin Struct Biol 22:72–79 Thomas CM (2000) In: Thomas CM (ed) The horizontal gene pool: bacterial plasmids and gene spread. Harwood Academic, Amsterdam Tripathi A, Dewana PC, Barua B, Varadarajana R (2012) Additional role for the ccd operon of F-plasmid as

P

954 a transmissible persistence factor. Proc Natl Acad Sci U S A 109:12497–12502 Zechner EL, Lang S, Schildbach JF (2010) Assembly and mechanisms of bacterial type IV secretion machines. Philos Trans R Soc Lond B Biol Sci 367:1073–1087 Zhang R, Peng S, Zhongjun QZ (2010) Two internal origins of replication in Streptomyces linear plasmid pFRL1. App Env Microbiol 76:5676–5683

Plasmid Grouping

of the plasmids has additional features that give it an advantage – for example, possession of a second replicon. Incompatible plasmids are placed in the same group and there are wellestablished classification systems for certain hosts or groups of hosts but there are many plasmids that fall outside these systems because incompatibility tests have not been carried out systematically on all known plasmids.

Plasmid Grouping Discussion ▶ Plasmid Incompatibility

Plasmid Incompatibility Christopher M. Thomas Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, UK

Synonyms Plasmid classification; Plasmid family; Plasmid grouping; Replicon type

Definition Plasmid incompatibility refers to the inability of two plasmids to coexist stably over a number of generations in the same bacterial cell line. Generally, closely related plasmids tend to be incompatible, while distantly related plasmids tend to be compatible. The most frequent reason for two plasmids being incompatible is that they both possess a replicon with the same specificity of Rep protein or controlling elements. However, incompatibility can also be due to other types of competition, for example, between the same or closely related partitioning systems. Incompatibility may be reciprocal, in that both plasmids have the same chance of being lost from a cell line that starts with both, or it can be unidirectional, if one

The fundamental property of a plasmid genome is its ability to replicate autonomously, and thus the possession of an independent replicon (see ▶ “Plasmid Genomes, Introduction to”). During the analysis and assembly of genome or metagenome (▶ Metamobilomics: The Plasmid Metagenome of Natural Environments) sequence data, one feature that will mark out a contig (an assembly of overlapping sequence reads into a continuous block of reliable DNA sequence data) as coming from a plasmid is the presence of a replicon (although some “plasmids” have become “chromosomes” by virtue of acquiring genes essential to their hosts – see ▶ “Plasmids as Secondary Chromosomes”). Knowing the replicon type(s) present in a plasmid will allow one to make predictions about the likely properties of the plasmid, such as copy number, host, and host range which is important for naming and annotation (see ▶ “Plasmids, Naming and Annotation of”). Therefore, it is logical to use replicon type as the basis on which to group plasmids although for other sorts of questions classifying plasmids by the transfer system they carry can also be justified (see ▶ “Conjugative Transfer Systems and Classifying Plasmid Genomes”). Modern technology such as PCR and DNA sequencing allows rapid determination of replicon sequence type, and because plasmid groups are often referred to as incompatibility groups, this process results in a plasmid being allocated to an incompatibility group without a real appreciation of what that implies. Plasmid incompatibility typing was developed as a basis for classifying plasmids before DNA

Plasmid Incompatibility

955

Plasmid Incompatibility, Fig. 1 Random selection from a mixed pool of plasmids is the basis of incompatibility. Two plasmids can be incompatible if they carry a replication or partitioning system with the same specificity. (a) Traditional view. The first stage in incompatibility arises from random selection of DNA molecules for replication. It is possible that replicated plasmid molecules remain paired and efficiently partitioned prior to cell division, but if not, then random selection of DNA molecules for the partitioning process will result in uneven distribution of the two plasmid types in daughter cells and this will

lead to further imbalance which eventually leads to loss of one or other type. (b) Revised view of partitioned-based incompatibility based on results from fluorescence microscopy (Ebersbach et al. 2005). According to this view, partitioning results in random distribution of plasmid genomes at even intervals along the bacterial cell, and thus, plasmids sharing a partitioning system compete for these positions. It is proposed that the critical position to occupy is at the cell division plane so that replication and segregation result in distribution to both daughter bacteria

sequences were available (Novick 1987). When gene cloning techniques allowed the functional dissection of plasmids and the isolation of their replicons, it became clear that incompatibility was generally linked to the replicon type. Incompatibility can be explained as a simple consequence of the random selection of plasmid molecules competing for a critical stage in the stable inheritance processes, whether it is random replication events or the partitioning event that follows replication (Austin and Nordstrom 1990; Ebersbach et al. 2005; Fig. 1). Depending on what determines the specificity of this competition (see ▶ “Plasmid Regulatory Systems, Modeling”), some plasmids that are almost identical can coexist stably due to a few critical sequence differences in regulatory RNA (e.g., in ColE1-type replicons), while plasmids that show wide divergence from a common ancestor can still exhibit incompatibility (e.g., IncP-1 replicons). The population behavior of such plasmids can be modeled mathematically (see ▶ “Mathematical Modeling of Plasmid Dynamics”). A number of systematic sets of tools have been created to aid with the classification process

including collections of mini-plasmids, hybridization probes, or PCR primers in order to establish the replicon type of unclassified plasmids. Pittard and coworkers (Davey et al. 1984) carried out pioneering work, collecting the mini-replicons that had been cloned and joining them to an easily screened gene so that plasmid loss caused by the unclassified plasmid could be easily detected. Subsequently Couturier et al. (1988) used these mini-replicons as the basis of hybridization probes to classify plasmids and a number of different microarray systems have used variants of this approach to identify the replicon types present in a sample to be analyzed. Finally, the most recent and comprehensive system is that of Carattoli et al. (2005) which depends on sets of primers for PCR to distinguish between plasmids of different groups. A further complication of plasmid typing arises when plasmids contain multiple replicons. This is well illustrated by plasmids of the IncF group that can contain two, three, or more replicons (see ▶ “Theta-Replicating Plasmids, Large”). This has practical implications because such plasmids are

P

956

not easily displaced by plasmids with a single replicon. A key part of studying the properties conferred by plasmids is to displace the plasmid (a process called curing) and determine what properties are lost by the host bacterium. Having efficient procedures to achieve curing is important but traditional methods involve treatment of the bacteria with physical (heat treatment or electric shock) or chemical (detergents or DNA modifying agents) stress that may induce the SOS response and increase the mutation rate. A stress-free strategy to displace such plasmids based on this knowledge involves incorporation of key genetic components of the system into a vector which can first displace the resident by blocking the function of all replicons and neutralizing all addiction systems and can then be displaced itself (Hale et al. 2010). With the increasing ease with which we can determine DNA sequences of new plasmids, it is important to remember that incompatibility is strictly a practical property that relates to the ability of two plasmids to be inherited together over a number of generations. Unless two plasmids are identical, one cannot be 100% sure that they will be incompatible. Therefore, if one wishes to talk about the actual incompatibility properties of a plasmid, real, wet lab experiments will be needed to give a definitive answer. Conversely, if one wants to build a synthetic plasmid (see ▶ “Synthetic Plasmid Biology”) that is compatible with known plasmid vectors, then understanding what normally determines these relationships is essential.

Cross-References ▶ Conjugative Transfer Systems and Classifying Plasmid Genomes ▶ Mathematical Modeling of Plasmid Dynamics ▶ Metamobilomics – The Plasmid Metagenome of Natural Environments ▶ Plasmid Genomes, Introduction to ▶ Plasmid Regulatory Systems, Modeling ▶ Plasmids as Secondary Chromosomes ▶ Plasmids, Naming and Annotation of ▶ Synthetic Plasmid Biology ▶ Theta-Replicating Plasmids, Large

Plasmid Regulatory Systems, Modeling

References Austin S, Nordstrom K (1990) Partition-mediated incompatibility of bacterial plasmids. Cell 60:351–354 Carattoli A, Bertini A, Villa L, Falbo V, Hopkins KL, Threlfall EJ (2005) Identification of plasmids by PCR-based replicon typing. J Microbiol Methods 63:219–228 Couturier M, Bex F, Bergquist PL, Maas WK (1988) Identification and classification of bacterial plasmids. Microbiol Rev 52:375–395 Davey RB, Bird PI, Nikoletti SM, Praszkier J, Pittard J (1984) The use of mini-gal plasmids for rapid incompatibility grouping of conjugative R-plasmids. Plasmid 11:234–242 Ebersbach G, Sherratt DJ, Gerdes K (2005) Partitionassociated incompatibility caused by random assortment of pure plasmid clusters. Mol Microbiol 56:1430–1440 Hale L, Lazos O, Haines AS, Thomas CM (2010) An efficient stress-free strategy to displace stable bacterial plasmids. Biotechniques 48:223–228 Novick RP (1987) Plasmid incompatibility. Microbiol Rev 51:381–395

Plasmid Regulatory Systems, Modeling Dov J. Stekel School of Biosciences, University of Nottingham, Loughborough, UK

Synopsis The success of plasmids as stably inherited, autonomously replicating units depends on control circuits that ensure that positive events such as replication occur efficiently at a set average frequency and that the genetic load carried by the plasmid is at minimal metabolic cost to the host. While selective pressure has ensured that natural plasmids do achieve this, the wish to exploit plasmids or interfere with their survival mechanisms for biotechnological applications means that we need to understand the critical features that are needed for success. Mathematical modeling of the intracellular control circuits can help to explore different systems and to distinguish between key parameters and those whose

Plasmid Regulatory Systems, Modeling

variation will have little effect on the system. The relatively low complexity of plasmids makes them ideal systems to model and they also provide suitable systems to test prediction from the models. In the past, plasmid modeling has particularly focused on the ColE1 and R1 plasmids, using both deterministic and stochastic approaches; more recent work has started to address plasmids with more complex regulatory architectures, such as RK2. This has developed our understanding of the contrasting regulatory mechanisms found in high and low copy number plasmids. The combination of mathematical modeling with robust statistical methods for parameter estimation can integrate experimental data into the model, leading to more realistically parameterized mathematical models. These have greater predictive power and are likely to play a crucial future role in the rational design of plasmids for use in biotechnology and bioprocessing.

Introduction The essential feature of a plasmid genome is its possession of a replicon that allows the plasmid to maintain itself at a certain average copy number in its host. The systems studied to date all have been found to involve one or more negative feedback circuits to achieve this. In addition, the other basic plasmid functions that may help the plasmid to achieve stable inheritance, mobilizability, and self-transmissibility all involve gene sets that are controlled tightly separately or coordinately. This regulation arguably has evolved in order to minimize burden on the plasmid host (Herman et al. 2012). Because these genomes are relatively simple compared to bacterial chromosomes and amenable to experimental investigation due to the nonessential nature of plasmids, they represent opportunities to use mathematical modeling to understand how systems work. They come with the added benefit that being able to predict how plasmids behave has practical applications. A molecular biological system, such as a system controlling plasmid replication, can be described by a detailed series of chemical reactions. These might include transcription, translation, DNA replication,

957

and interactions between DNA, RNA, and protein molecules. A mathematical model of a molecular system is a formal and quantitative description of these chemical reactions and can be thought of as an encapsulation of a set of scientific hypotheses about the relevant reactions in the system. Types of Models There are many types of mathematical models that can be characterized in different ways. A good model would strive to have four defining features: to be mechanistic in that it describes the molecular processes thought to give rise to the phenotypes under study; to be realistic in that it incorporates known biology and parameter values based on experimental data; to be dynamic in that it includes the element of time; and to be predictive in that it can make predictions that can be tested against experimental data. Models are also valuable for their explanatory power, in deepening our understanding of why the biological systems we study have evolved to be the way that they are. Thus, ideally, they should allow one to investigate aspects of the system which may be difficult to address empirically. Models can be deterministic or stochastic. Deterministic models describe the rate of change of concentrations of molecules as a function of time. They have the advantage of being easier to describe, simulate, and fit to data; however, they can only describe the average behavior in a population and are less realistic for describing systems with small numbers of molecules. Stochastic models, on the other hand, include randomness in the molecular processes of the system. They are harder to work with, but have the advantages of being able to describe the variability in a population, as well as being more realistic in systems with small numbers of molecules. Most work in plasmid modeling has been deterministic, but stochastic models have been particularly effective in describing the molecular regulation in the control of copy number variability (Paulsson and Ehrenberg 1998, 2000). Plasmids as Models Plasmids are particularly attractive for mathematical modeling because plasmids are considerably

P

958

Plasmid Regulatory Systems, Modeling

M

D

DC

D RII

DM

DP M RI

M

D∗C

RII DSII

DLII

RI

Replicative Cycle Inhibitory Cycle

Plasmid Regulatory Systems, Modeling, Fig. 1 Kinetic scheme for Brendel and Perelson’s ColE1 replication model. Free plasmid DNA (D) replicates via the replicative cycle (in red) through four steps: the formation of short-length plasmid-bound RNA II (DSII), the formation of long-length plasmidbound RNA II (DLII), primed DNA, and finally replication from the primed DNA. Inhibition of DNA replication takes

place by the cycle in blue. The short-length plasmid-bound RNA II forms an unstable complex with RNA I (D*C). These either can convert directly to the stable complex DC or can associate with Rom protein (M) to form DM. The stable complex DC converts back to free DNA after the dissociation of the RNAs. This scheme is modeled with 10 differential equations (Figure and legend adapted from Brendel and Perelson 1993)

simpler than prokaryotic or eukaryotic cells. For example, mathematical models for the control of replication in ColE1 plasmids can be expressed in a relatively small number of equations: Brendel and Perelson’s detailed model contains 10 differential equations and 19 parameters (Fig. 1; adapted from Brendel and Perelson 1993). In contrast, mathematical models for the cell cycle of even a simple eukaryote such as Saccharomyces cerevisiae need to capture considerably greater biological complexity: Chen et al.’s model for the yeast cell cycle contains 36 differential equations, 26 algebraic constraints, and 143 parameters (Chen et al. 2004). The size of model is particularly important. The estimation of biologically realistic parameters and the evaluation of the sensitivity of a model to those parameters are critical steps in model construction and evaluation. Thus, a model with fewer parameters, such as a model of a plasmid system, is easier to produce, analyze, and compare with biological data and, therefore,

is much more likely to be realistic and of greater explanatory and predictive value. The earliest mathematical model of plasmid regulation was on the synthetic, phage-derived, ldv plasmid (Lee and Bailey 1984). However, the majority of early modeling work focused on explaining the mechanisms of copy number control found in the ColE1 and R1 plasmids. These provide an interesting contrast: ColE1 is a high copy number plasmid lacking an active partitioning system, while R1 is a low copy number plasmid with an active partitioning system. ColE1 as a High Copy Number Model Initial work on copy number control in ColE1 focused on the role of Rom protein in stabilizing the complex between the inhibitor RNA I and the preprimer RNA II (Atai and Shuler 1986; Perelson and Brendel, 1989). Atai and Shuler (1986) compared model prediction with experimental data on strains with or without Rom as

Plasmid Regulatory Systems, Modeling

a way of testing different hypotheses about the mechanism of action of Rom. The experimental data were consistent with a model that encoded increased binding of RNA I to RNA II by Rom, but not consistent with the hypothesis that Rom might increase the susceptibility of RNA II to endonuclease action, allowing the latter hypothesis to be rejected. Later models continued to highlight the importance of the rate of formation of the RNA I–RNA II complex in copy number control (Brendel and Perelson 1993). The most sophisticated modeling showed that the multistep association between RNA I and RNA II leads to a much finer “exponential” control of inhibition, as opposed to the “hyperbolic” control provided by a single step (Paulsson et al. 1998). This in turn leads to tighter control of copy number. By developing a stochastic model of copy number control, it was shown that high transcription of RNA II combined with multistep inhibition provides an optimal trade-off between segregational stability and metabolic cost (Paulsson and Ehrenberg 1998). Moreover, in order to achieve a similar level of segregational stability under single-step control, plasmid copy number would need to be 1.4 times higher than with multistep control. Additional Lessons from Low Copy Number Plasmid R1 The modeling of copy number regulation in R1 plasmids followed a similar pattern. The earliest models (on the related NR1 or R100 plasmid) highlighted the importance of the rate of translation of the RepA protein as the most important parameter governing copy number control (Womble and Rownd 1986). Later models showed how the multistep accumulation of RepA molecules at the origin of plasmid replication, with translation of RepA inhibited by CopA, could explain the “eclipse time,” i.e., the delay between successive plasmid replications, and runaway plasmid replication in CopA mutant strains (Ehrenberg and Sverredal 1995). Finally, the precise impact of the multistep accumulation of RepA proteins on plasmid copy number was explored with a stochastic model (Paulsson and Ehrenberg 2000). Paulsson and Ehernberg used their models of plasmid regulation to discuss in

959

detail the comparison between the multistep copy number control mechanisms in ColE1 and R1 plasmids. The ColE1 copy number control system can correct variations due to segregation, but at the cost of a higher copy number. This is necessary because of the absence of an active partitioning system. In contrast, R1 has a highly efficient partitioning system (Nordström et al. 1980). The translational control mechanism of R1 minimizes variation in copy number due to plasmid replication; although it cannot correct variations in copy number due to segregation, this is not needed because of the active partitioning system. Thus, the partitioning and replicative mechanisms in R1 have evolved to minimize two different sources of copy number variation. Plasmids with More Complex Control Circuitry Some efforts have now been made to model plasmids with more complex regulatory mechanisms, such as RK2. This plasmid has 74 genes and multiple regulatory circuits for replication, conjugation, and stable inheritance, controlled by seven transcriptional regulators, including four global regulators. Efforts to understand the regulatory mechanism of this plasmid have focused on its central control operon, in which two of the global regulators, KorA and KorB, negatively and cooperatively regulate their own expression. Because this is a more complicated system, parameter identification and evaluation are considerably harder, and modeling efforts are aided by combining mathematical models with Bayesian Monte Carlo statistical inference to allow better estimation of parameters from a wider range of data (Herman et al. 2011). Modeling work has been used to evaluate the potential evolutionary benefits of the cooperative regulation (Herman et al. 2012), leading to the conclusion that the regulatory architecture has a major impact on the host energy required to express these plasmid proteins (Fig. 2; adapted from Herman et al. 2012). Perspective In conclusion, mathematical modeling has had considerable success in explaining the architecture of plasmid regulatory systems. This success

P

960

3000 0

0.00

1000

Time [s]

0.08

0.12

b

0.04

σ⁄μ

a

Plasmid Regulatory Systems, Modeling

CCO

CConoC CCOregB CCOnoR

CCO

CCOnoC CCOregB CCOnoR

8000 4000 0

mRNA [molecules]

c

CCO

CCOnoC CCOregB CCOnoR

Plasmid Regulatory Systems, Modeling, Fig. 2 Adaptation for protein synthesis affinity in RK2 plasmids. The comparison of the ability of the four different regulatory networks to optimize different desirable properties while maintaining a set level of protein abundance: CCO is the wild-type system; CCOnoC is a system with two regulators but with no cooperativity between then regulators; CCOregB is a system with just a single dimeric regulator; CCOnoR is a system with no regulation. Bar heights are means and error bars are standard errors across 20 replicates. (a) Fluctuations in KorB regulator concentration in their steady state. There is little improvement in protein fluctuations between the models, with a minor improvement observed when a second

regulator is introduced. (b) Times of reaching a half of a mean KorA concentration: the systems with strong regulation reach a half of mean KorA concentration quicker than with weak or without regulation, but there is little difference between the systems that do or do not include cooperativity. (c) Burden to host, as measured by the number of mRNA produced per generation after the model has reached steady state. There is a small reduction in mRNA usage after the introduction of a single regulator; a second regulator brings a 20-fold improvement in mRNA usage; the introduction of cooperativity brings a further threefold improvement (Figure and legend adapted from Herman et al. 2012)

derives from building dynamical models that are mechanistic and realistic and that are founded upon and can be tested against experimental data. Looking to the future, mathematical modeling of plasmid regulation is likely to play an increasingly important role in the rational design of biological agents with desired behaviors in synthetic biology and bioengineering. Bioengineering is likely to play a prominent role in the future commercial production of chemicals, including fuel, enzymes, plastics, drugs, and food products. In this scenario, mathematical modeling of plasmid regulation will have increased prevalence in the rational design of new synthetic biological agents.

Cross-References ▶ Differential Equations and Chemical Master Equation Models for Gene Regulatory Networks ▶ Gene Regulation ▶ Mathematical Models in the Sciences ▶ Plasmid Genomes, Introduction to ▶ Synthetic Plasmid Biology

References Atai MM, Shuler ML (1986) Mathematical model for the control of ColE1 type plasmid replication. Plasmid 16:204–212

Plasmids as Secondary Chromosomes Brendel V, Perelson AS (1993) Quantitative model of ColE1 plasmid copy number control. J Mol Biol 229:860–872 Chen KC, Calzone L, Csikasz-Nagy A, Cross FR, Novak B, Tyson JJ (2004) Integrative analysis of cell cycle control in budding yeast. Mol Biol Cell 15:3841–3862 Ehrenberg M, Sverredal A (1995) A model for copy number control of the plasmid R1. J Mol Biol 246:472–485 Herman D, Thomas CM, Stekel DJ (2011) Global transcription regulation of RK2 plasmids: a case study in the combined use of dynamical mathematical models and statistical inference for integration of experimental data and hypothesis exploration. BMC Syst Biol 5:119 Herman D, Thomas CM, Stekel DJ (2012) Adaptation for protein synthesis efficiency in a naturally occurring self-regulating operon. PLoS One 7(11):e49678 Lee SB, Bailey JE (1984) A mathematical model for ldv plasmid replication: analysis of wild-type plasmid. Plasmid 11:151–165 Nordström K, Molin S, Aagard-Hansen H (1980) Partitioning of plasmid R1 in Escherichia coli I. Kinetic loss of plasmid derivatives deleted of the par region. Plasmid 4:215–227 Paulsson J, Ehrenberg M (1998) Trade-off between segregational stability and metabolic burden: a mathematical model of plasmid ColE1 replication control. J Mol Biol 279:73–88 Paulsson J, Ehrenberg M (2000) Molecular clocks reduce plasmid loss rates: the R1 case. J Mol Biol 297:179–192 Paulsson J, Nordstrom K, Ehrenberg M (1998) Requirements for rapid plasmid ColE1 copy number adjustments: a mathematical model of inhibition modes and RNA turnover rates. Plasmid 39:215–234 Perelson AS, Brendel V (1989) Kinetics of complementary RNA-RNA interaction involved in plasmid ColE1 copy number control. J Mol Biol 208:245–255 Womble DD, Rownd RH (1986) Regulation of IncFII plasmid DNA replication: a quantitative model for control of plasmid NR1 replication in the bacterial cell cycle. J Mol Biol 192:529–548

Plasmids as Secondary Chromosomes Max Mergeay and Rob Van Houdt Unit of Microbiology, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium

Synonyms Chromid; Megaplasmid; Secondary chromosome

961

Synopsis Large replicons secondary to the main chromosome have been termed both “second chromosomes” if they carry essential genes and are indispensable for cell viability and “megaplasmids” if they do not use chromosome-type but plasmid-type replication systems. Recently, the term “chromid” was introduced to distinguish this replicon as it is neither a chromosome nor a plasmid. Three criteria were defined: (i) chromids have plasmid-type maintenance and replication systems, (ii) chromids have a nucleotide composition close to that of the chromosome, and (iii) chromids carry core genes that are found on the chromosome in other species (Harrison et al. 2010). Although this adds to the complexity of the nomenclature found in the literature, it reflects the necessity to clearly differentiate these types of replicons.

Introduction Large replicons in addition to the chromosome that are indispensable for cell viability but do not use chromosome-like replication systems are increasingly being discovered through the exponential increase in the number of sequenced bacterial genomes. Different terms (second chromosome or megaplasmid) have been used, but they do not reflect the nature of this replicon as it is neither a chromosome nor a plasmid. This entry highlights the incidence and peculiarities of this replicon, recently termed “chromid.”

Discussion From the 1,625 complete bacterial genomes in the NCBI database on September 2011, 102 carry a chromid. Most of these (87) are from the proteobacteria, including 10 genera from the a-proteobacteria, 4 from the b-proteobacteria, and 4 from the g-proteobacteria. Chromids are also found in members of the Bacteroidetes, Chloroflexi, Cyanobacteria, Deinococcus-Thermus, Firmicutes, and Spirochaetes (Table 1). The habitats

P

962

Plasmids as Secondary Chromosomes

Plasmids as Secondary Chromosomes, Table 1 Occurrence of chromids in bacterial genomes Phylum Alphaproteobacteria

Betaproteobacteria Gammaproteobacteria Chloroflexi Cyanobacteria Bacteroidetes/chlorobi Firmicutes Spirochaetes Deinococcus-thermus Other bacteria Total

Order Caulobacterales Rhizobiales Rhodobacterales Sphingomonadales Burkholderiales Alteromonadales Vibrionales Sphaerobacterales Chroococcales Bacteroidales Clostridiales Spirochaetales Deinococcales Unclassified bacteria

No. 1 23 7 2 36 2 16 1 2 1 1 6 3 1 102

of these strains are diverse being aquatic, terrestrial, or host associated (including symbionts and pathogens to plants, animals, and humans). This diversity in lifestyles and the fact that carrying a chromid is genus specific indicates that their distribution is phylogeny – rather than ecology based. For instance, both Cupriavidus metallidurans CH34 and Delftia acidovorans SPH-1 thrive in similar environments, have similar genome sizes (6.91 and 6.77 Mb, respectively), and share approximately 64% of their proteins (including those encoded by multiple mobile genetic elements). However, C. metallidurans CH34 includes a chromosome, a chromid, and two megaplasmids, while D. acidovorans SPH-1 has only one chromosome. The following paragraphs highlight some peculiarities of chromids, illustrated with emphasis on the closely related genera Cupriavidus and Ralstonia (b-proteobacteria, Burkholderiaceae). The illustrations are not exhaustive but cover a variety of observations and functions, which may be suggestive for the specific role of chromids in b-proteobacterial gene evolution and, more generally, in the construction of a bacterial horizontal gene pool. 1. Chromids carry core genes (housekeeping and central metabolism): all chromids from the

Cupriavidus and Ralstonia genera carry genes coding for the biosynthesis of trehalose, the metabolism of galactarate, the EntnerDoudoroff pathway of glycolysis, nitric oxide reductase, and assimilatory nitrate reductase and carry almost all genes involved in chemotaxis and the synthesis of flagella. 2. Chromids carry genus-specific genes: for example, approximately 40 chromosomal genes conserved in the Cupriavidus genus are chromid-borne within the Ralstonia genus. 3. Chromids often carry specialized genes beneficial in certain niches or conditions. While the megaplasmids of C. metallidurans CH34 (pMOL28 and pMOL30) play a major role in the adaptation to high concentrations of bioavailable metals (Mergeay et al. 2003, 2009; Monchy et al. 2007; Janssen et al. 2010), the C. metallidurans CH34 chromid is also rich in gene clusters involved in resistance to heavy metals, especially for the HME-RND (for Heavy Metal Efflux-Resistance Nodulation Division) family of tri-component efflux systems (Mergeay et al. 2003; Janssen et al. 2010). The HME-RND complex spans the complete cell wall and mediates efflux via a chemiosmotic gradient from the periplasm to the exterior of the cell. The archetype is CzcCBA, which is encoded by genes of pMOL30 and confers resistance to Co, Zn, and Cd (see Monchy et al. 2007). The HME-RND systems in C. metallidurans CH34 comprise two groups. One group is conserved in all Cupriavidus chromids including pumps such as HmyCBA and HmvCBA (both not yet characterized). The other group is conserved in Ralstonia chromids including pumps such as ZneCBA (Zn) and ZniCBA (not yet characterized). Thus Cupriavidus chromids may share a common heritage with respect to response to heavy metals, while the C. metallidurans chromid has recruited supplementary genetic determinants from other sources. Other examples for niche specification include the following: symbiotic plant host invasion on pSymB of Sinorhizobium meliloti and pathogenic plant host invasion on the chromid of Ralstonia solanacearum,

Plasmids as Secondary Chromosomes

biodegradation of acetone on Cupriavidus chromids, metabolism of phosphonoacetaldehyde on Ralstonia chromids, and catabolism of aromatic compounds on the chromid of Ralstonia pickettii 12J. In fact, catabolic genes involved in the biodegradation of aromatic compounds are located more often on chromids in Burkholderiales (Perez-Pantoja et al. 2011). 4. Chromids may be evolutionary test beds where genes are weakly preserved and evolve more rapidly (Cooper et al. 2010). Bavishi et al. (2010) also concluded that chromids evolve faster than primary chromosomes, resulting in different conservation on interand intraspecies level. Thus, chromid gene conservation is around 27% for the sequenced Cupriavidus species (C. metallidurans, C. pinatubonensis JMP134, C. taiwanensis, and C. eutrophus H16) (Janssen et al. 2010) and around 52% for C. metallidurans strains (Van Houdt et al. 2012). Four potential mechanisms were proposed by Cooper et al. (2010). First, delayed replication of the chromid could limit gene copy number and minimize expression, which has been recently shown for Vibrio parahaemolyticus (Dryselius et al. 2008). Reduced chromid gene expression is also observed for Cupriavidus metallidurans CH34 (unpublished results). Second, chromid genes appear to be more dispensable. Third, chromids exhibit increased homologous recombination. The last and most speculative mechanism is a systematically higher mutation rate (e.g., increased rate of transversions and dNTP pool asymmetry). 5. Chromids depend on plasmid-related replication and partitioning systems. Alphaproteobacterial chromids use a repABC-based system, which is also the abundant plasmid type in a-proteobacteria, while others have an iteron-based replication. In the Cupriavidus and Ralstonia chromids, the gene cluster cspA repA csp parA parB is strongly conserved. Specifically for C. metallidurans CH34, the repA upstream region contains three putative DnaA boxes with only one mismatch to the consensus, at locations –392, –419, and

963

–1,080 with respect to the start codon of repA. This region also contains at least 12 repetitive 17 nt-long elements with a highly conserved motif that may serve as RepA binding sites (Janssen et al. 2010). For all Cupriavidus chromids this conserved cluster is extended downstream of the parB gene with a xerD-like gene coding for a tyrosine-based site-specific recombinase, which could putatively assist in resolving chromid dimers. The partition proteins belong to the parAB superfamily, as for most chromids, which is widely distributed on both chromosomal and plasmid replicons. However, chromid- and plasmid-encoded Par proteins tend to group in phylogenetic analyses and separate from those encoded by the primary chromosome (Dubarry et al. 2006). Currently, seven Bsr (Burkholderiales secondary replicon) families of Par systems are identified in Burkholderiales that are exclusive to chromids and large plasmids, which further indicates this phylogenetic linkage (Passot et al. 2012). These families include, among others, family Bsr2 comprising chromosome 2 of Burkholderia species, Bsr3 comprising chromosome 3 of the Burkholderia cepacia complex, and Bsr4 comprising the Par systems of Cupriavidus/Ralstonia chromids. Putatively, Ralstonia pickettii 12J could be a remarkable case where a chromosome 3 (or second chromid) is emerging, with an integrated 380 kb plasmid carrying a Bsr3like Par system similar to the chromosome 3 of the Burkholderia cepacia complex (Passot et al. 2012). Finally, unlike plasmid replication, chromid replication needs to be coordinated with the cell. Therefore, additional regulators are required in addition to the basic mechanisms of plasmid copy number control. Thus chromids correspond well to the chimeric structure of their name with typical chromosomal and plasmidic features. Chromids arise rarely, as they are only found in a minority of bacteria, but appear to be stably maintained. However, once a chromid becomes a stable part of the host, there does not appear to be an extra burden associated compared to maintaining one large replicon

P

964

(with similar total genome size) considering the successful colonization of a wide range of different niches by both [e.g., Vibrio (chromid) and Pseudomonas (no chromid)]. Some Burkholderia genomes even carry more than one chromid and have an extraordinary total number of replicons. They are also remarkably ubiquitous and versatile (soil and water, plant symbiosis, and opportunistic pathogenicity).

Cross-References ▶ Conjugative Transfer Systems and Classifying Plasmid Genomes ▶ DNA Replication ▶ Metamobilomics – The Plasmid Metagenome of Natural Environments ▶ Plasmid Genomes, Introduction to ▶ Plasmids, Naming and Annotation of ▶ Theta-Replicating Plasmids, Large ▶ Transposable Elements and Plasmid Genomes

References Bavishi A, Abhishek A, Lin L et al (2010) Complex prokaryotic genome structure: rapid evolution of chromosome II. Genome 53:675–687 Cooper VS, Vohr SH, Wrocklage SC et al (2010) Why genes evolve faster on secondary chromosomes in bacteria. PLoS Comput Biol 6:e1000732 Dryselius R, Izutsu K, Honda T et al (2008) Differential replication dynamics for large and small Vibrio chromosomes affect gene dosage, expression and location. BMC Genomics 9:559 Dubarry N, Pasta F, Lane D (2006) ParABS systems of the four replicons of Burkholderia cenocepacia: new chromosome centromeres confer partition specificity. J Bacteriol 188:1489–1496 Harrison PW, Lower RP, Kim NK et al (2010) Introducing the bacterial ‘chromid’: not a chromosome, not a plasmid. Trends Microbiol 18:141–148 Janssen PJ, Van Houdt R, Moors H et al (2010) The complete genome sequence of Cupriavidus metallidurans strain CH34, a master survivalist in harsh and anthropogenic environments. PLoS ONE 5:e10433 Mergeay M, Monchy S, Vallaeys T et al (2003) Ralstonia metallidurans, a bacterium specifically adapted to toxic metals: towards a catalogue of metal-responsive genes. FEMS Microbiol Rev 27:385–410 Mergeay M, Monchy S, Janssen PJ et al (2009) Megaplasmids in Cupriavidus genus and metal resistance.

Plasmids, Naming and Annotation of In: Schwartz E (ed) Microbial megaplasmids. Springer, Berlin, pp 209–238 Monchy S, Benotmane MA, Janssen PJ et al (2007) Plasmids pMOL28 and pMOL30 of Cupriavidus metallidurans are specialized in the maximal viable response to heavy metals. J Bacteriol 189:7417–7425 Passot FM, Calderon V, Fichant G et al (2012) Centromere binding and evolution of chromosomal partition systems in the Burkholderiales. J Bacteriol 194:3426–3436 Perez-Pantoja D, Donoso R, Agullo L et al (2011) Genomic analysis of the potential for aromatic compounds biodegradation in Burkholderiales. Environ Microbiol 14:1091–1117 Van Houdt R, Monsieurs P, Mijnendonckx K et al (2012) Variation in genomic islands contribute to genome plasticity in Cupriavidus metallidurans. BMC Genomics 13:111

Plasmids, Naming and Annotation of Laura S. Frost1 and Christopher M. Thomas2 Department of Biological Sciences, University of Alberta, Edmonton, AB, Canada 2 Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, UK 1

Synopsis Genome sequences are being added to the public databases at phenomenal rates as sequencing becomes faster and cheaper. Plasmids are an important subset of extrachromosomal elements that often make significant contributions to the character of their hosts. Predicting these potential attributes depends on analyzing and cataloging the plethora of sequencing data in consistent and sensible ways. Currently, there is no consensus within the plasmid community on plasmid and gene names or how to handle the annotation of plasmids during the submission process to databases such as GenBank, but there are good models which can form the basis of a general naming system. It is also important to have clearer rules for the naming of plasmid core functions such as replication, partitioning, and conjugative transfer, among others. This entry explores these issues

Plasmids, Naming and Annotation of

and makes some proposals for a more sustainable and rational system for plasmid naming, annotation, and analysis as consensus is achieved among plasmid biologists. Based on the system used for naming plasmids from the Rhizobiaceae, natural plasmids should be given a unique name that, wherever possible, indicates its natural host, the host used in plasmid capture experiments (exogenous isolation), or the source in metagenome sequencing projects. This unique designation will allow less ambiguous linkage to relevant experimental data.

Introduction From the perspective of those interested in plasmid genomes, the current standard of plasmid annotation is far from ideal, despite the large amount of effort that goes into the deposition of plasmid sequences in public databases (Klimke et al. 2011). This can be contrasted to the better defined classification and annotation of viruses and bacteriophages (Brister et al. 2010). Plasmids do not have the defining features of phages (morphology, nucleic acid type, etc.) and often defy classification. Although the naming of a plasmid and its bacterial host is required for submission to NCBI (National Center for Biotechnology Information), GenBank, and other databases (see INSDC at http://www.insdc.org/, the International Nucleotide Sequence Database Collaboration between three databases GenBank, DDBJ, and ENA), the natural hosts for promiscuous plasmids and plasmids identified via metagenomic sequencing projects are often not known. In addition, plasmid capture experiments using exogenous isolation techniques (e.g., Sen et al. 2011), by their very nature, result in not being able to identify the original host. Plasmids are also extremely adept at changing their composition in ways that phages is incapable of doing. Plasmids consist of backbone replication genes plus a wide assortment of accessory genes including, but not limited to, conjugation and mobilization; resistance to antibiotics, heavy metals, pollutants, and organic solvents; virulence determinants; and mobile elements (transposons,

965

insertion sequences, integrons, etc.) as well as many genes of known and unknown functions that appear to be duplicates of chromosomal genes. They also undergo deletion, amplification, recombination, and mutation resulting in inactivated replicons, pseudogenes, and many other genetic peculiarities. These features are not usually found in phages since maintenance of their genetic content and size constraints imposed by their capsid dimensions ensure a more rigorous uniformity. Because there is no formal taxonomy scheme for plasmids, they present unusual challenges not found in other mobile genetic elements, which has led to ad hoc solutions surrounding the naming of plasmids and their genes.

The Key Issues Natural plasmids can be divided into three groups: historically important plasmids, newly discovered plasmids that are studied in detail at the level of gene function, and plasmids that are by-products of genome or metagenome sequencing projects. Many of the latter, in all likelihood, will never receive much individual attention although they may inform the properties and evolution of a plasmid group. Any proposal for naming plasmids and their genes needs to accommodate these possibilities. The issues associated with plasmid sequences in the databases can be boiled down to two distinct points: naming and annotation. This entry raises some of these issues and makes suggestions to help resolve them, but there is a need for the wider plasmid community to debate them and adopt acceptable and sensible standards. Plasmid naming is inconsistent and confusing for a variety of reasons. A process for the naming of new plasmids, including synthetic plasmids, was proposed by Novick et al. (1976). Plasmids were to be named using a lower case “p” followed by an alphanumeric designation composed of a two-letter combination that reflected the researcher’s institution followed by a number. Thus, pUC18 was the 18th plasmid from the University of California. However, the enormous number of plasmids that have been isolated or constructed since the mid-1970s has

P

966

overwhelmed these simple rules. Either the rules are not followed or duplicate names arise because there is no simple way to check whether a plasmid name is unique. Added to this, the names of many of the well-studied plasmids such as F, R1, RK2, and ColE1 do not conform to these rules. Perhaps if these plasmids had been given new names 35 years ago, the naming of plasmids would have developed differently. Similarly, there has been no clear distinction between natural plasmids that make up the genomes of one or more strains or species and plasmids that have been created by recombinant DNA techniques or more recently by synthetic biology. Plasmid annotation (naming of genes) is also in disarray for many reasons. Aside from inconsistent naming of genes and gene products or the lack of annotation altogether, the legacy of different names for the same functions has exacerbated the problem. The transfer genes in plasmid F were named in the order in which they were identified using groundbreaking genetic techniques (Achtman et al. 1971). Thus, there is traYALEKBZ (in no particular order; traZ was later dropped) followed by trbA-J (Frost et al. 1994). The transfer genes in plasmid RP4 were named, more usefully, according to their position in the two main operons Tra1 and Tra2 (traABC-M and trbABC-P) (Pansegrau et al. 1994). The virulence locus of Agrobacterium tumefaciens Ti plasmid, which is involved in tumorigenesis in plants, consists of seven virulence operons, virA-G with the cistrons in these operons being sequentially numbered, e.g., virB1-11. The VirB1-11 gene products as well as VirD4 have become paradigms for the type IV secretion system (T4SS) and associated coupling protein (T4CP), respectively (AlvarezMartinez and Christie 2009). The coupling protein, which is required for conjugation, is named TraD in F, TraG in RP4, and VirD4 in the Ti plasmid. However, F also encodes a mating pair stabilization protein called TraG, for instance, that has an unrelated function. Well-meaning researchers have adopted gene names from these and other paradigmatic systems without regard to the function of the gene or the position of the cistron within the operon. For instance, many

Plasmids, Naming and Annotation of

coupling proteins involved in conjugation are named VirD4 even though there is no evidence that they are involved in virulence or that they are the fourth gene in the fourth (D) operon within a vir locus.

Plasmids in GenBank Several years ago, bacterial plasmids had their own link on the GenBank home page (http:// www.ncbi.nlm.nih.gov/genbank/). However, the number of plasmids is now enormous and some overlying organizational framework was deemed necessary by the NCBI. It has now moved toward a system based on phylogeny of related organisms with plasmids found within certain species or subspecies listed alongside the host. The best list of complete plasmids, which includes plasmids with no known host that are included in RefSeq (see below), can be found within the NCBI FTP directory at the following address: ftp://ftp.ncbi. nih.gov/genomes/Plasmids/. Although the NCBI website has become extremely complicated with many subdirectories and links to web pages/pdfs that must be read, the annotation of sequences is fairly well explained. A useful overview is provided at (http://www. ncbi.nlm.nih.gov/books/NBK21105/). The RefSeq platform (http://www.ncbi.nlm.nih.gov/ RefSeq/) is also very helpful in providing examples of sequences annotated to GenBank standards. This includes organisms with multiple plasmids; the suggestions for naming these plasmids and the genes therein in a sequential order are well illustrated. It is interesting to compare the annotation for “plasmid F” in GenBank (AP001918.1) and RefSeq (NC_002483.1) and see the level of detail in the latter that was added to the original GenBank submission. The gold standards for annotation are probably the entries for E. coli K12 in GenBank at U00096 and in RefSeq at NC_000913 that both have consistent annotation and are continually updated on a semiregular basis. This illustrates the importance of updating annotations as more information accumulates, a process that is currently lacking in most cases.

Plasmids, Naming and Annotation of

A Proposal for Naming Plasmids Although, in general, plasmid biologists agree that plasmid names should follow the scheme proposed by Novick et al. (1976), i.e., the format pAlphanumeric, using more than two letters where needed, has deficiencies with respect both to uniqueness and information content. A system based on plasmids from Rhizobiaceae (Sciaky et al. 1978) could be used as the basis for creating the formal, unique name for each plasmid. As detailed below, the name starts with “p” to indicate “plasmid” and is followed by a series of elements that are unique for that plasmid and give information about its host, strain, and sample source or the locus tag (which reflects this information) and a letter (for known plasmids using this system) or number (for newly identified plasmids) that is unique when more than one plasmid is present within a given genome or metagenome. For plasmids discovered and analyzed by classic procedures, the “p” designation should still be used as soon as a plasmid is identified but, for sequencing-based discovery, only complete, annotated plasmids should receive the “p” designation. Contigs that appear to contain plasmid sequences should be left as contigs until such time as it is clear that the whole sequence has been assembled and at least a preliminary annotation is completed. The rules for handling contigs are discussed at http://www.ncbi.nlm.nih.gov/ genbank/wgs under Whole Genome Shotgun Submissions (WGS). The “p” should be followed by a contraction of the genus and species names as in the case of restriction enzymes (Roberts et al. 2003). Thus, Eco would signify E. coli. This would be followed by the strain designation; for instance, EcoK12 and the F plasmid would be pEcoK12_1. The name of the plasmid pRetCFN42f is derived from its host, Rhizobium etli, CFN42 is the strain designation, and “f” indicates that it is the sixth and largest plasmid found in the strain. This was (and is) a useful name for this plasmid even before it was sequenced. The proper name could be altered slightly to pRetCFN42_6. Note the use of an underslash between the strain designation and the plasmid number. The reason for changing over

967

to numbers is to allow an open-ended system not limited by the number of letters available. It would be helpful to have standard categories for plasmids that identify their source. Thus, natural plasmids could be split into endogenously isolated (isolated from a natural host, pEND), exogenously isolated (isolated by capture into a permissive host, pEXO), or reconstructed virtually from a metagenome project (pMET). Unnatural plasmids can be split into vectors (pVEC), constructs derived from vectors (pCON), derivatives of natural plasmids (pDER), or synthetic plasmids (pSYN). At some stage, these terms could also be incorporated into the long version of the plasmid name. As examples, pEcoK12_1 would be pEND_EcoK12_1 and pRetCFN42_6 would be pEND_RetCFN42_6. Alternatively, plasmids that have been sequenced and submitted to GenBank should make use of the locus tag, which is unique for each sequencing project, to generate a shorter, informal plasmid name. The locus tag is used to construct systematic gene identifiers for each gene within a complete genome (chromosome and plasmids). It is an alphanumeric of 3–12 characters where the first character must not be a digit (http://www.ncbi.nlm.nih.gov/genbank/ genomesubmit#locus_tag). The locus tag contains information about the host and strain as well as the method of isolation that is required by GenBank during the sequence submission process. If the locus tag (REH) had been used for pEND_ RetCFN42_6, the plasmid name would be pREH_6 and pEND_EcoK12_1 would be pFpla_ 1where Fpla is the locus tag for the F plasmid. In the case of exogenously isolated plasmids, the host used in the plasmid capture experiment and the source could be described in a series of three or more letter codes. Thus, a plasmid isolated from Mildred Lake in Northern Alberta, using exogenous techniques, would be named pEXO_PaeO_LMI_1, which indicates that Pseudomonas aeruginosa strain O was used to exogenously capture plasmid “1” from an isolate from Lake Mildred (LMI). Alternatively, if the plasmid sequence was submitted to GenBank, the locus tag could be used to give pPAO_1 where PAO is the locus tag for this sequencing project.

P

968

For plasmids identified within metagenome sequencing projects, again the locus tag prefix assigned once the Metagenome BioProject ID has been given could be used. See http://www. ncbi.nlm.nih.gov/genbank/metagenome for details. The source should be designated in a way that is acceptable to GenBank and the plasmids should be numbered rather than lettered since this does not limit the number of plasmids that could be found. Thus, pMET_BHO_SKN_4 would be the formal name for the 4th plasmid from skin (SKN would be a GenBank-approved designation) in a metagenomic project from Birmingham Hospital, which has been given the locus tag BHO. Using only the locus tag, the informal designation would be pBHO_4. Since the resulting formal official names are quite lengthy, traditional names or shorter names derived from the locus tag could be used. Authors will normally state the official name as well as the shorter name used in their publications. It should then be possible to give older plasmids official names, but retain their short names, for example, F or RP4, for everyday usage. If a sequencing project reveals the presence of more than one plasmid, they could be numbered consecutively from largest to smallest in size (or vice versa in the tradition of studies on plasmids from Rhizobiaceae). For example, the plasmids pREB1-9 could be the informal names for pEND_AmaMBIC11017_1-9 from Acaryochoris marina, strain MBIC11017 (see NC_009926-34), where REB is the locus tag for the A. marina MBIC11017 genome. In the event that the same plasmid is identified in two separate sequencing projects, they should be named according to the rules outlined above. Thus, identical plasmids from separate sources should have different formal names and but this can be noted in the databases or relevant publications. Informal names can reflect either the locus tag for that particular project or the first or accepted name for that plasmid. Modern sequencing methods are often incapable of circularizing a plasmid or, in the case of linear plasmids, providing sequence at the ends of the plasmid. Incomplete sequences represent a valuable source of information and should be

Plasmids, Naming and Annotation of

submitted to GenBank as contigs but should not be given plasmid names. If further information becomes available, the database entry should be updated and the corrections noted. Examination of current practices in GenBank and RefSeq should illustrate what is currently acceptable.

Naming Plasmid Genes The protocol for naming genes and gene products is well described in GenBank under the Prokaryotic Annotation Guide (http://www.ncbi.nlm.nih. gov/GenBank/genomesubmit_annotation). Below is the entry for F plasmid TraD from RefSeq: gene 89804..91957 /gene="traD" /locus_tag ="Fpla104" /operon="transfer operon" /db_xref="GeneID:1263585" CDS 89804..91957 /gene="traD" /locus_tag="Fpla104" /operon="transfer operon" /experiment="experimental evidence, no additional details recorded" /note="type IV secretion system coupling protein; similar to F plasmid TraD"

The F plasmid (pFpla_1) has 108 genes, which are given sequentially numbered locus tags Fpla1108 with traD being the 104th gene. In the case of traD, extensive experimental evidence exists regarding its function although no updates providing these references are given, as suggested by the /experiment line. The /note line indicates its putative function with “putative” being the word of choice at GenBank (refrain from using “potential” or “predicted”). The /note line could be read as “ATPase; type IV secretion system-associated coupling protein; similar to F plasmid TraD; VirD4 family.” This indicates that F TraD is a paradigm for a subset of coupling proteins within the VirD4 family. In the absence of any experimental data, a situation more and more frequently encountered, genes should remain as locus tag designations

Plasmids, Naming and Annotation of

since they are unique. The /gene field need not be filled in. In this hypothetical case, Fpla104 would be referred to as gene Fpla104 in the literature. If the function of a plasmid gene product is highly probable and there has been some human oversight in the annotation process (as opposed to routine automated annotation), a more descriptive gene name that describes known or putative phenotypes based on the literature can be used. The following are generally recognized as key gene names in plasmids: rep, par, cop, inc, stb, tra, sfx, eex, rlx, pri, ssb (replication, partition, copy number, incompatibility, stability, transfer, surface exclusion, entry exclusion, relaxase, primase, single-stranded DNA binding protein) with others that surely could be added to the list. For instance, cpl could be used for the coupling protein since it is essential in the conversion of a T4SS to a conjugative system and deserves its own gene name. The usual format for gene names (abcD), as described in Demerec et al. (1966), should be used with the last, uppercase letter reflecting the placement of that particular gene within an operon. In long operons such as the transfer operon in plasmid F (33 kb), genes with no known function or with a function that is most likely not involved in transfer could remain as the locus tag designation. Some plasmids have adopted a different naming scheme, most notably the Ti family of plasmids in A. tumefaciens, whereby the gene and gene product names reflect the operon as well as the position within the operon, i.e., virD4. This is well entrenched in the literature and has many advantages that need to be debated by the plasmid community but is not the currently accepted practice at NCBI. In summary, /gene is the unique gene name within that genome using the standard bacterial gene naming conventions, either abcA or abcA1. /locus tag is normally an alphanumerical designator linked to the name of the genome and the position of the gene in that genome. It can be used as the informal gene name in the absence of detailed annotation. /note allows reportage of the likely function identified/predicted by bioinformatics and other analyses.

969

Plasmids often carry other, smaller, mobile elements such as insertion sequences, transposons, and integrons or gene clusters for wellcharacterized traits such as antibiotic or heavy metal resistance, and virulence. The conventions for naming these elements and genes should be followed as outlined in the Annotation Guide and in references such as Siguier et al. (2012) for IS elements.

Problems with Using Top BLAST Hits for Annotation The bugbear in the naming of genes appears to be BLAST, a wonderful tool that often instills unwarranted confidence in its user. Many annotations reflect unswerving faith in the ability of BLAST to correctly name a gene. This often leads to naming new genes after the number one hit that has the highest identity in a BLAST search, regardless of whether this name makes sense or not. It may be that the closest related gene does indeed perform a similar function but that gene was incorrectly named during its submission process. Not only is this confusing, but it has the potential to propagate incorrect gene names, which pop up time and time again in subsequent BLAST searches. One of the most egregious examples include naming newly found genes/gene products that are associated with type IV secretion systems after the Vir gene products of the Ti plasmid. Naming a transfer gene in a non-virulent plasmid virD4, for example, is confusing to anyone not intimately familiar with the coupling protein literature. Membership within the VirD4 family should be reserved for the/note line during the annotation process. Once there is some evidence for the function of a particular gene product, including a literature search, they can be named as described above, for instance, cplA. When deciding on gene product function, some tools are better than others. Because of the problems associated with BLAST, where erroneous functions are propagated in an alarming manner, care must be taken that the correct function is assigned. Swiss-Prot (http://web.expasy.org/groups/swissprot/) is

P

970

probably the most informative and accurate databank for assigning function, followed by RefSeq at NCBI, where sequences are annotated by NCBI staff. The next most reliable source is annotation done by individual labs followed by the great bulk of sequences entered into GenBank after automated annotation without much formal review.

Future Perspectives Although it would have been helpful to discuss nomenclature for plasmids and their genes ten years ago, it is never too late to start. In the absence of guidelines from the plasmid community, researchers have used their common sense as well as followed the GenBank rules for submission to produce, for the most part, useful annotations of plasmids. However, the growing confusion about plasmid and gene names needs to be addressed. Some simple solutions as described above are hereby proposed to initiate discussion within the plasmid biologists’ community. The lists of standardized abbreviations used in naming plasmids and genes should be curated by the databases. Once the plasmid community has decided on these standardized abbreviations, the annotation process itself would be simplified and would aid database managers in understanding plasmid provenance and gene function. Since this proposal is not particularly complex nor does it require extensive maintenance by interested parties, it should be relatively easy to make the case for why it should be generally adopted. The ability to use both long and short or common names, as is done for enzymes, for instance, should avoid the charge of a bureaucracy gone mad. The naming of genes remains more problematic than naming of plasmids because of the sheer numbers of genes and the multiple errors currently present and being propagated throughout the databases. Using BLAST to generate names for new genes, coupled with the use of automated annotation services that are insensitive to these errors, has contributed mightily to this situation. These simple solutions hopefully generate discussion within the plasmid community

Plasmids, Naming and Annotation of

and with automated annotation services. Hopefully, problematic gene names will eventually be diluted out or updated in the near future.

Cross-References ▶ Conjugative Transfer Systems and Classifying Plasmid Genomes ▶ Genomic Signature Analysis to Predict Plasmid Host Range ▶ Mathematical Modeling of Plasmid Dynamics ▶ Metamobilomics – The Plasmid Metagenome of Natural Environments ▶ Plant Genome Annotation, Methods for ▶ Plasmid Cloning Vectors ▶ Plasmid Genomes, Introduction to ▶ Plasmid Incompatibility ▶ Plasmid Regulatory Systems, Modeling ▶ Plasmids as Secondary Chromosomes ▶ Rolling Circle Replicating Plasmids ▶ Synthetic Plasmid Biology ▶ Theta-Replicating Plasmids, Large ▶ Transposable Elements and Plasmid Genomes Acknowledgments The authors wish to thank Celeste Brown (University of Idaho), Bill Klimke (NCBI), and Miguel Cevallos (CFN, Mexico) for useful discussions.

References Achtman M, Willetts NS, Clark AJ (1971) Beginning a genetic analysis of conjugational transfer determined by the F factor in Escherichia coli by isolation and characterization of transfer-deficient mutants. J Bacteriol 106(2):529–538 Alvarez-Martinez CE, Christie PJ (2009) Biological diversity of prokaryotic type IV secretion systems. Microbiol Mol Biol Rev 73(4):775–808 Brister JR, Bao Y, Kuiken C et al (2010) Towards viral genome annotation standards, report from the 2010 NCBI annotation workshop. Viruses 2(10): 2258–2268 Demerec M, Adelberg EA, Clark AJ et al (1966) A proposal for a uniform nomenclature in bacterial genetics. Genetics 54(1):61–76 Frost LS, Ippen-Ihler K, Skurray RA (1994) Analysis of the sequence and gene products of the transfer region of the F sex factor. Microbiol Rev 58(2):162–210 FTP directory of genomes/plasmids. ftp://ftp.ncbi.nih.gov/ genomes/Plasmids/. Accessed 16 Mar 2014

Polyglutamine Folding Diseases International Nucleotide Sequence Database Collaboration (INSDC). http://www.insdc.org/. Accessed 16 Mar 2014 Klimke W, O’Donovan C, White O et al (2011) Solving the problem: genome annotation standards before the data deluge. Stand Genomic Sci 5(1):168–193 NCBI GenBank bacterial genome submission guide (annotation). http://www.ncbi.nlm.nih.gov/GenBank/ genomesubmit_annotation. Accessed 16 Mar 2014 NCBI GenBank bacterial genome submission guide (locus tag). http://www.ncbi.nlm.nih.gov/genbank/genome submit#locus_tag. Accessed 16 Mar 2014 NCBI GenBank home page. http://www.ncbi.nlm.nih.gov/ genbank/. Accessed 16 Mar 2014 NCBI GenBank metagenome submission guide. http:// www.ncbi.nlm.nih.gov/genbank/metagenome. Accessed 16 Mar 2014 NCBI GenBank whole genome shotgun submissions guide. http://www.ncbi.nlm.nih.gov/genbank/wgs. Accessed 16 Mar 2014 NCBI handbook. http://www.ncbi.nlm.nih.gov/books/ NBK21105/. Accessed 16 Mar 2014 NCBI reference sequence database. http://www.ncbi.nlm. nih.gov/RefSeq/. Accessed 16 Mar 2014 Novick RP, Clowes RC, Cohen SN et al (1976) Uniform nomenclature for bacterial plasmids: a proposal. Bacteriol Rev 40(1):168–189 Pansegrau W, Lanka E, Barth PT et al (1994) Complete nucleotide sequence of Birmingham IncP alpha plasmids. Compilation and comparative analysis. J Mol Biol 239(5):623–663 Roberts RJ, Belfort M, Bestor T et al (2003) A nomenclature for restriction enzymes, DNA methyltransferases, homing endonucleases and their genes. Nucleic Acids Res 31(7):1805–1812 Sciaky D, Montoya AL, Chilton MD (1978) Fingerprints of Agrobacterium Ti plasmids. Plasmid 1(2):238–253 Sen D, Van der Auwera GA, Rogers LM et al (2011) Broad-host-range plasmids from agricultural soils have IncP-1 backbones with diverse accessory genes. Appl Environ Microbiol 77(22):7975–7983 Siguier P, Varani A, Perochon J et al (2012) Exploring bacterial insertion sequences with ISfinder: objectives, uses, and future developments. Methods Mol Biol 859:91–103 Swiss-Prot Group http://web.expasy.org/groups/swissprot/. Accessed 16 Mar 2014

PolyGln Diseases ▶ Polyglutamine Folding Diseases

Polyglutamine Diseases ▶ Polyglutamine Folding Diseases

971

Polyglutamine Folding Diseases Shallee T. Page Division of Environmental and Biological Sciences, University of Maine at Machias, Machias, ME, USA

Synonyms CAG repeat pathologies; PolyGln diseases; Polyglutamine diseases; PolyQ diseases

Synopsis The presence of repeats of the trinucleotide sequence CAG results in pathogenic stretches of glutamine in the gene product. The polyglutamine (polyQ) stretches, in turn, tend to aggregate and to induce other susceptible proteins to aggregate. This protein aggregation results in pathological states in certain cell types. Each pathology is discussed, as well as potential mechanisms and possible treatment strategies.

Introduction There are a number of disease states, including six spinocerebellar ataxias, which originate from protein misfolding due to an extended stretch of glutamine residues. These stretches of polyglutamine (polyQ) cause the misfolded protein to aggregate, leading to neural cytotoxicity. PolyQ diseases originate with the presence of extended unstable stretches of the trinucleotide CAG in specific proteins, which places them within a broader group of “triplet repeat diseases.” Normal individuals have less than 27 CAG repeats in these proteins. The triplet CAG encodes the amino acid glutamine; thus, the CAG repeat gives rise to extended stretches of glutamine. The expansion of the number of these genomic trinucleotide repeats leads to protein aggregation and disease, particularly in certain neural cells. In pathological states, these stretches of CAG can

P

972

number in the dozens. The threshold for pathology varies with disease, but typically is around 40 repeats. The number of these CAG repeats can increase in successive generations, correlating with earlier onset and/or increased severity of the pathology. The mechanism of initiation is likely polymerase slipping, leading to increased copy number of trinucleotide repeats. The resulting polyQ tracts disrupt proper folding, negatively impacting protein function. For most of these diseases, there are no successful treatments, only palliative care. However, scientists are actively examining intervention at each stage of the protein production.

Pathologies The polyQ diseases are characterized by progressive degeneration of neural function, typically with an adult onset of clinical signs. The diseases are categorized by disease state, such as spinocerebellar ataxias (SCAs), but many have historical names after the individuals that characterized the disorder (e.g., Huntington’s disease) or affected families (e.g., Machado-Joseph disease). Spinocerebellar ataxias (SCAs) are neurodegenerative diseases that affect neurons in the spine and in the cerebellum, the part of the brain responsible for motor movement. Thus, these diseases lead to progressive symptoms, including lack of coordination, involuntary eye movement, and impairment of motor movement. Of more than 25 SCAs, polyQ repeats are known to be the causative defect in CAG repeats in SCA1, SCA2, SCA3, SCA6, SCA7, SCA17, and DRPLA (Paulson 2009). Spinocerebellar ataxia-1 (SCA1) is a neurodegenerative disorder caused by a CAG expansion of greater than 43 repeats in the sca1 gene and subsequent polyQ expansion in the protein ataxin-1 (Paulson 2009). The polyQ proteins form into neuronal filaments, leading to neurodegeneration. The polyQ is necessary but not sufficient for pathology; localization, folding, and phosphorylation also play a role in the course of the disease.

Polyglutamine Folding Diseases

In alleles of ataxin-2 (ATXN2) associated with spinocerebellar ataxia-2 (SCA2) disease, the polyQ repeats range from 34 to 59. In the range of 32–34 repeats, the penetrance can vary among affected individuals. Notably, SCA2 is the most likely spinocerebellar ataxia to arise spontaneously in an individual without prior family history (Paulson 2009). SCA3 is also known as Machado-Joseph disease (MJD). SCA3 often manifests as staring eyes with infrequent blinking and progressive difficulty speaking and swallowing. In SCA3, the protein containing the repeats is referred to as ataxin-3 (Paulson 2009). As in other polyQ diseases, the number of repeats correlates with both severity as well as extent of disease. The ataxin-3 in healthy individuals displays 12–42 repeats and approximately 52–84 repeats in symptomatic individuals (Paulson 2009). Spinocerebellar ataxia type 6 (SCA6) is a dominantly inherited late-onset progressive ataxia. In some cases, the ataxia is accompanied by other neurological defects, such as impaired upward gaze and difficulty tracking movement. Later symptoms can include hyper-reflexivity and spasticity. Neurodegeneration is seen in the cerebellar neurons known as Purkinje cells (PC). In contrast to most of the polyQ diseases, as few as 18 repeats can trigger the pathology, though symptoms tend to be milder than many of the other SCAs and not fatal. The pathogenic CAG repeats are found within the a-subunit of the P-/Q-type calcium channel (CaV2.1) gene CACNA1A. The channel appears to remain functional, though the kinetics is altered, and the polyQ protein accumulates in the cell (Paulson 2009). SCA7 includes retinal degeneration, culminating in blindness (Paulson 2009). SCA7 is very susceptible to copy number, especially in paternal inheritance: normal individuals have fewer than 18 repeats, affected individuals have greater than 37 repeats, while larger expansions can even cause symptoms in neonates. Although repeat instability upon transmission is common in many polyQ diseases, it is particularly so in SCA7 (Paulson 2009). SCA17 (sometimes called Huntington’s disease-like 4, or HDL4) results from a stretch of

Polyglutamine Folding Diseases

more than 42 repeats in the transcriptional regulator TBP (TATA-box-binding protein), where the normal range is from 25 to 42 repeats. The broad function of the affected protein yields a range of symptoms: ataxia, dementia, extrapyramidal movement and seizures, and neuropsychoses (Stevanin and Brice 2008). Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease leading to paralysis and death. PolyQ repeats of 27–33 in ataxin-2 (ATXN2), encoding the ATXN2 protein (along with a number of other risk factors), were found to increase risk for ALS (Yu et al. 2011). Given the identified association between amyotrophic lateral sclerosis (ALS) and ataxin-2 polyQ expansion, a number of genes for polyQ expansion were examined to see if polyglutamine expansion in a number of proteins was overwhelming the folding machinery of the cell. However, the polyQ was only detected in the ataxin-2, indicating a specific defect related to ataxin-2. It has also been shown that the presence of intervening CAA codons in the middle of the CAG repeats conferred partial protection against ALS, significantly delaying age of onset (Yu et al. 2011). Huntington’s disease (HD) is the most common polyQ disease, particularly among those of Western European descent. It is a dominantly inherited disease displaying polyQ repeats in the first exon of the protein huntingtin (HTT) that leads to a toxic phenotype in neurons. The presence of CAG repeats at the 50 end results in a lengthened glutamine tract. Pathology requires at least 36 CAG repeat (Zuccato et al. 2010). An individual with 36–39 repeats in the HTT gene may or may not experience symptoms, but individuals with 40 or more repeats will invariably experience HD symptoms. The defective protein aggregates and is deposited in inclusion bodies in neurons. The resulting fatal neurodegeneration generally appears between the ages of 30 and 50 and is evidenced by chorea and cognitive decline, but there is significant phenotypic variation. The length of the repeats determines about 70% of the variability of the age of onset of Huntington’s disease (Zuccato et al. 2010); children with over 100 repeats experience symptoms within the first 6 years of life. Furthermore, the

973

repeat tends to expand further over time in postmitotic neurons. Thus, somatic instability is a critical factor in age of onset (Zuccato et al. 2010). Dentatorubral-pallidoluysian atrophy (DRPLA) is a spinocerebellar degeneration similar to HD, displaying ataxia, chorea, seizures, and eventual dementia. DRPLA arises from polyQ expansions in the ataxia-1 gene (as seen in the gene product ATN1). The extent of pathology varies greatly (Paulson 2009). Spinal and bulbar muscular atrophy (SBMA) is a disease caused by polyQ repeats (>36) that arises in the gene for the androgen receptor, which is found on the X chromosome. As a result, it is an X-linked gene that affects primarily males, apparently making it toxic to nerve cells. SBMA is a late-onset disease, displaying muscle wasting, as well as neuron degeneration. Curiously, the normal functions of the androgen receptor, including binding, translocation to the nucleus, and transcription factor recruitment, appear to be required for pathology (Nedelsky 2010).

Inheritance The effect of the extended polyQ repeats is dominant; thus the offspring of those suffering from many of the polyQ diseases have a 50% chance of inheriting the faulty gene and the disease. Furthermore, the gene defect is the tendency to accumulate CAG repeats, so the number of these repeats can increase in successive generations. As mentioned, the number of repeats correlates with earlier onset and increased severity of the disease.

Potential Mechanisms The origin of polyQ repeats lies with the RNA polymerase. The ability of polymerase to faithfully reproduce a genomic DNA sequence is compromised by trinucleotide repeats. Polymerase slippage at the repeating unit or at a hairpin or a base excision repair mechanism or mismatch repair is thought to give rise to errors. In fact, the mechanism may be tissue specific with the greatest instability in neurons and the brain.

P

974

These mitotic errors can be deletions, insertions, or replication errors. However, once the number of CAG repeats passes a certain threshold, the increase in number of repeats is the mostly likely error (Orr 2012). The normal functions of the genes affected vary widely, and the loss of function is often not the primary cause of pathology. Rather, the polyQ stretches interfere with the proper folding of the protein, formation of intrinsically disordered proteins (IDPs). Denatured proteins tend to aggregate and the protein aggregates (identified via microscopically as “inclusion bodies”) accumulate in the cell, many in the nucleus. The intrabackbone interactions in the polyQ IDPs mediate at least part of the misfolding. A number of studies have shown that sequences flanking polyQ stretches can modulate aggregation and disease (Zuccato et al. 2010). Clearly, expression of certain polyQ proteins is necessary and sufficient to cause disease in some systems. Expression of GAC repeats in unassociated proteins can induce neurodegeneration reminiscent of polyQ diseases, as can expression of the first exon of the huntingtin (HTT) protein when it contains polyQ. In addition, exogenous polyQ peptides can greatly accelerate the rate of aggregate formation by elongating aggregation nuclei and fostering neural degeneration (Zuccato et al. 2010). The exact role of the aggregation is not clear; although repeat length is related to disease progression, the degree of aggregation does not clearly correlate. The aggregation in some diseases is in equilibrium with soluble mutant protein. Some have argued that aggregation can even be neuroprotective (Takahashi et al. 2008). One model is that mutations leading to subclinical misfolding can be exacerbated by other genomic polymorphisms, overwhelming the cellular chaperone and quality control mechanisms (Gidalevitz et al. 2006). This helps explain why dysfunction caused by polyQ repeats can disrupt a range of cellular pathways. When the polyQ proteins flood the protein folding quality control mechanisms, proper folding of a host of proteins can suffer. Expression of polyQ proteins in C. elegans leads to aggregation of a range of

Polyglutamine Folding Diseases

other susceptible proteins that rely on the protein folding machinery (Gidalevitz et al. 2006), which is notable since a number of normal proteins also contain shorter polyQ sequences. Several of the polyQ proteins affect transcription and some directly bind transcription factor complexes. Finally, studies have shown that transcription of amino acid repeats can often occur bidirectionally and, in some cases, produce toxic RNA products, such as in SCA8. These RNAs may affect disease penetrance and severity (Batra et al. 2010).

Potential Therapeutic Strategies There are palliative medications; however, there is no treatment of these diseases. A number of strategies have been suggested that may yield therapeutic treatments in the future. The most effective treatment has been applied to SBMA. Since SBMA results in polyQ expression in the androgen receptor, administration of a gonadotropin-releasing hormone antagonist has proven to be a successful treatment (Katsuno et al. 2010). Since this treatment rests on the unique function of the affected protein, though, it is not applicable to other pathologies. It is presumed that silencing of the polyQ proteins will exacerbate pathologies since it would cause loss of normal function, as well as reducing the levels of aberrant protein. Antisense oligonucleotides targeted just to the CAG repeats may leave the wild-type protein unaffected. Some of the defects cause transcriptional repression, so relieving that transcription thus may reduce the impact of the HTT mutation (Ross and Shoulson 2009). Disruption of posttranslational modifications can also have a palliative effect. It had been shown that mutating ataxin-1 to mimic phosphorylation can lead to a similar pathology in the absence of polyQ stretches. Similarly, by eliminating the phosphorylation in HTT, the effect of the polyQ stretch can be overcome. Posttranslational cleavage also can be important: only N-terminal mutant HTT aggregates and exhibits higher toxicity than full-length mutant HTT. Transgenic mice expressing modified HTT

Polymerase Chain Reaction

incapable of cleavage by caspase-6 were less susceptible to neurotoxicity than the wild type (Graham et al. 2006). Thus, compounds that modulate posttranslational modifications such as phosphorylation or proteolytic processing would have therapeutic value. Stem cells have been explored as therapeutic agents. Stem cell therapy in animal models of HD has been reported, but efficacy is lacking and mechanisms remain to be elucidated (Kim et al. 2008). William Balch and coworkers have proposed that the use of chaperones and folding enzymes can be utilized as intervention for folding diseases. An experiment that provided confirmation of a protein folding defect, as well as pointing toward therapeutic intervention, was performed in a Drosophila model of Parkinson’s disease. Aggregation of human a-synuclein expressed in Drosophila was blocked by co-expressing Hsp70, a heat-shock protein that is presumably acting as a chaperone to assure proper folding of the a-synuclein (Auluck et al. 2010). The upregulation of other cell homeostasis functions may also serve to decrease misfolding. Particularly in light of the contribution of RNA products to pathogenesis, RNAi has been proposed as an intervention for these amino acid repeat diseases (Boudreau and Davidson 2010). Allele-specific silencing of the pathogenic protein arising from amino acid repeats has been achieved in several studies in rodent models, indicating that these treatments may hold promise for human interventions.

975 Graham RK, Deng Y, Slow EJ et al (2006) Cleavage at the caspase-6 site is required for neuronal dysfunction and degeneration due to mutant huntingtin. Cell 125:1179–1191. https://doi.org/10.1016/j.cell.2006.04.026 Katsuno M, Banno H, Suzuki K et al (2010) Efficacy and safety of leuprorelin in patients with spinal and bulbar muscular atrophy (JASMITT study): a multicentre, randomised, double-blind, placebo-controlled trial. Lancet Neurol 9:875–884. https://doi.org/10.1016/ S1474-4422(10)70182-4 Kim M, Lee S-T, Chu K, Kim SU (2008) Stem cell-based cell therapy for Huntington disease: a review. Neuropathology 28:1–9. https://doi.org/10.1111/j.1440-1789.2007.00858.x Nedelsky NB, Pennuto M, Smith RB, Palazzolo I, Moore J, Nie Z, Neale G, Taylor JP (2010) Native functions of the androgen receptor are essential to pathogenesis in a Drosophila model of spinobulbar muscular atrophy. Neuron 67:936–52. https://doi.org/10.1016/j.neuron.2010.08.034 Orr HT (2012) Cell biology of spinocerebellar ataxia. J Cell Biol 197:167–177. https://doi.org/10.1083/ jcb.201105092 Paulson H (2009) The spinocerebellar ataxias. J NeuroOphthalmol 29:227–237. https://doi.org/10.1097/ WNO0b013e3181b416de Ross C, Shoulson I (2009) Huntington disease: pathogenesis, biomarkers, and approaches to experimental therapeutics. Parkinsonism Relat Disord 15(Suppl 3):S135–S138. https://doi.org/10.1016/S1353-8020(09)70800-4 Stevanin G, Brice A (2008) Spinocerebellar ataxia 17 (SCA17) and Huntington’s disease-like 4 (HDL4). Cerebellum 17:170–178 Takahashi T, Kikuchi S, Katada S et al (2008) Soluble polyglutamine oligomers formed prior to inclusion body formation are cytotoxic. Hum Mol Genet 17:345–356. https://doi.org/10.1093/hmg/ddm311 Yu Z, Zhu Y, Chen-Plotkin AS et al (2011) PolyQ repeat expansions in ATXN2 associated with ALS are CAA interrupted repeats. PLoS One 6:e17951. https://doi. org/10.1371/journal.pone.0017951 Zuccato C, Valenza M, Cattaneo E (2010) Molecular mechanisms and potential therapeutical targets in Huntington’s disease. Physiol Rev 90:905–981. https://doi.org/10.1152/physrev.00041.2009

References Auluck PK, Caraveo G, Lindquist S (2010) a-Synuclein: membrane interactions and toxicity in Parkinson’s disease. Annu Rev Cell Dev Biol 26:211–233 Batra R, Charizanis K, Swanson MS (2010) Partners in crime: bidirectional transcription in unstable microsatellite disease. Hum Mol Genet 19:R77–R82. https:// doi.org/10.1093/hmg/ddq132 Boudreau RL, Davidson BL (2010) RNAi therapeutics for CNS disorders. Brain Res 1338:112–121. https://doi. org/10.1016/j.brainres.2010.03.038 Gidalevitz T, Ben-Zvi A, Ho KH et al (2006) Progressive disruption of cellular protein folding in models of polyglutamine diseases. Science 311:1471–1474. https://doi.org/10.1126/science.1124514

Polymerase Chain Reaction Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Definition The polymerase chain reaction is a procedure by which DNA is copied cyclically in vitro, leading to amplification of a specific DNA sequence. The

P

976

Polymerase Chain Reaction

Polymerase Chain Reaction, Fig. 1 The steps in a typical polymerase chain reaction. The initial mixture (upper left) contains dsDNA, a large excess of DNA primers (arrows), DNA polymerase, and dNTPs. The steps are (1) the DNA is heat denatured, (2) the primers form base pairs with the ssDNA, and (3) the DNA

polymerase synthesizes DNA (heavy line) starting from the 30 -ends of the primers (arrowheads). These three steps are repeated in the second and subsequent cycles. The final products are dsDNA molecules whose extent is determined by the 50 -ends of the primers

method can be used to increase the amount of a DNA segment that is present in small amount in an initial sample or to selectively amplify one sequence from among a large amount of other DNA sequences. A PCR reaction mixture contains a double-stranded DNA template, synthetic DNA oligonucleotides that can form base pairs at the ends of the desired sequence and serve as primers for DNA synthesis, a DNA polymerase, and the four deoxyribonucleoside triphosphates (dNTPs). The method generally involves the following steps (Fig. 1): (1) the reaction mixture is heated at ca. 95  C to denature the DNA; (2) the mixture is cooled to allow the primers to anneal to the denatured DNA; (3) the DNA polymerase synthesizes dsDNA, using the dNTPs, the denatured DNA as the template, and the oligonucleotides as the primers; and (4) steps 1–3 are repeated cyclically,

resulting in exponential amplification of the dsDNA targeted by the oligonucleotide primers.

Discussion PCR was invented by Dr. Kary Mullis at the Cetus Corporation in the 1980s (Mullis and Faloona 1987), for which Mullis was awarded the Nobel Prize in Chemistry in 1993. The procedure was first done using the Klenow fragment of E. coli DNA polymerase I. Since the DNA polymerase activity was destroyed during each 95  C DNA denaturation step, fresh enzyme had to be added for each cycle, after each denaturation step. The method was made widely applicable when it was done using a thermostable DNA polymerase. The first such enzyme used was the DNA polymerase

Polymerase Chain Reaction

I from Thermus aquaticus (Taq polymerase) (Saiki et al. 1988). T. aquaticus is a thermophilic bacterium that naturally grows at elevated temperatures. The Taq polymerase is optimally active at 80  C (Chien et al. 1976), and it is able to survive the 95  C denaturation step, obviating the need to add new enzyme for each reaction cycle. The PCR method is now automated using a thermocycler to change the reaction mixture temperature as programmed by the investigator. A typical PCR run involves about 30 cycles of heat denaturation (95  C), primer annealing at a temperature appropriate for the DNA template and primers (50–70  C), and DNA synthesis (elongation, at 70  C), all in about 2–4 h. The amount of targeted DNA is doubled at each cycle (assuming 100% efficiency at each cycle), for more than a millionfold amplification of the target DNA after 20 cycles. DNA polymerases from a variety of thermophiles are now available for PCR. They differ in synthesis rate, fidelity, processivity, and the structure of the dsDNA product ends. Several software tools are available for designing primers that anneal to the template with an appropriate melting temperature, that avoid annealing at nontarget sites in the DNA template, and that do not form primer dimers by base-pairing to each other.

Applications of PCR Gene Cloning. The availability of genome sequences from many organisms makes gene cloning by PCR much faster than earlier cloning procedures. PCR primers are designed to anneal to DNA upstream and downstream of the gene of interest. The primers are mixed with genomic DNA isolated from the organism and used for PCR. The PCR product comprising the gene is amplified to much higher concentration than any other sequence in the genome and can easily be ligated to an appropriate cloning vector and introduced into bacterial or other host cells. Quantitative PCR (qPCR and RT-PCR). PCR can be used to give relative or absolute

977

amounts of a specific DNA or RNA sequence in a sample (VanGuilder et al. 2008). PCR is done in the presence of a fluorescent probe such as a dye that becomes fluorescent upon its intercalation into dsDNA. The fluorescence signal is measured during the ongoing PCR reaction (hence, “real time” or “RT-PCR”), giving a measure of the DNA concentration present during the reaction. The number of PCR cycles required for the fluorescence signal to reach a specified threshold level depends on the amount of DNA target in the original sample. RNA amounts are quantitated by first converting the RNA to cDNA using reverse transcriptase, followed by quantitative PCR (also “RT-PCR”). RT-PCR is particularly useful as a way to quantitate RNA levels in gene expression studies. DNA Sequencing. DNA amplification by PCR is a crucial step in some “next-generation,” or “second-generation,” DNA sequencing methods (Metzker 2010). The DNA to be sequenced, such as genomic DNA, is first broken into small fragments, and a short synthetic dsDNA linker is attached to each end by ligation. The DNA fragments are then denatured and attached to tiny beads via the linker DNA, in such a way that only a single DNA fragment is attached to each bead. The DNA on the bead is amplified by PCR using primers that anneal to the linker DNA. The amplified DNA is itself captured by annealing to other linkers on the bead. The beads can be distributed individually to reaction wells (“picotiter” plates) and the DNA sequencing reaction carried out with the amplified, beadbound DNA as the template. Sequencing reactions are monitored by detecting an optical signal simultaneously from 108 beads, each linked to a different DNA template. Alternatively, the individual genomic DNA fragments can be attached to a solid surface and amplified to produce copies of the various fragments in isolated patches which can be sequenced and monitored simultaneously. Forensics. PCR is useful both to amplify tiny amounts of DNA that might be found at a crime scene, e.g., in a blood sample, hair, cigarette butt, etc., and to provide evidence that may incriminate

P

978

an individual in a crime or rule out their involvement (Morling 2009). The primers used in these applications give amplification of human DNA regions that are known to contain repeated sequences that vary in repeat length among individuals. The sizes of DNA fragments in several such regions amplified from a forensic sample are compared to those from a suspect. A match between the two samples implicates the individual as the source of the blood, hair, etc.

Cross-References

Polypeptide

Posttranscriptional Regulation ▶ Cytoplasmic mRNA, Regulation of

Post-Translational Modifications in DNA Double Strand Break Repair, Roles of Prabha Sarangi and Xiaolan Zhao Molecular Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA

▶ Key Enzymes Used in Cloning, Some

Synopsis

Chien A, Edgar DB, Trela JM (1976) Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus. J Bacteriol 127:1550–1557 Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46 Morling N (2009) PCR in forensic genetics. Biochem Soc Trans 37:438–440 Mullis KB, Faloona FA (1987) Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods Enzymol 155:335–350 Saiki RK, Gelfand DH, Stoffel S, Scharf SJ, Higuchi R, Horn GT, Mullis KB, Erlich HA (1988) Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science 239:487–491 VanGuilder HD, Vrana KE, Freeman WM (2008) Twentyfive years of quantitative PCR for gene expression analysis. Biotechniques 44:619–626

Genomic integrity is constantly challenged by DNA lesions, several thousands of which occur in each human cell every day. A particularly hazardous type of DNA lesion is the double-strand break (DSB), which can lead to large genetic alterations if not repaired accurately. To cope with DSBs, cells have evolved three highly conserved repair mechanisms: homologous recombination (HR), non-homologous end joining (NHEJ), and telomere addition. These competing mechanisms can lead to different outcomes, and all are tightly regulated at the levels of pathway choice and repair efficiency. Increasing evidence highlights the important roles of post-translational modifications (PTMs) in this regulation. This essay summarizes the current understanding of how PTMs contribute to DSB repair, with an emphasis on the most recent progress in the field.

Polypeptide

Introduction

▶ Secondary Structure

Among the three pathways that heal DSBs, HR can be the most faithful form of repair. HR begins with a two-stage 50 –30 resection of the DNA end to generate 30 ssDNA. This is best understood in budding yeast, where it is initiated by the removal of short DNA fragments (end clipping) by the nuclease functions of the Mre11-Rad50-Xrs2

References

PolyQ Diseases ▶ Polyglutamine Folding Diseases

Post-Translational Modifications in DNA Double Strand Break Repair, Roles of

(MRX) complex and Sae2. The second stage of extensive resection is mediated by two parallel pathways: one involving the exonuclease Exo1 and the other entailing the helicase Sgs1 and the nuclease Dna2. The ssDNA generated by resection is rapidly bound by the ssDNA-binding protein complex replication protein A (RPA) that is later displaced by Rad51 with the help of mediators such as Rad52 and Rad51 paralogs. The Rad51 nucleofilament then performs homology search, followed by strand invasion, DNA synthesis, and ligation of the fragments. The resulting DNA structures, called Holliday junctions (HJs), are either dissolved by the helicase Sgs1 in concert with its partners Top3 and Rmi1 to exclusively yield non-crossover products, or resolved by structure-specific endonucleases such as Mus81-Mms4 and Yen1 to give rise to either crossover or noncrossover products. Alternatively, the newly synthesized DNA strand can be displaced without generating HJs using the synthesis-dependent strand annealing (SDSA) mechanism. HR is usually error-free when repairing from the sister chromatid, but can lead to loss of heterozygosity when the homolog is used as template. Besides the canonical HR pathways, other forms of HR also exist, such as singlestrand annealing (SSA), which leads to deletion of the intervening sequence between two direct repeats, and break-induced replication (BIR), which involves extensive copying of another genomic region. While HR is the preferred DSB repair pathway in yeast, NHEJ is the predominant form of repair in higher eukaryotes. In NHEJ, broken DNA ends are stitched together based on short regions of homology in the vicinity of the break. This requires the Ku70/Ku80 heterodimer that binds with avidity to DSB ends. Nearly simultaneously, the MRX complex also binds to and bridges the ends. The Ku80 and Xrs2 subunits of both complexes recruit the ligase Dnl4 and the ligaseassociated factor Lif1, respectively, after which alignment of the homology regions results in ligation of the ends to produce an intact DNA molecule. In human cells, additional proteins such as DNA-PKcs are also involved in NHEJ. NHEJ is

979

frequently inaccurate and, in extreme cases, can lead to chromosomal translocations. Competing with HR and NHEJ is the de novo telomere addition pathway. The machinery that mediates legitimate telomere elongation at native telomeres can also add telomere repeats at DSB ends, particularly when the ends contain some telomere repeat sequences. This requires the recruitment of telomerase to the DSB site and is influenced by multiple positive and negative telomerase regulators, such as the Pif1 helicase, the ssDNA-binding protein Cdc13, and the Ku complex. Telomere addition often results in the loss of genetic information and is a major form of gross chromosomal rearrangement seen in yeast.

PTMs Provide the Required Dynamics and Adaptability in DSB Repair The choice of repair by HR or NHEJ or chromosome healing by telomere addition not only differs between organisms, but is also closely linked with the cell cycle stage. HR requires a homologous donor sequence, ideally on a sister chromatid, and hence is invoked in the S and G2/M phases, while NHEJ can function throughout the cell cycle. End healing by telomere addition is likely inhibited in all cell cycle stages and functions as a last resort or is a consequence of incomplete HR. In addition, there are multiple levels of regulation within each repair pathway. For example, steps within a pathway are tightly coordinated, repair efficiency can be upregulated in response to increasing DNA damage, and repair can be restricted to specific nuclear regions. This temporal and spatial regulation of DSB healing demands dynamic mechanisms. Because PTMs can impart exquisite versatility to the normal function of a protein, they are ideally suited to regulate DSB repair at all levels. By introducing charges, new protein-binding surfaces, or altering protein conformation, PTMs can lead to the recruitment of DNA repair proteins, modulation of their activity at lesion sites, or their deactivation after completion of function. Several PTMs such as phosphorylation, ubiquitination, and

P

980

Post-Translational Modifications in DNA Double Strand Break Repair, Roles of

sumoylation have been implicated in DSB repair. Since several recent reviews have discussed the role of phosphorylation in DSB repair, this chapter only touches upon new findings in this field and focuses instead on ubiquitination and sumoylation, modifications that are fast emerging as pivotal to successful DSB repair.

Phosphorylation Regulates HR at the Beginning, the Middle, and the End Phosphorylation-based regulation can sanction repair at different cell cycle stages. To restrict HR pathways to S and G2/M phases, the first step of resection is activated by two phosphorylation events. Upon entry into S phase, S-CDK phosphorylates Sae2 and its homologs, thereby stimulating their activity and tilting the balance in favor of HR over NHEJ (Polo and Jackson 2011) (and references therein). In addition, S-CDK phosphorylates Dna2, consequently facilitating its recruitment to DSBs and promoting resection of the DSB end (Chen et al. 2011). In addition, a large number of proteins involved in DSB healing are phosphorylated by checkpoint kinases in response to increased levels of DNA damage. For example, at least 12 mammalian proteins involved in HR are phosphorylated by the Mec1 homolog, ATR, to favor HR. A case in point is the Mec1dependent phosphorylation of the recombinase Rad51 that can promote HR repair, possibly by triggering inactivation of the protein after strand exchange, thus facilitating its turnover (Flott et al. 2011). Phosphorylation can also determine the fate of recombination products. A new study has shown that Cdc5-mediated phosphorylation of Mms4 activates it in G2/M phase, while phosphorylation of Yen1 maintains it in an inactive state until anaphase (Matos et al. 2011). The activation of HJ resolvases only in the later cell cycle stages makes the Sgs1/Top3/Rmi1 dissolvase the preferred enzyme to cleave HJs in a non-crossover configuration in S phase, resulting in fewer changes in genetic information, a more favorable outcome in mitotic cells.

Ubiquitin Chains at DSB Sites in Vertebrate Cells The conjugation of ubiquitin (an 80 a.a. protein) to lysine residues of substrate proteins requires an E1 activating enzyme, an E2 conjugating enzyme, and an E3 ligase that recognizes the substrate. Typically, mammals possess 2 E1s, 30 E2s, and several hundred E3s. Substrates can be modified by the attachment of a ubiquitin monomer, but more commonly by ubiquitin polymers or chains. At least seven types of ubiquitin chains can be formed by attaching ubiquitin to different lysines of another ubiquitin molecule, and these are decoded differently. While K48-linked ubiquitin chains often result in proteolysis, K63 and K6 linkages mainly lead to signal transduction and protein recruitment. Evidence linking ubiquitination to DSB repair comes primarily from studies in vertebrates (Polo and Jackson 2011) (and references therein). Ubiquitin conjugates and several E3s implicated in genomic integrity, such as BRCA1, RNF8, RNF168, RAD18, HERC2, and PRC1, accumulate at DSB sites within minutes of irradiation. These E3s mediate the modification of multiple proteins, thereby promoting the formation of repair platforms or enhancing the activities of repair enzymes. For example, ubiquitination of the histones H2A, H2AX, and H2B has been implicated in facilitating the DSB localization of downstream repair proteins such as 53BP1 and seeding additional ubiquitination events by the recruitment of the E3s BRCA1 and RNF168. BRCA1, a tumor suppressor, collaborates with a conformationally similar protein, BARD1, to catalyze ubiquitination. One of BRCA1’s targets is CtIP, the human homolog of Sae2. This modification increases the local concentration of CtIP, possibly by affecting its chromatin-binding ability. Since BRCA1 is involved in a diverse range of cellular pathways, it likely links DNA repair with cell cycle control and transcription to maintain genomic stability. Other important ubiquitin enzymes involved in DSB repair include the E3 Rad18 and the E2 Ubc13 (Polo and Jackson 2011) (and references therein). Although the yeast proteins are primarily

Post-Translational Modifications in DNA Double Strand Break Repair, Roles of

involved in post-replicative repair, their vertebrate counterparts have direct roles in DSB repair. RAD18 interacts with the Rad51 paralog, RAD51C, and possibly promotes its accumulation at damage sites. Intriguingly, the E3 signature RING domain is required chiefly for the recognition of RNF8-dependent ubiquitin chains leading to its recruitment to DSB sites, but not for its conjugation function. In addition, depletion of UBC13 in both human and DT40 chicken cells results in a decrease in HR, likely by affecting the early steps of DSB end processing. These data and others suggest that the communication of these ubiquitination enzymes with the HR machinery via the different forms of ubiquitin chains aids efficient and coordinated DSB repair in vertebrate cells.

Sumoylation, a New but Important Regulator at DSBs SUMO is a small protein similar to ubiquitin found in all eukaryotes and essential in most. Like ubiquitination, sumoylation is also a threestep process involving E1, E2, and E3 enzymes, but only one E1, one E2, and a few E3s have been found in most organisms. While lower eukaryotes possess only one SUMO, mammals express at least three isoforms – SUMO1 and the highly related SUMO2 and SUMO3 that differ in only three N-terminal residues. Although a late entrant into the DSB repair research scene, sumoylation appears to have a strong influence on DSBs in both yeast and vertebrate cells (Polo and Jackson 2011; Bergink and Jentsch 2009) (and references therein). The E2 UBC9, the E3s PIAS1 and PIAS4, and all three SUMO isoforms localize to DSB sites in mammalian cells. Additionally, depletion of PIAS1 and PIAS4 causes defects in both HR and NHEJ. One way in which sumoylation could affect DSB repair is by modifying the key repair proteins BRCA1 and 53BP1. Sumoylation of BRCA1 leads to a 10–20fold stimulation of its ubiquitin ligase activity in vitro, indicating that sumoylation of BRCA1 contributes to its activity. Consistent with this, depletion of PIAS1/4 results in a decrease of

981

K6-ubiquitin conjugates in gH2AX-labeled chromatin. In comparison, how SUMO affects 53BP1 is still enigmatic, although the protein is sumoylated at both its N and C termini. Several other proteins functioning in DSB repair have also been identified as SUMO substrates (Polo and Jackson 2011; Bergink and Jentsch 2009) (and references therein). In yeast and human cells, the recombination protein Rad52 is sumoylated in response to DSBs in both mitotic and meiotic cells. Sumoylation modulates several aspects of Rad52 function such as protein stability, localization, and dynamics. Mutant Rad52 that cannot be sumoylated is unstable when recombination intermediates accumulate, implying a role for sumoylation in shielding the protein from proteolysis, possibly by sequestering it into repair foci. Additionally, this mutant exhibits a defect in the nucleolar exclusion of Rad52, a process thought to be important for preventing catastrophic recombination events within the highly repetitive ribosomal DNA locus. Finally, Rad52 sumoylation may aid in recombination pathway selection, since the rad52 sumoylation-defective mutant displays higher levels of gene conversion at the expense of SSA and exhibits significantly shorter foci life span. A conserved sumoylation of RPA has been reported both in yeast and humans. RPA70, the largest subunit of the human complex, is maintained in a hyposumoylated form during S phase through a physical association with SENP6, a desumoylating enzyme. However, upon treatment with the TopoI inhibitor camptothecin, this interaction is lost, leading to the sumoylation of RPA70 which in turn facilitates the recruitment of the recombinase Rad51 to damage foci. The importance of this modification is underlined by the camptothecin sensitivity and low levels of HR repair in cell lines that are defective for RPA sumoylation. Other HR proteins known to be sumoylated include Sgs1 and its human homologs BLM and WRN. While the sumoylation of BLM appears to promote its interaction with Rad51 at stalled replication forks, Sgs1 sumoylation seems to be uniquely required for its role in telomere-telomere recombination.

P

982

Post-Translational Modifications in DNA Double Strand Break Repair, Roles of

Proteins functioning in NHEJ are also targets for sumoylation. Sumoylation of XRCC4, the human homolog of Lif1, appears to be important for regulation of its nuclear localization as mutating the sumoylation site on the protein causes its mislocalization in the cytoplasm. In addition, the Ku70/Ku80 heterodimer is also sumoylated in a DNA damage-dependent manner, although the function(s) of this modification has not been elucidated to date. Several telomere proteins in yeast and human cells are also sumoylated. For example, sumoylation of Cdc13 is inhibitory to telomerase function, and that of the telomere protection complex, shelterin, regulates telomere recombination (Nagai et al. 2011) (and references therein). It will be interesting to know whether these regulatory roles also impinge on de novo telomere addition at DSBs.

Interplay Between the Various PTMs in DSB Repair Crosstalk between PTM pathways underlies the complex choreography of DNA repair processes at DSBs. An illuminating example is the initial damage-signaling cascade in mammalian cells, in which protein recruitment to DSB sites is aided by the coordinated action of all three PTMs discussed above (Polo and Jackson 2011) (and references therein). The checkpoint kinase ATM phosphorylates histone H2AX within chromatin surrounding the break, which assists the recruitment of the checkpoint mediator protein MDC1. Subsequent phosphorylation of MDC1 by ATM recruits the ubiquitin E3 RNF8 to polyubiquitinate H2A and H2AX via K63 linkages, thereby transducing the damage-sensing phosphorylation signal into a ubiquitination cascade. This initiates a wave of protein recruitment that utilizes ubiquitinated histones as a platform for binding. The E3 RNF168 recognizes and binds to the K63-linked polyubiquitin chains and, in collaboration with the E2 UBC13 and the E3 HERC2, further amplifies the ubiquitin signal by expanding the H2A polyubiquitination domain. The sequential actions of

RNF8 and RNF168 result in the recruitment of yet another ubiquitin E3, BRCA1, that can catalyze K6-linked ubiquitin chains and promote DSB repair by modifying targets such as CtIP. Sumoylation also plays important roles in this signaling cascade, since RNF168 accumulation at DSB sites is affected by PIAS4 depletion. Additionally, sumoylation of BRCA1 stimulates its ubiquitin ligase activity in vitro. More examples of such SUMO-regulated ubiquitin ligases (SRUBLs) are likely to emerge, as RNF168 and HERC2 are also modified by SUMO.

Beyond Recruitment and Activation: Other Roles for PTMs at DSBs The role of PTMs also extends to protein disassembly and inhibition (Polo and Jackson 2011) (and references therein). For instance, the phosphorylation of Chk1 by ATM and ATR in mammalian cells correlates with its release from chromatin, possibly enabling further propagation of the checkpoint signal. Along similar lines, phosphorylation of the fission yeast Rad9 component of the 9-1-1 complex permits its dissociation from chromatin and subsequent signal amplification. Additionally, dissociation of the NHEJ factors DNA-PKcs from the Ku complex, and Ku80 from DNA, by autophosphorylation and K48-linked polyubiquitination, respectively, may allow the access of downstream NHEJ factors to the DSB site. PTMs by one enzyme can regulate all three DSB healing pathways. For example, Mec1 promotes HR and NHEJ by mediating the phosphorylation of Rad51 and Nej1, respectively, but disfavors telomere addition by causing the modification of both Pif1 and Cdc13. More broadly speaking, PTMs by the same enzyme can coordinate DSB repair with other processes that occur on chromatin surrounding the break. A recent example implicates ATM-mediated histone ubiquitination in the prevention of transcription in the vicinity of the break, in addition to its previously characterized function in recruiting repair factors (Shanbhag et al. 2010).

Predictions from Sequence

983

Other PTMs Involved in DSB Repair

References

Apart from the three modifications discussed above, several other PTMs have also been implicated in DSB repair. It is known that methylation of histone H3K79 is required for the recruitment of 53BP1 to DSB sites. Recently, it was shown that acetylation of Sae2 is important for its function in both yeast and humans. In the former case, inhibition of Sae2 deacetylation leads to its degradation, while in the latter, SIRT6-mediated deacetylation of CtIP promotes resection and HR (Kaidi et al. 2010; Robert et al. 2011). A recent study has unveiled a role for palmitoylation in regulating telomere protein function. Palmitoylation of Rif1, a protein that has roles in telomere dynamics and silencing, anchors it to the inner nuclear membrane and promotes the formation of discrete nuclear foci (Park et al. 2011).

Bergink S, Jentsch S (2009) Principles of ubiquitin and SUMO modifications in DNA repair. Nature 458:461–467 Chen X, Niu H, Chung WH, Zhu Z, Papusha A, Shim EY, Lee SE, Sung P, Ira G (2011) Cell cycle regulation of DNA double-strand break end resection by Cdk1dependent Dna2 phosphorylation. Nat Struct Mol Biol 18:1015–1019 Flott S, Kwon Y, Pigli YZ, Rice PA, Sung P, Jackson SP (2011) Regulation of Rad51 function by phosphorylation. EMBO Rep 12:833–839 Kaidi A, Weinert BT, Choudhary C, Jackson SP (2010) Human SIRT6 promotes DNA end resection through CtIP deacetylation. Science 329:1348–1353 Matos J, Blanco MG, Maslen S, Skehel JM, West SC (2011) Regulatory control of the resolution of DNA recombination intermediates during meiosis and mitosis. Cell 147:158–172 Nagai S, Davoodi N, Gasser SM (2011) Nuclear organization in genome stability: SUMO connections. Cell Res 21:474–485 Park S, Patterson EE, Cobb J, Audhya A, Gartenberg MR, Fox CA (2011) Palmitoylation controls the dynamics of budding-yeast heterochromatin via the telomerebinding protein Rif1. Proc Natl Acad Sci U S A 108: 14572–14577 Polo SE, Jackson SP (2011) Dynamics of DNA damage response proteins at DNA breaks: a focus on protein modifications. Genes Dev 25:409–433 Robert T, Vanoli F, Chiolo I, Shubassi G, Bernstein KA, Rothstein R, Botrugno OA, Parazzoli D, Oldani A, Minucci S et al (2011) HDACs link the DNA damage response, processing of double-strand breaks and autophagy. Nature 471:74–79 Shanbhag NM, Rafalska-Metcalf IU, Balane-Bolivar C, Janicki SM, Greenberg RA (2010) ATM-dependent chromatin changes silence transcription in cis to DNA double-strand breaks. Cell 141:970–981

Concluding Remarks The collective evidence reveals that the chromatin landscape around DSBs is shaped to a large degree by different proteins modifications. The increasing number of PTMs that occur at DSB sites underscores their importance in regulating the dynamics of repair through direct modification of chromatin structures and repair factors. Although only a few substrates have been studied thus far, the illuminating observation that the ubiquitination and sumoylation machineries are recruited to DSBs foretells the detection of many more. The next stage of research in this area will be directed towards both identifying additional substrates and providing a more mechanistic understanding of the roles of these modifications at DSB repair sites.

Predictions from Sequence Scott Cooper and Anton Sanderfoot Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA

Cross-References Synopsis ▶ DNA Recombination, Mechanisms of ▶ Double-Strand Break Repair ▶ End Joining, Classical and Alternative

All of the information about the final structure and function of a protein is encoded in the

P

984

primary sequence of amino acids. Soon, it may be possible to determine much about the shape, function, and interactions of a given protein, just by examining the peptide sequence. At present, this is rarely possible, but the application of bioinformatics may make this easier in the future.

Introduction Proteins have at least three levels of structure that determine the final shape and function. The primary structure of a protein is the linear order of amino acids that was encoded in the nucleic acid code of the gene. Secondary structure involves local folding of the peptide backbone, forming hydrogen bonds with adjacent parts of the backbone to produce alpha helices or beta sheets. Tertiary structure involves the threedimensional folding of the local secondary structures into larger domains. Tertiary structure involves more interactions among the peptide backbone as well as the amino acid side chains that include hydrogen bonds, ionic interactions, hydrophobic effects, and (at least in secreted or lumenal proteins) formation of disulfide bonds between spatially adjacent cysteine residues. Some proteins further fold into tertiary structures that involve the same types of interactions as tertiary structure, except that the interactions are with other protein chains in a larger multiprotein structure. It is important to note that the majority of the shape/function information is encoded in the primary sequence. Using the primary sequence of a protein, it is possible to identify motifs and domains that have been shown to have functional implications. In fact, as the same kinds of algorithms have matured with new information, the boundaries between motif, domains, and tertiary structure have begun to blur. As a first approach, one should probably begin with the algorithms discussed in that essay to learn more about the potential functions of a given protein sequence. This entry will only discuss those algorithms that

Predictions from Sequence

explicitly attempt to determine the predicted structure of a protein based solely on a peptide sequence.

Secondary Structure Prediction Early studies into the amino acid composition of secondary structural elements (alpha helices and beta sheets) showed that there is remarkable sequence heterogeneity possible in otherwise identical elements (Chou and Fasman 1974). In other words, many different sequences of amino acids are capable of forming structurally similar alpha helices or beta sheets. Yet, somehow different primary sequences can result in very different tertiary structures. It is not that an alpha helix is always produced from a particular amino acid sequence, while a beta sheet formed by another. Instead, it is the summed tendency of a particular stretch of amino acids to form (or not form) a particular secondary element that will determine the local folding. The tendency of each amino acid to be found within a defined secondary structure within the known tertiary structures of proteins available at the time was tabulated by Chou and Fasman in the 1970s. Automating these tables into algorithms was straightforward, and many sites still host an algorithm that will perform a Chou-Fasman analysis of a given peptide sequence. The output of such an analysis is a prediction about the tendency of a particular stretch of amino acids to form either a helix, sheet, or “coil” (i.e., neither helix nor sheet). This can be useful as a first approach, but is sensitive to the limited number of solved tertiary structures available at the time of tabulation. It also provides no information about how those secondary elements interact to produce tertiary structure. Later improvements were accomplished by Garnier, Osguthorpe, and Robson (i.e., the GOR method; Garnier et al. 1978), who added more Bayesian probability to the predictive algorithm. Again, many sites will offer a GOR analysis or other more updated approaches that add prediction of turns, coiled coils, and other secondary

Predictions from Sequence

elements. Again, these algorithms are sensitive to the structural data available at the time of tabulation and are also limited in the amount of information produced. While these types of predictions can be useful, the addition of homology modeling and the explosion of new solved structures have made these predictions almost obsolete.

Tertiary Structure Prediction by Homology Modeling The basics of this type of approach are the same as described for motif and domain analysis in the Protein Sequence Information to Identify Structural Motifs essay, so I recommend reading that essay first. This method first requires identifying a homolog for a peptide sequence that has some structural evidence available. The general assumption of all homology modeling methods is that a homologous sequence will produce a homologous structure. Any areas of gapped alignment are assumed to represent “loops” or elements external to the homologous fold. Improvements on the method occur when a multiple sequence alignment is used to identify families of proteins with presumed structural homology. Even more improvements become possible if several members of the multiple sequence alignment have structural information. Of course, such models are extremely sensitive to poor choice of homology models (i.e., if the model protein is not a good homolog, if the alignment is poor). An example of homology modeling algorithm is SWISS-MODEL from the Swiss Institute of Bioinformatics Web site (http://swissmodel. expasy.org; Arnold et al. 2006). The process begins by BLAST search for homologous proteins from their database of known structures. In the absence of a good BLAST match, the algorithm will attempt to use hidden Markov models to identify weaker homology regions. The query sequence is then modeled onto the known match, and the match is assessed by various means to determine the probability of a good

985

model. Other algorithms include MODELLER (Fiser and Sali 2003) or the various Rosetta@home projects (http://boinc.bakerlab. org/rosetta) that include ab initio, crowdsourcing, and other computational approaches to identify the structure of unknown sequences. Importantly, each of these approaches tries to estimate the accuracy of their predictions. It is important to pay attention to the level of accuracy and understand the limitations of these approaches.

Tertiary Structure Prediction by Threading A second approach for tertiary structure investigation is protein threading. Overall, the approach is similar to homology modeling in requiring a homologous known-structure model to begin, though in this case the model is a database of all known protein folds (e.g., the Structural Classification of Proteins, SCOP; http://scop.mrc-lmb.cam.ac.uk/scop/), rather than a single homologous structure. The concept is that only so many ways are possible to fold a protein, and any given sequence will match one of those possibilities if a large enough database of folds is tested. It begins by choosing a given fold, then overlaying the query protein sequence onto a known fold, and determining if it can possibly fit into that fold. Then, the same process is tried on a second fold and so on. Eventually, one particular fold is identified as a best fit, and the model is produced based on that series of folded elements. Compared to homology modeling, threading is generally more useful for proteins with poor homology to known proteins; however, appropriate caution should be given to the results and the limitations therein. Examples of threading algorithms include HHpred (Söding et al. 2005) or Protein Homology/analogY Recognition Engine (Phyre; Lawrence Kelley et al. 2011). Each is part of a larger package of algorithms that help users to best identify and utilize the software. Users should consult the instructions included with each package to

P

986

determine how best to proceed in any investigation of the structure of their unknown protein sequence.

Conclusions Considering the speed with which structural genomics projects (e.g., the Structural Genomics Consortium (http://www.thesgc.org); Protein Structure Initiative (http://www.nigms.nih.gov/ Research/SpecificAreas/PSI/Pages/default.aspx/); Riken Structural Genomics/Proteomics Initiative (http://www.rsgi.riken.go.jp/rsgi_e/index.html)) have been adding new protein structural data to databases, the simplest way to know the structure of an unknown protein might be to simply wait for someone to solve it for you. Nonetheless, a common result for a peptide sequence is too often “unknown protein.” Using bioinformatics approaches, it is becoming more common that even in an “unknown protein,” some elements of sequence will fit a known motif or domain or match a known tertiary structure. Using the approaches described in the Protein Sequence Information to Identify Structural Motifs essay will likely provide a starting point. The next step, modeling the whole tertiary structure, may become possible using the approaches given here. Of course, appropriate caution must be taken with the results of any structure prediction, but improvements to the algorithms and to the everexpanding databases of known structures occur each year. It may one day be possible to know almost everything about the shape, function, and interactions of an unstudied protein, just by looking at the sequence.

Pre-mRNA Processing

References Arnold K, Bordoli L, Kopp J, Schwede T (2006) The SWISS-MODEL workspace: a web-based environment for protein structure homology modeling. Bioinformatics 22:195–201 Chou PY, Fasman GD (1974) Prediction of protein conformation. Biochemistry 13(2):222–245 Fiser A, Sali A (2003) Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol 374:461–491 Garnier J, Osguthorpe DJ, Robson B (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 120:97–120 Kelley L, Bennett-Lovsey R, Herbert A, Fleming K (2011) Phyre: protein homology/analogY recognition engine. Structural Bioinformatics Group, Imperial College, London. Retrieved 22 Apr 2011 Protein Structure Initiative (2014) http://www.nigms.nih. gov/Research/SpecificAreas/PSI/Pages/default.aspx. Accessed 21 Apr 2014 Riken Structural Genomics/Proteomics Initiative (2014) http://www.rsgi.riken.go.jp/rsgi_e/index.html. Accessed 21 Apr 2014 Rosetta@home (2014) http://boinc.bakerlab.org/rosetta. Accessed 21 Apr 2014 Söding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucl Acids Res 33(Web Server issue): W244–W248 Structural Classification of Proteins (2014) http://scop. mrc-lmb.cam.ac.uk/scop/. Accessed 21 Apr 2014 Structural Genomics Consortium (2014) http://www. thesgc.org. Accessed 21 Apr 2014 Swiss-Model (2014) http://swissmodel.expasy.org. Accessed 21 Apr 2014

Pre-mRNA Processing ▶ Co-transcriptional Eukaryotes

mRNA

Processing

in

Processing

in

Cross-References ▶ Predictions from Sequence ▶ Primary Structure ▶ Sequence Information to Identify Motifs ▶ Tertiary Structure Domains, Folds and Motifs

Pre-mRNA Splicing ▶ Co-transcriptional Eukaryotes

mRNA

Primary Structure

987

Primary Structure Scott Cooper and Anton Sanderfoot Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA

Synopsis The primary structure of a protein is also its amino acid sequence. This information determines the ultimate structure and function of a protein.

Primary Structure, Fig. 1 Condensation reaction forming a peptide bond

Introduction Proteins are made of chains of amino acids joined through peptide bonds forming the primary sequence of the protein. The primary amino acid sequence of a protein determines the threedimensional structure that the protein will fold into. In addition, scientists can use the primary sequence to identify many functional attributes a protein will have, such as enzyme active sites, ligand and metal binding sites, motifs, and domains. Finally, protein sequences are used to study protein families within a species and taxonomic relationships between species. Because of this, several methods have been developed to determine the primary sequence of a protein, both from the amino acid sequence itself and from the gene or mRNA that encodes the protein.

Primary Structure of Proteins Amino acids have a carboxylic acid (COOH) at one end, an amino group (NH2) at the other end, and a side chain (or R group) attached to the central a carbon. Proteins are composed of these amino acids joined together by covalent amide bonds called peptide bonds. In a peptide bond, the amino group of one amino acid undergoes nucleophilic attack on the carboxyl carbon of an adjacent amino acid, displacing water. The resulting dehydration or condensation reaction

forms an amide linkage between the two amino acids (Fig. 1). The reaction is reversible when water breaks the peptide bond in a hydrolysis reaction. A peptide bond has electron resonance that causes it to have a partial double bond nature. Because of this, protein backbones have some restrictions on rotation and are also more reactive at the peptide bond linkage which has a direct influence on protein sequencing reactions. There are 20 naturally occurring amino acid R groups, and the chemical properties of these R groups affect the structure and function of a protein (Fig. 2). The order of amino acids in a protein is called its primary sequence, with one end of the protein having a free amino group (N terminus) and the other end having a free carboxylic acid (C terminus). It is useful to determine the amino acid sequence of a protein for several reasons. Primary sequence can be used to predict secondary structures (a helices and b sheets) and functional domains and motifs that a protein will assume. The primary sequence of a protein can also be used to identify an unknown protein by comparing it to the translated sequences in a genome. Finally, primary sequences can be compared between species in phylogenetic and taxonomic studies. There are two approaches to determining the primary sequence of a protein, (1) sequence the protein directly and (2) sequence RNA or DNA that encode the protein and translate the open reading frame into an amino acid sequence.

P

988

Primary Structure

a

pka 2.03

pka 1.70

pka 2.15

pka 9.09

pka 9.00

pka 1.95

pka 9.16

pka 2.16

pka 9.66

pka 9.58

pka 3.71 pka 6.04 pka 4.15 pka 10.67 pka 12.10

c

b

pka 2.13

pka 9.05

pka 8.96

pka 1.91

pka 2.18

pka 2.16

pka 2.20

pka 8.76

pka 1.9

pka 2.34

pka 1.95 pka 10.47

pka 9.00

pka 10.28

pka 9.58

pka 10

pka 8.14

d

pka 2.32

pka 2.16

pka 2.18

pka 2.24

pka 2.38

pka 2.26 pka 2.33

pka 2.27 pka 9.09 pka 9.60

pka 9.58

pka 9.04

pka 9.34

pka 9.08

pka 9.52 pka 9.71

pka 10.10 pka Data: CRC Handbook of Chemistry, v. 2010 Dan Cojocari, Department of Medical Biophysics, University of Toronto 2011

Primary Structure, Fig. 2 Amino acid side chains (Courtesy U. Toronto)

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase

Cross-References ▶ Predictions from Sequence

References Gass SI (1987) Managing the modeling process: a personal perspective. Eur J Oper Res 31:1–8 George FH (1971) Cybernetics. St. Paul’s House, Middlegreen Jaiswal NK (ed) (1985) OR for developing countries. Operational Research Society of India, New Delhi Simon HA (1957) Models of man. Wiley, New York

989

stress. This entry initially describes the basic function of sigma factors in transcription initiation and their classification into phylogenetically and structurally distinct families. Following this, the posttranslational control of sigma factor activity, primarily by protein inhibitors called anti-sigma factors, is described. Finally, the entry covers three fundamentally distinct mechanisms through which sigma factor activity is activated in response to an appropriate signal.

Introduction

Processing Bodies ▶ Cytoplasmic mRNA, Regulation of

Product Operator Formalism ▶ NMR Basis (Theory)

for

Biomolecular

Structure

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase Mark Paget School of Life Sciences, University of Sussex, Falmer, Brighton, UK

Synopsis In bacteria, sigma factors are dissociable protein subunits of RNA polymerase that play a key role in promoter selection and transcription initiation. Together with an essential principal sigma factor that is required for the expression of most housekeeping genes, bacteria often deploy alternative sigma factors in response to specific signals in order to express specialized sets of genes. These alternative sigma factors can control a wide range of adaptive responses, such as morphological development, starvation survival, and oxidative

Transcription of genes is a cyclic process that can be roughly divided into initiation, elongation, and termination stages. Initiation itself occurs in several steps: the recognition of promoter DNA elements by RNA polymerase (RNAP) to form a closed (double-stranded) complex; the unwinding and localized melting of the DNA to form an open complex, which reveals the template strand to the active site of the enzyme; the cyclic production of short “abortive” transcripts; and the eventual “escape” of the enzyme into the processive stage of elongation. RNAP, with the aid of elongation factors, then extends the RNA chain until a termination sequence is reached, whereupon it dissociates from both the DNA template and the nascent RNA. The basic core RNAP, which consists of five subunits (a2b, b’, and o), is catalytically active but rather promiscuous in vitro, unable to distinguish between promoter and non-promoter DNA. Specific initiation at promoters requires an additional dissociable subunit named sigma (s) that provides specific DNA-binding determinants in both closed and open complexes. However, sigma is not required for elongation and, following promoter escape, dissociates from elongating RNAP in a stochastic manner. This gives rise to a sigma cycle that functions within the larger transcription cycle, whereby the dissociated sigma enters a pool of competing sigma factors that can rebind core RNAP. Upon identifying the principal and essential factors in Escherichia coli (s70), it was immediately realized by Richard Burgess and colleagues (Burgess et al. 1969) that this sigma

P

990

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase

cycle would facilitate a regulatory strategy involving the controlled production of additional (alternative) sigma factors with specificity for different initiation sites, allowing the reprogramming of RNAP for the expression of specific regulons. It is now clear that alternative sigma factors represent a major mechanism for controlling gene expression in bacteria. However, the extent to which alternative sigma factors are deployed is highly variable and appears to roughly correlate with environmental complexity. For example, soil dwelling organisms such as the Streptomyces or Myxobacteria can possess more than 60 sigma factors whereas simple intracellular bacteria such as Mycoplasma genitalium might possess just one. This essay will describe the basic structure and functional organization of sigma factors and some general principles for how gene expression can be modulated by the production and control of alternative sigma factors.

Two Major Families of Sigma Factors and Their Promoters Sigma factors can be classified in several ways. At the highest level, they can be divided into two families based on their homology to two distinct sigma factors in E. coli, s70and s54, named after their apparent molecular weights when first described. The s54 family is widespread in bacteria, although most only possess a single member. These sigma factors direct initiation at promoters that are characterized by short sequence motifs centered 24 bp (GG) and 12 bp (TC) upstream from the transcription initiation point (located at +1). Unlike the s70 family (see below), RNAP containing s54 invariably requires the input of bacterial enhancer proteins and ATP hydrolysis to drive the process of promoter melting (isomerization). Originally described based on their role in nitrogen assimilation, it is now clear that s54 can play a wide range of biological activities, from the expression of virulence genes to those involved in motility. This family will not be discussed further and the reader is directed elsewhere (Buck et al. 2000) for further information. The extensive s70 family can be further divided

into four main groups (groups I-IV) based on phylogenetic relationships (including the presence and absence of domains) and biological function (Paget and Helmann 2003). Each recognizes short primary promoter elements centered at 35 and 10 with respect to the transcription initiation point. Some promoters have additional sequence elements that interact with distinct regions of sigma, in particular short motifs located just upstream from the 10 sequence (extended 10 element) and between the 10 sequence and the +1 position (discriminator sequence). These additional elements can influence the strength and regulation of the promoter and, in the case of the extended 10 element, can remove the requirement for a 35 element.

Structure and Function of the s70 Family of Sigma Factors All bacteria have a single essential group I sigma factor that is closely related to s70 and is responsible for the expression of most housekeeping genes. The primary promoter elements recognized by group I sigma factors are related to the canonical consensus elements recognized by s70: TTGACA and TATAAT centered at 35 and 10, respectively. The group I sigma factors have four conserved regions (1–4) that reflect distinct functional and structural domains (Fig. 1) (Murakami and Darst 2003). Region 1.1 (s1.1 domain) is negatively charged and appears to play a role in inhibiting the DNA-binding activity of free sigma through interacting with s2 and s4 (see below). Region 1.2–2.4 (s2 domain) forms part of a major interface with RNAP through contacting a conserved coiled coil in the b’ subunit. Upon DNA melting, it also makes base-specific interactions with the non-template strand in both the 10 element and the discriminator sequence (GGGa), thereby capturing the DNA and stabilizing the open complex. Region 3 (s3 domain) contacts a TG consensus motif, which is separated by one base from the 10 element at extended 10 promoters. Region 4 (s4 domain) forms part of another major interface with RNAP by contacting the “flexible flap”

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase

991

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase, Fig. 1 Structure and function of principal sigma factor. (a) Domain organization based on the structure of the Thermus aquaticus SigA. The sigma factor is divided into four conserved domains s1.1, s2, s3, and s4, separated by flexible unstructured linkers. Each domain is divided into several regions and subregions

with specific functions. NCR, nonconserved region between subregions 1.2 and 2.1, present in T. aquaticus. (b) Consensus sequence of promoters recognized by the principal sigma factor s70 in E. coli. Key interactions between protein and DNA are indicated. Transcription initiates at +1. DISCRIM, discriminator sequence

domain of the b subunit (b-flap) of RNAP and also makes contact with the 35 promoter element through a classical helix-turn-helix structural motif. The three a-helical domains are separated by flexible linkers and are spread over one face of RNAP with the s2 and s4 domains spaced appropriately to contact the 35 and 10 promoter elements. Strikingly, the linker between s3 and s4 (Region 3.2), loops into the RNAP close to the active site and out through the RNA exit channel. Group II sigma factors are structurally highly similar to the group I proteins, yet play specialized, nonessential roles. For example, in E. coli s38 (also known as sS) is responsible for the general stress response and is required for survival during stationary phase. It recognizes very similar promoters to those recognized by s70, although one key difference appears to be an interaction between s3 and a C nucleotide located at 13 in an extended 10 promoter element. Group III sigma factors are structurally and functionally diverse but tend to include all three of the globular domains (s2–s4). Their functions include coordinating flagellar biosynthesis (E. coli s28), global stress responses (Bacillus subtilis sB), heat shock responses (E. coli s32), and sporulation (B. subtilis sF). The primary promoter elements are distinct from those recognized by group I and II sigma factors, although some key nucleotides in the 10 region are conserved

(A at 11). In some cases (e.g., E. coli s28 and s32), but not others, there are contacts between s3 and an extended 10 element. The group IV sigma factor family is also known as the extracytoplasmic function (ECF) family since many members are involved in cellular responses to the external environment. These are the smallest sigma factors on account of lacking both s1.1 and s3 domains and are highly diverged both within the group and from the other groups. This is numerically the largest family, although the number of ECF sigma factors encoded by bacterial genomes varies widely with some organisms possessing more than 45 members. ECF sigma factors are involved in a diverse range of cellular responses including oxidative stress, outer membrane misfolding, and iron uptake.

Control of Alternative Sigma Factor Activity by Sequestration Alternative sigma factors tend to synchronously switch on a regulon of genes in response to a specific signal, and therefore the primary level of regulation of these genes is through varying the active level of the sigma factor itself. A wide range of mechanisms for controlling sigma factor activity have been uncovered. Many are controlled at the level of synthesis and degradation; however, since such mechanisms are common to many

P

992

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase

other cellular proteins, these are not covered here. Some “pro-s” factors are activated by controlled proteolysis, involving the removal of an inhibitory extension, commonly located at the N-terminus (e.g., the activation of the pro-sE and pro-sK in the mother cell during sporulation in B. subtilis, see below). Most characteristic though is control by sequestration by binding partners known as anti-sigma factors that usually prevent the association of the sigma factor with core RNAP (Österberg et al. 2011). Sigma and anti-sigma genes are often co-transcribed, which might ensure stoichiometric co-expression. The sigma factor is released in response to a signal that is perceived either by the anti-sigma factor itself or by additional components in more complex signal-relay-type systems. Therefore antisigma factors are often modular with at least two domains – a sigma-binding domain and a more varied sensory domain that has evolved to sense one of a multitude of signals, either inside or outside of the cell.

a compact unit. Interestingly, this form, which appears to be self-inhibitory for DNA binding, occurs in the absence of FlgM; the role of FlgM appears to be to both stabilize this structure and to mask the RNAP-binding determinants. The cognate anti-sigma factors of E. coli sE and R. sphaeroides sE are RseA and ChrR, respectively. While RseA and ChrR are unrelated in sequence, they each possess an N-terminal antisigma-domain (ASD) fold that is widespread in bacteria, potentially controlling up to 35% of all group IV sigma factors. In ChrR, this fold is stabilized by a structural zinc atom, characteristic of the ZAS (zinc-binding anti-sigma) subgroup, each of which possesses a highly conserved Hisxxx CysxxCys motif, which includes three of the four zinc-binding ligands. In contrast, in RseA the fold is stabilized by extensive hydrophobic interactions. In each case the anti-sigma factors interact directly with the s2 and s4 domains of their cognate sigma factor, thereby preventing its interaction with RNAP.

Mechanism of Sigma Factor Inhibition

Mechanisms of Sigma Factor Activation

As noted above, the interface between sigma factors and RNAP is extensive and includes distinct binding determinants located in s2 and s4. In order to effectively sequester sigma factors from RNAP, anti-sigma factors therefore need to mask both these regions. Structural information revealing how this is achieved is available for several sigma factors including the group III sigma factor s28 from Aquifex aeolicus and the group IV sigma factors sE from E. coli and sE from Rhodobacter sphaeroides (Campbell et al. 2008). In many motile bacteria, s28 or its orthologues coordinate the expression of late genes in flagella biosynthesis. It is produced before late gene expression is switched on but held inactive by the anti-sigma factor FlgM. FlgM is a small extended protein with three helices that wraps around the outside of s28 and occludes both s2 and s4 from binding RNAP. In the s28–FlgM complex, the structure of s28 is strikingly different from that what would be expected if it were bound to RNAP, with the promoter-binding determinants buried within

The release of sigma factors from anti-sigma inhibition is classified here into three basic mechanisms: partner switching, regulated proteolysis, and direct sensing (Fig. 2). In partner switching a third protein, known as an anti-anti-sigma factor, sequesters the antisigma factor resulting in the release of the sigma factor (Fig. 2(iii)). This mechanism is widespread in Gram-positive bacteria where it is often used to control general stress responses. Partner switching is also used to activate the first of several compartment-specific sigma factors during the production of endospores in B. subtilis ((Hilbert and Piggot 2004). An early stage of sporulation in B. subtilis is the formation of a polar septum that divides the cell into a larger mother cell and a smaller forespore. Key to the continued development of the forespore into a mature endospore is a crisscross pattern of regulation in which four sigma factors, sF, sE, sG, and sK, are sequentially activated in either the forespore (sF and sG) or the mother cell (sE and sK).

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase

993

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase, Fig. 2 Controlling sigma factor activity by sequestration. Sigma factors (blue) are prevented from binding to RNAP (gray) by sequestration to antisigma factors (red) that occlude the RNAP-binding determinants. Three basic mechanisms for controlling the

release (activation) of the sigma factors in response to a sigma are (i) controlled proteolysis of the anti-sigma factor; (ii) direct sensing of a signal by the anti-sigma factor, causing a structural change; and (iii) partner switching involving an anti-anti-sigma factor (green)

A landmark event in this pathway is the activation of the first sporulation-specific sigma factor, sF, in the forespore. Although sF is present throughout the cell prior to polar septum formation, it is held inactive by its anti-sigma factor SpoIIAB. sF activation occurs only in the forespore compartment through the switching of SpoIIAB from sF to the anti-anti-sigma factor, SpoIIAA. This is driven by the accumulation of SpoIIAA in a non-phosphorylated (SpoIIAB-binding) state which occurs more efficiently in the forespore due to the higher concentration of a septum-located SpoIIAA phosphatase called SpoIIE. In a second unusual example, the anti-anti-sigma factor PhyR in the Alphaproteobacterium Methylobacterium extorquens contains a sigma-like domain that competes with the anti-sigma factor NepA for the sigma factor sEcfG1. The activity of PhyR appears to be controlled by phosphorylation, which unmasks the sigma domain, thereby leading to increased NepAbinding activity. In regulated proteolysis, a protease or a series of proteases cleave the anti-sigma factor, thereby

releasing the sigma factor (Fig. 2(i)). This mechanism appears to be particularly important in the cytoplasmic activation of sigma factors in response to extracytoplasmic signals such as cell envelope stress. A well-studied example is the sE-RseA couple, which controls the envelope stress response in E. coli (Chaba et al. 2007). RseA is a membrane-traversing antisigma factor that indirectly senses signals associated with outer membrane stress, resulting in a series of RseA-directed proteolytic events that ultimately lead to the release of sE in the cytoplasm. The signal is actually perceived by the first protease to target RseA, the periplasmlocated DegS, which becomes activated when the C-terminal regions of unfolded or nascent outer membrane proteins accumulate in the periplasm. The initial cleavage of the periplasmic portion of RseA is followed by the cleavage of its transmembrane domain by the membraneembedded protease RseP and the release of the N-terminal anti-sigma domain in the cytoplasm. The full release and activation of sE then

P

994

Prokaryotic Gene Regulation by Sigma Factors and RNA Polymerase

requires the cleavage of the anti-sigma domain by the cytoplasmic protease ClpXP. In direct sensing, the anti-sigma factor itself senses the signal which leads to a conformational change causing the release of the sigma factor. The ZAS anti-sigma factor RsrA controls the activity of sR in Streptomyces coelicolor and related actinomycetes (Kang et al. 1999). In response to an oxidative shift in the cellular thiol-disulphide redox balance, RsrA forms an intramolecular disulphide bond between two of the zinc ligands, which causes a conformational change in RsrA that prevents it from interacting with sR. Like many of these stress-responses, it is homeostatically controlled because targets of sR include the enzymes involved in maintaining the cellular thiol-disulphide redox balance (e.g., thioredoxin) that can also reduce the oxidized RsrA back to its sR-binding conformation.

Sigma Factor Competition and Sequestration of the Principal Sigma Factor In growing cells it appears that core RNAP is present in limiting amounts which means that an increase in the active level of one sigma factor can result in a decrease in the activity of another sigma factor through competition. However, the impact of competition is likely to be subject to a wide range of parameters including the relative concentration and affinities of RNAP and sigma factors and the affinity of holoenzymes for promoter sequences versus nonspecific DNA (Grigorova et al. 2006). In addition, the global signaling molecule ppGpp along with its protein partner DksA can redistribute RNAP from promoters involved in rapid growth (e.g., rRNA promoters) to those associated with stationary phase or stress responses, although the mechanisms involved are currently unclear. Among the seven sigma factors in E. coli, s70 is present at the highest concentration and also has the highest affinity for core RNAP and so is most likely to impact on the activity of other sigma factors. The question of how alternative sigma factors can compete effectively with s70 to

instigate their cellular response has been at least partially addressed for the activation of s38. In E. coli, three sigma-binding factors, two proteinbased and one RNA-based, have been proposed to influence the global reprogramming of RNAP during the change from exponential to stationary phase. In general, this involves the switch from genes related to growth and cell division to those involved with maintaining the integrity of the non-replicating cell during stress and nutrient deprivation. The anti-sigma factor Rsd was identified in a direct search for an inhibitor of s70 to account for increased sS activity during stationary phase despite the maintenance of high levels of s70. Like many anti-sigma factors Rsd inhibits s70 activity by binding directly to it, thereby occluding key RNAP-binding determinants. Although the phenotype of rsd mutants in E. coli is weak, the Rsd-related protein AlgQ in Pseudomonas aeruginosa appears to play a key role in reducing s70 activity, which is required for the alternative sigma factor-directed production of virulence factors. The second protein-based factor is Crl, which is required for the full expression of several sSdependent promoters in enteric bacteria. Crl is unusual because whereas most sigma-binding proteins inhibit RNAP binding, Crl appears to stimulate holoenzyme formation. The RNA-based binding factor is 6S RNA, which is a small regulatory RNA that accumulates to high levels during stationary phase and binds specifically to s70-RNAP, sequestering it from promoters. 6S RNA is predicted to fold into a largely double-stranded structure with a central single-stranded region that mimics the open promoter complex. Remarkably, s70-RNAP not only binds to this structure but can initiate RNA-dependent transcription of small RNA products, which facilitates the release of s70 upon outgrowth from stationary phase. In conclusion, the transient association of sigma with RNAP during the transcription cycle has allowed the evolution of many sigma-based regulatory networks in bacteria. These systems can efficiently redirect the cellular transcription machinery to specific sets of genes thereby allowing the organism to sense and respond to a wide range of physiological and environmental signals.

Prokaryotic Gene Regulation by Small RNAs

References Buck M, Gallegos MT, Studholme DJ, Guo Y, Gralla JD (2000) The bacterial enhancer-dependent s54 (sN) transcription factor. J Bacteriol 182(15):4129–4136 Burgess RR, Travers AA, Dunn JJ, Bautz EK (1969) Factor stimulating transcription by RNA polymerase. Nature 221(5175):43–46 Campbell EA, Westblade LF, Darst SA (2008) Regulation of bacterial RNA polymerase sigma factor activity: a structural perspective. Curr Opin Microbiol 11(2):121–127 Chaba R, Grigorova IL, Flynn JM, Baker TA, Gross CA (2007) Design principles of the proteolytic cascade governing the sigmaE-mediated envelope stress response in Escherichia coli: keys to graded, buffered, and rapid signal transduction. Genes Dev 21(1):124–136 Grigorova IL, Phleger NJ, Mutalik VK, Gross CA (2006) Insights into transcriptional regulation and sigma competition from an equilibrium model of RNA polymerase binding to DNA. Proc Natl Acad Sci U S A 103(14):5332–5337 Hilbert DW, Piggot PJ (2004) Compartmentalization of gene expression during Bacillus subtilis spore formation. Microbiol Mol Biol Rev 68(2):234–262 Kang JG, Paget MS, Seok YJ, Hahn MY, Bae JB, Hahn JS, Kleanthous C, Buttner MJ, Roe JH (1999) RsrA, an anti-sigma factor regulated by redox change. Embo J 18(15):4292–4298 Murakami KS, Darst SA (2003) Bacterial RNA polymerases: the whole story. Curr Opin Struct Biol 13(1):31–39 Österberg S, del Peso-Santos T, Shingler V (2011) Regulation of alternative sigma factor use. Annu Rev Microbiol 65:37–55 Paget MS, Helmann JD (2003) The s70 family of sigma factors. Genome Biol 4(1):203

Prokaryotic Gene Regulation by Small RNAs Erin Murphy1, William Broach2 and Andrew B. Kouse2 1 Department of Biomedical Sciences, Life Sciences Building, Ohio University Heritage College of Osteopathic Medicine, Athens, OH, USA 2 Department of Biological Sciences, Ohio University, Athens, OH, USA

Synopsis Plasticity of gene expression is essential for survival within the frequently changing and often

995

extreme environments encountered by prokaryotes. The importance of small RNA molecules (sRNAs) in regulating prokaryotic gene expression has become abundantly clear within the last decade and has been the subject of several excellent reviews, many of which are referenced within. While in its infancy, the study of archaeal sRNAs is a promising field and one that is likely to produce exciting findings in the near future. In light of the significant advances made in the identification and characterization of bacterial sRNAs, these molecules are the focus of this review, with special emphasis placed on the molecular mechanisms underlying their regulatory functions.

Introduction The definition of a bacterial sRNA is continually evolving as additional details regarding the structure and function of these incredibly versatile regulatory elements are revealed. The definition must capture the conserved features and overarching function of sRNAs yet acknowledge the inherent diversity in their structure and in the molecular mechanisms that govern their regulatory activities. In response to these demands, Liu and Camilli recently proposed that a bacterial sRNA be defined as “any bacterial RNA molecule 99% identity). • No variations in repeat length occur in this second class. The level of sequence identity maintained across the repeats may be as little as 5% but is generally much higher. • The third set of repeats display some small variations in length and are conserved predominantly in chemical character rather than sequence identity. • The sequence repeats in the fourth group are generally non-integral in length and are observed in one or more residue groupings, such as acidic, basic, or apolar. As a consequence of the repeat being non-integral, no residue is conserved absolutely in any position. • The fifth group of repeats generally represent functional rather than structural motifs, and although they can occur multiple times contiguously in the same protein, they rarely do so. They do occur in different proteins, however, where the repeats may show small variations in length. Some residues within these repeats may be conserved, but this is not an absolute requirement. The first four of these classes of repeat are commonly contiguous, thereby leading to a considerable length of sequence with whatever special characteristic that particular repeat gives

the protein. In contrast, the last of the repeat types can occur at (apparently) unrelated sites along the length of the sequence (though there are some examples of these motifs occurring contiguously) or, alternatively, as an important and conserved functional feature in a wide diversity of proteins.

Structural/Functional Relevance of Protein Repeats First Class of Repeats These repeats vary enormously in length being as little as a single residue (Q) in human huntingtin, for example. This is repeated contiguously 10–35 times in the normal population but as many as 120 times in Huntington disease patients. Tworesidue repeats (AQ) occur in the silk from Pachylota audouinii. In this case the protein is believed to form an extended b-chain with one face containing only alanine residues (A) and the other only glutamine residues (Q). An example of a non-contiguous motif is the tripeptide (K-S-P), a phosphorylation site in the neurofilament heavy chain. This is often found within a longer repeat of the type described in the third class. Chick scale keratin displays a 13-residue repeat four times contiguously and this is believed to form an eight-stranded b-sheet. However, much longer repeats also exist. One such example is the 534-residue repeat in human epiplakin. In this protein the repeats occur five times contiguously, but there are no data yet available as to the structure adopted by each repeat. A repeat of this type implies that the protein can tolerate only the most minor variation if the appropriate structure and/or function is to exist or, alternatively, that the gene duplication event giving rise to the repeats is a relatively recent one. Second Class of Repeats A very large number of different repeat lengths have been observed and it is probably true that all possible values will ultimately be observed. The level of sequence identity present, however, may be as low as 5% when repeats or families of repeats are compared one with another.

Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

To give some examples, the b-keratins from the skin, claw, scale, or feathers each contain a wide variety of sequence repeats in their terminal domains. Often, but not always, these are arranged contiguously, and where this occurs, the numbers of repeats lie typically between 2 and 16. Most of these repeats are predominantly based around G, Y, and L residues and thereby favor b-structures. Repeat lengths of two, three, four, six, seven, and eight residues long have all been observed. A spectacular example of a very special repeat in the second class occurs in the >1,000 residues in the Type I collagen a-chains. These display a remarkable tripeptide substructure based on glycine (G). Indeed, the (GXY)338 repeats, with X and Y representing any amino or imino acid, are a requisite for structure formation. Each a-chain has a left-handed structure similar to that in polyglycine II. Three of these chains then assemble parallel to one another to form a threestranded right-handed coiled coil with 10 residues in 3 turns (Fig. 1a). The glycines in every third position are absolutely conserved because these are located internally in the structure, and any residue larger than glycine would impose such severe stereochemical problems along the helix core that the structure would be unable to form without a major distortion occurring. Further assembly into functional collagen fibrils would also be severely compromised. Other types of collagen chains, of which many have now been characterized, all contain the same GXY repeat, though fewer contiguous repeats occur than in the example cited above. Few examples of this very precise type of repeat have been reported, excluding those seen in members of the collagen family. Such repeats have a very obvious character and are normally spotted by visual inspection alone. A few other examples of the repeats in the second class include the following: • Fifty-four contiguous 10-residue repeats in human ribosome receptor • Nine contiguous 14-residue repeats in the human neurofilament chain • Fourteen contiguous 22-residue repeats in human nestin

1049

• Eleven contiguous 24-residue repeats in giardia from median body • Eighteen 44-residue repeats in hamster nestin • Twenty contiguous 106-residue repeats in human erythrocyte a-spectrin • Ten contiguous 128-residue repeats in a keratinocyte plasma-membrane-associated protein • Eighteen to more than 30 contiguous 250- or 255-residue repeats in mouse profilaggrin Third Class of Repeats Variable Length Repeats with Some Sequence Conservation

A number of examples now exist for this class of repeat. One of these is the WD or b-transducin repeat. In this case the repeat is about 40 residues long and it generally terminates in a tryptophanaspartic acid (WD) dipeptide (Fig. 1b). As many as 16 contiguous repeats have been observed. A 7–8-bladed b-propellor is the most commonly observed structure resulting from WD repeats. Leucine-rich repeats (LRR) about 27–29 residues long and which occur 15 times contiguously in porcine ribonuclease inhibitor and HEAT-like repeats of lengths 38–50 and 27–29 residues that occur about 29 and 15 times contiguously in PA200 and huntingtin, respectively, are further examples of this type of repeat. The HEAT-like repeats probably represent what is currently the most difficult repeat to detect due to very low sequence identity and considerable length variations. Fixed Length Repeats with Little Sequence Conservation

Within this third class, it may well be that the heptad (7 residues), hendecad (11 residues), and pentadecad (15 residues) are the most important, especially the former. All of these display low levels of sequence identity (typically about 15%), unlike the others described above. The heptad, hendecad, and pentadecad repeats are found in diverse proteins and, in each instance, display a high conservation of apolar and hydrophilic residues in specific positions within each of the repeats. The heptad is often written as

R

1050

Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

a

b

c

1B

1A

d

2B

2A

A11 2A

2B 1B

1B

1A

2A

1A

e

2B A22 2B

2A

2A

1B

1B

1A

1A

13.7 nm

2B A12 1A

1B

2A

2B 38.5 nm

f

E2α F2α

N

E3α

F1α F4α

F3α

E4α’ 5.5 nm

EF-hand domain

Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications, Fig. 1 (a) Three-chain structure of a collagen molecule. Each a-chain, characterized by its GXY sequence repeat, forms a left-handed helical structure with threefold symmetry similar to polyglycine II. Three such chains (pink, green, and yellow) then assemble to form a right-handed coiled coil with ten residues in three turns, (b) sevenbladed b-propellor formed from WD or b-transducin repeats, and (c) two-stranded parallel chain coiled coil in the GCN4 leucine zipper arises from an a-helix-rich heptad substructure in each chain. The apolar residues therein are

E4α

alternating three and four residues apart and become internalized when the coiled coil is formed. (d) Intermediate filament molecules contain two, long, left-handed coiled coils (segments 1B and 2B). Each segment has a periodicity (9.55 residues in segment 1B and 9.85 residues in segment 2B) in the distribution of both its acidic and basic residues. The periods are out of phase with respect to one another (designated by alternating red and white blocks). Molecular aggregation, via intermolecular ionic interactions, occurs between antiparallel molecules with their 1B segments overlapped (A11), 2B segments overlapped (A22), and with the entire molecules largely

Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

1051

(a-b-c-d-e-f-g) with positions a and d predominantly apolar (occupancy 70–75%) and positions e and g commonly charged. Likewise, the hendecad and pentadecad repeats can be written, respectively, as (a-b-c-d-e-f-g-h-i-j-k) with apolars largely in the a, d, e, and h positions and hydrophilic residues elsewhere and (a-b-c-d-e-f-g-h-i-j-kl-m-n-o) with apolars predominantly in the a, d, e, h, and l positions and hydrophilics elsewhere. In addition, each repeat contains strongly a-helical favoring residues. These motifs, especially the heptad, occur widely in both fibrous and globular proteins and have as their primary role a means by which oligomerization is facilitated. The leucine zipper GCN4 is a particularly well-characterized example (Fig. 1c; O’Shea et al. 1991). The coiledcoil structure (fibrous proteins) or the a-helical bundle (globular proteins) thus formed occurs in a wide variety of forms ranging from two stranded through to seven stranded. In addition, the strands may be parallel or antiparallel (Parry et al. 2008).

“on” and “off” position per actin molecule (Fig. 1e). Common periods in myosin and paramyosin molecules (28 residues with a strong 9.33 residue sub-period) allow co-assembly through intermolecular ionic interactions. The paramyosin acts as the core on which the myosin is laid down. Yet another example occurs in the collagen a-chain. In this case the period is very long (234 residues) and is superimposed on top of the underlying GXY triplet structure. The numbers of possible interchain ionic and apolar interactions show maxima at multiples of 234 residues (equivalent to 67 nm as observed in native Type I/II collagen fibrils by X-ray diffraction and electron microscopy). A variety of sub-repeats is also observed. The special non-integral characteristic of this set of repeats seems confined to fibrous proteins that require a mixture of intermolecular charged or apolar interactions to ensure that appropriate molecular alignment occurs in the filamentous structures adopted in vivo.

Fourth Class of Repeats A common acidic residue (D, E) and basic residue period (K, R) observed in long, coiled-coil, a-helical segments termed 1B (period 9.55 residues) and 2 (period 9.85 residues) in intermediate filament chains favor modes of molecular assembly in which intermolecular ionic interactions are maximized (Fig. 1d). Likewise, the coiled-coil, a-helical tropomyosin molecules found in the thin filaments of vertebrate skeletal muscle have a 39.2-residue period in the linear distributions of their acidic and apolar residues. This matches the period of the actin molecules in the thin filaments, thereby allowing each actin molecule to be regulated in a quasi-equivalent manner. The period is strongly halved and this provides an

Fifth Class of Repeats Examples here include the Ca2+ EF hands (Fig. 1f) that are about 29 residues long and which consist of a pair of a-helices that define a site capable of binding Ca2+ (or Mg2+) (http://structbio.vander bilt.edu/chazin/cabp_database), zinc fingers about 28 residues long that define a family of structurally diverse zinc-stabilized compact domains (http://prodata.swmed.edu/zndb), and DNA-binding motifs that are about 22 residues long and which are characterized by their helixturn-helix conformation (http://www.ebi.ac.uk/ thornton-srv/databases/cgi-bin/HTHquery/index.pl). Other repeats in this class, often termed “linear motifs,” are very short and generally do not occur contiguously in any one protein. Indeed, there may

ä Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications, Fig. 1 (continued) overlapped (A12). (e) Two-stranded parallel chain tropomyosin molecule (yellow and blue) has a 14-fold periodicity in the linear distribution of both its acidic and apolar residues (shown by horizontal bars). The 5.5-nm period present in the actin thin filament structure

matches the sevenfold period (two bars). The movement of tropomyosin in the thin filament between an “on” and “off” position allows each actin to be regulated by tropomyosin in a quasi-equivalent manner in both states. (f) EF hands typically consist of a pair of a-helices involved in Ca2+ (or Mg2+) binding

R

1052

Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

be but a single copy in a particular protein but the same motif may be found widely, nonetheless, in a diverse range of other proteins. In general, these motifs are only about 3–11 residues long, and their role lies most commonly in recognition and targeting. Examples of these include the C-a-a-X box in lamin that targets lamin chains to the nuclear envelope, the four-residue Src binding motif (P-XX-P) that binds to a hydrophobic surface of SH3 domains, the retinoblastoma-binding protein that contains the Rb binding motif (L-X-C-X-E), and the proline-rich motif (P-P-X-Y or A/P-P-P-A/P-Y) that binds the WWP domain in proteins such as dystrophin.

Detection of Protein Repeats Where the repeats are contiguous, they are most commonly recognized by inspection alone without the need of employing bioinformatics. However, most protein structuralists prefer the use of websites such as http://www.expasy.org (Gasteiger et al. 2003) or http://www.well.ox.ac.uk/ariadne/ prospero.shtml that are designed specifically to tackle this type of problem. There are, of course, a number of other websites that are available and these may be identified through a simple websearch. Tools are provided within each website to undertake similarity searches, pattern recognition, and sequence alignment, as well as many other important aspects in the field of proteomics. These tools can be of particular value in recognizing repeats that are separated (non-contiguous) from one another within a particular sequence. The detection of non-integral periodicities in protein sequences can be tackled by just a single method – fast Fourier transform analysis. Suppose, for example, it is suspected that a periodicity may exist in the linear distribution of acidic residues in a sequence, or perhaps a fragment of that sequence. Each acidic amino acid would be assigned a value of one, and all other amino acids would be assigned values of zero. The binary sequence thus formed is then embedded in an array of zeroes to give a total array length of (typically) 2,048 or 4,096 (this must be 2n). This array is then fast Fourier

transformed and scaled so that any intensity peak (I) seen in the transform has a statistical significance of occurring by chance of exp (-I) (Mclachlan and Stewart 1976). The embedding technique allows greater sampling of the transform and hence a more accurate evaluation of the actual periodicity present. Various routines now exist that can recognize two- and three-stranded heptad repeats from sequence considerations alone. These include COILS (http://www.ch.embnet.org/software/ COILS_form.html) and PAIR-COIL (http:// groups.csail.mit.edu/cb/paircoil2/). However, recognition of the more common antiparallel coiled coils and multi-stranded bundles (>3 strands) from sequence data awaits further development. Linear motifs, including nuclear localization signals, phosphorylation and glycosylation sites, as well as those noted in section entitled “Fifth Class of Repeats”, have been summarized in two main databases (ELM, http://elm.eu.org/; Gould et al. 2010; MiniMotif Miner, http://mnm.engr. uconn.edu/MNM/SMSSearchServlet; (Dinkel et al. (2012); Rajasekaran et al. (2009) Because these repeats are so short, there is a high possibility that searches of new sequences will identify false positives. Great care must therefore be taken to interpret the predictions conservatively and in line with other experimental evidence.

Summary Sequence repeats occur in a variety of forms, but the five most common (archetypical) ones have been depicted here in terms of their implications for secondary/tertiary structure and their functions in vivo. In addition, a selection of the methods by which these repeats may be most easily recognized in sequences have been described.

References Andrade MA, Perez-Iratxeta C, Ponting CP (2001) Protein repeats: structures, functions, and evolution. J Struct Biol 134:117–131 Dinkel H, Michael S, Weatheritt RJ et al (2012) ELM – the database of eukaryotic linear motifs. Nucleic Acids Res 40:D242–D251

Replication Origin of E. coli and the Mechanism of Initiation Gasteiger E, Gattiker A, Hoogland C et al (2003) ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acid Res 31:3784–3788 Gould CM, Diella F, Via A et al (2010) ELM: The status of 2010 eukaryotic linear motif resource. Nucleic Acid Res 38:D167–D180 http://elm.eu.org. Accessed 29 Mar 2014 http://groups.csail.mit.edu/cb/paircoil2/. Accessed 29 Mar 2014 http://mnm.engr.uconn.edu/MNM/SMSSearchServlet. Accessed 29 Mar 2014 http://prodata.swmed.edu/zndb. Accessed 29 Mar 2014 http://structbio.vanderbilt.edu/chazin/cabp_database. Accessed 29 Mar 2014 http://www.ch.embnet.org/software/COILS_form.html. Accessed 29 Mar 2014 http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/ HTHquery/index.pl. Accessed 29 Mar 2014 http://www.expasy.org. Accessed 29 Mar 2014 http://www.well.ox.ac.uk/ariadne/prospero.shtml. Accessed 29 Mar 2014 Mclachlan AD, Stewart M (1976) The 14-fold periodicity in a-tropomyosin and the interaction with actin. J Mol Biol 103:271–298 O’Shea EK, Klemm JD, Kim PS et al (1991) X-ray structure of the GCN4 leucine zipper, a two-stranded, parallel coiled coil. Science 254:539–544 Parry DAD (2005) Structural and functional implications of sequence repeats in fibrous proteins. Adv Pro Chem 70:11–35 (eds DAD Parry and JM Squire) Parry DAD, Fraser RDB, Squire JM (2008) Fifty years of coiled-coils and a-helical bundles: a close relationship between sequence and structure. J Struct Biol 163:258–269 Rajasekaran S, Balla S, Gradie P et al (2009) Minimotif miner 2nd release: a database and web system for motif search. Nucleic Acids Res 37:D185–D190

Replication Origin of E. coli and the Mechanism of Initiation Jon M. Kaguni Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA

Synopsis The process of DNA replication can be formally divided into the stages of initiation, the elongation of nascent DNA, and termination. In E. coli, DNA replication initiates at a specific location (oriC) on

1053

the circular duplex chromosome. This site is where the replication fork machinery is assembled, leading to duplication of the E. coli chromosome. At oriC, specific biochemical events must take place in a step-wise manner in order to establish the enzymatic machinery that will operate at a replication fork. The first step involves the recognition of DNA sequence elements in oriC by the replication initiator (DnaA). Its interaction with these DNA sequences leads to the assembly of a DnaA oligomer that unwinds a region within oriC. DnaA then interacts with DnaB in a complex with DnaC to load the replicative DNA helicase (DnaB) onto the single-stranded DNA in the unwound region. After helicase loading and its activation, involving the binding of primase (DnaG) to DnaB and primer formation, the primers are utilized by DNA polymerase III holoenzyme for DNA synthesis and duplication of the E. coli chromosome.

Introduction In the operon model presented in 1961, Jacob and Monod proposed that regulatory factors control the transcription of a set of genes that are expressed from a single promoter (Jacob and Monod 1961). By analogy with this model, the replicon model has two central elements (Jacob et al. 1963). One is the replicator, which is specifically recognized by the initiator to start DNA replication. We now call the replicator the replication origin. In E. coli and other bacteria, this genetic site is named oriC. The second element of the replicon model is the initiator, a genetic locus encoding a protein that recognizes the replication origin. In bacteria, the initiator protein is named DnaA. Other critical functions provided by this protein are its ATP-dependent unwinding of a region within oriC and its recruitment of DnaB in a complex with its partner named DnaC to the unwound region. The general finding that eukaryotic cells use similar mechanisms to initiate DNA replication supports the universality of the replicon model. The proteins that initiate DNA replication from the E. coli replication origin are multifunctional.

R

1054

Replication Origin of E. coli and the Mechanism of Initiation

As described in detail below, DnaA, DnaB, and DnaC perform specific tasks in establishing the replication fork machinery at oriC. ATP binding by each protein is critical for activity. Whereas ATP binding and hydrolysis support unwinding of the parental duplex DNA by DnaB as the replicative DNA helicase, the binding of ATP to DnaA or DnaC controls their activity in the initiation process. This essay summarizes recent advances in our understanding of the E. coli replication origin (oriC) and the roles of these and other proteins in the process of initiation of DNA replication. The E. coli Replication Origin (oriC) Mutational analysis of the E. coli replication origin reveals that it is a contiguous DNA sequence of 245 base pairs (Leonard and Grimwade 2011; Oka et al. 1980; von Meyenburg and Hansen 1980). Various sites within it have been identified that are essential for its function (Fig. 1). One is the DnaA box, a nine-base pair (bp) DNA sequence that is specifically recognized by DnaA protein (Fuller et al. 1984; Matsui et al. 1985). Five such DnaA boxes named R1–R5 are present in oriC. R1 and R4 are identical, R2 and R3 contain one mismatch, and R5 contains two mismatches. Because DnaA binds to ATP or ADP with affinities (KD) of 0.03 and 0.1 mM, respectively (Sekimizu et al. 1987), the effect of these cofactors has been examined. These studies reveal that the affinity of DnaA for these DnaA boxes is similar whether DnaA is complexed with ATP or ADP (hereafter cited as DnaA-ATP or DnaAADP, respectively) (Kawakami et al. 2005; McGarry et al. 2004). However, their affinities differ relative to each other with R4 > R1 > R2 > R5 > R3. These different binding affinities apparently reflect the effect of variation of specific bases in and flanking a specific DnaA box on their interaction with individual amino acids in the DNA-binding domain of DnaA (Margulies and Kaguni 1996; Schaper and Messer 1995). In contrast, sites named I1, I2, and I3 (McGarry et al. 2004), other sites named t1 and t2 (Kawakami et al. 2005), and a third class of sites named C1–C3 (Rozgaja et al. 2011) are specifically bound by DnaA-ATP as measured by their

protection from DNase I digestion or from chemical modification by dimethyl sulfate. These sites are bound with lesser affinity than the higheraffinity DnaA boxes named R4, R1, and R2. DnaA-ATP also evidently binds preferentially to a 6-mer sequence (AGATCT) found in an AT-rich region near the left border of oriC (Speck and Messer 2001). These sequences are embedded within 13-mer sequences that are unwound by DnaA-ATP (Bramhill and Kornberg 1988; Duderstadt et al. 2011; Ozaki and Katayama 2012). As described in more detail in a separate section, it is likely that this unwinding of the 13-mers requires the binding of the 6-mers by a DnaA oligomer. Binding sites for Fis (factor for inversion stimulation), IHF (integration host factor), and SeqA are also present in oriC (Brendler et al. 1995; Brendler and Austin 1999; Gille et al. 1991; Polaczek 1990; Slater et al. 1995). Fis and SeqA are also discussed in the section entitled ▶ “Control of Initiation in E. coli” by Kaguni in this volume. Mutation of the respective sites disrupts oriC function, so the binding of Fis and IHF appears to be important for initiation. On the basis that null mutations in fis or himD, which encodes one of the subunits of the IHF heterodimer, cause an abnormal timing of initiation relative to the bacterial cell cycle (Boye et al. 1993; Flamm and Weisberg 1985; Kano and Imamoto 1990), it appears that the binding of Fis and IHF at oriC ensures that initiation is timed properly in the cell cycle. Both proteins bend DNA upon binding (Finkel and Johnson 1992; Rice et al. 1996), suggesting that their role is to alter the conformation of oriC and/or the relative arrangement of DnaA molecules bound to this site. In support, IHF stimulates the DnaA-dependent unwinding of oriC (Hwang and Kornberg 1992) and appears to reorganize DnaA assembled at oriC (Grimwade et al. 2000). These observations are consistent with the binding of IHF to oriC at the time of initiation in vivo (Cassler et al. 1995). In contrast, Fis inhibits in vitro DNA replication of oriC-containing plasmids (Hiasa and Marians 1994; Margulies and Kaguni 1998), which correlates with studies showing that the binding of Fis to oriC occludes the binding of DnaA to the

Replication Origin of E. coli and the Mechanism of Initiation

1055

R

Replication Origin of E. coli and the Mechanism of Initiation, Fig. 1 DNA motifs of Escherichia coli oriC and steps in the initiation of DNA replication. The DnaA boxes are recognized by DnaA bound to either ATP or

ADP, whereas DnaA-ATP specifically binds to the I-, t-, and C-sites. E. coli oriC also carries binding sites for IHF and Fis and the 13-mer sequences that become unwound by DnaA complexed to ATP. The green bilobed object

1056

Replication Origin of E. coli and the Mechanism of Initiation

DnaA box named R3 (Cassler et al. 1995; Gille et al. 1991; Ryan et al. 2002) and more recent studies that Fis interferes with the binding of IHF to its site in oriC. Other in vitro evidence attributes the inhibitory effect of Fis to its ability to sequester the negative superhelicity of the oriC-containing plasmid (Hiasa and Marians 1994; Margulies and Kaguni 1998), raising the possibility that Fis acts by two separate mechanisms to inhibit initiation from oriC. However, mutation of DnaA boxes R1 or R4 that block DnaA binding to these sites leads to a requirement for Fis and IHF for initiation, suggesting that the structure of the DnaA-oriC complex and its function are somewhat plastic (Kaur et al. 2014). Taken together, these observations suggest that these DNA-binding proteins modify the architecture of DnaA oligomerized at this site. Steps in the Process of Initiation The initiation of DNA replication from the E. coli replication origin (oriC) requires DnaAATP as a sequence-specific DNA-binding protein. As described above, DnaA-ATP recognizes the DnaA boxes I-, t-, and C-sites in oriC. DnaA also acts in the recruitment of DnaB complexed to DnaC to load the replicative DNA helicase onto the unwound region of oriC. The individual activities of DnaA, DnaB, and DnaC are described in the essay entitled ▶ “DnaA, DnaB, DnaC” by Kaguni elsewhere in this volume. A detailed description of their roles in the different steps of the initiation process is presented below.

DnaA binds to specific sites in oriC (Fig. 1). Initiation of E. coli DNA replication requires the recognition of the high-affinity DnaA boxes (R1, R2, and R4) and the low-affinity DnaA boxes (R5 and R3, but see the discussion below) and the t-, I-, and C-sites in oriC by DnaA. Early in vivo footprinting experiments showed that DnaA remains bound to the DnaA boxes named R1, R2, and R4, throughout the cell cycle (Cassler et al. 1995). In contrast, DnaA binds to the region containing R3 at the time of initiation, suggesting that the affinity of DnaA for this site is weaker than to the high-affinity DnaA boxes and that a sufficient level of DnaA must accumulate for its binding to R3. This weak binding to R3 agrees well with independent biochemical studies. One study showed that DnaA binds to R3 with the weakest affinity compared with, respectively, increasing affinities to R2, R1, and R4 (Margulies and Kaguni 1996). DnaA occupancy of R3 correlated with the level of DnaA needed for in vitro DNA replication of an oriC-containing plasmid. Another report described that the binding affinity to an oligonucleotide containing R3 was comparable with an oligonucleotide of random sequence, suggesting that DnaA does not bind specifically to this DnaA box, whereas the binding affinities of DnaA for oligonucleotides containing DnaA boxes R1 and R4 were, respectively, about 11- and 60-fold greater than for R2 (Schaper and Messer 1995). However, the R3-containing oligonucleotide lacked sites named C2 and C3, which overlap on either side of R3. As recent studies suggest that DnaA in a complex with ATP may instead actually interact with C2 and C3, the R3

ä Replication Origin of E. coli and the Mechanism of Initiation, Fig. 1 (continued) represents DnaA, with its functional domains denoted by the overlaid numbers. Step 1: DnaA-ATP as a self-oligomer bound to the sites described above unwinds a region containing the 13-mers near the left border of oriC. For simplicity, the interactions via domain 1 in the DnaA oligomer are not shown. Step 2: DnaA loads one DnaB6-DnaC3 complex on each DNA strand of the unwound region. Step 3: Primase interacts with the small domain of DnaB. Step 4: As oligonucleotide synthesis by primase begins on the top and bottom strands, it induces the dissociation of DnaC from DnaB. Primer

synthesis is coordinated with the translocation of DnaB. Step 5: As primase completes the synthesis of primers, DnaB continues to move on each single-stranded DNA toward the apex of each replication fork. These primers will be used by DNA polymerase III holoenzyme (not shown) in DNA synthesis of each leading strand for the rightward and leftward progressing replication forks. During this process, DnaB at the junction of each replication fork will act as a DNA helicase to unwind the parental duplex DNA (This figure is adapted from an article by Bell and Kaguni (2013))

Replication Origin of E. coli and the Mechanism of Initiation

DnaA box may have been incorrectly identified and may not be a site recognized by DnaA (Rozgaja et al. 2011). Of interest, inverting R1, R2, or R4 inactivated oriC (Langer et al. 1996). Changing the helical spacing between R4 and sequences to its left or R2 and sequences to its right by less than a helical turn of DNA (10 bp) also disrupted the function of oriC, whereas maintaining the helical phasing by insertion of 10 bp did not (Crooke et al. 1993; Messer et al. 1992). Together, these results suggest that the orientation of DnaA monomers and their relative organization at oriC in a specific nucleoprotein structure is critical for initiation. In support of this model, the X-ray crystal structure of a truncated form of Aquifex aeolicus DnaA (lacking domain 1 and 2) complexed to AMP-PCP (an ATP analogue) has been constructed into a right-handed helical filament that reveals interactions between DnaA monomers (Duderstadt et al. 2010; Erzberger et al. 2006; see below also). It has been suggested that the strong and weak binding sites in oriC interact with the DNA-binding domain (domain 4) of each DnaA monomer in the oligomerization of DnaA at oriC. However, the DNA-binding domain of A. aeolicus DnaA modeled as a right-handed helical filament is buried. As an explanation, this domain may rearrange from this buried state to one that can bind DNA by rotation of the a-helix that connects the ATP-binding domain (domain 3) with the DNA-binding domain. DnaA unwinds the region of oriC containing the 13-mers. After binding to oriC, DnaA complexed to ATP induces the unwinding of the region of oriC containing three 13-mers (Fig. 1). The amount of DNA unwound in vitro is estimated to be 26 bp (Gille and Messer 1991). If single-stranded DNA-binding protein is present, 44–52 bp of DNA become unwound. Reaction conditions that are optimal to form the open complex include a temperature of 38  C and 5 mM ATP (Sekimizu et al. 1987). Inasmuch as ATPgS (a nonhydrolyzable ATP analogue) but not ADP supports this reaction, the hydrolysis of ATP bound to DnaA is not required. These results suggest that DnaA bound to ATP has a specific conformation that differs from DnaA-ADP.

1057

Insight into the different conformations and the influence of the bound adenine nucleotide has been obtained by X-ray crystallography of domains 3 and 4 of A. aeolicus DnaA. In the DnaA-ADP co-crystal, six DnaA monomers assemble into a closed ring with a central cavity (Erzberger et al. 2002). In contrast, DnaA complexed with AMP-PCP is modeled as a right-handed helical filament in which each DnaA molecule is arranged head to tail (Erzberger et al. 2006). The structure of this truncated DnaA complexed to both AMP-PCP and a dodecamer of dAMP supports the proposal that the unwound DNA interacts with positively charged and hydrophobic amino acids located in the inner channel of the filament (Duderstadt et al. 2010, 2011). On the basis that A. aeolicus DnaA can interact with A. aeolicus DnaC and that the X-ray crystallographic structure of a deletion mutant of A. aeolicus DnaC can be modeled as a helical filament like that of A. aeolicus DnaA, it has been suggested that oligomeric DnaC interacts with the DnaA oligomer to load DnaB onto the unwound region of oriC (Mott et al. 2008). Supporting the model that positively charged amino acids in the interior of the DnaA filament interact with DNA, a separate study examined mutant DnaAs of E. coli bearing alanine substitutions for V211 or R245 (Ozaki and Katayama 2012). Relative to a homology model of E. coli DnaA-ATP assembled as a filament, V211 and R245 reside in the inner channel. The mutant proteins but not wild-type DnaA were inactive in unwinding the AT-rich region of oriC using a linear DNA carrying oriC. Assuming that the mutant DnaAs retain the ability to oligomerize at oriC, these results support the idea that these amino acids bind to the unwound DNA. Of interest, DnaA only needs to bind to part of oriC to unwind it (Ozaki et al. 2008; Ozaki and Katayama 2012). Comparing full-length oriC to the part from the left border to just beyond the I2 site when carried in a plasmid or a DNA fragment, DnaA was active in unwinding the AT-rich region contained in these DNAs. DnaA was inefficient in loading DnaB onto the unwound DNA, suggesting a role for DnaA molecules bound to the right half of oriC. Considering that E. coli can

R

1058

Replication Origin of E. coli and the Mechanism of Initiation

tolerate removal of the right half of oriC from the chromosome but initiation is asynchronous in vivo (Bates et al. 1995; Stepankiw et al. 2009), these results suggest that the defect in the proper timing of initiation relates to helicase loading. DnaA recruits the DnaB-DnaC complex to oriC. After unwinding, DnaA-ATP loads one DnaB-DnaC complex on each of the unwound strands of oriC (Fig. 1; Carr and Kaguni 2001; Fang et al. 1999). Two regions of DnaA appear to interact directly with DnaB to load the helicase. One region is within domain 3, which is based on a monoclonal antibody that inhibits the interaction between DnaA and DnaB by binding to an epitope within amino acids 111–148 of DnaA (Marszalek and Kaguni 1994; Sutton et al. 1998). This region was localized to residues 135–148 by deletion analysis (Seitz et al. 2000). The second interacting region is near the N-terminus of DnaA. Mutant DnaAs lacking this region or bearing either E21A or F46A substitutions are specifically impaired in activities that involve the DnaB-DnaC complex (Keyamura et al. 2009; Seitz et al. 2000). Other experiments show directly that the mutant containing the F46A substitution is defective in interacting with DnaB (Keyamura et al. 2009). Because DnaA oligomer formation, presumably as a helical filament, is also required (Felczak et al. 2005), these two regions in either a single DnaA molecule or separate molecules of the DnaA oligomer mediate the loading of the DnaB-DnaC complex. The region bound by the DnaB-DnaC complex on each of the unwound strands has been determined by footprint analysis with potassium permanganate, a reagent that preferentially reacts with single-stranded DNA. On the top strand of oriC, whose polarity is 50 -to-30 relative to the genetic map of the E. coli chromosome, the region bound is near the left border of oriC (Fang et al. 1999). On the bottom strand, the DnaBDnaC complex binds to an area containing the 13-mer nearest the DnaA box named R1. Other results that quantify the number of DnaB-DnaC complexes establish that a single DnaB-DnaC complex is bound to each separated strand of

oriC (Carr and Kaguni 2001; Fang et al. 1999; Makowska-Grzyska and Kaguni 2010). After helicase loading, DnaC must be released from DnaB in order to activate DnaB as a DNA helicase. Activation appears to require primase and primer formation (Fig. 1; Makowska-Grzyska and Kaguni 2010). Of interest, DnaC interacts with the larger domain of DnaB whereas primase interacts with DnaB’s smaller N-terminal domain with a stoichiometry of three primase molecules per DnaB hexamer (Barcena et al. 2001; Galletto et al. 2003; Mitkova et al. 2003). The interaction of primase with DnaB, which is transient, is required for primer synthesis. These observations suggest that the binding of primase to DnaB induces a conformational change that causes DnaC to dissociate from DnaB. Like DnaA, DnaC is an AAA + protein and binds to ATP or ADP, but the interaction of DnaC with either nucleotide is not needed for DnaC to form a complex with DnaB nor do adenine nucleotides affect the affinity of this interaction (Davey et al. 2002; Galletto et al. 2003; Mott et al. 2008). Nevertheless, substitution of a conserved lysine with arginine in the Walker A box leads to undetectable ATP binding and inactivity in DNA replication of an oriC-containing plasmid (Davey et al. 2002). Similarly, in vivo studies show that missense mutations in each AAA + motif impair DnaC function (Hupert-Kocurek et al. 2007; Ludlam et al. 2001). Hence, adenine nucleotide binding by DnaC is essential at a step after formation of the DnaB-DnaC complex, but what is this event? In assays that measure the helicase activity of DnaB, DnaC presumably complexed to ATP but not ADP inhibits the ability of DnaB to unwind DNA (Davey et al. 2002). These results suggest that the hydrolysis of ATP bound to DnaC leads to the dissociation of DnaC from DnaB. If so, what is the signal that induces ATP hydrolysis? Arginine 220 (the arginine finger) in the box VII motif of DnaC (see ▶ “DnaA, DnaB, DnaC” by Kaguni in this volume) is thought to coordinate ATP hydrolysis with a conformational change based on studies of other AAA + proteins. Arginine 216 is proposed to be involved in formation of a DnaC oligomer. Using mutant DnaC proteins bearing alanine substitutions for these highly

Replication Origin of E. coli and the Mechanism of Initiation

1059

Replication Origin of E. coli and the Mechanism of Initiation, Fig. 2 Structures of replication forks that start from E. coli oriC and then move bidirectionally. In 1, primers synthesized by primase for leading strand synthesis are shown as black squiggles. In 2, DNA polymerase III holoenzyme extends these primers in the synthesis of the leading strand. In 3, the leading strand is extended. In addition, subsequent primers synthesized by primase on the top and bottom strands are used by DNA polymerase III holoenzyme in the synthesis of Okazaki fragments

conserved arginines, primer formation by primase in DNA replication of an oriC-containing plasmid did not induce these mutants to dissociate from DnaB (Makowska-Grzyska and Kaguni 2010). Together, these results suggest that the interaction of primase with DnaB and primer formation triggers the hydrolysis of ATP bound to DnaC so that DnaC can dissociate from DnaB and that these arginines play a role in this process. Following the synthesis of oligonucleotide primers by primase, DNA polymerase III holoenzyme assembles on these primers (see ▶ “Initiation Complex Formation, Mechanism of” and ▶ “Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis” by McHenry in this volume). Their extension by this DNA replicase results in the synthesis of the leading progeny strand (Fig. 2). The interaction of the t subunit of this DNA polymerase with DnaB mediates rapid replication fork movement at a rate of nearly 1,000 nucleotides per second (Kim et al. 1996). In the absence of this interaction, the rate of DNA synthesis of about 35 nucleotides per second is comparable to the unwinding rate by DnaB alone. Thus, the helicase and the DNA polymerase facilitate each other’s activities. As DnaB translocates

in the 50 -to-30 direction on the bound DNA, ATP hydrolysis provides the energy for separation of the duplex DNA and movement of the enzyme. In vitro evidence indicates that Rep helicase interacts with DnaB to facilitate DNA unwinding, so Rep and DnaB may act together in vivo to advance replication forks (Atkinson et al. 2011). At this stage of elongation of DNA replication, primase occasionally interacts with DnaB to form primers. The assembly of DNA polymerase III holoenzyme at these primers followed by their extension explains how the lagging strand is synthesized as Okazaki fragments. Explanation of Terms oriC: the replication origin of the Escherichia coli chromosome. DnaA box: a nine-base-pair sequence that is specifically recognized by DnaA complexed with either ATP or ADP. The DnaA boxes in oriC are named R1, R2, R3, R4, and R5 to denote their specific positions. I-site: a DNA sequence recognized by DnaA complexed to ATP. t-site: another type of DNA sequence recognized by DnaA complexed to ATP.

R

1060

Replication Origin of E. coli and the Mechanism of Initiation

13-mer: 13-base pair sequences near the left border of oriC that reside in an AT-rich region. C-site: a third type of DNA sequence recognized by DnaA complexed to ATP. DnaA: the replication initiator that recognizes specific sites in the E. coli replication origin (oriC) in assembling a DnaA oligomer at this locus, unwinds an AT-rich region within oriC, and then loads one DnaB-DnaC complex on each single-stranded DNA in this unwound region. DnaB: the replicative DNA helicase that, driven by the energy provided by ATP hydrolysis, unwinds the parental duplex DNA. DnaC: the partner of DnaB that when complexed with DnaB inhibits its activity as an ATP-dependent DNA helicase. Primase (DnaG): the enzyme that interacts with DnaB to form primers that are then extended by DNA polymerase III holoenzyme during semiconservative DNA replication. SeqA: a protein that binds specifically to hemimethylated GATC sequences in newly synthesized DNA.

Cross-References ▶ Control of Initiation in E. coli ▶ Cycling of the Lagging Strand Replicase During Okazaki Fragment Synthesis ▶ DNA Replication ▶ DnaA, DnaB, DnaC ▶ Initiation Complex Formation, Mechanism of Acknowledgments I thank the members of my lab for their support while I wrote. This work is supported by Grant GM090063 from the National Institutes of Health and by the Michigan Agricultural Experiment Station.

References Atkinson J, Gupta MK, McGlynn P (2011) Interaction of Rep and DnaB on DNA. Nucleic Acids Res 39(4):1351–1359 Barcena M et al (2001) The DnaB.DnaC complex: a structure based on dimers assembled around an occluded channel. EMBO J 20(6):1462–1468

Bates DB et al (1995) The DnaA box R4 in the minimal oriC is dispensable for initiation of Escherichia coli chromosome replication. Nucleic Acids Res 23(16):3119–3125 Bell SP, Kaguni JM (2013) Helicase loading at chromosomal origins of replication. Cold Spring Harb Perspect Biol 5(6):pii:a010124 Boye E et al (1993) Regulation of DNA replication in Escherichia coli. In: Fanning E, Knippers R, Winnacker EL (eds) DNA replication and the cell cycle, vol 43, Mosbacher Kolloquium. Springer, Berlin, pp 15–26 Bramhill D, Kornberg A (1988) Duplex opening by dnaA protein at novel sequences in initiation of replication at the origin of the E. coli chromosome. Cell 52(5):743–755 Brendler T, Austin S (1999) Binding of SeqA protein to DNA requires interaction between two or more complexes bound to separate hemimethylated GATC sequences. EMBO J 18(8):2304–2310 Brendler T, Abeles A, Austin S (1995) A protein that binds to the P1 origin core and the oriC 13mer region in a methylation-specific fashion is the product of the host seqA gene. EMBO J 14(16):4083–4089 Carr KM, Kaguni JM (2001) Stoichiometry of DnaA and DnaB protein in initiation at the Escherichia coli chromosomal origin. J Biol Chem 276(48):44919–44925 Cassler MR, Grimwade JE, Leonard AC (1995) Cell cyclespecific changes in nucleoprotein complexes at a chromosomal replication origin. EMBO J 14(23): 5833–5841 Crooke E et al (1993) Replicatively active complexes of DnaA protein and the Escherichia coli chromosomal origin observed in the electron microscope. J Mol Biol 233(1):16–24 Davey MJ et al (2002) The DnaC helicase loader is a dual ATP/ADP switch protein. EMBO J 21(12):3148–3159 Duderstadt KE et al (2010) Origin remodeling and opening in bacteria rely on distinct assembly states of the DnaA initiator. J Biol Chem 285(36):28229–28239 Duderstadt KE, Chuang K, Berger JM (2011) DNA stretching by bacterial initiators promotes replication origin opening. Nature 478(7368):209–213 Erzberger JP, Pirruccello MM, Berger JM (2002) The structure of bacterial DnaA: implications for general mechanisms underlying DNA replication initiation. EMBO J 21(18):4763–4773 Erzberger JP, Mott ML, Berger JM (2006) Structural basis for ATP-dependent DnaA assembly and replicationorigin remodeling. Nat Struct Mol Biol 13(8):676–683 Fang L, Davey MJ, O’Donnell M (1999) Replisome assembly at oriC, the replication origin of E. coli, reveals an explanation for initiation sites outside an origin. Mol Cell 4(4):541–553 Felczak MM, Simmons LA, Kaguni JM (2005) An essential tryptophan of Escherichia coli DnaA protein functions in oligomerization at the E. coli replication origin. J Biol Chem 280(26):24627–24633 Finkel SE, Johnson RC (1992) The Fis protein: it’s not just for DNA inversion anymore [published erratum

Replication Origin of E. coli and the Mechanism of Initiation appears in Mol Microbiol 1993 Mar;7(2):1023]. Mol Microbiol 6(22):3257–3265 Flamm EL, Weisberg RA (1985) Primary structure of the hip gene of Escherichia coli and of its product, the beta subunit of integration host factor. J Mol Biol 183(2):117–128 Fuller RS, Funnell BE, Kornberg A (1984) The dnaA protein complex with the E. coli chromosomal replication origin (oriC) and other DNA sites. Cell 38(3): 889–900 Galletto R, Jezewska MJ, Bujalowski W (2003) Interactions of the Escherichia coli DnaB helicase hexamer with the replication factor the DnaC protein. Effect of nucleotide cofactors and the ssDNA on protein-protein interactions and the topology of the complex. J Mol Biol 329(3):441–465 Gille H, Messer W (1991) Localized DNA melting and structural pertubations in the origin of replication, oriC, of Escherichia coli in vitro and in vivo. EMBO J 10(6):1579–1584 Gille H et al (1991) The FIS protein binds and bends the origin of chromosomal DNA replication, oriC, of Escherichia coli. Nucleic Acids Res 19(15):4167–4172 Grimwade JE, Ryan VT, Leonard AC (2000) IHF redistributes bound initiator protein, DnaA, on supercoiled oriC of Escherichia coli. Mol Microbiol 35(4):835–844 Hiasa H, Marians KJ (1994) Fis cannot support oriC DNA replication in vitro. J Biol Chem 269(40):24999–25003 Hupert-Kocurek K et al (2007) Genetic method to analyze essential genes of Escherichia coli. Appl Environ Microbiol 73(21):7075–7082 Hwang DS, Kornberg A (1992) Opening of the replication origin of Escherichia coli by DnaA protein with protein HU or IHF. J Biol Chem 267(32):23083–23086 Jacob F, Monod J (1961) Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3:318–356 Jacob F, Brenner S, Cuzin F (1963) On the regulation of DNA replication in bacteria. Cold Spring Harb Symp Quant Biol 28:329–348 Kano Y, Imamoto F (1990) Requirement of integration host factor (IHF) for growth of Escherichia coli deficient in HU protein. Gene 89(1):133–137 Kaur G et al (2014) Building the bacterial orisome: highaffinity DnaA recognition plays a role in setting the conformation of oriC DNA. Mol Microbiol 91(6): 1148–1163 Kawakami H, Keyamura K, Katayama T (2005) Formation of an ATP-DnaA-specific initiation complex requires DnaA arginine 285, a conserved motif in the AAA +protein family. J Biol Chem 280(29):27420–27430 Keyamura K et al (2009) DiaA dynamics are coupled with changes in initial origin complexes leading to helicase loading. J Biol Chem 284(37):25038–25050 Kim S et al (1996) Coupling of a replicative polymerase and helicase: a tau-DnaB interaction mediates rapid replication fork movement. Cell 84(4):643–650 Langer U et al (1996) A comprehensive set of DnaA-box mutations in the replication origin, oriC, of Escherichia coli. Mol Microbiol 21(2):301–311

1061

Leonard AC, Grimwade JE (2011) Regulation of DnaA assembly and activity: taking directions from the genome. Annu Rev Microbiol 65:19–35 Ludlam AV et al (2001) Essential amino acids of Escherichia coli DnaC protein in an N-terminal domain interact with DnaB helicase. J Biol Chem 276(29): 27345–27353 Makowska-Grzyska M, Kaguni JM (2010) Primase directs the release of DnaC from DnaB. Mol Cell 37(1):90–101 Margulies C, Kaguni JM (1996) Ordered and sequential binding of DnaA protein to oriC, the chromosomal origin of Escherichia coli. J Biol Chem 271(29): 17035–17040 Margulies C, Kaguni JM (1998) The FIS protein fails to block the binding of DnaA protein to oriC, the Escherichia coli chromosomal origin. Nucleic Acids Res 26(22):5170–5175 Marszalek J, Kaguni JM (1994) DnaA protein directs the binding of DnaB protein in initiation of DNA replication in Escherichia coli. J Biol Chem 269(7): 4883–4890 Matsui M et al (1985) Sites of dnaA protein-binding in the replication origin of the Escherichia coli K-12 chromosome. J Mol Biol 184(3):529–533 McGarry KC et al (2004) Two discriminatory binding sites in the Escherichia coli replication origin are required for DNA strand opening by initiator DnaA-ATP. Proc Natl Acad Sci U S A 101(9):2811–2816 Messer W et al (1992) The complex for replication initiation of Escherichia coli. Chromosoma 102(1 Suppl): S1–S6 Mitkova AV, Khopde SM, Biswas SB (2003) Mechanism and stoichiometry of interaction of DnaG primase with DnaB helicase of Escherichia coli in RNA primer synthesis. J Biol Chem 278(52):52253–52261 Mott ML et al (2008) Structural synergy and molecular crosstalk between bacterial helicase loaders and replication initiators. Cell 135(4):623–634 Oka A et al (1980) Replication origin of the Escherichia coli K-12 chromosome: the size and structure of the minimum DNA segment carrying the information for autonomous replication. Mol Gen Genet 178(1): 9–20 Ozaki S, Katayama T (2012) Highly organized DnaA-oriC complexes recruit the single-stranded DNA for replication initiation. Nucleic Acids Res 40(4):1648–1665 Ozaki S et al (2008) A common mechanism for the ATP-DnaA-dependent formation of open complexes at the replication origin. J Biol Chem 283(13):8351–8362 Polaczek P (1990) Bending of the origin of replication of E. coli by binding of IHF at a specific site. New Biol 2(3):265–271 Rice PA et al (1996) Crystal structure of an IHF-DNA complex: a protein-induced DNA U-turn. Cell 87(7): 1295–1306 Rozgaja TA et al (2011) Two oppositely oriented arrays of low-affinity recognition sites in oriC guide progressive binding of DnaA during Escherichia coli pre-RC assembly. Mol Microbiol 82(2):475–488

R

1062 Ryan VT et al (2002) IHF and HU stimulate assembly of pre-replication complexes at Escherichia coli oriC by two different mechanisms. Mol Microbiol 46(1):113–124 Schaper S, Messer W (1995) Interaction of the initiator protein DnaA of Escherichia coli with its DNA target. J Biol Chem 270(29):17622–17626 Seitz H, Weigel C, Messer W (2000) The interaction domains of the DnaA and DnaB replication proteins of Escherichia coli. Mol Microbiol 37(5):1270–1279 Sekimizu K, Bramhill D, Kornberg A (1987) ATP activates dnaA protein in initiating replication of plasmids bearing the origin of the E. coli chromosome. Cell 50(2):259–265 Slater S et al (1995) E. coli SeqA protein binds oriC in two different methyl-modulated reactions appropriate to its roles in DNA replication initiation and origin sequestration. Cell 82(6):927–936 Speck C, Messer W (2001) Mechanism of origin unwinding: sequential binding of DnaA to double- and singlestranded DNA. EMBO J 20(6):1469–1476 Stepankiw N et al (2009) The right half of the Escherichia coli replication origin is not essential for viability, but facilitates multi-forked replication. Mol Microbiol 74(2):467–479 Sutton MD et al (1998) E. coli DnaA protein: the N-terminal domain and loading of DnaB helicase at the E. coli chromosomal origin. J Biol Chem 273:34255–34262 von Meyenburg K, Hansen FG (1980) The origin of replication, oriC, of the Escherichia coli chromosome: genes near oriC and construction of oriC deletion mutations. ICN-UCLA Symp Mol Cell Biol 19:137–159

Replicative DNA Helicases and Primases Panos Soultanas1 and Edward Bolt2 1 School of Chemistry, Centre for Biomolecular Sciences, University of Nottingham, Nottingham, UK 2 School of Life Sciences, University of Nottingham, Nottingham, UK

Synopsis Replication of a cell’s genetic material is one of the most fundamental functions in biology. The genetic information defining each species is encoded within the sequence of the DNA double helix, and to be copied into a new genome, the sequence of the parental DNA must be revealed

Replicative DNA Helicases and Primases

and made available to copying enzymes known as DNA polymerases. Watson and Crick, in their seminal 1953 paper on the “Molecular Structure of Nucleic Acids” (Watson and Crick 1953), pointed out that “the specific base pairing (Guanine to Cytosine and Adenine to Thymine) immediately suggests a possible copying mechanism for the genetic material.” They further postulated that “prior to duplication the hydrogen bonds break and the two chains unwind and separate” and asked “what makes the pair of chains unwind and separate?” The answer to this important question is DNA helicases.

Introduction DNA helicases are molecular motors that convert chemical energy from NTP (nucleoside triphosphate) binding and hydrolysis to mechanical energy, translocating along the DNA polymer and unwinding the double helix in the process. Helicase activity was first demonstrated in vitro in fractionated Escherichia coli extracts (AbdelMonem et al. 1976). Because of the fundamental and ubiquitous nature of DNA replication, replicative DNA helicases are ubiquitous and essential enzymes. They are the “engine motors” of multicomponent protein machines, known as replisomes, evolved to replicate genomes rapidly and accurately. Replication is a complex process that, in addition to duplex unwinding, has to deal with a multitude of challenges such as the inability of DNA polymerases to synthesize DNA de novo, the unidirectional synthesis of the two antiparallel parental DNA strands, the topological problems associated with unwinding superhelical DNA, the roadblocks encountered on protein-associated genomes, and the potentially mutagenic DNA lesions in bases or breaks in the sugar-phosphate backbone. Different proteins dealing with these problems have been assembled on replisomes that drive replication forks forward. Replicative DNA helicases are the motor engines of replisomes. The inability of DNA polymerases to synthesize DNA de novo is overcome by primases. Primases copy the DNA template to synthesize de novo small RNA primers to initiate DNA synthesis. DNA

Replicative DNA Helicases and Primases

polymerases then use the free 30 -OH ends of these primers to unidirectionally extend them copying the DNA template into new DNA strands. Helicases and primases work cooperatively during DNA replication.

DnaB Is the Main Replicative Helicase in Bacteria The principal replicative DNA helicase in bacteria, DnaB, has likely evolved from a gene duplication event of a recA-like ancestor gene (Leipe et al. 2000). DnaB homologues are present in eukaryotes but are not required in eukaryotic nuclear DNA replication. The DnaB homologue of Caenorhabditis elegans contains a divergent inactive primase (DnaG) domain suggesting that eukaryotic nuclear DnaB homologues are related to the bacteriophage T7gp4 helicase-primase protein and may have been introduced into eukaryotes via horizontal transfer. By contrast, DnaB sequences in the chloroplast genomes are homologous to bacterial DnaB, suggesting that they too have been inherited horizontally by bacterial symbionts, the ancestors of plastids. These organelle DnaB helicases are involved in chloroplast DNA replication. The bacterial replicative helicase is not an orthologue of the replicative helicases in archaea and eukaryotes. The two families of replicative helicases have evolved independently from each other. Unfortunate nomenclature of some replication proteins in bacteria is somewhat confusing for the nonspecialists. The replicative helicase, homologous to the Escherichia coli DnaB, is known as DnaC in Bacillus subtilis, while the Bacillus subtilis DnaB is an essential primosomal protein unrelated to the Escherichia coli DnaB helicase. To complicate matters further, DnaC is the helicase loader in Escherichia coli, homologous to the Bacillus subtilis DnaI helicase loader, and should not be confused with the DnaC replicative helicase in Bacillus subtilis. The replicative helicase is approximately 454 (Escherichia coli) to 471 (Bacillus subtilis) amino acid residues ranging between 50 and 53 kDa, depending on the bacterial species, and forms functional hexamers (see below).

1063

MCM Is the Replicative Helicase in Archaea and Eukaryotes Archaea and eukaryotes utilize Mcm (minichromosome maintenance) protein to unwind parental template DNA strands for copying by DNA polymerases within replisomes. The function of Mcm in DNA strand separation during replication is therefore analogous to bacterial DnaB. There are strong similarities between DnaB and Mcm in overall structural aspects of helicase unwinding but also fundamental differences in their reaction mechanisms and regulation. The presence of Mcm in all sequenced archaeal and eukaryotic genomes reflects a more widespread conservation of other replisome and genome information processing proteins in archaeal and eukaryotes. A few bacterial species possess helicase active, Mcm-like proteins with sequence homology to the Mcm AAA+ domain, but these are encoded on prophage elements (Samuels et al. 2009). Mcm was discovered in Saccharomyces cerevisiae by mapping chromosomal mutations that resulted in cells being unable to maintain plasmids that had been engineered to behave as autonomously replicating mini-chromosomes (Maine et al. 1984). In eukaryotes and archaea, Mcm forms multimeric ring complexes that are assembled at origins of replication (Bowers et al. 2004; Duggin et al. 2008) and catalyze DNA strand separation in vivo and in vitro (Bochman and Schwacha 2008; Chong et al. 2000; Kelman et al. 1999). Eukaryotic Mcm assembles from six distinct monomers (Mcm2-7) forming heterohexameric rings, contrasting with archaeal Mcm and bacterial DnaB that form homohexameric rings. A subset of eukaryotic Mcm2-7 complex (Mcm467) is also an effective helicase in vitro, reviewed in Bochman and Schwacha (2009). Details of Mcm helicase activities and their regulation are described further in a later section. Other MCM proteins are also present in some eukaryotes (e.g., Mcm8, Mcm9, Mcm10) and have various roles in aiding DNA replication or its regulation. Mcm homologues are detected based mainly on sequence homology to a conserved ATP processing domain of about 200 amino acids within Mcm2-7. Functionality

R

1064

of this domain is described in detail in subsequent sections. Complexity in eukaryotic MCMs reflects gene duplication from a single archaeal MCM progenitor monomer of about 650 amino acids that assembles into a homohexameric/ homoheptameric complex that catalyzes DNA unwinding. Much of the biochemical analysis of archaeal Mcm has been using purified protein from thermophilic species Methanothermobacter thermautotrophicus and Sulfolobus solfataricus, proteins that have also been amenable for solving high-resolution atomic structures of Mcm helicase. Structural information on function and arrangement of eukaryotic Mcm2-7 and Mcm467 hetero-complexes is inferred from homologous archaeal structures and from detailed biochemical analysis.

Primases: De Novo Synthesis and Lagging Strand Versus Leading Strand Synthesis The inability of DNA polymerases to synthesize de novo DNA is complemented by DNA-dependent RNA polymerases known as primases. Highly conserved bacterial primases are coded by the dnaG gene. They recognize small trinucleotide sequences (recognitions sites) on the parental DNA and synthesize small complementary RNA primers. DNA polymerases unidirectionally elongate the RNA primers by sequentially adding dNTPs at the 30 -OH end of the growing nascent DNA strand. This is particularly important during discontinuous lagging strand synthesis where each Okazaki fragment is assembled by extension of an RNA primer synthesized by the primase. The RNA primers are then replaced by DNA and neighboring Okazaki fragments are joined together by a DNA ligase. Several primase recognition sites have been identified and it appears that members of the same phylogenetic class share trinucleotide template specificity (Larson et al. 2010). The T4 primase primes at 50 -d(GTT) and the T7 primase at 50 -d(GTC), while primases from Firmicutes prime predominantly at 50 -d(CTA) and 50 -d (TTA), from g-Proteobacteria at 50 -d(CTG), and

Replicative DNA Helicases and Primases

from Aquificae at 50 -d(CCC). Synthesis starts at the anti-penultimate position and forms the first dinucleotide which is then extended further. The first two bases are always purines, stacking strongly to enhance complex formation and dinucleotide condensation. Only NTPs that can form Watson-Crick H-bonds with the DNA template are incorporated into the primer but the fidelity of incorporations is relatively low. In that respect, primases resemble the lesion bypass DNA polymerases of the Y-family. Eukaryotic primases have minimal template specificity. They prefer to bind to pyrimidine-rich templates in vitro but it remains unknown how initiation specificity is controlled in vivo. Most primases, with the exception of archaeal enzymes, have an intrinsic ability to synthesize primers of a defined length, a property known as “counting.” In bacteriophage and prokaryotic primases, counting results from one primase binding to the template and presenting it in trans to a second primase which initiates primer synthesis (Corn et al. 2005; Qimron et al. 2006). This intermolecular interaction acts as a “molecular break” restricting primer length. In the two-subunit eukaryotic primase (p49/p58), the polb-like domain in p58 is critical for counting. It binds at a defined distance ahead of p49, and as p49 elongates the primer, it bumps onto the p58 displacing it from the template in a “hinge opening-closing” type of mechanism. The apparent lack of counting in archaeal primases is a reflection of their RNA-DNA polymerase hybrid nature. They can efficiently initiate primer synthesis using dNTPs instead of NTPs, a reaction that cannot be done by classical DNA polymerases. They can also efficiently incorporate dNTPs. By comparison, prokaryotic primases can only do so inefficiently, while eukaryotic primases like the human enzyme cannot do so at all.

Helicase Structures and Oligomerization: Replicative Helicases Form Rings Most replicative helicases belong to the AAA+ (ATPases associated with various cellular

Replicative DNA Helicases and Primases

1065

Replicative DNA Helicases and Primases, Fig. 1 Replicative helicases form ring structures. (a) The hexameric ring structure of the Geobacillus stearothermophilus DnaB helicase (Bailey et al. 2007). Six monomers (cyan, gold, green, blue, red, and pink) are arranged side by side in the same orientation to form a two-tier ring. The NTD tier defines the back face of the helicase and has threefold symmetry, while the CTD tier defines the front face and has sixfold symmetry. Side, back face, and front

face views are shown from left to right. (b) The crystal structure of the Sulfolobus solfataricus Mcm NTD in a side view (left) and top view (right) reveals a single-tier hexameric ring wide enough to accommodate ssDNA (Liu et al. 2008). The nonclassical Zn ribbon motifs of each monomer are oriented toward the same face (bottom) of the ring. Diametrically opposite pairs of monomers are colored differently (cyan, green, and gold)

activities) family with 30 -50 polarity or the RecAlike family with 50 -30 polarity with respect to the 50 -phosphate and 30 -OH ends of the DNA strand along which they translocate. Bacterial replicative helicases belong to the RecA-like superfamily 4 (SF4) otherwise known as DnaB-like DNA helicases. They form characteristic homohexameric ring structures (Fig. 1a). Each monomer consists of two domains, an N-terminal domain (NTD) of helical nature structurally homologous to the C-terminal (helicaseinteracting) domain of the bacterial primase DnaG (Syson et al. 2005) and a C-terminal domain (CTD) with a RecA-like fold characteristic of many helicases and translocases (Bailey et al. 2007). The six monomers are arranged around the ring in the same orientation, placing all the

NTDs adjacent to each other on the back face of the ring and all the CTDs next to each other on the opposite front face of the ring. This arrangement results in the formation of a double-tier ring with a threefold symmetric back NTD ring tier sitting on top of a sixfold symmetric front CTD ring tier, giving the impression of threefold or sixfold symmetry when viewed from the back or front face, respectively. The translocating strand threads through the central cavity of the ring (see subsequent sections). The size of the central cavity can vary significantly. Crystal structures of the SV40 LTag helicase show a progressive increase in the size of the central cavity as the enzyme binds ATP and ADP or is free of nucleotide (Gai et al. 2004). Such “molecular breathing” is a likely consequence of cooperative NTP hydrolysis and

R

1066

coupling to unidirectional translocation along the DNA lattice and can result in either singlestranded or double-stranded DNA being thread through the central cavity. Recent structural data from the Aquifex aeolicus replicative helicase DnaB suggest that the role of the NTD ring tier acts as an autoregulatory hub to control the transition between different functional states during NTP binding and hydrolysis and also the ability of the helicase to be functionally modulated through direct interactions with partner and accessory proteins (Strycharska et al. 2013). It is not clear whether the NTD is required for helicase activity since in some DnaB helicases it is essential for helicase activity (Bird et al. 2000) and yet in others it is redundant (Nitharwal et al. 2007). In the crystal structure of the Geobacillus kaustophilus DnaC helicase, the inner ring surface of the NTD tier accommodated three symmetrical single-stranded 9mer oligonucleotides (Lo et al. 2009). They likely represent three positions that the ssDNA can adopt as it emerges from the back end of the forward moving ring helicase. The CTD provides the helicase motor as it binds and hydrolyzes NTPs and has all the five sequence motifs (H1, H1a, H2, H3, and H4) characteristic of SF4 helicases (Ilyina et al. 1992). The inner ring surface of the CTD tier encircles and engages the translocating ssDNA strand during movement (discussed in detail in a subsequent section). Like DnaB, archaeal and eukaryotic Mcm proteins assemble into toroidal multimeric rings. Electron microscopy and size exclusion chromatography studies have indicated structural polymorphism in assembly of Mcm monomers into hexameric and heptameric rings and helical filaments (reviewed in Sakakibara et al. 2009). Double hexameric rings (dodecamers) are formed by yeast and Xenopus Mcm2-7 (Evrin et al. 2009; Gambus et al. 2011) and Mcm from the archaeon Methanothermobacter thermautotrophicus (Chong et al. 2000). Several atomic resolution crystal structures of Mcm monomers have been reported, each of which can be modeled into hexameric rings, detailed below. The arrangements of monomers relative to one another in eukaryotic Mcm2-7 differ from that in archaeal

Replicative DNA Helicases and Primases

Mcm, forming a gapped hexameric ring, with dynamic or transient interaction between Mcm 2 and 5 within a hexameric ring. This forms a “gate” or “discontinuity” within the hexameric ring that has been implicated in coupling of ATPase and DNA binding activities, discussed below. Mcm is a member of the AAA+ protein superfamily, based on sequence conservation of its ATP processing domain, although Mcm proteins often contain structural insertions into the typical AAA+ morphology. In addition to an AAA+ ATPase domain (200–300 amino acid residues), each Mcm monomer has distinct N-terminal (NTD, approximately 250 amino acid residues) and C-terminal domains (CTD, approximately 100 residues) that coordinate various functions coupling DNA binding and unwinding to ATP hydrolysis and regulation of helicase activity. High-resolution X-ray crystallographic structures have been solved for Mcm from three archaeal species: Methanothermobacter thermautotrophicus (Mth) (Fletcher et al. 2003), Sulfolobus solfataricus (Sso) (Brewster et al. 2008), and Methanopyrus kandleri (Mka) (Bae et al. 2009). The most complete Mcm structures, from Sso and Mka, show N-terminal and C-terminal domains of monomeric Mcm forms that can be modeled into hexameric rings (Fig. 1b). The Sso Mcm structure is at relatively low resolution (4.5 Å) but was the first near-fulllength Mcm structure and has identified key features of b-hairpins in each monomer that contribute to function, described more below. The Mka Mcm atomic structure is high resolution, but with the caveat that it is of Mka Mcm2 that has so far given no helicase activity in vitro and does not hexamerize. No Mcm-DNA co-structure is solved, but the central channel formed by Mcm from Methanothermobacter thermautotrophicus has a diameter that could accommodate dsDNA. The equivalent channel of Sulfolobus solfataricus Mcm is predicted to accommodate only ssDNA. Fluorescent labeling studies have highlighted that Mcm from Sulfolobus loads onto ssDNA for 30 -50 translocation but also makes interactions with 50 ssDNA tails of forked substrates

Replicative DNA Helicases and Primases

(McGeoch et al. 2005). Archaeal Mcm complexes also contain six side channels, running perpendicular to the central channel at each monomermonomer interface. These channels connect the central channels to the exterior of Mcm complex, similarly to channel organization of SV40 LTag. Three major domains of each monomer (NTD, CTD, and AAA+) each contain sub-domains with specific functions. NTD structures from euryarchaeal and crenarchaeal species reveal three sub-domains, A, B, and C. The NTD sub-domains each contribute to successful oligomerization of Mcm. Sub-domain A seems to contribute to control of helicase unwinding by hexameric Mcm rings, perhaps by specific contacts with substrate DNA. A zinc finger in sub-domain B and a b-hairpin in sub-domain C dictate DNA binding by Mcm, described in more detail in a later section. Sub-domain C facilitates communication between DNA binding by NTD and ATPase functions of the AAA+ domain via an N-terminal communication loop (NCL). Comparison of electron microscopy and atomic structures of archaeal Mcm have also detected intriguing differences in conformation of sub-domain A between structures, indicating significant rotation of this sub-domain depending on conditions. Biochemical and mutagenesis analysis of sub-domain A amino acid residues indicated that a conformational switch in this sub-domain may be a basis for control of helicase unwinding, drawing interesting parallels with control of Mcm function in eukaryotes, discussed more in the section of regulation of Mcm helicase. The AAA+ domain also forms sub-domains, called the P-loop (also called a/b) sub-domain and the lid sub-domain. These are both critical for Mcm monomers to bind and hydrolyze ATP, described in detail in the next section.

Primases: Structural Considerations Primases can be divided into two main structural groups: the two-subunit archaeal/eukaryotic enzymes and the one-subunit bacterial/phage and plasmid-encoded enzymes with TOPRIM architecture (the DnaG superfamily). The former are

1067

associated with Pola, while the latter are associated with helicases. Bacterial primases comprise three distinct domains: a Zn-binding N-terminal domain (ZBD) that binds to the DNA template, a central RNA polymerization domain (RPD) with TOPRIM fold, and a C-terminal helicasebinding domain (CTD) which is structurally homologous to the NTD of the DnaB helicase (Fig. 2a, b). The ZBD belongs to the zinc ribbon subfamily with a distinct b-sheet involved in template binding (Pan and Wigley 2000). Key residues in the b4 strand confer class-associated initiation site specificity. Specificity can be transferred across bacterial classes by simply changing certain residues to appropriate ones found in a different class (Larson et al. 2010). The TOPRIM fold of RPD contains an a/b core with four conserved strands characteristic of topoisomerases Ia and II and the RecR-related DNA repair proteins. A positively charged groove on the surface of RPD outside the active site also binds the DNA template which then bends by 90 to enter into the active site (Corn et al. 2008). Synthesis of the primer forms a primer-template heteroduplex stabilized by binding along the basic ridge (Fig. 2a). SAXS modeling places the CTD on top of the RPD template binding groove supporting the notion that initiation specificity is conferred in trans by a ZBD of another primase, as the linker that connects the ZBD and RPD is too short to allow binding of the ZBD to the nascent primer-template heteroduplex emerging from the basic ridge at the bottom of the structure (Fig. 2a). Heterodimeric archaeal/eukaryotic primases comprise a small and a large subunit. The catalytic small subunit contains an RNA recognition fold (RRM) similar to those found in reverse transcriptases, cyclic nucleotide transferases, and repair DNA polymerases of the A, B, and Y families. The large subunit is essential but its precise role is poorly understood. It contains a highly conserved C-terminal domain with a 4Fe-4S cluster that is lacking in the archaeal proteins. Disruption of this cluster abolishes primase activity. Eukaryotic primases co-purify with Pola and the polymerase b subunit in a tetrameric complex. The close association of the primase activity to a polymerase activity is exemplified in the prim-pol family of

R

1068

Replicative DNA Helicases and Primases

include enzymes from the herpes-pox subfamily characterized by a C-terminal b-strand-rich region containing a Zn ribbon similar to that found in DnaG-like primases and from the iridovirus subfamily characterized by a C-terminal PriCT a-helical domain. The Zn ribbon is a likely functional equivalent of the Zn ribbon found in the TOPRIM DnaG-like bacterial primases, determining specificity of binding to the DNA template. Primases work in conjunction with other proteins at the replication fork, but three proteins in particular, DNA polymerase, DNA helicase, and single-strand DNA binding protein, greatly influence primase activity. Primases are either fused physically to one of these activities or these activities are inherent components for efficient primer synthesis.

Cross-References ▶ PCNA Loading by RFC, Mechanism of ▶ PCNA Structure and Interactions with Partner Proteins ▶ Replication Origin of E. coli and the Mechanism of Initiation ▶ Rolling Circle Replicating Plasmids ▶ Theta-Replicating Plasmids, Large Replicative DNA Helicases and Primases, Fig. 2 The structure of the bacterial primase. (a) The RPD of the Escherichia coli DnaG primase bound to ssDNA (Corn et al. 2008). An arrow shows how the ssDNA (blue) is thought to bend by 90 into the active site and the RNA-DNA heteroduplex stabilized within the basic ridge. A SAXS-based model of the ZBD sitting on top of the RPD is shown in cyan. The linker connecting the RPD and the ZBD is not long enough to allow the ZBD to swing over toward the active site which is consistent with a “trans” mechanism where a ZBD of one primase cooperates with the RPD of another in trans during primer synthesis. (b) Structures of the CTD (Syson et al. 2005) and ZBD (Pan and Wigley 2000) from the Geobacillus stearothermophilus are also shown

the archaeal/eukaryotic primases. The prim-pols exhibit both a polymerase and a primase activity. They include enzymes found in crenarchaeal and Gram-positive bacterial plasmids. Certain eukaryotic and viral primases belong to a distinct family of the archaeal/eukaryotic superfamily. They

References Abdel-Monem M, Durwald H, Hoffmann-Berling H (1976) Enzymic unwinding of DNA: 2. Chain separation by an ATP-dependent DNA unwinding enzyme. Eur J Biochem 65:441–449 Bae B, Chen YH, Costa A, Onesti S, Brunzelle JS, Lin Y, Cann IK, Nair SK (2009) Insights into the architecture of the replicative helicase from the structure of an archaeal MCM homolog. Structure 17:211–222 Bailey S, Eliason WK, Steitz TA (2007) Helicase and its complex with a domain of DnaG primase. Science 318:459–463 Bird LE, Pan H, Soultanas P, Wigley DB (2000) Mapping protein-protein interactions within a stable complex of DNA primase and DnaB helicase from Bacillus stearothermophilus. Biochemistry 39:171–182 Bochman ML, Schwacha A (2008) The Mcm2-7 complex has in vitro helicase activity. Mol Cell 31:287–293 Bochman ML, Schwacha A (2009) The Mcm complex: unwinding the mechanism of a replicative helicase. Micro Mol Biol Rev 73:652–683

Residue Bowers JL, Randell JC, Chen S, Bell SP (2004) ATP hydrolysis by ORC catalyzes reiterative Mcm2-7 assembly at a defined origin of replication. Mol Cell 16:967–978 Brewster AS, Wang G, Yu X, Greenleaf WB, Carazo JM, Tjajadia M, Klein MG, Chen XS (2008) Crystal structure of a near-full-length archaeal MCM: functional insights for an AAA + hexameric helicase. Proc Natl Acad Sci U S A 105:20191–20196 Chong JP, Hayashi MK, Simon MN, Xu RM, Stillman B (2000) A double-hexamer archaeal minichromosome maintenance protein is an ATP- dependent DNA helicase. Proc Natl Acad Sci USA 97:1530–1535 Corn JE, Pease PJ, Hura GL, Berger JM (2005) Crosstalk between primase subunits can act to regulate primer synthesis in trans. Mol Cell 20:391–401 Corn JE, Pelton JG, Berger JM (2008) Identification of a DNA primase template tracking site redefines the geometry of primer synthesis. Nat Struct Mol Biol 15:163–169 Duggin IG, McCallum SA, Bell SD (2008) Chromosome replication dynamics in the archaeon Sulfolobus acidocaldarius. Proc Natl Acad Sci USA 105: 16737–16742 Evrin C, Clarke P, Zech J, Lurz R, Sun J, Uhle S, Li H, Stillman B, Speck C (2009) A double-hexameric MCM2-7 complex is loaded onto origin DNA during licensing of eukaryotic DNA replication. Proc Natl Acad Sci U S A 106:20240–20245 Fletcher RJ, Bishop BE, Leon RP, Sclafani RA, Ogata CM, Chen XS (2003) The structure and function of MCM from archaeal M. Thermoautotrophicum. Nat Struct Biol 10:160–167 Gai D, Zhao R, Li D, Finkielstein CV, Chen XS (2004) Mechanisms of conformational change for a replicative hexameric helicase of SV40 large tumour antigen. Cell 119:47–60 Gambus A, Khoudoli GA, Jones RC, Blow JJ (2011) MCM2-7 form double hexamers at licensed origins in Xenopus egg extract. J Biol Chem 286:11855–11864 Ilyina TV, Gorbalenya AE, Koonin EV (1992) Organization and evolution of bacterial and bacteriophage primase-helicase systems. J Mol Evol 34:351–357 Kelman Z, Lee JK, Hurwitz J (1999) The single minichromosome maintenance protein of Methanobacterium thermoautotrophicum DeltaH contains DNA helicase activity. Proc Natl Acad Sci U S A 96: 14783–14788 Larson MA, Griep MA, Bressani R, Chintakayala K, Soultanas P, Hinrichs SH (2010) Class-specific restrictions define primase interactions with DNA template and replicative helicase. Nucleic Acids Res 38:7167–7178 Leipe DD, Aravind L, Grishin NV, Koonin EV (2000) The bacterial replicative helicase DnaB evolved from a RecA duplication. Genome Res 10:5–16 Liu W, Puccie B, Rossi M, Pisani FM, Ladenstein R (2008) Structural analysis of the Sulfolobus solfataricus MCM protein N-terminal domain. Nucleic Acids Res 36:3235–3243

1069 Lo YH, Tsai KL, Sun YJ, Chen WT, Huang CY, Hsiao CD (2009) The crystal structure of a replicative hexameric helicase DnaC and its complex with single-stranded DNA. Nucleic Acids Res 37:804–814 Maine GT, Surosky RT, Tye BK (1984) Isolation and characterization of the centromere from chromosome V (CEN5) of Saccharomyces cerevisiae. Mol Cell Biol 4:86–91 McGeoch AT, Trakselis MA, Laskey RA, Bell SD (2005) Organization of the archaeal MCM complex on DNA and implications for the helicase mechanism. Nat Struct Mol Biol 12:756–762 Nitharwal RG, Paul S, Dar A, Choudhury NR, Soni RK, Prusty D, Sinha S, Kashav T, Mukhopadhyay G, Chaudhuri TK, Gourinath S, Dhar SK (2007) The domain structure of Helicobacter pylori DnaB helicase; the N-terminal domain can be dispensable for helicase activity whereas the extreme C-terminal region is essential for its function. Nucleic Acids Res 35:2861–2874 Pan H, Wigley DB (2000) Structure of the Zn-binding domain of Bacillus stearothermophilus DNA primase. Structure 8:231–239 Qimron U, Lee SJ, Hamdan SM, Richardson CC (2006) Primer initiation and extension by T7 DNA primase. EMBO J 25:2199–2208 Sakakibara N, Kelman LM, Kelman Z (2009) Unwinding the structure and function of the archaeal MCM helicase. Mol Microbiol 72:286–296 Samuels M, Gulati G, Shin JH, Opara R, McSweeney E, Sekedat M, Long S, Kelman Z, Jeruzalmi D (2009) A biochemically active MCM-like helicase in Bacillus cereus. Nucl Acids Res 37(13):4441–4452 Strycharska MS, Arias-Palomo E, Lyubimov AY, Erzeberger JP, O’Shea VL, Bustamante CJ, Berger JM (2013) Nucleotide and partner-protein control of bacterial replicative helicase structure and function. Mol Cell 52:844–854 Syson K, Thirlway J, Hounslow AM, Soultanas P, Waltho JP (2005) Solution structure of the helicaseinteraction domain of the primase DnaG; a model for helicase activation. Structure 13:609–616 Watson JD, Crick FH (1953) Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171:737–738

Replicon Type ▶ Plasmid Incompatibility

Residue ▶ Secondary Structure

R

1070

Restriction Endonucleases Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Synopsis Restriction endonucleases are enzymes that bind to a specific double-stranded DNA sequence and catalyze hydrolysis of phosphodiester bonds in both DNA strands, within or near the specific sequence. Their biological function is as part of bacterial restriction-modification systems that consist of the nuclease and a DNA methyltransferase that modifies the bacterial DNA in the same sequence as recognized by the nuclease. There are four types of restriction endonucleases (types I, II, III, and IV). Of these, the type II enzymes are most useful for recombinant DNA applications, and several thousand are known. Type II restriction endonucleases that recognize and cleave DNA sequences that are 4, 5, 6, or 8 base pairs in length are commercially available. These enzyme catalyze an SN2-like nucleophilic attack by water on the phosphodiester bond. They are assisted by Mg2+ ions bound in the active site. There is little amino acid sequence similarity among these enzymes, but many have in common an active site motif Pro – Asp – Xxxn – Asp/Glu – Xxx – Lys, where Xxx stands for any residue. The acidic residues (Asp or Glu) bind to the Mg2+ ions. The enzyme catalyzes hydrolysis of its specific DNA sequence at least 106-fold faster than that of sequences that differ by as little as one base pair from the specific sequence. The discovery and characterization of restriction endonucleases was a key step in the development of recombinant DNA technology, as they are used in a wide range of applications.

Introduction Restriction endonucleases, also called restriction enzymes, are enzymes that bind to specific base

Restriction Endonucleases

sequences in double-stranded DNA and catalyze hydrolysis of phosphodiester bonds in both strands of the DNA, within or near the specific site. The discovery of restriction enzymes was an essential step in the development of recombinant DNA and cloning technology. These enzymes provide the ability to generate a limited number of DNA fragments of predictable length and sequence from a DNA substrate. The ends of the DNA products can be joined to any other DNA molecule cleaved with the same enzyme, using the enzyme DNA ligase, to produce a recombinant DNA molecule. The tremendous utility of restriction enzymes has inspired a great deal of effort to identify and isolate new enzymes from many sources, so that today a very large number of these enzymes are available commercially. Examples of restriction endonucleases that recognize and cleave sites that are 4, 5, 6, and 8 base pairs in length are shown in Table 1 and in Fig. 1.

Main Text Discovery of Restriction Endonucleases The biological phenomenon of restriction was discovered in the 1950s (Smith 1978). Bacteriophages that were propagated in one bacterial strain were found to be unable to replicate and propagate in related strains. This strain-specific growth was referred to as “restriction.” Further study showed that the phage DNA was destroyed in restrictive bacterial host cells. The biochemical basis of restriction became clear when restriction endonucleases were discovered in the late 1960s. The first to be discovered was a complex enzyme, now classified as a type I enzyme (see below), that required ATP and S-adenosyl methionine (AdoMet) for activity (Nathans and Smith 1975). This enzyme was later found not to cleave DNA at specific sequences, and so it is not particularly useful in biotechnology applications. The first site-specific restriction endonuclease was discovered in the late 1960s by Hamilton Smith and co-workers at Johns Hopkins University (Smith 1978), for which Smith was awarded the

Restriction Endonucleases

1071

Restriction Endonucleases, Table 1 Some restriction endonucleasesa Enzyme name HaeIII DpnI BstNI EcoRV HindII NaeI SmaI BamHI BglII EcoRI KpnI PstI Bpu10I FseI NotI BglI FokI TspRI

Recognition site (50 –30 )b GG/CC GA(Me)c/TC CC/WGG (W = A or T) GAT/ATC GTY/RAC (Y = T or C; R = A or G) GCC/CCG CCC/GGG G/GATCC A/GATCT G/AATTC GGTAC/C CTGCA/G 50 -CC/TNA GC 30 -GG ANT/CG GGCCGG/CC GC/GGCCGC GCCNNNN/NGGC 50 -GGATG(N)9/N 30 -CCTAC(N)13/N 50 -N NNCASTGNN/N 30 -N/NNGTSACNN N (S = C or G)

Type IIP IIM IIP

Organism of origin Haemophilus aegyptius Diplococcus pneumoniae G41 Bacillus stearothermophilus N

IIP IIP

E. coli J62, plasmid pLG74 Haemophilus influenzae Rc

IIE IIP IIP IIP IIP IIP IIP IIA

Nocardia aerocolonigenes Serratia marcescens Bacillus amyloliquefaciens H Bacillus globigii Escherichia coli RY13 Klebsiella pneumoniae OK8 Providencia stuartii 164 Bacillus pumilus 10

IIP IIP IIP IIS

Frankia species Eu1b Nocardia otitidis-cavarium Bacillus globigii Flavobacterium okeanokoites

IIB

Thermus species R

a

From REBASE (http://rebase.neb.com) The cleavage site is indicated by “/.” “N” stands for any base c Requires N-6-methyl-adenosine in both strands of the recognition site for DNA cleavage b

Nobel Prize in Physiology or Medicine in 1978 (along with Werner Arber and Daniel Nathans). In the course of studies on DNA recombination in the bacterium Haemophilus influenzae, Smith discovered an enzymatic activity in H. influenzae cell extracts that could catalyze hydrolysis of DNA from bacteriophage P22, but that had no activity on DNA from H. influenzae itself. The purified enzyme required Mg2+ for activity, but it required neither ATP nor AdoMet. The P22 DNA was degraded to fragments that were about 100 base pairs in size rather than to mononucleotides, suggesting that the enzyme acted as an endonuclease and, more importantly, that it cleaved the DNA at specific sequences. Smith confirmed the sequence specificity and identified the recognition sequence by applying a primitive DNA sequencing procedure to the DNA products from the reaction. The DNA products were first labeled with 32P

and/or 33P and then the ends were digested with the non-sequence-specific nucleases pancreatic DNase I, snake venom phosphodiesterase, or E. coli exonuclease I. The resulting labeled products were identified by their electrophoretic migration relative to mono- and dinucleotide markers. The results showed that the enzyme, now called HindII (Roberts and Halford 1993), recognizes a set of sequences designated 50 -G-T-Py-Pu-A-C30 , where “Py” indicates either of the pyrimidine bases C or T and “Pu” indicates either purine base A or G. The enzyme catalyzes hydrolysis of the phosphodiester bond between Py and Pu in that sequence, in both DNA strands. An important feature of restriction is that the restriction endonuclease fails to hydrolyze the DNA from the bacterial cell itself. Protection of the cellular DNA against degradation was hypothesized to result from a chemical modification of the

R

1072

Restriction Endonucleases

Restriction Endonucleases, Fig. 1 DNA cleavage by restriction endonucleases. The recognition sequence and DNA hydrolysis products are shown for some representative restriction endonucleases. “N” stands for any of the four bases

DNA, predicted by Arber to be methylation. Smith proceeded to isolate DNA methyltransferases from H. influenzae (Smith 1978). One of the four such activities in the organism was found to confer protection to DNA against digestion by HindII. It was later found to catalyze methylation of the A residue in the same site that was recognized and cleaved by the restriction enzyme, using AdoMet as the methyl group donor. The combination of a restriction endonuclease and a DNA methyltransferase that recognize and act on the same DNA sequence is known as a restrictionmodification system. The methylase modifies the host organism’s DNA, protecting it from attack by the endonuclease. The endonuclease cleaves foreign DNA that may enter the cell, such as a bacteriophage that was produced in an organism

that lacks the restriction-modification system, thereby inactivating the phage. Types of Restriction Endonucleases Restriction endonucleases have been classified into four types (Roberts et al. 2003). Type I and type III enzymes are complex, multisubunit enzymes. Individual subunits of type I enzymes carry DNA methylation, DNA sequence recognition, and DNA cleavage activities, in one multiprotein complex. The type I enzymes require Mg2 + , ATP, and AdoMet for endonuclease activity. They recognize a specific sequence in DNA, but they cleave the DNA randomly and at some distance from that sequence, making these enzymes of little use in the laboratory. Type III enzymes have two subunits, one with the DNA binding and

Restriction Endonucleases

methylation activity and the second with the cleavage activity. These enzymes also need ATP for activity. Type IV restriction enzymes cut methylated DNA without sequence specificity and use GTP, and they are not part of restrictionmodification systems. The vast majority of restriction endonucleases used in recombinant DNA work are type II enzymes. Type II restriction enzymes are homodimers or homo-tetramers that generally require Mg2+ but no other cofactor for enzymatic activity. They are further classified in a number of different subgroups, including type IIP (enzymes that recognize and cleave both DNA strands within a palindromic recognition sequence), type IIA (recognize asymmetric sites), type IIB (cleave outside, and on both sides of, the recognition site), type IIE (must bind simultaneously to two recognition sites for cleavage activity), type IIS (cleave one or both DNA strands outside of the recognition sequence), and others. Type IIP are the most common and useful restriction enzymes. Information about the vast number of known restriction endonucleases can be found in the REBASE (Restriction Enzyme Database) (http://rebase.neb.com), compiled and updated by New England Biolabs Corp. (Roberts et al. 2010). The database has information about the properties and commercial sources (if any) for type II enzymes from AaaI to Zsp2I. Examples of type II restriction endonucleases are listed in Table 1. Basic Properties of Restriction Endonucleases Restriction endonucleases catalyze an SN2-like nucleophilic attack by water on the phosphorus atom in a phosphodiester bond in DNA (Fig. 2; Pingoud et al. 2005). The leaving group is the 30 -OH of the 2-deoxyribose in the DNA chain. The cleaved DNA products have a phosphoryl group on the 50 -end of the broken DNA strand and a hydroxyl group on the 30 -end. Some enzymes cleave phosphodiester bonds that are directly opposite each other in the recognition sequence, to produce DNA fragments with blunt ends (e.g., EcoRV; see Fig. 1). More commonly, the cleaved phosphodiester bonds are offset by

1073

Restriction Endonucleases, Fig. 2 Restriction endonuclease reaction mechanism. Restriction endonucleases catalyze nucleophilic attack by water on the phosphodiester bond in DNA, with the 50 -nucleotide as the leaving group. The reaction is assisted by magnesium ions bound to acidic residues (Asp or Glu) in the active site. The acidic residues are part of the PD. . .D/ExK active site motif (see text and Fig. 3). Two magnesium ions are shown, although mechanisms involving one or three magnesium ions have also been proposed. After Pingoud et al. (2005)

one or more nucleotides in the two strands, and the DNA products have short single-stranded overhangs, called “sticky ends,” on either the 50 or 30 -end (e.g., EcoRI or PstI; see Fig. 1). The 50 -phosphorylated and 30 -OH DNA ends are the substrate structures used for DNA end-joining catalyzed by DNA ligase. Thus, any two DNA fragments produced by cleavage with a given restriction enzyme, e.g., EcoRI, can be ligated together because the sticky ends will anneal by base pairing, and the 50 -end of one fragment can be ligated to the 30 -end of the other. Any two DNA fragments produced by enzymes that produce blunt ends can be ligated, regardless of the recognition sequence of the enzymes used for the original cleavage. Ligation of blunt ends is less efficient than that of sticky ends, and the ligation of blunt-ended DNA fragments made by enzymes with different recognition sites produces a dsDNA product that neither restriction enzyme will cleave. The 30 -OH group in a cleaved product with a 50 -single-stranded overhang can also be used as a primer-terminus for DNA synthesis by a DNA polymerase to convert the sticky end into

R

1074

a blunt end. In a few cases, more than one enzyme is known to recognize and cleave the same recognition site. These enzymes are referred to as isoschizomers. Restriction Endonuclease Structures A large number of restriction endonucleases are known from many different organisms. These enzymes all catalyze a similar reaction (DNA hydrolysis), but they are diverse in their amino acid sequences, with essentially no overall sequence similarity among them. Many of these enzymes are similar at the level of threedimensional structure, in spite of the lack of sequence similarity. In particular, a similar core structure consisting of an alpha-helix flanking four beta-strands in the vicinity of the active site has been noted among enzymes whose threedimensional structures have been determined (Niv et al. 2007; Fig. 3). The enzymes shown in Fig. 3, and a number of others, share an active site motif: Pro-Asp – Xxxn – Asp/Glu – Xxx – Lys, where Xxx stands for any residue (also written as PD. . .D/ExK). The carboxylate-containing residues (Asp or Glu) chelate Mg2+ ions that are thought to participate directly in catalysis of phosphodiester bond cleavage (Pingoud et al. 2005; see Fig. 2). Interestingly, other nucleases involved in DNA recombination and repair, such as the bacteriophage lambda exonuclease, the RecB subunit of RecBCD enzyme, and the mismatch repair endonuclease MutH, have a threedimensional fold and an active site motif that are similar to restriction endonucleases (Laganeckas et al. 2011). A few restriction enzymes are related to nucleases in the H-N-H (e.g., KpnI) or GIY-YIG superfamilies (Bujnicki et al. 2001; Saravanan et al. 2004). These superfamilies also include homing endonucleases encoded by some self-splicing introns. Homing endonucleases recognize and bind to large (12–40 bp) DNA sequences (Belfort and Roberts 1997) and catalyze cleavage of both DNA strands, an activity similar to that of restriction enzymes. Several homing endonucleases are available commercially. They are used to generate unique genomic double-strand DNA breaks upon expression in eukaryotic cells (Stoddard 2011).

Restriction Endonucleases

Structural and Mechanistic Basis for Cleavage Specificity of Restriction Endonucleases Restriction enzymes catalyze cleavage of the specific site at least 105-fold more rapidly than cleavage of a site that differs by only one base pair from that sequence (Alves et al. 1995; Taylor and Halford 1989; Thielking et al. 1990; Yang et al. 1994). The sequence specificity of a given enzyme breaks down if the enzyme is used under suboptimal reaction conditions, a phenomenon called “star activity” (Wei et al. 2008). Variations in conditions, such as Mg2+ concentration, pH, and glycerol concentration, lead to relaxation of the specificity. Under such conditions, the enzyme will catalyze cleavage of sites that differ from the expected site by one or more base pairs and produce more than the expected number of DNA fragment products. The molecular basis of sequence recognition and discrimination by restriction enzymes has been studied both as a fundamental problem in molecular recognition and in attempts to engineer enzymes with new specificities. The individual subunits of a dimeric restriction enzyme generally bind to the two halves of the palindromic recognition site, and each subunit catalyzes cleavage of one of the strands (Pingoud et al. 2005). Structural studies of enzymeDNA complexes, both specific and nonspecific, have shown that restriction endonucleases make a large number of hydrogen bonds to bases in the recognition sequence, mainly via the major groove. The enzymes also make van der Waals interactions to nonpolar groups on the DNA and ionic and hydrogen bonds to the sugar-phosphate backbone. The three-dimensional structures of both the protein and the DNA are different in the specific enzyme-DNA complex compared to the nonspecific complex. The structural analysis has led to the hypothesis that the enzyme must undergo a conformation change to a catalytically active form in order for DNA cleavage to occur. The interactions with the recognition sequence promote this conformation change, while nonspecific DNA does not. This is thought to allow for greater sequence specificity than can be accounted for only by the direct interactions between the protein and the specific DNA recognition sequence (Stahl et al. 1996; Winkler et al. 1993).

Restriction Endonucleases

1075

R

Restriction Endonucleases, Fig. 3 Structures of restriction endonucleases bound to DNA. The DNA is shown end-on. Three residues from the PD. . .D/ExK active site motif are shown in spacefill and in red (Asp side chain), yellow (Asp/Glu), or blue (Lys). (a) A single subunit of BglI bound to ssDNA (17 nt). Active site residues are Asp-116 (red), Asp-142 (yellow), and Lys-144 (blue). PDB ID 1DMU. (b) A single subunit of EcoRI bound to ssDNA (13 nt). Active site residues Asp-91 (red), Glu-111 (yellow), and Lys-113 (blue) are shown in spacefill. PDB

ID 1ERI. (c) FokI bound to dsDNA (20 bp). The DNA is bound by the recognition domain of this type IIS enzyme (residues 1–372), while DNA cleavage is catalyzed by a separate cleavage domain (residues 389–576) (Wah et al. 1997). Active site residues are Asp-450 (red), Asp-467 (yellow), and Lys-469 (blue). PDB ID 1FOK. (d) A single subunit of EcoRV bound to dsDNA (11 bp). Active site residues are Asp-74 (red), Asp-90 (yellow), and Lys-92 (blue). PDB ID 1AZ0

1076

The structural factors that influence cleavage specificity by a restriction endonuclease have been studied with particular care for the enzyme EcoRV. The enzyme was found to catalyze hydrolysis of sequences differing from the correct, or cognate, sequence by only one base pair (e.g., 50 -GTTATC vs. 50 -GATATC (cognate)). However, the non-cognate sequences were hydrolyzed at least 106-fold more slowly than the correct site (Taylor and Halford 1989). Discrimination of this magnitude between such similar sequences is unlikely to result from differences of one to two hydrogen bonds in binding to the cognate versus non-cognate sites. Further insight came from three-dimensional structures of EcoRV bound to a small dsDNA molecule containing its cognate site (50 -GGGATATCCC) compared to the enzyme bound to a non-cognate DNA (50 -CGAGCTCG), both in the absence of Mg2+. The cognate DNA was bent substantially (50 ) in the enzymeDNA complex, while the non-cognate DNA remained in the ordinary, straight, B-form (Winkler et al. 1993). Differences were also evident in the enzyme conformation in the two structures. Interestingly, EcoRV binds with about equal affinity to its specific DNA sequence as to non-cognate DNA, in the absence of Mg2+ ion (Taylor et al. 1991). Conversion of the DNA to a bent conformation when bound to the enzyme would presumably be energetically costly. The fact that the binding affinities to cognate and non-cognate DNAs are the same indicates that new bonding interactions between the protein and the DNA, and between the protein monomers in the dimer, compensate for the unfavorable energetics of bending the DNA. Moreover, Mg2+ (required for DNA cleavage) was bound by the cognate enzyme-DNA complex much more tightly than by a non-cognate enzyme-DNA complex. These observations led to the proposal that upon binding to the correct DNA sequence, the enzyme and DNA undergo a conformation change to a catalytically active form that is able to bind Mg2+ and catalyze DNA hydrolysis. The conformation change to the catalytically active enzyme form does not occur with the non-cognate site because the energetic cost of DNA bending is

Restriction Endonucleases

not compensated by new, favorable, proteinDNA interactions. Further insight into the ability of the enzyme to discriminate cognate from non-cognate sequences came from kinetic studies on heterodimer forms of EcoRV (Stahl et al. 1996). Dimeric enzymes were generated in which one monomer was wild type and the other monomer had a mutation in either the hydrolysis active site (D90A mutation) or in a DNA recognition loop (N188Q and T186S mutations). A WT/D90A heterodimer catalyzed DNA hydrolysis about half as fast as a wild-type homodimer (WT/WT), as expected since one active site monomer was inactive. However, a heterodimer in which one monomer was impaired in binding to the specific site by either the N188Q or T186S mutation, but both monomers had the normal (WT) catalytic active site, cleaved the DNA 30 and 50 times more slowly than the wild-type homodimer. These results were interpreted to indicate that there is intersubunit communication between the two subunits in the homodimeric enzyme, such that both monomers must be bound to the cognate sequence in order for the enzyme to undergo the conformation change to the active conformation. Applications of Restriction Endonucleases A straightforward use of restriction enzymes is in a “shotgun” cloning procedure. A cloning vector and a target DNA such as genomic DNA are cleaved with a restriction enzyme, mixed, and joined by DNA ligase and ATP. Ligation produces a large collection of joined DNA molecules, including some in which both ends of the vector are joined to the ends of a target DNA fragment, producing a circular recombinant DNA molecule. The ligation mixture is introduced into a host cell by transformation or electroporation, and the transformed host cells are spread on plates and allowed to form colonies. The colonies must then be screened to identify those that contain the desired recombinant plasmid. Subcloning involves digesting an existing recombinant DNA molecule with a restriction enzyme and ligating one or more of the resulting fragments to a new, cleaved, vector molecule.

Restriction Endonucleases

A specific, desired, DNA fragment may be isolated from the mixture of restriction products before ligation, to simplify the mixture of possible ligation products. Finally, PCR reactions may be done with primers that introduce new restriction enzyme sites into the DNA product. The PCR product is then treated with the restriction enzymes and ligated to a vector cleaved with the same enzyme. Restriction enzyme cleavage can also be used either to map the location of recognition sites in a DNA molecule of unknown sequence or to confirm the structure of a recombinant product. The DNA is cleaved with the enzyme and the mixture analyzed by gel electrophoresis. The size of the restriction products can be estimated by comparison of their mobility relative to standards of known size. Cleavage with two enzymes, and double digests of the DNA with both enzymes simultaneously, followed by analysis of the DNA product sizes, enables the relative positions of the recognition sites to be mapped in the original DNA molecule.

References Alves J, Selent U, Wolfes H (1995) Accuracy of the EcoRV restriction endonuclease: binding and cleavage studies with oligodeoxynucleotide substrates containing degenerate recognition sequences. Biochemistry 34:11191–11197 Belfort M, Roberts RJ (1997) Homing endonucleases: keeping the house in order. Nucleic Acids Res 25:3379–3388 Bujnicki JM, Radlinska M, Rychlewski L (2001) Polyphyletic evolution of type II restriction enzymes revisited: two independent sources of second-hand folds revealed. Trends Biochem Sci 26:9–11 Laganeckas M, Margelevicius M, Venclovas C (2011) Identification of new homologs of PD-(D/E)XK nucleases by support vector machines trained on data derived from profile-profile alignments. Nucleic Acids Res 39:1187–1196 Nathans D, Smith HO (1975) Restriction endonucleases in the analysis and restructuring of DNA molecules. Annu Rev Biochem 44:273–293 Niv MY, Ripoll DR, Vila JA, Liwo A, Vanamee ES, Aggarwal AK, Weinstein H, Scheraga HA (2007) Topology of Type II REases revisited; structural classes and the common conserved core. Nucleic Acids Res 35:2227–2237

1077 Pingoud A, Fuxreiter M, Pingoud V, Wende W (2005) Type II restriction endonucleases: structure and mechanism. Cell Mol Life Sci 62:685–707 Roberts RJ, Halford SE (1993) Type II restriction endonucleases. In: Linn SM, Lloyd RS, Roberts RJ (eds) Nucleases. Cold Spring Harbor Laboratory Press, Plainview, pp 35–88 Roberts RJ, Belfort M, Bestor T, Bhagwat AS, Bickle TA, Bitinaite J, Blumenthal RM, Degtyarev S, Dryden DT, Dybvig K et al (2003) A nomenclature for restriction enzymes, DNA methyltransferases, homing endonucleases and their genes. Nucleic Acids Res 31:1805–1812 Roberts RJ, Vincze T, Posfai J, Macelis D (2010) REBASE – a database for DNA restriction and modification: enzymes, genes and genomes. Nucleic Acids Res 38:D234–D236 Saravanan M, Bujnicki JM, Cymerman IA, Rao DN, Nagaraja V (2004) Type II restriction endonuclease R. KpnI is a member of the HNH nuclease superfamily. Nucleic Acids Res 32:6129–6135 Stoddard BL (2011) Homing endonucleases: from microbial genetic invaders to reagents for targeted DNA modification. Structure 19:7–15 Smith HO (1978) The Nobel Prize in physiology or medicine 1978 (Nobelprize.org) Stahl F, Wende W, Jeltsch A, Pingoud A (1996) Introduction of asymmetry in the naturally symmetric restriction endonuclease EcoRV to investigate intersubunit communication in the homodimeric protein. Proc Natl Acad Sci U S A 93:6175–6180 Taylor JD, Halford SE (1989) Discrimination between DNA sequences by the EcoRV restriction endonuclease. Biochemistry 28:6198–6207 Taylor JD, Badcoe IG, Clarke AR, Halford SE (1991) EcoRV restriction endonuclease binds all DNA sequences with equal affinity. Biochemistry 30:8743–8753 Thielking V, Alves J, Fliess A, Maass G, Pingoud A (1990) Accuracy of the EcoRI restriction endonuclease: binding and cleavage studies with oligodeoxynucleotide substrates containing degenerate recognition sequences. Biochemistry 29:4682–4691 Wah DA, Hirsch JA, Dorner LF, Schildkraut I, Aggarwal AK (1997) Structure of the multimodular endonuclease FokI bound to DNA. Nature 388:97–100 Wei H, Therrien C, Blanchard A, Guan S, Zhu Z (2008) The Fidelity Index provides a systematic quantitation of star activity of DNA restriction endonucleases. Nucleic Acids Res 36:e50 Winkler FK, Banner DW, Oefner C, Tsernoglou D, Brown RS, Heathman SP, Bryan RK, Martin PD, Petratos K, Wilson KS (1993) The crystal structure of EcoRV endonuclease and of its complexes with cognate and non-cognate DNA fragments. EMBO J 12: 1781–1795 Yang CC, Baxter BK, Topal MD (1994) DNA cleavage by NaeI: protein purification, rate-limiting step, and accuracy. Biochemistry 33:14918–14925

R

1078

Rhodophyta

Discussion

Rhodophyta ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Rhodophytes ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

RMSD, Root-Mean-Square Deviation ▶ NMR Approaches to Determine Protein Structure

RNA Interference Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms Knockdown; miRNA-induced deadenylation; miRNA-induced silencing; RNAi

Definition RNA interference is the binding of small RNAs, called microRNAs (miRNA) or small interfering RNAs (siRNAs), to mRNAs, which triggers either translation repression or mRNA degradation of the target mRNA. If the miRNA/siRNA binds the mRNA with perfect complementarity, it will cause cleavage of the mRNA, which will trigger its degradation. If the miRNA binds the mRNA imperfectly, it will trigger either translation repression or deadenylation-dependent decay. It is estimated that between half to two thirds of human mRNAs are targets of RNAi, suggesting that miRNAs are a major source of regulation of gene expression.

Proteins are not the only trans-factors that affect the function of individual mRNAs; small RNAs (miRNAs and siRNAs) can bind mRNAs through Watson-Crick base pairing to decrease their translation or stability. miRNAs are 21–23 nucleotide RNAs that bind cis elements in mRNAs (usually in the 30 UTR). miRNAs usually bind the target with imperfect complementarity, that is, there will be mismatches within the base pairing. miRNAs trigger either translation repression or deadenylation and decay of the target mRNA. siRNAs bind with perfect complementarity, which leads to internal cleavage of the mRNA and its subsequent degradation (Fig. 1). All of these mechanisms are generally referred to as RNA interference (RNAi). miRNAs and siRNAs are both expressed as longer double-stranded RNA precursors. A distinction between miRNAs and siRNAs is their origin. miRNAs are endogenous to the genome, whereas siRNAs are exogenous and come from viruses, transposons, or experimental manipulation. Both miRNAs and siRNAs are processed by the enzymes Drosha and Dicer, which cleave the precursor RNA into a 21–23 nucleotide double-stranded RNA duplex that contains the miRNA/siRNA (also called the guide strand) and its complementary strand (the passenger strand; Fig. 1). The short double-stranded RNA assembles with several proteins to form an RNA-induced silencing complex (RISC). A key component of RISC is Argonaute (Ago) protein, which interacts with the 50 and 30 ends of the miRNA (Carthew and Sontheimer 2009). Ago recruits GW182 protein, which is required for RNAi, but how is not clear. GW182 does interact with polyA binding protein (PABP), so GW182 might interfere with PABP’s ability to promote translation or protect the mRNA (Fabian et al. 2010; Huntzinger and Izaurralde 2011). Together, the proteins in the RISC complex remove the passenger strand, help the guide strand bind its target mRNA, and then repress translation or trigger decay (Fig. 1). siRNA-containing RISC will cleave the mRNA, while miRNA-containing

RNA Interference

1079

RNA Interference, Fig. 1 Simplified model of RNA interference. Long double-stranded RNAs (dashed lines indicated additional length) are processed by Dicer (and Drosha) into 21–23 nucleotides doublestranded miRNAs. The strand that will bind the mRNA is the guide strand (blue) and the other strand is the passenger strand (grey). The small double-stranded RNAs bind RISC, which includes an Ago protein. The passenger strand is discarded and the guide strand binds its target. Depending on the sequence specificity of the miRNA/ siRNA, the mRNA will either be cleaved internally by Ago, degraded via recruitment of a deadenylation complex, or prevented from translating

RISC will either repress translation or trigger deadenylation and decay (Fig. 1). The exact mechanism of miRNA-mediated translation repression is not clear, but it requires GW182 (Fig. 1). There may be several different mechanisms for miRNA-mediated repression, as experiments suggest that different steps in translation initiation are targeted by miRNA-mediated repression, including eIF4E/cap recognition, 43S ribosome joining, or 60S ribosome joining (alternatives reviewed in Fabian et al. 2010). Differences in miRNAs or RISC may be responsible for distinct mechanisms of translation inhibition (reviewed in Fabian et al. 2010; Huntzinger and Izaurralde 2011).

While miRNA-containing RISC can repress translation, it can also trigger the degradation of the target mRNA via deadenylation (see ▶ Cytoplasmic mRNA, Regulation of, Fig. 7). miRNAmediated mRNA decay requires RISC and requires Ago to recruit GW182. In this case, GW182 recruits the CCR4-NOT1 deadenylase complex (Fig. 1), which removes the polyA tail and triggers mRNA decay (reviewed in Fabian et al. 2010). It is not clear why some miRNA-containing RISC cause only translation repression, only mRNA decay, or a combination of the two. It may rely on differences between miRNA/mRNA binding or the protein composition of RISC, as not every RISC is identical. For example, in mammals, there are four

R

1080

Ago proteins, each with different properties, three GW182 proteins, and four Ago-related proteins. In addition, RISC can contain other proteins, including RNA helicases and RNA binding proteins. siRNA-containing RISC degrades mRNAs by a different mechanism (Fig. 1). In this pathway, the siRNA is completely complementary to the target mRNA and the RISC complex contains Ago with endonucleolytic activity. Not all Ago proteins are endonucleases. In humans, only Ago2 can cleave mRNA. These RISC complexes cleave the mRNA where the miRNA binds. As a consequence of this cleavage, the mRNA is cut in two pieces, which are quickly degraded (Fig. 1). siRNA-mediated cleavage and decay has become an essential experimental tool to study gene function. It is difficult to knockout genes in many eukaryotes. But most eukaryotes have RNAi machinery (the budding yeast Saccharomyces cerevisiae is a notable exception). If one introduces either double-stranded RNA or a DNA construct that will transcribe double-stranded RNA, then the cell will process the doublestranded RNA into siRNAs (small inhibitory RNAs). Investigators design the siRNA to be perfectly complementary to their mRNA of interest. This technique is called a “knockdown” rather than a “knockout” as the gene is still present, but its mRNA and protein products are highly (although not entirely) reduced by the siRNA. Not only has RNAi revolutionized scientists’ ability to study gene function, it is a leading candidate for drug design. If siRNAs can be delivered to the correct cells, then any mRNA, in theory, could be knocked down, which would be useful for the treatment of a wide variety of diseases. Additionally, one could introduce small RNAs that interfere with the binding of miRNAs, causing an increase in protein expression. According to microRNA.org, there are over 16 million predicted miRNA binding sites in about two third of human genes, suggesting that miRNAs are binding and regulating a significant portion of mRNAs. Many miRNAs have a small effect on their target. However, mRNAs are often targeted by more than one miRNA, allowing for combinatorial control. This combinatorial control may help miRNAs modulate mRNAs, where one

RNA Localization

combination of miRNAs may downregulate an mRNA slightly, while another combination of miRNAs can downregulate the mRNA significantly. Thus, miRNAs provide a powerful way to fine-tune the protein production of individual mRNAs.

Cross-References ▶ Cytoplasmic mRNA, Regulation of

References Carthew RW, Sontheimer EJ (2009) Origins and mechanisms of miRNAs and siRNAs. Cell 136:642–655 Fabian MR, Sonenberg N, Filipowicz W (2010) Regulation of mRNA translation and stability by microRNAs. Annu Rev Biochem 79:351–379 Huntzinger E, Izaurralde E (2011) Gene silencing by microRNAs: contributions of translational repression and mRNA decay. Nat Rev Genet 12:99–110

RNA Localization ▶ mRNA Localization and Localized Translation

RNA Quality Control Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms No-go decay (NGD); Nonfunctional rRNA decay (NRD); Nonsense-mediated decay (NMD); Nonstop decay (NSD)

Definition RNA can accumulate mutations through transcription of mutated DNA, through errors in transcription of wild-type DNA, or through damage to the

RNA Quality Control

RNA itself from mutagens. Some mutant or damaged mRNAs cause stalls in translation elongation or termination. Eukaryotic cells have developed several RNA quality control mechanisms that trigger the decay of such mRNAs. Without these surveillance mechanisms, the cell would produce mutant proteins or would lose ribosomes that remained stalled on aberrant transcripts. While this section focuses on quality control of mRNA, mRNA represents a fraction of the RNA in a cell and other RNAs are also subject to quality control.

Discussion It has long been appreciated that DNA can be damaged by mutagens and that extensive repair pathways exist to repair damaged DNA. Many of the same agents that damage DNA also damage RNA. RNA is more abundant in the cell than DNA; experiments in yeast show that there is 100 times more RNA by weight than DNA. This suggests that the cell must degrade and replace damaged RNA to avoid serious ramifications for gene expression and cellular function. Several RNA surveillance mechanisms have been discovered to turnover aberrant RNAs. Messenger RNAs (mRNAs) may contain mutations that were encoded in the DNA, that arose from errors in transcription or mRNA processing, or that accumulated due to damage from metabolism or mutagens. In most cases, missense mutations that change one or a few amino acids will not look aberrant to the cell and are not caught by known RNA quality control mechanisms. However, there are at least three types of mRNA mutations that trigger one of the quality control pathways. The first type is premature stop codons (also referred to as nonsense mutations). If these mRNAs are translated, they will produce truncated peptides that, at the very least, are wasteful, but could also be harmful to the cell. These mRNAs are degraded by a process called nonsense-mediated decay (NMD). Second, mRNAs may be truncated or mutated in a way that removes the natural stop codon, which is the ribosome’s signal to stop translating. These mRNAs will have ribosomes stalled at their 30 end and are degraded by a process called

1081

nonstop decay (NSD). Third, mRNAs may have sequences within them that slow or stall the ribosome. To avoid ribosomes from remaining stuck on these transcripts, a third quality control mechanism, no-go decay (NGD), is employed. The general theme that is emerging from these quality control pathways is that these mutations cause an unusual translation termination event that is recognized by key proteins. These proteins then recruit degradation machinery. Nonsense-mediated decay (NMD) targets mRNA with premature stop codons. This includes not only mutant mRNAs but also some normal mRNAs that are natural NMD substrates. Thus, NMD is both a quality control mechanism for aberrant mRNAs and a means to control the stability of some endogenous mRNAs. How does the NMD machinery detect that an mRNA contains a premature stop codon? In mammals, each pre-mRNA contains several introns, typically within the protein coding sequence. When an intron is removed, a complex of proteins (called the exon junction complex or EJC) is deposited on the mRNA, marking the site of the splicing event. During translation, if the ribosome encounters a stop codon with an EJC further downstream, it treats the stop codon as premature (Wu and Brewer 2012). In yeast, many pre-mRNAs lacks introns, and therefore, their mature mRNAs will lack EJC complexes, yet these mRNAs can still be targets of NMD. Long 30 UTRs and a large distance between the stop codon and the polyA tail binding protein, Pab1, have been implicated in triggering NMD; however, they cannot explain all cases (Parker 2012). mRNAs that undergo NMD have an unusual translation termination event that recruits Upf1, an RNA-dependent ATPase. When Upf1 binds, the ribosome changes how it interacts with the mRNA, but the precise consequence of that change is unknown. After Upf1 binding, the proteins Upf2 and Upf3 bind. It is thought that Upf2 and Upf3 help activate Upf1, which can use the energy from ATP hydrolysis to rearrange the mRNA. The exact nature of this rearrangement is not known (Parker 2012). Some mRNAs are aberrant because they lack a stop codon, which can occur if the polyA tail is

R

1082

added upstream of the natural stop codon. When the ribosome reaches the 30 end of the mRNA, it cannot terminate. These stalled complexes will trigger NSD, which requires another set of proteins, Ski7 and partners, to trigger degradation of the mRNA via the exosome. No-go decay (NGD) is another quality control mechanism that targets mRNAs with ribosomes stalled on them. So far, NGD has only been observed in yeast and its natural substrates are not known. When reporter mRNAs are modified so that they contain strong secondary structure, runs of rare codons, or runs of the same codon, these reporter mRNAs are quickly degraded. All of these sequence aberrations would cause a ribosome to slow down or stall on the mRNA. It is hypothesized that mRNAs damaged by mutagens could also cause stalled ribosomes. During normal translation termination, release factors (eRF1 and eRF3) bind the A site and trigger the removal of the ribosome. NGD is hypothesized to be activated when a ribosome is stalled, for any reason, leaving its A site empty. NGD factors Dom34 and Hbs1 mimic the function of the release factors to trigger premature translation termination. The mRNA is then cleaved by an unknown nuclease and degraded by mRNA decay machinery (Parker 2012). mRNA only represents about 1% of the total RNA in a cell. Most of the RNA is ribosomal RNA (rRNA) and transfer RNAs (tRNA), which are also susceptible to damage. rRNA with mutations in the peptidyl transferase center is decayed by a process termed 25S NRD (nonfunctional rRNA decay). rRNA with mutations in the decoding center is decayed by a process called 18S NRD. Both forms of NRD allow the cell to destroy defective rRNA. 18S NRD requires some of the same factors used in NGD, suggesting that the pathways to monitor the mRNA and rRNA are connected (Cole et al. 2009). For a review of the quality control mechanisms for other RNAs, see (Parker 2012).

RNA-Dependent Heterochromatin

References Cole SE, LaRiviere FJ, Merrikh CN, Moore MJ (2009) A convergence of rRNA and mRNA quality control pathways revealed by mechanistic analysis of nonfunctional rRNA decay. Mol Cell 34:440–450 Parker R (2012) RNA degradation in Saccharomyces cerevisiae. Genetics 191:671–702 Wu X, Brewer G (2012) The regulation of mRNA stability in mammalian cells: 2.0. Gene 500:10–21

RNA-Dependent Heterochromatin ▶ RNA-Induced Chromatin Remodeling

RNAi ▶ RNA Interference

RNAi-Dependent Heterochromatin ▶ RNA-Induced Chromatin Remodeling

RNA-Induced Chromatin Remodeling Scheherazade Khan and Angela K. Hilliker Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms RNA-dependent heterochromatin; dependent heterochromatin

RNAi-

Definition Cross-References ▶ Cytoplasmic mRNA, Regulation of ▶ DNA Repair

RNAs can influence the genes from which they were transcribed by acting as scaffolds that recruit chromatin-modifying factors, which result in gene silencing. Long non-coding RNAs can silence

RNA-Induced Chromatin Remodeling

large parts of chromosomes or even an entire X chromosome. Small interfering RNAs, which are typically thought to affect gene expression at the level of mRNA, can sometimes promote heterochromatin formation around their gene. Both classes work in cis, meaning that they appear to silence the copy of the gene from which they were transcribed, not the second copy on the homologous chromosome. Both types of RNAs appear to function by recruiting chromatin-modifying enzymes that modify the histones and/or DNA to form heterochromatin.

Discussion In mammals, females have two X chromosomes, while males have one X and one Y chromosome. The X chromosome contains over 1,000 genes. Some of these genes are required for proper development, but some are problematic if both copies are expressed. Therefore, in females, one of the X chromosomes is inactivated (Lyon 1961). It is very important that one X chromosome remains active, but that all others are inactivated. Thus, in rare XXX females, two of the three chromosomes must be inactivated. This inactivation occurs early in development and is random. Therefore, females are genetic mosaics, as some cells are derived from progenitors that inactivated the maternal X chromosome, while other cells are derived from progenitors that inactivated the paternal X chromosome. One visible example of female mosaicism is the patchwork coat of a calico cat. One of the genes that affect coat color is located on the X chromosome. If a female cat is heterozygous for coat color, she has one dominant and one recessive allele for this gene. Early in development, some cells will inactivate the X chromosome with the dominant allele; the skin cells that arise from these progenitor cells will have the same X chromosome inactivation and will produce black fur. Cells that have inactivated the chromosome with the recessive allele will produce orange fur. In the adult female, the fur will appear as patches of orange and black, depending on which X chromosome was inactivated in the skin cells underneath. As this

1083

phenomenon is due to random X-inactivation, calico cats are almost always female. A long non-coding RNA called XIST (X inactivation specific transcript) plays an essential role in the inactivation of X chromosomes. Long noncoding RNAs (lncRNAs) are greater than 2,000 nucleotides and do not code for a protein product. In mammalian genomes, these lncRNAs outnumber protein coding mRNAs (reviewed in Saxena and Carninci 2011). In mice, over half of the transcribed RNA does not encode for protein (reviewed in Carninci and Hayashizaki 2007). The XIST gene is located near the X-inactivation center (Xic), which controls X-inactivation and produces the 17,000 nucleotide XIST RNA. XIST RNA is transcribed only from the X chromosome that will be inactivated (Xi) and is tethered to that X chromosome by proteins; it is thought that this tethering occurs while XIST RNA is being transcribed. The XIST RNA coats the entire Xi chromosome and recruits chromatin remodeling machinery that promotes the formation of heterochromatin, silencing the chromosome (reviewed in Saxena and Carninci 2011). The amount of XIST RNA increases with the number of X chromosomes. Thus, a XXX cell will make more XIST RNA than an XX cell. The amount of XIST RNA is controlled to allow the cell to count the number of X chromosomes that need to be inactivated (reviewed in Lee 2011). XIST RNA serves as a paradigm for the role of non-coding RNAs in the silencing of transcription by influencing chromatin state. A handful of other lncRNAs have been found to act similarly to XIST. For example, the 108,000 nucleotide Airn RNA is expressed from the paternal copy of chromosome 17 and silences three genes spanning a 400,000 nucleotide section of that same chromosome (reviewed in Saxena and Carninci 2011). In all of these cases, the lncRNA appears to direct chromatin remodeling machinery to the gene from which it was expressed and forms a scaffold for the assembly of this chromatin remodeling machinery. These chromatin remodeling enzymes then modify histones, which correlates with transcriptional silencing (reviewed in Saxena and Carninci 2011). XIST RNA is not the only regulator of X chromosome inactivation. There are several

R

1084

other long non-coding RNAs that are transcribed from genes near the Xic and that help regulate X-inactivation including TSIX, which is the reverse complement of the XIST RNA. TSIX RNA is highly transcribed from the activated X chromosome (Xa) and poorly transcribed from Xi. TSIX RNA remains associated with the Xa and recruits methyltransferases to methylate the promoter of the XIST gene, silencing its transcription. At least five other long non-coding RNAs are thought to influence X-inactivation (reviewed in Lee 2011). This work has led to the model that long non-coding RNAs form a series of switches to allow one X to remain active while the other is silenced. Recent evidence suggests that small interfering RNAs (siRNAs) can alter chromatin to inhibit transcription, particularly around transposons. siRNAs are generally thought to downregulate gene expression in the cytoplasm in a process called RNA interference (RNAi) by binding mRNA to decrease translation or promote mRNA degradation. It is now known that some of these siRNAs reenter the nucleus and promote chromatin modifications that correlate with a decrease in transcription. It is crucial for cells to maintain transposons in a heterochromatic state to silence them. Transposons are DNA elements that can become mobile and jump into other places in the genome; if they jump into a gene, they will likely disrupt the gene’s function. Transposons exhibit low-level transcription and their transcribed RNAs are cut into siRNAs – small 21–23 nucleotide pieces of RNA cleaved by an enzyme called Dicer. The siRNAs, along with Dicer and other RNAi machinery, such as Ago1, are associated with these heterochromatin sites and are required for maintaining the heterochromatin. The siRNAs and RNAi machinery recruit methylases and HDACs that help modify histones into a state that correlates with transcriptional silencing. These modified histones, in turn, recruit and stabilize the RNAi machinery and siRNAs. What comes first, the production of heterochromatin or the production of siRNAs? The answer is not clear, but the two processes do form a feedback loop, with the siRNA-RNAi machinery promoting heterochromatin formation and the

Rolling Circle Replicating Plasmids

heterochromatin aiding in the production of more siRNAs (reviewed in Grewal 2010).

Cross-References ▶ RNA Interference ▶ Transposons

References Carninci P, Hayashizaki Y (2007) Noncoding RNA transcription beyond annotated genes. Curr Opin Genet Dev 17:139–144 Grewal SI (2010) RNAi-dependent formation of heterochromatin and its diverse functions. Curr Opin Genet Dev 20:134–141 Lee JT (2011) Gracefully ageing at 50, X-chromosome inactivation becomes a paradigm for RNA and chromatin control. Nat Rev Mol Cell Biol 12:815–826 Lyon MF (1961) Gene action in the X-chromosome of the mouse (Mus musculus L.). Nature 190:372–373 Saxena A, Carninci P (2011) Long non-coding RNA modifies chromatin: epigenetic silencing by long non-coding RNAs. Bioessays 33:830–839

Rolling Circle Replicating Plasmids Gloria del Solar1, Cris Fernández-López1, José Angel Ruiz-Masó1, Fabián Lorenzo-Díaz2 and Manuel Espinosa1 1 Centro de Investigaciones Biológicas, CSIC, Madrid, Spain 2 Unidad de Investigación, Hospital Universitario Nuestra Señora de Candelaria and Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias, Centro de Investigaciones Biomédicas de Canarias, Universidad de La Laguna, Santa Cruz de Tenerife, Spain

Synopsis The replication of plasmids by the rolling circle replication mechanism represents one of the simplest strategies, relying as it does on nicking one

Rolling Circle Replicating Plasmids

strand to generate the primer for leading strand initiation and on a single priming site for lagging strand synthesis. The generation of singlestranded intermediates, which are potentially unstable and can also induce SOS response, means that there is selection against such plasmids growing very large. Their genomes therefore consist of a number of basic elements: leading strand initiation and control, lagging strand origin, phenotypic determinants, and mobilization generally in that order of frequency. Although phenotypic determinants may be part of transposable elements, it also appears that recombination mediated by short repeated sequences can drive construction and reassortment of plasmid cassettes. This can create enigmatic variations where core functions may exhibit high conservation while being

1085

interspersed with totally different DNA segments that constitute the phenotypic load.

Introduction The rolling circle mode of replication has only been found in small (30 kb) plasmid genomes, and actually the economy of their genetic information constitutes a hallmark of this type of replicon (Espinosa 2013; Khan 2005). These so-called RCR plasmids exhibit a classical modular arrangement in which three main coding modules and at least one locus containing the origin of lagging strand replication (single-strand origin, sso) may be found (see Fig. 1). The module involved in leading strand replication

R

Rolling Circle Replicating Plasmids, Fig. 1 The modular organization of RCR plasmids: (a) Organization of the streptococcal promiscuous plasmid pMV158 as the paradigm of plasmids harboring all representative gene cassettes and (b) organization of plasmids representative from the different families. To illustrate the modular organization and the cassettes they harbor, the following plasmids are shown: pMV158 from Streptococcus agalactiae (Acc. X15669); pE194 from Staphylococcus aureus

(Acc. NC_005908.1), pADB201 from M. mycoides (Acc. M25059), pWVO1 from Lactococcus lactis (Acc. X56954), and pT181, pC221, pC194, and pUB110 from S. aureus (Acc. NC_001393.1, NC_006977, NC_002013.1 and M19465.1, respectively). Modules with high percentage of sequence identity, hence belonging to the same family, carried on different plasmids are indicated with the same color and filling (Modified from Espinosa 2013)

1086

initiation and control (LIC) is the only essential one and defines the replicon families. In fact, some RCR plasmids only carry the LIC cassette (del Solar et al. 1993). In addition to LIC, there may be another cassette involved in plasmid mobilization (MOB), which harbors an origin of transfer (oriT) and a relaxase protein (Mob) (Grohmann et al. 2003). A third module that might be present is formed by an antibiotic resistance determinant (module DET), and, less frequently, modules encoding metabolic functions can be found in plasmids from strains used in biotechnological processes (Petrova et al. 2003).

Rolling Circle Replicating Plasmids

cases, the PcrA helicase. The recent solution of the three-dimensional structure of the hexameric RepB protein from plasmid pMV158 has thrown some light on the promiscuity exhibited by this replicon (Boer et al. 2009). The hexameric ring formed by the oligomerization domain (OD) of RepB, which is located in the C-terminal moiety of the protein, shows striking structural similarity to those of viral Rep proteins belonging to the SF3 helicase superfamily. This singular structure of the Rep protein might underlie, together with the above-mentioned employment of common host functions, the extremely broad host range of pMV158 and other members of its plasmid family.

Replication Modules Replication Control The LIC module includes a cis-acting region, termed the double-strand origin (dso), as well as the genes encoding the replication initiator Rep protein and the element(s) controlling synthesis of Rep and hence initiation of replication and plasmid copy number. The dso in turn contains the sequences where the Rep protein binds specifically to the DNA (locus bind) and generates the specific nick required for replication initiation (locus nic). Based on similarities in the genetic arrangement of the LIC module and on sequence homologies both in the nic locus and in the Rep protein, RCR plasmids have been classified into different families, the best-known member(s) of each being the prototype. Of special relevance by reason of the deeper knowledge of their replicons are the pT181/pC221, pC194/pUB110, and pMV158 families (Fig. 1). Members of each of these families have been isolated from a variety of bacteria, and some of them have been shown to replicate in species, genera, or even phyla other than those from which they were isolated (del Solar et al. 1987; Goursot et al. 1982). The ability of many RCR plasmids to replicate in a variety of hosts has been attributed to the fact that they use conserved host proteins to duplicate their DNA, the key enzymes being RNA polymerase, DNA polymerases I and III, DNA gyrase, singlestranded DNA binding protein(s), and, in several

With respect to the replication control elements, all natural RCR plasmids characterized so far encode one or two small countertranscribed (ct) RNAs (i.e., transcribed in opposite direction to the rep mRNA) that control replication either by promoting premature termination of the rep mRNA or by inhibiting translation of the rep gene (Khan 2005; López-Aguilar et al. 2013). In addition, the plasmids of the pMV158 family also encode small transcriptional repressor proteins (Cop) that negatively control expression of the essential rep gene. In the prototype pMV158 plasmid, the CopG protein, which has a dimeric ribbon-helix-helix conformation (Gomis-Rüth et al. 1998), controls synthesis of the RepB initiator by hindering the binding of the RNA polymerase to the promoter that governs transcription of the repB gene (Hernandez-Arriaga et al. 2009). Remarkably, the two control elements of the replicons belonging to the pMV158 family are, respectively, the smallest transcriptional repressor proteins and plasmidencoded ctRNAs so far described (Gomis-Rüth et al. 1998; López-Aguilar and del Solar 2013). The two regulatory elements (ctRNA and Cop protein) work as a pair that measure and sense fluctuations in the plasmid copy number (del Solar et al. 1995).

Rolling Circle Replicating Plasmids

Single-Strand Origins A different replicative module is the so-called single-strand origin (sso), three main types of which (ssoU, ssoA, and ssoT) have been described (Kramer et al. 1998). They all consist of a DNA region structured in such a way that they can generate complex secondary structures on single-stranded DNA molecules, which are the intermediate of rolling circle replication and their replicative hallmark. The sso is recognized by host factors (RNA polymerase, primase, or primosome) that synthesize a small RNA, which primes the synthesis of the lagging strand by DNA polymerases I and III. Since no plasmid-encoded factors seem to be involved in recognition of the sso, the functionality of these signals should depend only on the host, even though homologies between the different types of sso have been found (Kramer et al. 1998). All natural RCR plasmids appear to contain at least one sso that functions efficiently in their natural host, although the streptococcal plasmid pMV158 has two of these origins: ssoU and ssoA (Fig. 1). In general, it would appear that the ssoA type is fully functional only in the host from where the plasmid has been isolated, whereas the ssoU type seems to function in different hosts. In this sense, it has been demonstrated that whereas the pMV158-ssoU participates in the plasmid mobilization between different bacterial species, the pMV158-ssoA would be involved mainly in intraspecific transfer (Lorenzo-Díaz and Espinosa 2009). Perhaps the presence of these two sso in the same plasmid could account for its promiscuity. Although not strictly required for replication, removal of the sso frequently results in plasmid instability, and the lack of a functional signal may decrease fitness of the plasmid-containing bacteria, thus leading to overrepresentation of plasmid derivatives that have acquired a functional sso.

1087

constitutive. In the former case are those present in the staphylococcal plasmids pC194, pE194, and pT181; they carry genes encoding resistance to chloramphenicol (cat), erythromycin (erm), and tetracycline (tet), respectively. Curiously enough, a tetL determinant highly homologous to the one carried by pT181, albeit constitutive, is present in plasmid pMV158 and in the chromosome of Bacillus subtilis 168.

Mobilization The MOB cassette is composed of the cis-acting oriT and the gene (or genes) involved in plasmid conjugal transfer. The economy of genetic information of the RCR plasmids dictates that they may harbor only a MOB module: although they cannot be conjugative (self-transferable), these RCR plasmids can be mobilized by functions provided by auxiliary plasmids (Grohmann et al. 2003). Thus, many, but not all, RCR plasmids encode only a mobilization (relaxase) protein, generically called Mob. In the case of plasmid pC221, the MOB cassette has, in addition, an accessory protein that is needed to generate the DNA-protein complex termed relaxosome assembled at the oriT (Smith and Thomas 2004). The Mob relaxase generates a stable protein-DNA complex that would be transferred to the recipient cell by means of a set of proteins provided by the auxiliary plasmid: the coupling protein and the T4-type secretion system (Llosa et al. 2002). So far, the best-characterized MOB cassette from RCR plasmids is that of pMV158, which defines a novel family of relaxases, termed MOBV (Garcillán-Barcia et al. 2011), composed of nearly 100 members, among them the relaxase of the staphylococcal RCR plasmid pUB110 and, more interestingly, the relaxase of a conjugative transposon of the Gram-negative bacteria Bacteroides (Smith et al. 1998).

Phenotypic Determinants Overall Genome Structure The DET module, when present, is generally formed by a single gene encoding an antibiotic resistance determinant that can be inducible or

In general, codon usage and G+C content of RCR plasmids match well those of their natural

R

1088

bacterial hosts, an indication of coevolution. In this sense, it is worth pointing that Mycoplasma mycoides uses UGA to specify tryptophan instead of a stop codon. The mycoplasma RCR plasmid pADB201 (one of the smallest RCR plasmids reported so far, matching the smallness of the genome of its host) also has three UGA codons in its rep gene, suggesting a limitation of its replicative abilities: the plasmid now cannot leave its host to colonize different bacterial species (del Solar et al. 1996). There is still a wealth of information stored in these types of replicons that awaits an in-depth exploitation of their potential to spread between many different hosts. The finding that homologues of genes carried by RCR plasmids have been found in the chromosome of bacteria (Espinosa 2013), in conjunction with the ability of RCR plasmids to integrate in the bacterial chromosome in a recA-independent manner by the use of very short (just 6–14 bp) regions of homology (Dempsey and Dubnau 1989), makes these kinds of plasmids a fascinating substrate to study the evolutionary bases of DNA genomes.

References Boer DR, Ruíz-Masó JA, López-Blanco JR, Blanco AG, Vives-Llàcer M, Chacón P, Usón I, Gomis-Rüth FX, Espinosa M, Llorca O, del Solar G, Coll M (2009) Plasmid replication initiator RepB forms a hexamer reminiscent of ring helicases and has mobile nuclease domains. EMBO J 28:1666–1678 del Solar G, Díaz R, Espinosa M (1987) Replication of the streptococcal plasmid pMV158 and derivatives in cellfree extracts of Escherichia coli. Mol Gen Genet 206:428–435 del Solar G, Moscoso M, Espinosa M (1993) Rolling circle-replicating plasmids from gram-positive and gram-negative bacteria: a wall falls. Mol Microbiol 8:789–796 del Solar G, Acebo P, Espinosa M (1995) Replication control of plasmid pLS1: efficient regulation of plasmid copy number is exerted by the combined action of two plasmid components, CopG and RNA II. Mol Microbiol 18:913–924 del Solar G, Alonso JC, Espinosa M, Díaz-Orejas R (1996) Broad host range plasmid replication: an open question. Mol Microbiol 21:661–666

Rolling Circle Replicating Plasmids Dempsey LA, Dubnau DA (1989) Identification of plasmid and Bacillus subtilis chromosomal recombination sites used for pE194 integration. J Bacteriol 171:2856–2865 Espinosa M (2013) Plasmids as models to study macromolecular interactions: the pMV158 paradigm. Res Microbiol 164:199–204 Garcillán-Barcia MP, Alvarado A, de la Cruz F (2011) Identification of bacterial plasmids based on mobility and plasmid population biology. FEMS Microbiol Rev 35:936–956 Gomis-Rüth FX, Solá M, Acebo P, Párraga A, Guasch A, Eritja R, González A, Espinosa M, del Solar G, Coll M (1998) The structure of plasmid-encoded transcriptional repressor CopG unliganded and bound to its operator. EMBO J 17:7404–7415 Goursot R, Goze A, Niaudet B, Ehrlich SD (1982) Plasmids from Staphylococcus aureus replicate in yeast Saccharomyces cerevisiae. Nature 298:488–490 Grohmann E, Muth G, Espinosa M (2003) Conjugative plasmid transfer in gram-positive bacteria. Microbiol Mol Biol Rev 67:277–301 Hernandez-Arriaga AM, Rubio-Lepe TS, Espinosa M, del Solar G (2009) Repressor CopG prevents access of RNA polymerase to promoter and actively dissociates open complexes. Nucl Acids Res 37:4799–4811 Khan SA (2005) Plasmid rolling-circle replication: highlights of two decades of research. Plasmid 53:126–136 Kramer MG, Espinosa M, Misra TK, Khan SA (1998) Lagging strand replication of rolling-circle plasmids: specific recognition of the ssoA-type origins in different gram-positive bacteria. Proc Natl Acad Sci U S A 95:10505–10510 Llosa M, Gomis-Ruth FX, Coll M, de la Cruz F (2002) Bacterial conjugation: a two-step mechanism for DNA transport. Mol Microbiol 45:1–8 López-Aguilar C, del Solar G (2013) Probing the sequence and structure of in vitro synthesized antisense and target RNAs from the replication control system of plasmid pMV158. Plasmid 70:94–103 López-Aguilar C, Ruiz-Masó JA, Rubio-Lepe TS, Sanz M, del Solar G (2013) Translation initiation of the replication initiator repB gene of promiscuous plasmid pMV158 is led by an extended non-SD sequence. Plasmid 70:69–77 Lorenzo-Díaz F, Espinosa M (2009) Lagging strand DNA replication origins are required for conjugal transfer of the promiscuous plasmid pMV158. J Bacteriol 191:720–727 Petrova P, Miteva V, Ruiz-Masó JA, del Solar G (2003) Structural and functional analysis of pt38, a 2.9 kb plasmid of Streptococcus thermophilus yogurt strain. Plasmid 50:176–189 Smith MCA, Thomas CD (2004) An accessory protein is required for relaxosome formation by small Staphylococcal plasmids. J Bacteriol 186:3363–3373 Smith CJ, Tribble GD, Bayley DP (1998) Genetic elements of Bacteroides species: a moving story. Plasmid 40:12–29

S

Sanger – Dideoxy Chain Termination Synopsis Sequencing ▶ Plant Genome Sequencing Methods

ScRad51 ▶ Rad51 and Dmc1 Recombinases

Secondary Chromosome

The structure of a protein is critical to its proper function. The proper function of proteins is critical to life itself. What kinds of structures can a polymer made of amino acids adopt? The possibilities are constrained, both by the identity of the amino acid side chains and also by the nature of the amide bond that connects them. Conventionally, the medium-scale structural unit of a polypeptide chain is referred to as its secondary structure. The three secondary structure elements are the alpha helix, the beta-sheet, and the turn. This review examines each type of secondary structure and touches on contemporary research questions regarding secondary structure.

▶ Plasmids as Secondary Chromosomes

Introduction

Secondary Structure Gabriel S. Brandt Franklin & Marshall College, Lancaster, PA, USA

Synonyms Amide bond; Amino acid monomer; Backbone; Main chain; Peptide bond; Polypeptide; Protein; Residue # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

The three-dimensional structure of proteins is a subject of truly fundamental importance to biochemistry (Petsko and Ringe 2004). Of the matter of which living things are made, the polymers known as proteins are generally responsible for the mechanics of life. The derivation of energy from the sun or from food, for example, is dependent on proteins. Each and every protein of defined structure acquires its three-dimensional structure autonomously, by folding up on its own, or with assistance from other proteins (Dill and MacCallum 2012). In talking about protein structure, there is a kind of Linnaean hierarchy that is

1090

Secondary Structure

used. Primary structure refers to the linear sequence of amino acids that are linked together to make the polypeptide chain. Secondary structure broadly refers to interactions in threedimensional space that are mediated by the peptide backbone or main chain. The overall structure of a protein, the tertiary structure, reflects the disposition of the various secondary structure elements in space, mediated by interactions between both side chains and the main chain. Quaternary structure refers to interactions between protein molecules. This essay will first address the fundamental structure of polypeptides and the forces that lead to protein structure and then will individually treat the three elements of secondary structure, alpha helices, beta-sheets, and turns.

Proteins are Biopolymers

Secondary Structure, Fig. 1 The rudiments of polypeptide structure. (a) Three of the 22 possible genetically encoded amino acids, glycine, phenylalanine, and threonine, depicted in the form of line structures used by organic chemists. (b) The three-residue polypeptide resulting from the condensation of the three amino acids shown in panel (a). (c) Ball-and-stick representation of the structural coordinates corresponding to the sequence shown in panel (b),

taken from the X-ray structure of a crystal of the acetylcholine binding protein from Aplysia californica (PDB ID: 2BR7). Note that, for clarity, hydrogen atoms are not shown. Carbon atoms are shown in black, oxygen in red, and nitrogen in blue. Covalent bonds are indicated by white rods. (d) Space-filling model of the polypeptide from panel (c), using the same coloring convention. Hydrogen atoms are shown in white

A protein molecule is a polymer (Kuriyan et al. 2012). It is made up of amino acid monomers covalently linked through an amide bond – the amino acids used by a cell’s protein-making machinery contain an amine at one end and a carboxylate at the other. These two functional groups are condensed, generally with the assistance of the ribosome, to form an amide, or peptide, bond (Fig. 1). The resulting carbon-nitrogen bond has significant double bound character, so rotation around it is restricted. In addition, the bond is polarized, with partial positive charge around the nitrogen and its attached hydrogen and partial negative

Secondary Structure

charge around the oxygen and its lone pairs. The amino acids have unique side chains, with a variety of shapes and chemical properties, ranging from acids and bases to hydrocarbons. There are 22 amino acids currently known to be used in translation across all kingdoms of life. The linear polypeptide chain is in constant motion. This dynamism leads the chain to actively twist and coil upon itself, sampling myriad possible conformations. A real-world analogy is difficult to frame, partly because it’s difficult to picture inanimate objects moving of their own accord. But, on the atomic scale, the world is in constant motion. As the protein chain writhes around, specific interactions permit different parts of the chain to stick together. If these interactions are stronger than random thermal motion, this sticking will persist, and the protein will adopt a kind of structure. One could imagine, perhaps, a blacksmith forging a chain by drawing individual links out of a bucket of 22 different magnetic shapes. If the resulting chain is stretched out, thrown into a barrel, and then violently shaken for a few minutes, complementary shapes that adventitiously come together could stick together, if their magnetic polarities match properly. Upon removing the chain from the barrel, local structures may be observed. A particularly fortuitous combination of links could even lead to a stable global structure or structures.

Factors that Influence the Structure of the Polypeptide Backbone Three features of the polypeptide chain determine the kinds of three-dimensional structures that a protein may adopt (Anslyn and Dougherty 2006). First is the constraint of the limited flexibility of the amide connection between amino acids. Second is the charge distribution of the backbone, which gives rise to local electrostatic attraction and repulsion. Third is entropy, largely driven by the more numerous solvent molecules that rearrange in response to discrete protein structures.

1091

Geometry of the Peptide Bond The amide bond that connects individual amino acids in a peptide bond has two limiting geometries, the cis and trans conformations (Fig. 2). This geometry is defined by the dihedral described by the four atoms between and inclusive of the alpha carbons of two adjacent residues. The dihedral is assigned a value of 0 for the trans case and 180 for the cis case. The trans conformation is more energetically favorable and represents the preferred equilibrium conformation. An amide bond in solution may isomerize between the two forms, but interconversion is fairly slow (approximately 0.3 min1) at physiological temperatures. The great majority of amide bonds in a protein are trans, but cis peptide bonds are not uncommon, particularly in tight turns. Between these two extremes, the angles actually observed in a given protein structure are described by a statistical mapping called a Ramachandran plot. Electrostatic Interactions and Hydrogen Bonding In addition to the geometrical constraints of the amide backbone, protein structure is guided by electrostatic interactions. The peptide bond comprises a hydrogen bond donor (the amide hydrogen) and a hydrogen bond acceptor (the carbonyl oxygen). The three canonical secondary structures observed in proteins are ones in which backbone hydrogen bonding is internally satisfied. The alpha helical arrangement perfectly matches hydrogen bond donors and acceptors within a stretch of amino acids. The beta-sheet arrangement offers the opportunity for complementary hydrogen bonding between two stretches of amino acids adjacent in space. Beta-sheets do not provide complete complementarity, the way alpha helices do, but rather satisfy every other hydrogen bonding position. Because of this, a given beta strand is free to interact with two other beta strands simultaneously. Parallel and antiparallel arrangements are possible. Entropy and the Hydrophobic Effect The third feature affecting the formation of protein secondary structure is entropy. The

S

1092

Secondary Structure

Secondary Structure, Fig. 2 The rudiments of polypeptide structure – constraints of the amide linkage. (a) The lower-energy trans conformation of a representative dipeptide, shown as a line structure, with the atoms involved in the characteristic peptide dihedral in red. (b) The less favorable cis conformation of a representative dipeptide. (c) Ball-and-stick representation of a trans dipeptide. Inset

shows the same model, rotated 90 to look down the peptide axis. Note that, for clarity, the amide hydrogen atoms are not shown. Carbon atoms are shown in black, oxygen in red, and nitrogen in blue. Covalent bonds are indicated by white rods. (d) Ball-and-stick representation of a cis dipeptide, highlighting the potential for steric clash, as both side chains are brought into apposition

biological milieu of a protein is also fundamental to its structure, since water plays a critical role. Proteins would not have the structures they do, if it weren’t for water. The immiscibility of oil and water drives protein structure. Some of the amino acids are primarily hydrocarbons and thus hydrophobic. In the same way that oil droplets in water will coalesce, these hydrophobic side chains are driven together, a phenomenon referred to as the hydrophobic effect. Minimizing the hydrophobic surface area minimizes the number of water molecules forced into a relatively less fluid and more ordered state at the hydrophobic surface, offsetting the ordering inherent in bringing lipid droplets together. Within the cell membrane, a slightly different set of constraints operates, as the relatively fluid lipid molecules of the membrane act as a cosolvent with water.

Energetics and Structural Dynamics Even for a fully folded, stable protein, the balance between electrostatic forces and entropy is precarious. It would seem naturally advantageous for the amide backbone to make energetically favorable hydrogen bonds with itself, but it must be remembered that the polypeptide is born in water, meaning that solvent molecules with excellent hydrogen bonding capability are always present to satisfy these bonding opportunities. For a given carbonyl, trading a hydrogen bond with a water molecule for a hydrogen bond with a nearby amide is essentially energetically neutral. Since this is true for all hydrogen bonding partners, the overall energetic impulse for a polypeptide to become structured is usually close to zero. The fact that the biosphere is made from structured proteins is largely a testament to the power of entropy.

Secondary Structure

Thus, for most proteins, the thermodynamics of folding are just barely on the side of structure, a state of affairs with extremely important biological consequences. For relatively little energy input, proteins can be de-structured, or denatured. This denaturation can be complete and essentially irreversible, as in the case of boiling an egg, or it can be partial. Partial denaturation can take many forms, from the breathing motion of a resting protein, to swapping of structural domains between adjacent proteins, to the formation of toxic semi-folded aggregates. All of these phenomena contribute to the dynamics of protein structure. It has been relatively recently appreciated that many diseases are accompanied by misfolding of proteins (Valastyan and Lindquist 2014).

Elements of Protein Secondary Structure The mechanics of how a polymer synthesized as a linear chain folds up into a defined threedimensional structure are not clear (Neira 2013). Only some sequences of amino acids will fold up into a defined structure, but this essay concerns secondary structure and has focused on the general polypeptide chain, without particular attention to the identities of the individual amino acids. Even for this simplest case, though, the details of how folding occurs are not established. The individual steps of the process may require local secondary structures to form, establishing a structured core that nucleates the acquisition of further structure. The particular kinds of secondary structure that are observed in proteins are generally taken to be alpha helices, beta-sheets, and turns. Alpha Helices The alpha helix is not the most locally regular structure (Fig. 3). The repeat unit of an idealized helix is fully 36 amino acids, meaning that one turn of the helix is not a whole number of residues, but rather 3.6. The side chains of every fourth amino acid are roughly aligned on one side of the helix, but the overlap is not perfect. An idealized helix makes one turn through 5.4 Å (0.54 nm), positioning the i and i + 4 side chains roughly 5 Å apart, usually too far to interact with

1093

each other. Alpha helices observed in proteins may deviate considerably from the standard dimensions. Some forms of helices are honored with their own names, such as the 310 and p helices. The 310 helix is a more tightly packed helix than the conventional alpha, with backbone hydrogen bonding between every third residue. The 310 is a perhaps the most mathematically satisfying helix, with a repeat unit of three amino acids (each of which makes a 120 turn, of course) and a rise of 2 Å per turn. The p helix, on the other hand, is a more extended structure, with hydrogen bonding between every fifth amino acid. The convention of polypeptide directionality is from the amino group toward the carboxylate, thus from the N-terminus of the protein toward the C-terminus. Under this convention, most observed alpha helices are right-handed. This handedness arises from the allowed angles that the amide backbone is able to adopt, combined with the chirality of the naturally occurring L-amino acids. Amino acids that are able to adopt atypical backbone angles, such as glycine and proline, can form left-handed helices under certain circumstances. The polyproline helix, of which both left- and right-handed versions are observed in nature, is an important biological recognition element used, for example, by Src-family kinases, such as the target of the anticancer drug Gleevec. A further consequence of the directionality of the protein chain is that the individual dipoles of each dipeptide pair sum in the direction of the C-terminus, creating an overall dipole moment called the helical dipole. Perhaps, a more general way of thinking about this dipole is to ask the question of what the hydrogen bonding partners are for the first and last residues in a helix. Here, these hydrogen bonds are sometimes formed with the side chain of the adjacent residue. There are particular propensities for the kinds of residues that cap helices at the N- and C-termini. As it happens, these propensities are consistent with the expected charge of a helical dipole (residues with anionic side chains preferentially cap the presumably electropositive N-terminal helical residue, and basic residues predominate after the C-terminal helical residue).

S

1094

Secondary Structure

Secondary Structure, Fig. 3 Structure of the alpha helix. (a) All-atom depictions of the transmembrane alpha helix – NVLLSAAINFFLIAFAVYFLVV – found in the mechanosensitive ion channel MscL from M. tuberculosis. Space-filling and ball-and-stick representations utilize experimental atom coordinates from the MscL crystal structure (PDB ID: 2OAR). Inset shows the same model, rotated 90 to look down the helix axis. Note that, for clarity, hydrogen atoms are not shown in the balland-stick model. (b) The alpha helix with amino acid side

chains stripped away, showing the helical nature of the backbone. The translucent ribbon in the ball-and-stick model further illustrates the helix axis. (c) Diagram of the main chain, showing internal hydrogen bonding (springs) between N and C atoms of the peptide backbone. (d) Model of the MscL structure showing, in red, the location of the helix depicted in panel (a). Note the representation used in these Richardson diagrams of the alpha helices as schematic cylinders or tubes, where the arrow head indicates the C-terminus of the helix

Alpha helices are characterized by very local interactions, between neighboring residues in linear sequence. As a result, the alpha helix has often been proposed as a spontaneously forming structural element that might nucleate or template the acquisition of global structure by a linear polypeptide. There are proteins for which this seems to be the case, but there are also counterexamples. Of course, not every protein structure contains alpha helices at all, so no one is proposing that this is the only mechanism of structure acquisition. Certainly, the temporal role that secondary structure plays in the development of overall protein structure is an area of very active research (Neira 2013).

Beta-Sheets The second canonical element of protein secondary structure is the beta-sheet. As with the helix, the beta-sheet is a structure that aligns the polypeptide backbone in such a way as to maximize mutual hydrogen bonding (Fig. 4). A big difference between the two is that the beta-sheet relies on hydrogen bonding not with adjacent residues in linear sequence, but with residues that are spatially proximal in the three-dimensional structure of the protein. Where the alpha helix presents a rod of electron density, the beta-sheet is a flatter structure, as the name implies. However, the two-dimensional

Secondary Structure

1095

Secondary Structure, Fig. 4 Structure of the beta-sheet. (a).Two strands of a parallel beta-sheet from the structure of chicken triosephosphate isomerase (PDB ID: 1TIM). Ball-and-stick representation of the coordinates corresponding to the backbone atoms of the polypeptide chain, with hydrogens omitted for clarity and hydrogen bonds indicated as springs. The schematic below illustrates the directionality of each strand. (b) Three antiparallel strands of a beta-sheet from the E. coli maltose transporter maltoporin (PDB ID: 1MAL). Note the ability of a strand

to participate in two-strand interactions simultaneously. (c) All-atom side view of two of the strands from the antiparallel sheet in panel (b), showing the alternating distribution of side chains above and below the plane of the sheet. Note the pleated nature of the sheet and that hydrogen atoms are omitted for clarity. (d).Model of the maltoporin structure showing, in red, the location of the three strands depicted in panel (b). Note the representation used in Richardson diagrams of the beta-sheets as flattened ribbons, where the arrow head indicates the C-terminus of the sheet

representation gives a much flatter appearance than is warranted, as the Ca carbon of an amino acid is, of course, tetrahedral. Some authors prefer the term “pleated sheet” to reflect this. As is readily apparent from Fig. 4, beta-sheets are perfectly possible between polypeptide chains in the same N-to-C orientation or in the opposite orientation. These possibilities are referred to as parallel or antiparallel beta-sheets. It will also be noted from this figure that the beta-sheet is not as efficient as the helix in maximizing hydrogen bonding among backbone atoms. This has extremely important structural consequences, in that the presence of unsatisfied hydrogen bonding partners means that a stretch, or strand, of polypeptide can form two beta-sheets, one on each side. Thus, laterally extended beta-structures are extremely common. These extended beta-sheets can take many forms,

the details of which are a subject of the study of tertiary structure. The interstrand distance of an idealized, fully extended beta-sheet is 4.7 Å. The side chains of the beta stand appear on alternate faces of the sheet, staggered so that every other side chain is on the same face. The potential for side-chain interaction between strands of a beta-sheet is very great, with interdigitated side-chain hydrogen bonds potentially contributing to the strength of the structure. A role for beta-sheets in the formation of the fibrils associated with neurodegenerative disease has been recently confirmed (Knowles et al. 2014). The amyloid plaques associated with Alzheimer’s disease, for example, consist of an extended chain of Ab protein fragments held together by beta-sheet interactions between strands on adjacent Ab molecules.

S

1096

Secondary Structure

Turns The third canonical secondary structural element is the turn (Fig. 5). A turn is simply a region where the polypeptide backbone doubles back on itself. It may seem surprising that this topological necessity is described as a specific class of structure. Indeed, there are an infinite number of turns, even with the constrained geometry of the backbone. As with alpha helices and beta-sheets, a great deal of attention has been paid to internal hydrogen bonding. Generally, classes of turns have been established based on where these interactions are made, although it should be pointed out that not all turns have internal hydrogen bonds at all. There is an additional classification based on how tight the turn is, with various dihedral angles arranged into numerical classes. Thus, a b turn might be class I or class II (all the way up to VIII),

depending on the angles between adjacent residues in the turn. The most commonly observed turn is the b-turn, presumably because the flexibility of the backbone is such that it most easily accommodates two amino acids between the hydrogen-bonded pair. As with secondary structural elements generally, research has focused on the extent to which formation of these elements nucleates or templates protein folding. Attempts to make synthetic peptide-like molecules have often focused on the turn as the minimal structural unit, and non-peptide analogs exist for these structures. A natural example of this is morphine, which has a molecular structure that likely allows it to mimic the b-turn of the opioid peptide met-enkephalin, allowing the plant-derived alkaloid to activate the peptide’s receptor.

Secondary Structure, Fig. 5 The structure of a variety of turns from the E. coli protein maltoporin. (a) An a turn, where the turn is stabilized by a hydrogen bond (shown as a spring) between the i and i + 4 residues. (b) A b turn, with hydrogen bonding between the i and i + 3 residues. (c) A g turn, with hydrogen bonding between i and i + 2. Note that

this turn also contains a second hydrogen bond (light blue). (d) A beta hairpin, where the turn is further stabilized by hydrogen bonds between the antiparallel strands that it brings into contact. (e) Model of the maltoporin structure showing, in red, the location of the beta hairpin depicted in panel (d)

Secondary Structure by Circular Dichroism, Experimental Assessment of

1097

Conclusion On the one hand, secondary structure simply represents an artificial hierarchical designation that is used to describe protein structure. On the other hand, as secondary structure represents the minimal domain of a folded structure, it continues to attract a great deal of research attention. Attempts to make synthetic peptide-like molecules often focus on whether or not they can adopt helical or sheet structures (Tomasini et al. 2013). The big question of how proteins assume their final structure has directed attention to the role of secondary structure elements in templating structure. Does secondary structure form first? Finally, the role of specific secondary structures in protein pathologies has directed research toward ideas like developing inhibitors to the formation of extended betasheet structures as a way of warding off neurodegenerative diseases, such as Alzheimer’s disease.

Cross-References ▶ Chemical Denaturation

Secondary Structure by Circular Dichroism, Experimental Assessment of Marina Ramirez-Alvarado Department of Biochemistry and Molecular Biology, Mayo Clinic College of Medicine, Rochester, MN, USA

Synopsis All proteins in nature have evolved, through selective pressure, to perform specific functions. These functions depend upon their threedimensional structures arising from sequences of amino acids in the polypeptide chain and adopting secondary structures that interact through longrange interactions within the three-dimensional structure. In this entry, a review of the structure of proteins will be presented as well as the spectroscopic and energetic basis of circular dichroism (CD) utilized extensively to assess secondary structure of proteins.

References

Introduction

Anslyn EV, Dougherty DA (2006) Modern physical organic chemistry. University Science Books, Sausalito Dill KA, MacCallum JL (2012) The protein-folding problem, 50 years on. Science 338(6110):1042–1046 Knowles TPJ, Vendruscolo M, Dobson CM (2014) The amyloid state and its association with protein misfolding diseases. Nat Rev Mol Cell Biol 15(6):384–396 Kuriyan J, Konforti B, and Wemmer D (2012) The molecules of life: physical and chemical principles. Garland Science Neira J (ed) (2013) Archives of biochemistry and biophysics. Special issue. Protein Folding and Stability 531(1–2):1–135 Petsko GA, Ringe D (2004) Protein Structure and Function. New Science Press, London Tomasini C, Huc I, Aitken DJ, Fülöp F (2013) Foldamers. Eur J Org Chem 2013(17):3408–3409 Valastyan JS, Lindquist S (2014) Mechanisms of protein-folding diseases at a glance. Dis Model Mech 7(1):9–14

Polypeptide Structure An amino acid is the basic building block of proteins. All amino acids have a central carbon atom (called Ca) to which a hydrogen atom, an amino group (NH2), and a carboxyl group (COOH) are attached. The major distinction between amino acids lies in the side chain attached to the Ca through the fourth valence. The four groups attached to the Ca atom are chemically different for all the amino acids except glycine, where two H atoms bind to the Ca atom. All amino acids (except glycine) are thus chiral molecules that can exist in two different forms: L- or D- forms. All amino acids in nature are L-amino acids. There are 20 natural amino acids that are the main component of all proteins, specified by the genetic code. Other side chains occur in rare cases as the

S

1098

Secondary Structure by Circular Dichroism, Experimental Assessment of, Fig. 2 Ramachandran plot. This plot depicts the allowed regions of peptide bond dihedral angles F and C. Most of the plot areas are not allowed. The different secondary structures are depicted in their known regions

R groups define the 20 different amino acids O

R2

H

C

C

N

H H2N Amino (N) terminus

H

C

N H

R1

H

COOH

C

C

O

R3

Carboxyl (C) terminus

Amide group

+180 β-sheets Left handed α-helix Ψ (degrees)

Secondary Structure by Circular Dichroism, Experimental Assessment of, Fig. 1 Peptide bonds in proteins. Schematic representation of the protein backbone. The amide group (CONH) is highlighted with a box. R represents side chains in amino acids

Secondary Structure by Circular Dichroism, Experimental Assessment of

0 α-helix

−180 −180

products of enzymatic modifications after translation (Branden and Tooze 1999). In proteins, amino acids are joined end to end when the carboxyl group of one amino acid condenses with the amino group of the next amino acid to eliminate water. This process is repeated as the chain elongates during protein translation. Peptide bond formation takes place by the reaction between the polypeptide chain bound to the peptidyl-tRNA in the P site of the ribosome and the amino acid or aminoacyl-tRNA in the A site of the ribosome (Lewin 2000) (Fig. 1). Secondary Structure Elements The a-helix is the classic element of protein structure. It was first described by Pauling, Corey, and Branson in 1951 (Pauling et al. 1951). Pauling et al. predicted that the a-helix was a highly stable and energetically favorable structure in proteins.

0 Φ (degrees)

+180

Their prediction received strong experimental support from diffraction patterns of hemoglobin crystals obtained by Max Perutz in Cambridge in 1960. Alpha helices in proteins are formed when a stretch of consecutive residues all have the required dihedral angles, known as f and c of approximately 60 and 45 , corresponding to the allowed region in the bottom left quadrant of the Ramachandran plot. Significant deviations can be found in the protein database, especially for the first and the last helical turns (Jiménez et al. 1994; Muñoz and Serrano 1994) (Fig. 2). The a-helix has 3.6 residues per helical turn with hydrogen bonds between C0 = O of residue i and NH of residue i + 4. All NH and C0 = O groups are joined with hydrogen bonds except the first NH and the last C0 = O groups at the ends of the a-helix. As a consequence, the

Secondary Structure by Circular Dichroism, Experimental Assessment of

O

12

N 11 11 C’ Cα O H N 11 10 O C’ Cα 10 H 10 N 9 C’ O Cα N H 9 C’ 9 O Cα 8 8 7 7 H O N C’ 8 7 Cα N 6 H O C’ Cα 5 5 C’H 6 N 6 5 O Cα N H 4 C’ 4 H O Cα 4 N C’ O 3 Cα 3 H N O C’ 3 Cα 2 N 2 H C’ 2 Cα 1 1 H

Secondary Structure by Circular Dichroism, Experimental Assessment of, Fig. 3 The a-helix structure. The N-terminus of the helix is depicted at the bottom and continues towards the top of the scheme. Hydrogen bonds are depicted as magenta discontinued lines. Amino acid numbers are depicted next to each atom. Atoms are labeled along the peptide bond trace. The direction of the hydrogen bond network is depicted with an arrow, which is the opposite direction of the helix dipole moment (Adapted from Branden and Tooze (1999))

ends of a-helices are polar and are almost always at the surface of protein molecules (Fig. 3). In globular proteins, a-helices vary considerably in length, ranging from four or five amino acids to over 40 residues. The average length is around ten residues, corresponding to three turns. In theory, an a-helix can be either right handed or left handed depending on the screw direction of the polypeptide chain. However, left-handed helices are not allowed for L-amino acids due to the

1099

close steric constraints of the side chains and the C0 = O group. Short regions of left-handed a-helices are observed in very few cases. The second major structural element found in globular proteins is the b-sheet. This secondary structure is built up from a combination of several regions of the polypeptide chain, in contrast to the a-helix, which is formed by one continuous region. The building blocks of the b-sheets, the b-strands, are usually from five to ten residues long. These b-strands are in an almost purely extended conformation, with f and c angles within the broad structurally allowed region in the upper left quadrant of the Ramachandran plot. The b-strands are aligned adjacent to each other such that hydrogen bonds can form between C0 = O groups of one b-strand and NH groups on an adjacent b-strand and vice versa. The b-sheets that are formed from several b-strands are pleated with Ca atoms successively a little above and below the plane of the b-sheet. Side chains follow the above and below pattern such that within a b-strand, they point alternately above and below the b-sheet (Figs. 4 and 5). The b-strands can interact in a parallel way in which the N- and C- termini of the two strands are running in the same direction, while antiparallel b-sheets have one b-strand running N- to C- and the other C- to N-. The hydrogen bonding pattern of each type of b-sheet is unique: antiparallel b-sheets have narrowly spaced hydrogen bond pairs that alternate with widely spaced pairs, while parallel b-sheets have evenly spaced hydrogen bonds that bridge the b-strands at an angle. It has been proposed that the nucleation event in the building of an antiparallel b-sheet is the formation of a b-hairpin, defined by a turn region flanked by two b-strands arranged in an antiparallel way. Alpha helices are true local structural elements with interactions taking place solely between residues that are close to each other in the protein sequence. In b-sheets, however, interactions are formed between residues from different strands that can be distant in the primary sequence or can even be part of different molecules, depending on whether the b-sheet is intra- or intermolecular. Therefore, the nonlocal context effects in b-sheets are important, as has been found in the analysis of b-sheet forming tendencies of amino acids (Minor

S

1100

Secondary Structure by Circular Dichroism, Experimental Assessment of

H

R

H

O

H

R

N

H N

N

N

H

O

O

H

H H

R

H

O

R

O

H

N

H H

O

H

H

O

O

H

H

R

R

H

R H

O

H

O

R

N

H

O

H

R

H

O

N

C

N

C

N

N H

H

N H

N

N

N O

C

R

discontinued lines. The direction for the polypeptide chain is shown in the arrows on the right (Modified from Ramírez-Alvarado et al. (1999))

N

H

C

N

Secondary Structure by Circular Dichroism, Experimental Assessment of, Fig. 4 Schematic representation of an antiparallel b-sheet. The hydrogen bonds between the backbone amide groups (CO and NH) are depicted as

R

N R

N

N

H

O

N O

H

R

H

O

H

R

Parallel β-sheet Secondary Structure by Circular Dichroism, Experimental Assessment of, Fig. 5 Schematic representation of a parallel b-sheet. The hydrogen bonds between the

backbone amide groups (CO and NH) are depicted as discontinued lines. The direction for the polypeptide chain is shown in the arrows on the right

and Kim 1994, 1996) and interstrand interactions in b-sheet residues (Smith and Regan 1995) in model protein systems. Almost all b-sheets show a right-handed twist in protein structures. Based on theoretical calculations and empirical calculations, this right-

handed twist is suggested to represent the conformation of lower free energy when compared with straight b-sheets or left-handed twist (Chothia 1973). Most protein structures are built up from combinations of secondary structure elements,

Secondary Structure by Circular Dichroism, Experimental Assessment of

a-helices and b-strands, connected by loop regions of various lengths and irregular shapes. A combination of secondary structure elements contribute to the stable core of the protein that is formed by amino acids with side chains that are apolar (hydrophobic). The loop regions are at the surface of the molecule. The main chain C0 = O and NH groups of these loop regions form hydrogen bond networks with themselves and with water molecules. Loop regions are rich in charged and polar hydrophilic residues. Loop regions that connect two adjacent antiparallel b-strands are called b-turns or reverse turns. The two most frequently occurring turns are type I and type II turns. Long loop regions are often flexible and can frequently adopt several different conformations. Circular Dichroism Circular dichroism (CD) is a type of absorption spectroscopy that uses circularly polarized light. It is very sensitive to the conformations, chiralities, and environment of molecules. The CD signal arises from the interactions between light and the dipoles present in the molecule that are involved in its absorption. CD is especially sensitive to the relative orientations of these dipoles. CD measures the difference in absorption of left and right circularly polarized light by a molecule. A CD signal can be positive (D molecules) or negative (L molecules). When left or right polarized light is absorbed by the sample to a greater extent, the resultant polarized radiation would now be elliptically polarized, i.e., the resultant would trace out an ellipse (Kelly and Price 1997). Mirror image molecules or enantiomers have CD spectra that are identical but of opposite sign. CD is exhibited not only by intrinsically chiral (optically active) chromophores but also if a chromophore becomes effectively chiral by being covalently linked to a nearby chiral center or placed in an asymmetric environment. For example, a polymer of achiral monomers can adopt a helical conformation similar to a DNA helix or a protein a-helix that can be either left handed or right handed. In other words, the helix is chiral. Helices that are identical except for their handedness will have identical CD signal of opposite sign.

1101

A single chromophore has only a minimal CD signal, but putting two or more chromophores in close proximity in defined orientations can generate a strong CD signal. Consequently, CD is widely used to characterize the conformations of proteins and nucleic acids. CD can give a reasonably accurate estimate of a fraction of residues in a globular protein that are in a-helical segments or other types of secondary structures. CD cannot determine how many a-helical segments exist in a protein or their location in the sequence. These limitations are outweighed by the sensitivity and ease of use of CD. The peaks or bands of CD spectra generally coincide with peaks of absorbance, although not all absorbance bands exhibit a CD signal. In the case of proteins and nucleic acids, CD is observed only in the ultraviolet wavelength (UV) spectral region, unless chromophores that absorb in the visible region are also present. Many CD instruments are calibrated in units of ellipticity, y, which is an angular unit based on the minor and major axes of the resultant ellipse. One can calculate mean residue ellipticity (MRE) as a normalization of the signal by dividing ellipticity by the number of amino acids or nucleotide residues and taking into account the path length used in the experiment. MRE is commonly used for the CD of DNA, RNA, and proteins in the far UV spectral region (Creighton 2010). The peptide groups of the polypeptide backbone of proteins absorb strongly in the far UV spectral region. This polypeptide bond is chiral; therefore, the CD signal from the polypeptide backbone dominates the protein CD spectrum. The peptide (amide) group has three p centers (C, O and N) and therefore three p orbitals. With four p electrons, there will be two pp* transitions, the well-known one near 190 nm (po ! p*). The p+ ! p* is a transition of high energy not yet identified (Woody 1996). In addition to the p orbitals, there are two lone pairs on the carbonyl oxygen. The highest energy lone pair (n orbital) is largely localized on the carbonyl oxygen in a nearly pure 2p orbital with its axis in the amide plane and perpendicular to the carbonyl bond. The other lone pair (n’) is at substantially lower energy and mixes much more strongly with the s orbitals.

S

1102

Secondary Structure by Circular Dichroism, Experimental Assessment of

Secondary Structure by Circular Dichroism, Experimental Assessment of, Fig. 6 Molecular orbitals and electronic transitions of the amide group (from Fasman). Occupied orbital is depicted with a dark color. The different electronic transitions are shown with arrows

C

π∗

O

N

C

πoπ*

nπ*

O

N n

C

O

πo

N

C

O

N n’ C

O

N

π+ π orbitals

Lone pairs

H N Cα Cα

C’

+

mp p *



O

+

– mn p *

+

n

π+

– πo

+ π*

Secondary Structure by Circular Dichroism, Experimental Assessment of, Fig. 7 The electronic dipole transition moment of the amide pp* transition (mpp*) and the magnetic dipole transition moment of the amide

np* transition (mnp*) and the upper plane of the orbitals of the amide bond. The lower plane has the same lobes but of opposite sign (Adapted from Fasman book and Shellman and Shellman 1964)

The wavelength of the amide np* transition is quite sensitive to solvent, ranging from 230 nm in apolar solvents to 210 nm in strong hydrogen bond donors. The np* transition for polypeptides usually lies between 215 and 222 nm. The np* transition at 220 nm absorbs light weakly but generates strong CD signals (Fig. 6). Both the absorbance and the CD signals are strong for the pop* transition at 190 nm. The strong electric dipole moment of this pp* transition means that transitions in neighboring peptide groups can interact with each other, giving rise to two or more absorbance or CD bands. This phenomenon is called exciton splitting and is most notable in a-helices (Fig. 7).

a-Helix CD Spectrum The most distinctive and strongest CD spectrum belongs to a-helices. The typical spectrum has two negative bands (also referred to as minima) at 222 and 208 nm, plus a strong positive band (maximum) at 190 nm. The intensity of the band at 222 nm is used to quantify the amount of a-helix present in a protein. The longwavelength negative band at 222 nm has been assigned to the np* transition based on calculations that give negative rotational strengths for this transition in a right-handed a-helix. The dominant mechanism by which the np* transition acquires its rotational strength is that of static field mixing with the pp* transition in the same amide group.

Secondary Structure by Circular Dichroism, Experimental Assessment of

The 208 nm negative band and the 190 nm positive bands result from exciton splitting of the pp* absorption band into a long-wavelength component polarized along the helical axis (208 nm) and a short-wavelength degenerate pair of bands polarized perpendicular to the helix axis (190 nm). b-Sheets CD Spectrum Proteins adopting b-sheet conformations have a characteristic CD spectrum with a band (minimum) around 217 nm and a positive band (also called maximum) between 195 and 200 nm. The negative band near 217 nm is assigned to the np* transition, whereas the positive band near 198 nm is assigned to pp* exciton components. The np* transition is predicted to have negative rotational strength in both the antiparallel and the parallel b-sheets. In the b-sheet, the m-m mechanism dominates the np*-pp* mixing, in contrast to the a-helix. CD spectra of b-sheets are more variable than the spectra from a-helices due to the fact that b-sheets present a more divergent, variable structure than a-helices. Much of this variation in the b-sheet spectra is probably due to the right-handed twist found in b-sheets. Theoretically, the absolute value of the ratio of the ellipticity at the positive maximum around 197 nm to that of the negative minimum at 217 nm should increase with increased twisting of the b-sheet. This ratio should be greater for parallel than for antiparallel twisted b-sheets. Exciton splitting occurs in b-sheet polypeptides, although it is less dramatic than in the a-helix. In the ideal planar antiparallel b-sheet, there are four residues per unit cell, and so four exciton bands are predicted. Of these bands, transitions to one exciton level are forbidden both electronically and magnetically. A transition to each of the remaining three levels gives rise to an electric dipole transition moment polarized along one of the three unit cell directions: along the chain direction (y-axis), in-plane and perpendicular to the chain direction, along the hydrogen bonds (x-axis), and normal to the plane (z-axis). Interactions among the transition moments cause these exciton bands to occur at increasing energy in the order given. The exciton component with the strongest absorption is that directed along the

1103

H-bond direction. The position of this component depends on the width of the sheet, strongly shifted to the blue (shorter wavelengths) in one- and two-stranded b-sheets, but shifting to longer wavelengths (shifting to the red) with increasing numbers of b-strands. In wide b-sheets, this component is shifted to slightly longer wavelengths from the monomer position, in agreement with observations of lmax for b-sheets in high molecular weight polypeptides near 195 nm. In the case of the ideal parallel b-sheets, the unit cell consists of only two residues, so only two exciton components are predicted. The low energy exciton component is polarized along the chain direction and is predicted to have a positive rotational strength, and the high energy component is polarized in the plane perpendicular to the chain direction with a negative rotational strength. Parallel b-sheets, just like the antiparallel b-sheet, are predicted to have a positive exciton couplet centered near 190 nm in circular dichroism. Qualitatively, the CD spectra for both types of b-sheets are very similar: a negative np* band and a positive pp* exciton couplet. The most useful criterion for distinguishing between the two types of b-sheets is the difference in lmax for CD and absorption. The difference in lmax is predicted to be 5 nm for wide antiparallel b-sheets and 13 nm for parallel b-sheets. The effect of the right-handed twist observed in b-sheets has been explored by several researchers using calculations. Twisting leads to a stronger np* band and to a strong positive couplet in the pp* region. The strong positive pp* couplets originate in the twisting within the individual strands. The strand contributions are additive for parallel b-sheets, but there is some destructive interference between adjacent strands in the antiparallel b-sheets. The results from these calculations suggest a criterion for the degree of twist in b-sheets. Weakly twisted sheets have np* and pp* maxima of approximately equal magnitude, whereas for strongly twisted sheets the pp* band near 200 nm is much stronger than np* band. b-Turns CD Spectra The CD characteristics of b-turns vary, as do their structures. Type I turns have an a-helix-like CD

S

1104

spectra, with negative bands around 220 and 210 nm and a positive band around 190 nm. Type II turns, in contrast, have CD spectra that resemble the spectrum of a b-sheet, where the negative band is found between 220 and 225 nm and the positive band is shifted 5–10 nm from a b-sheet spectrum (between 200 and 210 nm). The CD spectra of folded proteins of known structure are generally similar to those expected from their content of secondary structure. CD spectra are generally being used as a quick and easy method to estimate the secondary structure content of proteins of unknown structure. Protein CD data can be analyzed initially by fitting the CD spectrum to a linear combination of model polypeptides for all secondary structure components. This secondary structure estimation is generally not satisfactory, suggesting that the CD spectra of globular proteins are complex enough to be represented by the individual secondary components. This is not entirely surprising because the secondary structure found in proteins is generally not as long as those of model polypeptides. A number of more sophisticated methods have been developed to estimate the secondary structure content of proteins of unknown structure. CD spectra are deconvoluted to determine the CD contributions for each secondary structure. Deconvolution methods provide useful estimates of the amount of a-helix, b-sheet, b-turn, and unordered structure found in CD spectra and to assign a protein to the general broad classification of all-a, all-b, a + b, or a/b proteins. These methods are usually improved by extending the analysis to shorter wavelengths, by including more proteins in their datasets, and by weighing them flexibly; so the intrinsic limitations of this approach are not yet being reached.

Secondary Structure, Theoretical Aspects of Jiménez MA, Muñoz V, Rico M, Serrano L (1994) Helix stop and start signals in peptides and proteins. The capping box does not necessarily prevent helix elongation. J Mol Biol 242:487–496 Kelly SM, Price NC (1997) The application of circular dichroism to studies of protein folding and unfolding. Biochim Biophys Acta 1338:161–185 Lewin B (2000) Genes VII. Oxford University Press, Oxford. ISBN 0-19-879276-X Minor DL Jr, Kim PS (1994) Context is a major determinant of beta-sheet propensity. Nature 371:264–267 Minor DL Jr, Kim PS (1996) Context-dependent secondary structure formation of a designed protein sequence. Nature 380:730–734 Muñoz V, Serrano L (1994) Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. Proteins 20(4):301–11 Pauling L, Corey RB, Branson HR (1951) The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci U S A 37:205–211 Ramírez-Alvarado M, Kortemme T, Blanco FJ, Serrano L (1999) Beta-hairpin and beta-sheet formation in designed linear peptides. Bioorg Med Chem 7 (1):93–103 Shellman JA, Schellman C (1964) Conformation of polypeptide chains in Proteins, 2nd edition. Volume 2, In: Neurath H (ed) Academic press, 1–137. Waltham, Massachusetts Smith CK, Regan L (1995) Guidelines for protein design: the energetics of beta sheet side chain interactions. Science 270:980–982 Woody RW (1996) Circular dichroism and the conformational analysis of biomolecules. In: Fasman GD (ed) Theory of circular dichroism of proteins. Plenum Publishing, New York. ISBN 0-306-45142-5

Secondary Structure, Theoretical Aspects of Chanaka Mendis Department of Chemistry, University of Wisconsin Platteville, Platteville, WI, USA

References Branden C, Tooze J (1999) Introduction to protein structure. 2nd edn. Garland Press. ISBN 0-8153-2305-0. New York, NY Chothia C (1973) Conformation of twisted beta-pleated sheets in proteins. J Mol Biol 75:295–302 Creighton TE (2010) The physical and chemical basis of molecular biology. Helvetian Press. ISBN 0-95647810-8. Distributed by Gardners Books, East Sussex, UK

Synopsis Defined as the local spatial arrangement of the backbone atoms of a polypeptide chain, the secondary structure of a protein is often categorized as alpha (a)-helices, beta (b)-sheets, and others. Each category of secondary structure has a unique shape

Secondary Structure, Theoretical Aspects of

and characteristics due to the distinctive hydrogen bonding pattern and dihedral angles. In a-helices, hydrogen bonding between the oxygen atom of a carbonyl group of a backbone amino acid and a nitrogen atom of an amino group located four amino acids away on the backbone results in a helical conformation, while in b-sheets these two atoms can be located in the same polypeptide chain or two adjacent chains in a parallel or antiparallel direction (having the same or opposite amino to carboxyl arrangements, respectively) resulting in a more extended zigzag type structure. Any irregularities in a-helices and b-sheets result in the formation of other secondary structures such as 310-helices and b-bulges. Complexity in structure and folding of proteins also effect many of the secondary structures to combine and form supersecondary structures such as bab, aa, and b-meander. Dihedral angles, phi[F] and psi[C], designate rotations of C–N and C–C bonds, respectively, and comprise unique repetitive bond angles. The a-helix dihedral bond angles are F (57) and C (47), while b-sheet dihedral bond angles are F (119)/C (113) and F (139)/C (135) for parallel and antiparallel sheets, respectively.

1105

categorized as a-helices, b-sheets, and others or as a-helices b sheets, b-turns, and others. Yet, each type of secondary structure has a set of unique features in the form of dihedral angles, distances between each amino acid (AA), and even to a certain extent type of AA. Similar to secondary structures are the supersecondary structures or the structural motifs such as coiled coil, b-hairpins, a-helix hairpins, and bab hairpins differing in the geometric arrangement due to a combination of multiple secondary structures. Even though secondary structure of a protein can be determined by multiple regions of AA spanning membranes as well as by the accessibility of various soluble individual amino acids, these topics will not be explored here. Here an attempt will be made to better understand how secondary structures can be determined by various localized features that contribute to minimizing the potential energy function such as hydrogen bonding patterns and dihedral angles as well as prediction of secondary structures almost to a 75% accuracy level by computational methods can and assignment of certain types of secondary structures utilizing Dictionary of Protein Secondary Structures (DSSP) (Singh 2005; Kabsch and Sander 1983).

Introduction Introduced in 1952 as an integral component of the protein structure organization hierarchy, the definition of the secondary structure of a protein has evolved and redefined multiple times in the last few decades (Linderstrøm-Lang 1952). Dependent somewhat on the primary sequence, secondary structures play a significant role in the functions and the characteristics of many types of proteins. Experiments done utilizing ribonuclease A protein [RNase A] provided strong evidence of the correlation between shape (in a water solution) and sequence (Anfinsen 1973). Influenced even more by the characteristics of hydrogen bonds between backbone atoms and by the pattern of backbone dihedral angles, a secondary structure can be defined as the local spatial arrangement of main-chain atoms without regard to the conformation of side chains or in respect to other segments. Frequently, secondary structures are

Phi (F) and Psi (C) Bond Angles A peptide backbone could be visualized as a series of rigid planes limiting the range of possible conformations due to the common point of rotation at Ca, specifically between N and Ca as well as Ca and C atoms (Fig. 1). These limited peptide conformations are clearly defined by dihedral (torsion) angles called F and C, while both F and C are defined by three bond vectors connecting four atoms in the peptide backbone. In the case of dihedral angle F, rotation occurs between N and Ca atoms in C–N–Ca–C bonds, and in dihedral angle C, rotation occurs between Ca and C atoms in N–Ca–C–N bonds. When segments of AA retain identical dihedral angles throughout a specific secondary structure, a-helix or b-sheet is attained. If the dihedral angles throughout the segment are varied, then the

S

1106

Secondary Structure, Theoretical Aspects of

O

121.1°

nm

52

0.1

123.2° 0.133nm

C

121.9°



Ca

N 119.5°

R

5nm

0.14

115.6°



R

H

0.123nm

118.2°

H

0.1nm

H

Secondary Structure, Theoretical Aspects of, Fig. 1 The peptide bond is shown in its usual trans conformation of carbonyl O and amide H. The Ca atoms are the a-carbons of two adjacent amino acids joined in peptide linkage. The dimensions and angles are the average values observed by crystallographic analysis of amino acids and small peptides. The peptide bond is the light gray bond between C and N (Ramachandran et al. 1974)

segment is referred to have a random coiled or undefined secondary structure. Critical in defining the percentage of a specific type of a secondary structure, dihedral angles solely depend on the hydrogen bonding pattern and the type of AA in an indirect manner. Together, these factors minimize the potential energy and influence the formation of a unique secondary structure.

Ramachandran Plot Ramachandran diagram has been further improved by utilizing tens of thousands of highresolution protein structures deposited in the Protein Data Bank (PDB) determined by X-ray crystallography studies as illustrated in Fig. 2 (Ramachandran et al. 1963; Lovell et al. 2003). In this improved plot, geometric validation around the Ca is described by measuring and updating a new Cb. This is mainly due to the sensitive nature of Cb deviation to incompatibilities between side chain and backbone caused by multiple factors including misfit conformations. This new dihedral angle plot utilizes close to 80,000 non-glycine, non-proline, and non-preproline AA from 500 high-resolution proteins that show sharp secondary structural characteristics.

Secondary Structure, Theoretical Aspects of, Fig. 2 The plot is generated by using data points from a large set of high-resolution structures and contours for favored and for allowed conformational regions for the general case (all amino acids except Gly, Pro, and prePro) (Lovell et al. 2003)

Hydrogen Bonds Hydrogen bonds are formed between adjacent backbone amino acids and play a crucial role in contributing to the localized order and structure of proteins. Hydrogen bonds are non-covalent attractions between hydrogen atoms attached to electronegative atoms of one molecule/s and electronegative atoms (nitrogen [N], oxygen [O], and sulfur [S]) of a different molecule/s. These bonds are formed to avoid high energy costs associated with breaking hydrogen bonds between water molecules and the polar backbone of AA. Formation of hydrogen bonds pairs polarities of buried polar backbone atoms allowing to achieve a fixed protein conformation. Even though hydrogen bonds can be defined using different bond angles and distances (Baker and Hubbard 1984), coulomb bond energy calculations have taken somewhat of precedence over others. For example, protein secondary structure is assigned in Dictionary of Protein Secondary Structure (DSSP) by calculating the electrostatic attractions as shown on Fig. 3 (Gu and Bourne 2011) and utilizes single letter codes to describe possible secondary structure formation (Fig. 4).

Secondary Structure, Theoretical Aspects of

DSSP is solely based on the backbone–backbone hydrogen bonds when the bond energy is below 0.5 Kal/mol as shown in coulomb hydrogen bond calculation method. The eight-state DSSP code is further simplified to three states (helix, sheet, and coil) in majority of the subsequent secondary structure prediction methods.

DSSP Assignment and Secondary Structure Predictions DSSP program was designed to standardize secondary structure assignments and currently functions as a tool to assign mainly a-helix and b-sheet structures (Kabsch and Sander 1983). According to the DSSP prediction, each type of helix (Fig. 4) has to have at least a minimum of two consecutive

E = fδ +δ −

1

+

1

+

1

+

1

rNO rHC' rHO rNC'

Secondary Structure, Theoretical Aspects of, Fig. 3 Coulomb hydrogen bond calculation – used by DSSP. f is a constant 332 Å kcal/e2, delta is the + and – polar charge in electrons, weakest H bond 0.5 kcal/mol in DSSP, H not given – requires extrapolation – note assumes planar geometry for peptide bond the coulomb energy where f = 332 Å kcal/e2 mol is the dimensional factor and 8+ and 8 are the polar charges given in units of the elementary electron charges e. A cutoff level has been set for the weakest acceptable hydrogen band so that the resulting energy is bound by E < 0.5 kcal/mol in DSSP (Gu and Bourne 2011)

1107

hydrogen bonds in the helix. When two consecutive amino acids have hydrogen bonds between i and i + 4 AA and end likewise with two consecutive hydrogen bonds between i and i + 4 AA, it is assigned an a-helix structure (H). Assignments given to 31O-helices (G) and 5AA helices (I) also follow a similar pattern of hydrogen bonding between i and i + 3 AA and i and i + 5 AA. Antiparallel and parallel b-sheets (E) are defined as either having two hydrogen bonds in the sheet or being surrounded by two hydrogen bonds in the sheet. If a b-sheet-like structure consisting only two AAs at each partner segment is labelled as a b-bridge (B), the remaining two DSSP states S and (space) indicate a bend in the chain and unassigned/other, respectively (Gu and Bourne 2011). Since the 1970s, various methods have been utilized to investigate proteins with known secondary structures, to determine which amino acids (or amino acid combinations) most frequently favor a particular secondary structure and to utilize known statistical information to predict a secondary structure of newly sequenced proteins. Currently for any soluble proteins the secondary structure at any amino acid can be determined with about 75% accuracy. Early approaches to predict secondary structure were first based on single AA statistics/properties and were limited by the small number of proteins with solved structures. These factors limited the successful prediction of proteins to about 50–60% accuracy (Chou and Fasman 1974).

G = 3-turn helix (310 helix); Minimum length of 3 AA. H = 4-turn helix (α helix); Minimum length of 4 AA. I = 5-turn helix (π helix). Minimum length of 5 AA. T = hydrogen bonded turn (3, 4 or 5 turn) E = extended strand in parallel and/or anti-parallel β-sheet conformation. Minimum length of 2 AA. B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation) S = bend (the only non-hydrogen-bond based assignment). Secondary Structure, Theoretical Aspects of, Fig. 4 Amino acid residues which are not in any of the above conformations are assigned as the eighth type “Coil”: often codified as “ ” (space), C (coil), or “-” (dash). The helices (G, H, and I) and sheet conformations are all required to have a reasonable length. This means

that two adjacent residues in the primary structure must form the same hydrogen bonding pattern. If the helix or sheet hydrogen bonding pattern is too short, they are designated as T or B, respectively. Other protein secondary structure assignment categories exist (sharp turns, Omega loops, etc.), but they are less frequently utilized

S

1108

Recent developments in protein secondary structure prediction have been aided tremendously by the large amount of available sequence data of proteins and further improved by better remote homology detection (e.g., using PSI-BLAST or hidden Markov models). PSI-BLAST is an iterative database searching method that uses homologues found in one iteration to build a profile used for searching in the next iteration. Even though existing approaches do not take into consideration the effect of longterm interactions effecting secondary structure prediction, it is clear that protein secondary structure is greatly influenced by both short-range and long-range interactions (Aloy et al. 2003). Further complicating the accuracy of secondary structure prediction is the ability to predict a-helices 9.5% more accurately than b-strands.

Characteristics of Secondary Structures Specific characteristics of hydrogen bonds formed between backbone atoms of adjacent AAs are useful tools in differentiating one secondary structure from another. For example, in a-helices, hydrogen bonds are seen between i and i + 4 AA resulting in a 3.6 AA turn of about 5.4A0 in length. Forming these hydrogen bonds between a carboxy-oxygen (partial negative charge) and an amino hydrogen (partial positive charge) 4 AAs further along the chain stabilizes the 13 atoms per loop in an a-helical structure. These a-helices are sometimes referred to as 3.613-helices and show a counterclockwise rotation (right handed). Even though under physiological conditions formation of both left- and right-handed helices is theoretically possible, formation of left-handed helices are improbable for L-AA due to the steric hindrance between the b-carbon and carbonyloxygen. These secondary structures are also arranged in a manner that all side groups of AA (R-groups) point outward and have dihedral angles of ’ = 57 , c = 47 . In a-helices the number of AA can vary in numbers ranging from four or five to a couple of hundred.

Secondary Structure, Theoretical Aspects of

On the other hand, b-strands can contain anywhere between 5 and 100 AAs and form b-sheets through hydrogen bonds with other b-strands. These b-strands are normally quite distant in sequence and could have parallel or antiparallel structures. Unlike a-helices, b-strands are more extended and contain only 2 AA per turn and have dihedral angles of 119 (F)/113(C) and 139 (F)/135 (C) for parallel and antiparallel sheets, respectively, and rarely form hydrogen bonds between carboxy-oxygen and amino-hydrogens in neighboring AAs. The side groups of AA (R-groups) point above and below the plane of the sheet and sometimes result in the formation of hydrogen bonds and ionic bonds between the outward-pointing side groups. b-sheets are rarely flat, but twisted, the peptide bond planes of different strands form a right-handed coil, and large antiparallel sheets may sometime form structures referred to as b-barrels. There are also remarkable differences in the type of AA that favors each of the above two secondary structures even though the side chains of AA do not play a direct role in the formation of hydrogen bonds. This preference can be described as twofold: first, the effect of an R-group on the neighboring main-chain atoms to acquire a given F / C angles, and second, the effect of an R group on R groups of adjacent AAs. For example, proline rarely occurs in a-helices due to the inability of the nitrogen atom to assume the required F angle and instead introduces a destabilizing kink in an a-helix. Bulky R-groups can prevent the tight packing of AAs required for the formation of a-helices, and these a-helices are further stabilized by allowing R-groups with opposite charges to be in close proximity. a-helical conformation permits the atoms that form hydrogen bonds with each other a linear arrangement and an increased stability.

Cross-References ▶ Predictions from Sequence ▶ Primary Structure

Selection with Antibiotics

References Aloy P, Stark A, Hadley C, Russell R (2003) Prediction without templates: new folds, secondary structure, and contacts in CASP5. Protein Struct Funct Bioinform 53:436–456 Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181:223–230 Baker EN, Hubbard RE (1984) Hydrogen bonding in globular proteins. Prog Biophys Mol Biol 44:97–179 Chou P, Fasman G (1974) Prediction of protein conformation. Biochemistry 13:222–245 Gu J, Bourne PE (2011) Structural bioinformatics. Wiley and Sons New York, NY Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637 Linderstrøm-Lang KU (1952) Proteins and enzymes. Lane medical lectures, Stanford University publications, University series, medical sciences, vol 6. Stanford University Press, Stanford, CA Lovell SC, Davis IW, de Arendall WB, Bakker PIW, Word JM, Prisant MG, Richardson JS, Richardson DC (2003) Structure validation by Ca geometry: f, c and Cb deviation. Proteins 50:437–450 Ramachandran GN, Ramakrishnan C, Sasisekharan V (1963) Stereochemistry of polypeptide chain configurations. J Mol Biol 7:95–99 Ramachandran GN, Kolaskar AS, Ramakrishnan C, Sasisekharan V (1974) The mean geometry of the peptide unit from crystal structure data. Biochim Biophys Acta 359:298–302 Singh M (2005) Handbook of computational molecular biology, Chapman & Hall CRC computer and information science series. Chapman & Hall/CRC, Boca Raton

Selection with Antibiotics Douglas A. Julin Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, USA

Definition Growth of cells in the presence of an antibiotic is an efficient way to select a small number of cells that contain a plasmid from among a much larger number that do not have the plasmid. Antibiotics kill cells by inhibiting cellular components that

1109

are necessary for vital processes. Plasmid cloning vectors generally contain a gene that encodes a factor that makes a cell containing that plasmid resistant to an antibiotic and thus able to grow in its presence. A culture of cells, some of which contain the plasmid, can be spread on an agar plate that contains the antibiotic. Cells that contain the plasmid are able to grow and form a colony on the plate, while cells that do not contain the plasmid are unable to grow.

Discussion Antibiotics slow the growth of or kill bacteria by inhibiting a variety of cellular processes, including cell wall biosynthesis, protein synthesis, DNA replication, and transcription. The antibiotics used most commonly for recombinant DNA work in E. coli are listed in Table 1 (Sambrook et al. 1989; Walsh 2003), and their structures are shown in Fig. 1 (Walsh 2003). The cellular targets of these antibiotics and the resistance determinants carried on plasmid vectors and resistance. The antibiotics used in the laboratory are secondary metabolites produced naturally by microorganisms. Humans use antibiotics to kill bacteria, both as selection tools in the laboratory and to treat and cure disease. It is often assumed that microorganisms in the natural environment produce antibiotics for the same reason, as a way to prevent the growth of competitor organisms (Walsh 2003; Martinez 2008). However, antibiotics may serve different functions when present at the natural concentrations that are far below the concentrations used in the laboratory, including acting as signaling molecules among bacterial cells in a population (Martinez 2008; Aminov 2009; Romero et al. 2011). An example is the aminoglycoside tobramycin, which is toxic at high concentrations due to binding to the ribosome. At lower, subinhibitory, concentrations, tobramycin binds to a receptor protein and stimulates biofilm formation through the second messenger dicyclic-GMP (Hoffman et al. 2005).

S

1110

Selection with Antibiotics

Selection with Antibiotics, Table 1 Antibiotics for selection in bacterial cells (Sambrook et al. 1989; Walsh 2003) Antibiotic Ampicillin Chloramphenicol Kanamycin

Streptomycin Tetracycline

Target Inhibits cell wall biosynthesis Binds 50S ribosome subunit Binds 30S ribosome subunit Binds S12 protein in 30S ribosome subunit Binds 30S ribosome subunit

Resistance determinant b-Lactamase Chloramphenicol acetyltransferase Aminoglycoside phosphotransferase

Aminoglycoside phosphotransferase or aminoglycoside adenyltransferase 42 kDa membrane protein

Resistance mechanism Hydrolysis of b-lactam ring Acetylation prevents binding to the ribosome Phosphorylation prevents entry into the cell Modification prevents entry into cell Efflux pump

Selection with Antibiotics, Fig. 1 Structures of selected antibiotics (Walsh 2003)

References Aminov RI (2009) The role of antibiotics and antibiotic resistance in nature. Environ Microbiol 11:2970–2988 Hoffman LR, D’Argenio DA, MacCoss MJ, Zhang Z, Jones RA, Miller SI (2005) Aminoglycoside antibiotics induce bacterial biofilm formation. Nature 436: 1171–1175

Kohanski MA, Dwyer DJ, Collins JJ (2010) How antibiotics kill bacteria: from targets to networks. Nat Rev Microbiol 8:423–435 Martinez JL (2008) Antibiotics and antibiotic resistance genes in natural environments. Science 321:365–367 Romero D, Traxler MF, Lopez D, Kolter R (2011) Antibiotics as signal molecules. Chem Rev 111: 5492–5505

Selectivity of Chemicals for DNA Damage Sambrook J, Fritsch EF, Maniatis T (1989) Molecular cloning: a laboratory manual, 2nd edn. Cold Spring Harbor Laboratory Press, Cold Spring Harbor Walsh C (2003) Antibiotics: actions, origins, resistance. ASM Press, Washington, DC

Selectivity of Chemicals for DNA Damage Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Definition Reactive chemicals differ in their tendencies to react with different biological molecules, e.g., proteins versus DNA. Further, chemicals are also selective for which DNA bases (and which atoms of DNA bases) they react with.

Discussion As pointed out elsewhere, many carcinogens require activation to reactive forms that will modify DNA. This is a part of what is called the somatic mutation theory of cancer. The first reports of DNA modification involve N7-guanyl adducts formed with nitrogen mustard (Lawley Selectivity of Chemicals for DNA Damage, Fig. 1 (a) Basic reaction chemistry relevant to modification of DNA. SN1 and SN2 reactions. (b) Hard–soft acid–base theory

1111

1984). The modification of DNA follows fundamental chemical principles. One important consideration in DNA modification by alkylating (and aralkylating) species is SN1 versus SN2 reactions (Fig. 1a). Alkylation reactions can be described in the context of a Swain–Scott relationship (Lawley 1984; Swain et al. 1958): s  ny ¼ log10 ky =k0 , where S is a factor for each electrophile, ny is the nucleophilicity of a reactant relative to a reference “0,” and the rate constants k are for the reference (0) and the nucleophile of interest (y). SN1 reactions show less dependence on n (than do SN2 reactions). Another issue in the chemical modification of DNA is hard–soft acid–base theory (Fig. 1b), as developed by Pearson (Lawley 1984; Pearson 1968). Briefly “soft” nucleophiles (e.g., thiols) tend to react with soft electrophiles (e.g., Michael acceptors), and hard nucleophiles (e.g., exocyclic amines of nucleic acid bases) tend to react with hard electrophiles (e.g., SN1 alkylating agents). With some SN1 alkylating agents, rearrangements might occur prior to reaction, e.g., a propyl carbocation could form an isopropyl before reacting with DNA. However, this did not occur (Park et al. 1980). Evidence has been presented that reaction of an isopropyl carbocation occurs by a pre-association mechanism in which the cationic intermediate reacts in the solution shell in which it is generated (Blans and Fishbein 2004).

S

1112

Selectivity of Chemicals for DNA Damage

Selectivity of Chemicals for DNA Damage, Fig. 2 (a) Free radical reactions. (b) Formation of nitrenium ions

Comparable yields for reactions at hard and soft sites of purines led to a conclusion that nucleophilicity is unimportant in site selectivity (by an isopropyl cation) but that differences in the pre-association complexes drive the reaction selectivity (Blans and Fishbein 2004). The hard–soft reaction paradigm can be used to understand why Michael acceptors (soft, e.g., C = CH-C = O) react more efficiently with proteins (e.g., sulfhydryls) in preference to nucleic acids (Guengerich et al. 1981). Acylating agents (e.g., acyl chlorides) are also electrophilic but do not generally undergo reactions with DNA. These are much more inclined to react with protein nucleophiles (Cai and Guengerich 2001) and are not very mutagenic. Another group of reactive chemicals is free radicals (Fig. 2a), including those derived from oxygen (see entry ▶ “Reactive Oxygen Species”). Radicals form multiple products, although one common site is the C8 atom of guanine (Maeda et al. 1974). Nitrenium ions are derived from the loss of a leaving group from a hydroxylamine (Fig. 2b). These primarily generate C8 and N2 adducts on guanine. The N2 adducts can be considered hydrazine derivatives. A direct reaction at the C8 atom is possible, although evidence has been presented that the initial attack is at the N7 atom of guanine, followed by rearrangement (Humphreys et al. 1992).

Cross-References ▶ Damage DNA, Natural Products that ▶ DNA Base Pairing, Modes of

▶ DNA Damage by Endogenous Chemicals ▶ DNA Damage, Frequency of ▶ DNA Damage, Types of ▶ DNA Replication ▶ Electrophiles, Types of ▶ Exocyclic Adducts ▶ Hydrolytic, Deamination, and Rearrangement Reactions of DNA Adducts ▶ Reactive Oxygen Species ▶ Synthesis of Modified Oligonucleotides

References Blans P, Fishbein JC (2004) Determinants of selectivity in alkylation of nucleosides and DNA by secondary diazonium ions: evidence for, and consequences of, a preassociation mechanism. Chem Res Toxicol 17:1531–1539 Cai H, Guengerich FP (2001) Reaction of trichloroethylene oxide with proteins and DNA: instability of adducts and modulation of functions. Chem Res Toxicol 14:54–61 Guengerich FP, Mason PS, Stott WT et al (1981) Roles of 2-haloethylene oxides and 2-haloacetaldehydes derived from vinyl bromide and vinyl chloride in irreversible binding to protein and DNA. Cancer Res 41:4391–4398 Humphreys WG, Kadlubar FF, Guengerich FP (1992) Mechanism of C8 alkylation of guanine residues by activated arylamines: evidence for initial adduct formation at the N7 position. Proc Natl Acad Sci U S A 89:8278–8282 Lawley PD (1984) Carcinogenesis by alkylating agents. In: Searle CE (ed) Chemical carcinogens, vol 1, 2nd edn. American Chemical Society, Washington, DC, pp 325–484 Maeda M, Nushi K, Kawazoe Y (1974) Studies on chemical alterations of nucleic acids and their componentsVII. C-Alkylation of purine bases through free radical process catalyzed by ferrous ion. Tetrahedron 30:2677–2682

Sequence Determination, Classic Approaches to Park KK, Archer MC, Wishnok JS (1980) Alkylation of nucleic acids by N-nitrosodi-n-propylamine: evidence that carbonium ions are not significantly involved. Chem Biol Interact 29:139–144 Pearson RG (1968) Hard and soft acids and bases, HSAB, Part 1. J Chem Educ 45:581–587 Swain CG, Stivers EC, Reuwer JF Jr et al (1958) Use of hydrogen isotope effects to identify the attacking nucleophile in the enolization of ketones catalyzed by acetic acid. J Am Chem Soc 80:5885–5893

Sequence Determination, Classic Approaches to Scott Cooper and Anton Sanderfoot Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA

Synopsis Determining the amino acid sequence of a protein can yield information about its secondary structure and activity.

Introduction Proteins are made of chains of amino acids joined through peptide bonds. Knowing the order or sequence of these amino acids is useful in identifying or studying a protein. Several methods are used to determine this sequence. The protein can be sequenced from the amino-terminus using a chemical reaction called the Edman degradation. More commonly, the sequence of amino acids is inferred from the sequence of the mRNA or DNA that encodes the protein. mRNA is first converted into complementary DNA (cDNA) where it and genomic DNA can be sequenced using a variety of DNA sequencing techniques. The resulting DNA sequence is scanned for open reading frames in the case of cDNA, which will contain only exon sequences. Genomic DNA sequences must be analyzed for markers that identify the start and stop of a gene, along with intron/exon splice sites to reconstruct the DNA regions that would

1113

code for protein. Once the open reading frame is identified, it is translated into an amino acid sequence using the genetic code. Amino-terminal protein sequencing. Most of the modern direct sequencing methods will be discussed in the next short entry. A classical direct amino-terminal sequencing method is the Edman degradation, developed by Pehr Edman. In this method, phenyl isothiocyanate is reacted under alkaline conditions with the amino-terminus of a protein to form a cyclical phenylthiocarbamoyl derivative. The amino-terminal amino acid is cleaved with acid and extracted with an organic solvent as a phenylthiohydantoin (PTH) amino acid that can be identified by using liquid chromatography. This reaction is then repeated with the next amino-terminal amino acid. Because this is a chemical reaction, it does not work at 100% efficiency and typically the sequence of the first 20–30 amino acids can be resolved before the background signal gets too large. To sequence an entire protein, it would first be partially digested with trypsin or another protease to produce multiple peptides with free amino-termini. The second drawback of the Edman degradation is that in some proteins the amino-terminus is blocked making generation of peptides with free termini a necessity (Fig. 1). Translating nucleic acid sequences into amino acid sequences. With the explosion in sequenced genomes, and ease of sequencing complementary DNA (cDNA) made from messenger RNA (mRNA), most protein sequences are obtained by translating the sequence of the gene or mRNA that encodes a protein. In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box and transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame (ORF), which is typically hundreds or thousands of base pairs long. Thus, once a bacterial genome is sequenced, it is relatively straightforward to obtain the sequences of all of the proteins encoded by the genome. In eukaryotes, a gene is first transcribed into a primary RNA which is then processed by

S

1114

Sequence Determination, Classic Approaches to

Sequence Determination, Classic Approaches to, Fig. 1 Edman degradation reaction (http://www.chem. wisc.edu/areas/reich/ handouts/elecpush/epmechanisms.htm)

Sequence Determination, Classic Approaches to, Fig. 2 Six possible open reading frames of a cDNA. The 50 end of the mRNA and the N-terminus of the protein are

shown for orientation. The upper case letters in the DNA show the forward strand of DNA and the lower case letters the reverse strand (Courtesy UW-La Crosse)

splicing out intervening sequences (introns) and combining the remaining protein-encoding sequences (exons) to form an mRNA. To sequence these protein-encoding regions, cDNA is produced by first purifying mRNA from a tissue and then incubating it with reverse transcriptase. The resulting cDNAs can then be cloned into a library and screened with an antibody or oligonucleotide probe for the protein of interest. Alternatively, with high-throughput sequencing, all of the cDNAs can be sequenced producing a transcriptome. Once the cDNA sequences have been obtained, they can be translated in all six possible reading frames to find the longest open reading frame (Fig. 2). Each strand of DNA has three possible reading frames set by the three base

pair codons that encode an amino acid. An open reading frame has a start codon and a stop codon in the same frame. Because there are three stop codons and 61 amino acid encoding codons, by random chance a stop codon should appear every 21 codons. Thus if an open reading frame of 100 or more codons appears, it is unlikely to be due to chance and is most likely maintained by natural selection. All six reading frames are analyzed, and the longest open reading frame is assumed to be the actual reading frame. This is then translated using the genetic code to obtain a hypothetical amino acid sequence (Fig. 3). Isolating mRNA from tissue, creating cDNA, and screening libraries are time consuming and difficult. Protein-coding regions can also be

Sequence Information to Assess Evolutionary Relationships Second Nucleotide Position U

U First Nucleotide Position

Sequence Determination, Classic Approaches to, Fig. 3 Genetic code. The codon bases of the mRNA are read in the order; 1-left column, 2-top row, 3-right column. For example, tyrosine can have the codons UAU or UAC (Courtesy UW-La Crosse)

1115

C

A

G

▶ Predictions from Sequence ▶ Primary Structure

A

G

UCU Serine

UAU Tyrosine

UGU Cysteine

UUC Phenylalanine

UCC Serine

UAC Tyrosine

UGC Cysteine

UUA Leucine UUG Leucine

UCA Serine UCG Serine

UAA STOP UAG STOP

UGA STOP UGG Tryptophan

CUU Leucine

CCU Proline

CAU Histidine

CAU Arginine

CUC Leucine

CCC Proline

CAC Histidine

CAC Arginine

CUA Leucine

CAA Glutamine

CAA Arginine

CUG Leucine

CCA Proline CCG Proline

CAG Glutamine

CAG Arginine

AUU Isoleucine

AUC Threonine

AAU Asparagine

AAU Serine

AUC Isoleucine

ACC Threonine

AAC Asparagine

AAC Serine

AUA Isoleucine

ACA Threonine

AAA Lysine

AAC Arginine

AUG Methionine

ACG Threonine

AAG Lysine

AAG Arginine

GUU Valine

GCU Alanine

GAU Aspartate

GAU Glycine

GUC Valine

GCC Alanine

GAC Aspartate

GAC Glycine

GUA Valine

GCA Alanine

GAA Glutamate

GAA Glycine

GUG Valine

GCG Alanine

GAG Glutamate

GAG Glycine

obtained from genomic sequences produced by genome sequencing projects. The first challenge is finding the gene amongst 98% of a typical genome that does not encode translated protein sequences. The beginning and end of a gene are located by looking for the promoter and other regulatory signals such as CpG islands and binding sites for a poly(A) tail. The second challenge arises because cDNA is produced from mRNA, so the intervening sequences or introns have been removed during RNA processing leaving only the protein-coding exons. Genomic sequences still contain introns, meaning that the genomic sequence cannot be directly translated to look for open reading frames. Once a putative gene has been found, programs look for sequences that are conserved in intron/exon splice sites, and the intron sequences are removed leaving a single sequence of spliced exons. This putative coding sequence is then analyzed for the longest open reading frame. Most databases contain both the genomic DNA sequence of identified genes and the putative coding sequence.

Cross-References

C

UUU Phenylalanine

References Edman P, Högfeldt E, Sillén LG, Kinell P-O (1950) Method for determination of the amino acid sequence in peptides. Acta Chem Scand 4:283–293

Sequence Information to Assess Evolutionary Relationships Scott Cooper and Anton Sanderfoot Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA

Synopsis Using biological information as “traits” for the construction of phylogenetic trees is now a common occurrence in the study of evolutionary biology. Many tools have been developed for measuring or estimating the relatedness of proteins or for use in finding a better understanding of the potential functions for a protein of interest. These tools are powerful, but are prone to misunderstanding, and have several potential pitfalls for nonexperts. This entry will explain the basics of sequence alignment and production of phylogenetic trees.

S

1116

Sequence Information to Assess Evolutionary Relationships

Introduction An essential theory of biology is that all living things share a common ancestor. Through various means, an individual’s heritage can be traced through time from recent ancestors to distant relatives and beyond to larger and larger natural groups. One mechanism to accomplishing such a feat is to use the genes carried by individuals as a way to measure the relatedness. This requires that genes, like the individuals they are found in, can be traced backward through recent ancestors and so on. This ability to find genes that match in different individuals is important not only for genealogy but for understanding the genes themselves. Various means have been established to try to measure and understand the relationship between two sequences, in other words to determine the sequence homology present between two biomolecules. The assumption is that sequence homology is most likely derived from shared ancestry. For very short sequences, this assumption may not be that useful, since similar sequences can appear through random factors rather than shared ancestry, but such things can be analyzed statistically to indicate the level of significance in any alignment. Early methods used hand alignment and were subject to everincreasing work as sequence became longer or more than a pair of sequences were analyzed. Modern methods use computer algorithms, which are much faster and powerful and are established upon a mathematical basis that can provide the statistical analysis needed to understand the significance. Of course, computers will simply align any two sequences if instructed to. It is always up to a person to decide if the result is significant and whether the alignment represents a random alignment or sequence homology based on shared ancestry. This entry will attempt to introduce the tools, the technology, and the limitations of modern alignment algorithms and help readers to use the techniques to measure and analyze the evolutionary basis of the resulting sequence homology.

Choosing the Correct Genes or Proteins for Alignment Homology between two sequences has certain implications. “Homo” means “same,” so there is generally an assumption that two proteins that share sequence homology share a common ancestor – or that the two sequences share homology by inheritance from a common ancestor. However, recent gene duplications, convergence, or simple random alignment can lead to the appearance of sequence homology. For this reason, additional terms are often used to better control the perceptions of what it means to be a homolog. For example, the terms “ortholog” and “paralog” are used to specify the particular kind of homolog that one sequence represents in relation to a second homologous sequence. In this context, “ortho-” means “correct,” while “para-” means “beside,” and these terms are helpful for identifying the kind of homolog a sequence represents, which can greatly affect the results of any analysis. The danger of paralogs is that these genes most-often result from gene duplication and, as duplicates, often have reduced selection to maintain a particular function. They tend to quickly pick up mutations and often have altered expression patterns and novel biological functions. Each of these can result in a model of evolution that is not the same as the original gene prior to duplication (i.e., the ortholog), which will result in great differences in the ability of the paralog to be aligned to orthologs from other species. A related danger is when a gene is not inherited by vertical descent (i.e., from parent to child) and is instead the result of horizontal gene transfer. This occurs when a gene from another organism is transferred into the genome and becomes stably inserted. Transfer of genetic material is common in bacteria and archaea (Boto 2010) and even occurs at a significant rate in eukaryotes. One commonplace where this is seen in eukaryotes is organelle genes from the mitochondria or plastids being moved to the nuclear genome. Occasionally, the nuclear version of a gene can be replaced

Sequence Information to Assess Evolutionary Relationships

with the mitochondrial or plastid version in some lineages and not in others. Under these conditions, one organism’s gene may be more closely related to bacteria than to the original eukaryotic gene found in otherwise related species, resulting in peculiar alignments. It is therefore very important when creating an alignment to seek out the orthologs when possible and to avoid paralogs or genes resulting from horizontal gene transfer. When one gene in an alignment is found to be greatly divergent from what is expected (e.g., a human protein more closely aligned with a fungal gene than to a chimpanzee), one should suspect that they do not have the proper ortholog. In such a case, it would be wise to double-check the human genome to find if such an ortholog is available.

Sequence Alignments Alignment of biological sequences requires matching information on the two (or more) sequences for a significant length, with only limited mismatches or gaps introduced into the alignment. These alignments used to be accomplished by hand, but now various algorithms have been optimized to quickly and accurately perform the alignment. In general, different tools are often used to align a nucleic acid sequence rather than align protein sequence, but they each will work on similar principles. While there are advantages and disadvantages to working with nucleic acid sequence or protein sequences for alignments, the most important thing is to use a tool that is optimized for the correct type of sequence being aligned. Nucleic acids are polymers of four different bases: adenine, guanine, thymine, and cytosine. As such, each element of sequence has four possible states, and a four base-pair sequence has 44 = 256 possible permutations. This means that any two unrelated sequences of four base pairs each have a 1/256 chance of aligning perfectly at random. Meanwhile, any pair of 100 base-pair sequences have a less than a 1/1060 chance of

1117

randomly aligning, so longer alignments have a much greater probability of producing more significant alignments. Proteins, on the other hand, are polymers of 20 different amino acids, which tend to increase the significance of alignments (e.g., 204 = 160,000). However, proteins are templated upon the nucleic acid sequence through translation of the genetic code, so the nucleic acid sequence alignments are often preferred over many alignment procedures. Still, as long as the correct tools are used (nucleic acidoptimized tools for nucleic acid alignments, protein-optimized tools for protein alignments), a useful alignment can be accomplished with either biomolecule. The general principle of all alignment algorithms is to attempt to maximize the number of aligned residues while minimizing the mismatched items and gaps introduced by alignment. Two major approaches are used: Global alignment methods (e.g., Needleman and Wunsch 1970) attempt to force alignment of the full length of two sequences (i.e., from beginning to end). Such an approach is useful if the two sequences are believed to be both full-length sequences that share common descent and are of similar length. The algorithms start by making a matrix of all possible alignments of the two sequences and then using a scoring matrix to identify the alignment that produces the highest score. Global alignments often tend to introduce more gaps, especially if the sequences are not the same length. It will also do poorly with sequences that only share limited regions of homology. Local alignment methods (e.g., Smith and Waterman 1981) do not force alignment across the full length of sequences and focus only on small regions of homology. The algorithm first finds short sequence of high homology and then expands the alignment in both directions until the alignment begins to fail. Local alignments are good for finding limited homology between two sequences and are much faster than global alignment approaches. However, they can occasionally be misleading since a high score for only a limited fragment of each sequence can lead a researcher to

S

1118

Sequence Information to Assess Evolutionary Relationships

believe homology exists for the entire protein when not the case. Hybrid methods which mix the two approaches are now common as well. These semi-global or “glob-loc” algorithms generally start with local approaches and then force the alignment past where it would normally stop to make an attempt to align the full length of the sequence pair. Each of these methods is most useful for alignment of a pair of sequences (e.g., “pairwise alignment”). Which method is best? Well, the best method is usually decided empirically, and it is often useful to use multiple methods. Multiple alignment algorithms are used to accomplish the task of aligning more than a pair of sequences and use general principles similar to the pairwise alignments above, but then take extra steps to accomplish the alignment of the full data set. Two major approaches are used to accomplish multiple alignments. Progressive alignments start by first performing all possible pairwise alignments between the sequences in the data set. Then, starting with the highest-scoring pair, it uses the pairwise alignments to work progressively through to the lowest-scoring pair to produce a multiple-sequence alignment. These are the most widely used of the multiple-alignment algorithms and include CLUSTAL and tCOFFEE (Notredame et al. 2000; Higgins and Sharp 1988). They tend to work well on large data sets without becoming computationally intensive. The downside is that these methods are extremely sensitive to the results of the initial alignment (the highest-scoring pair), and any errors in the initial results are propagated through to the final product. Iterative alignments are not subject to the initial-error issues of progressive methods and iteratively add one sequence at a time to the alignment, optimizing and re-gapping at each iteration. This can become computationally intensive, and optimization of the algorithm can lead to results that may be difficult to quantitatively evaluate for biological significance (see Edgar 2004). Examples of iterative algorithms include MUSCLE and PRRN/PRRP (Edgar 2004; Gotoh 1996). Newer methods for multiple-sequence alignment have been developed that use hidden Markov models (that attempt to assign likelihood scores to all

possible alignments; Grasso and Lee 2004) or genetic algorithms (that try to simulate real natural processes to emulate natural alignments; Notredame and Higgins 1996). Though such methods appear promising, they are not widely used at this time.

Sequence Distance Once an alignment has been accomplished, the next step is to calculate some measure of sequence homology among the members of the alignment. By examining the alignment between two sequences, the number of changes between the two aligned sequences can be counted. The assumption is that the number of changes is somehow proportional to time since the sequences last shared a common ancestor. Two identical sequences (i.e., zero sequence changes) are assumed to have shared a common ancestor so recently that there has been no time to accumulate mutations. Since mutation is a fact of life, the general idea is that the longer two sequences have been evolving in isolation, the more changes they will have accumulated since the common ancestor sequence. The mutation rate for most organisms is sufficiently low that it can be assumed that, for any given site, the base or residue in one of the sequences is likely to represent the ancestor base or residue. Thus, a difference at a particular site represents one change between the two sequences. This relatedness between two sequences is generally scored as sequence distance, such that identical sequences have no changes and have zero distance, closely related sequences will have accumulated some changes and will have a nonzero distance, and less-related sequences will have more changes and have a greater distance. Basically, distance is proportional in some mathematical manner to the number of changes between two sequences. In a multiple-sequence alignment, which ideally represents a pairwise alignment of all the members, each sequence will have a separate number of changes relative to another sequence in the alignment. This is most simply represented by an n-by-n matrix where n represents the

Sequence Information to Assess Evolutionary Relationships

1119

Sequence Information to Assess Evolutionary Relationships, Fig. 1 Visualizing “distance”

number of sequences in the alignment. Each of the elements of the matrix represents a distance between the two sequences (again, proportional in some way to the number of changes), and thus, the matrix is often called a distance matrix. An inspection of the distance matrix resulting from a multiple-sequence alignment can provide a measure of the relatedness of all the members of the alignment: closely related sequences having small distance and less-related sequences having larger distances. Of course, not all sequence changes are necessarily the same (transitions versus transversions in nucleic acid sequences or “conservative changes” in protein sequences). Some changes are neutral to evolution, while some changes are likely to alter function. To try to account for such variation in the “consequences” of a mutation, some kind of scoring matrix (e.g., PAM, BLOSUM; Heinikoff and Heinikoff 1992) can be used to score the relative changes present between the two sequences to better quantify the evolutionary significance of that change. Each kind of scoring matrix uses different assumptions and is generally either based upon chemical characteristics (size of residue, hydrophobicity, polarity, charge, etc.) or extrapolated from real biological cases (natural mutation rate, adjacency in codon tables, redundancy of coding, etc.). Each has advantages and disadvantages, so the choice of scoring matrix should be made carefully or empirically for a given data set.

Phylogenetic Trees Most of the time, researchers want to examine the phylogeny resulting from any multiple-sequence

alignment or simply would like to have a graphical representation of the implied relationships between members of the alignment. In the simplest case, two sequences will produce a distance matrix (1  1, a direct score either based on number of changes or modified by a scoring matrix) with a distance score for those two sequences that can be used to produce a single line with a length proportional to the sequence distance. The length of the line (or branch) connecting the two sequences has some proportionality to the homology of the two sequences. Extrapolating this to a multiple-sequence alignment, which produces a higher-order distance matrix (Fig. 1a), one can make a visual representation where all the sequences are connected by lines of a length proportional to their relative homology. Of course, more closely related sequences will have shorter lines that connect them, while distantly related sequences will have proportionately longer lines (Fig. 1b). In the case of this simple alignment, it is clear that sequences A and B are more closely related to each other (distance of 2) and share a more recent common ancestor relative to sequence C. To show this, connect A and B at a node that represents the common ancestor to A and B. The line connecting A and B along the path through the node must be a length of 2 (from the distance matrix), so the node can be put a distance of 1 from both A and B. Both A and B have the same distance of 5 to C in this example, that means that it can be calculated that C must have a distance of 4 from the common ancestor of A and B; thus, the line connecting the A-B node to C should have a distance of 4. Examining this tree (Fig. 1c), it produces a diagram with the appropriate distances between each sequence. Such a diagram is often

S

1120

Sequence Information to Assess Evolutionary Relationships

called a star diagram or can also be called an unrooted tree. In reality, the distance matrix resulting from a multiple-sequence alignment is often much more complex than the example in Fig. 1, so algorithms must be used to convert the distance matrix into a phylogenetic tree. Techniques like neighbor joining (NJ; Saitou and Nei 1987) identify the shortest distance in the matrix and then fix that node before recalculating the entire data matrix. It then identifies the shortest distance in the new matrix, fixes that node, and recalculates again. Iteratively repeating this process until all the nodes are fixed produces a tree that is statistically significant regardless of the model of evolution assumed for the alignment and often results in an accurate topology. NJ algorithms also tend to be fast and are usually not computationally intensive, even with large data sets. As with all algorithms, it does tend to suffer with distance matrices with very large distance scores (i.e., distantly related sequences). Another approach based on a distance matrix, Unweighted Pair Group Method with Arithmetic Mean (UPGMA; Sokal and Michener 1958), uses sequence clustering to produce a tree. First, it creates clusters of closely related sequences, assigning each cluster a score representing the mean distance of the members. It then combines the next-most related clusters into larger clusters, which again have the mean distance of all the combined members. Eventually, all members are parts of nested clusters with distances that can be represented by the mean of each level of cluster. UPGMA trees must assume a constant rate of evolution among all the members of the alignment (i.e., it uses the mean distance between elements for the nodes), which may not be accurate for particular alignments. UPGMA also always produces a rooted tree, since the entire tree represents the largest cluster, and the final node will be the mean of the distances between the penultimate clusters – or the root will be placed at the mean of the largest distance. If a constant rate of evolution is correct for a particular data set, this placement of the root is correct, but this is not necessarily true. Many other phylogenetic algorithms do not rely upon a distance matrix; instead, they use

parsimony and cladistics to analyze each position in the alignment as if it was itself a matrix. Each of these methods basically creates all possible trees for the sequences present in the alignment and then uses various methods to determine which of the trees has the least number of changes necessary to produce that phylogeny (maximum parsimony) or which tree represents the highest statistical likelihood of being accurate (maximum likelihood or Bayesian inference; Felsenstein 1981; Yang and Rannala 1997). These algorithms are generally believed to be more statistically supported and better able to produce an accurate topology. But they can become computationally intensive (computation timescales with number of elements) and can also be sensitive to artifacts when faced with low-homology sequences/long-branch distances. The correct algorithm will vary with the particular data set and should be tested empirically to determine which result seems most accurate. Conversely, a topology supported by two different methods of analysis can be considered to provide useful support for the accuracy of that topology.

Rooted or Unrooted Trees? Some phylogenetic programs will default to an unrooted tree, since this is the only fair demonstration of the data without any additional knowledge about the placement of the root. If a rooted tree is desired, some additional information must be used to determine the best location for a root. Often, programs will default to setting a root at the midpoint of the longest branch (as UPGMA must do). This may be reasonable considering the assumptions in the representations of a tree (i.e., the most divergent sequences represent the greatest time for mutations to accumulate, and the ancestor/root should be older than the oldest sequence). But, in alignments of very similar sequences (or alignments of highly divergent sequences), a rooting based on the longest branch may be misleading. Instead, information from outside of the aligned sequence can be used to root the tree. For example, using taxonomic data to choose sequences from species that are outgroups (fish

Sequence Information to Identify Motifs

or amphibians for mammals, gymnosperms for flowering plants, etc.) and including those sequences in the alignment and subsequent phylogeny. Assuming that these outgroups do share a common ancestor that is more ancient than the common ancestor of the group of interest can bring confidence that the root should be placed on the branch connecting the outgroups to the “ingroups.”

Common Misperceptions About Phylogenetic Trees It is important to always remember that a phylogenetic tree represents a hypothesis about the relationship between the members of an alignment. It is not “proof” of phylogeny, nor should it be the end of an analysis. As with any hypothesis, it must be tested. The results of an alignment and any subsequent phylogenetic tree should be examined with respect to other information known about the organisms the sequence is from. Issues with use of paralogs rather than orthologs in alignments, longbranch attraction in phylogenetic algorithms, and other mechanical issues in the computations can result in poorly supported topology. With the large number of algorithms available, parallel analysis with algorithms of different methods should be attempted. Variability among the results should be a sign of low confidence in the alignment or phylogeny. Synchrony among the different methods, on the other hand, should provide more confidence in the results.

1121 Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376 Gotoh O (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264(4):823–838 Grasso C, Lee C (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 20(10):1546–1556 Heinikoff S, Heinikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89(22):10915–10919 Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73(1):237–244 Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453 Notredame C, Higgins DG (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res 24(8):1515–1524 Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217 Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425 Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197 Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38:1409–1438 Yang Z, Rannala B (1997) Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol Biol Evol 14:717–724

Sequence Information to Identify Motifs Cross-References ▶ Genes and Genomes: Structure ▶ Predictions from Sequence

Scott Cooper and Anton Sanderfoot Department of Biology, University Wisconsin – La Crosse, La Crosse, WI, USA

References

Synopsis

Boto L (2010) Horizontal gene transfer in evolution: facts and challenges. Proc Biol Sci 277(1683):819–827 Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797

Identifying potential functions for an unknown protein sequence has become easier due to the proliferation of powerful algorithms to compare the sequence to large databases of consensus

S

1122

Sequence Information to Identify Motifs

domains and motifs. Understanding the use and the potential dangers of such algorithms can greatly speed any work with new and novel sequences. The most important thing to remember is that such programs can only create a hypothesis about the shape or function of a sequence, not prove any functions. Such programs are the first step in any study of a new protein, not the last.

Introduction The flow of biological information from DNA to RNA to protein necessarily requires that the final shape of the protein is encoded into the DNA. This essential fact of biology is the basis for the field of bioinformatics. Bioinformatics greatly expands the knowledge of researchers about the potential functions of a given gene by simply examining the nucleotide or encoded-protein sequence. Another essential fact of biology, evolution, further allows the extension of knowledge about a given nucleotide sequence to be potentially applied to other sequences with which it shares sequence homology. Combining these facts, one can identify sequences that encode a particular shape, a particular binding interface, or a particular function (i.e., motifs or domains) and then apply that knowledge to make a hypothesis about the functions of an unknown sequence with homology to the known motif or domain. Of course, any predictive algorithm is not useful until previously collected data about the function, shape, or characteristics of a known sequence have been determined. This traditionally requires experimentation and collection of structural information about a sequence that establishes the function of that motif or domain. Analysis of these experiments, combined with bioinformatics approaches, produces a consensus sequence or matrix for a given motif or domain that can be used to compare to unknown sequences.

Definition of Motifs and Domains Though there can often be quite a bit of overlap, motifs and domains represent different levels of

Sequence Information to Identify Motifs, Fig. 1 Alignment of the structure of a bacterium with a eukaryotic malate dehydrogenase. Escherichia coli malate dehydrogenase (1EMD, downloaded from http:// www.rcsb.org/pdb and colored green) was aligned with Sus scrofa (pig) mitochondrial malate dehydrogenase (1MLD, downloaded from http://www.rcsb.org/pdb and colored red), and both were opened in PyMOL (the PyMOL Molecular Graphics System, Version 1.5.0.1 Schrödinger, LLC) and aligned to show the similar tertiary structure of the two proteins. Each protein is predicted by NCBI:CDD to contain a single cd01337 domain (MDH_glyoxysomal_mitochondrial). 1EMD with an E-value of 2.83e-164 and 1MLD with an E-value of 0e +00 (exceptionally high confidence for both)

information that can be extracted from sequence homology: Domains represent conserved sequences that have a common tertiary structure, while motifs represent conserved sequences that have a defined biological function. For example, enzymes that catalyze the oxidation of malate to oxaloacetate using a NAD+ cofactor (e.g., malate dehydrogenase) are found in virtually all living organisms, and the proteins with that enzymatic activity have similar structural shapes (Fig. 1). Moreover, the primary sequence of many such proteins can be aligned to produce a consensus sequence that can be considered to be diagnostic of a conserved functional domain that folds into a similar tertiary structure and carries the function of malate dehydrogenase (e.g., cd01337; http://www.ncbi.nlm.nih.gov/cdd). An unknown sequence that shows homology to the consensus sequence of the malate dehydrogenase domain can be hypothesized to fold

Sequence Information to Identify Motifs

into a similar tertiary structure and, assuming that it carries the active site residues for catalysis, to also encode an enzyme with malate dehydrogenase activity. Even within a domain, it is possible to look more closely for the residues that are essential for the activities of the larger domain. For example, though the malate dehydrogenase domain is rather large, only a few amino acids directly interact with the substrate. For example, the active site of malate dehydrogenases could be represented by a conserved sequence motif of only 13 amino acids (e.g., PS00068; http://prosite.expasy.org/) that directly binds to substrate and performs the catalysis of the reaction. The general concept is that known malate dehydrogenases carry such a motif in their active sites. However, as a short sequence, it is possible that such a sequence could be produced at random. In other words, an unknown sequence with homology to this PS00068 motif could be hypothesized to encode for malate dehydrogenase activity, but it would be better if such a sequence also carried homology to the larger malate dehydrogenase domain. Of course, that a particular sequence shares homology with a given domain or motif is not a proof of the activity known for that domain/motif and can only indicate what actual experiments could be next performed to test for that function.

Overlap Between Domains, Motifs, and Targeting Signals As outlined above, motifs and domains represent different levels of information that can be extracted from homology. Of course, there is often overlap, since some motifs may also represent a common tertiary structure. For example, a “Zn-finger” motif called the ZN_FYVE (PS50178; http://prosite.expasy.org/) has a consensus sequence and a conserved function of coordinating two Zn2+ atoms and specifically binding phosphatidylinositol-3-phosphate. Proteins that contain FYVE motifs also fold up into a conserved tertiary structure (e.g., 1HYJ, 1VFY; http://www.rcsb.org/pdb/), indicating that this motif also represents a domain.

1123

Not all proteins are enzymes, so the concept of domains and motifs can also be extended to functions unrelated to catalysis. Some domains are known more for a structure or shape than an enzymatic function (e.g., WD40 domain, cl02567 (http://www.ncbi.nlm.nih.gov/cdd) or a Clathrin_propel, PF01394 (http://pfam.sanger. ac.uk/)). Similarly, motifs can also describe the sequence of protein that represents the substrate for another protein enzyme. For example, motifs that describe phosphorylation sites for kinases or glycosylation sites for glycosyltransferases (e.g., PKC_PHOSPHO_SITE, PS00005 or ASN_GLYCOSYLATION, PS000001 (http://pro site.expasy.org/)) provide information about the potential posttranslational modifications of a protein. Other motifs describe the various targeting signals for secretion or organelles (e.g., MICROBODIES_CTER, PS00342, or NLS_BP, PS50079 (http://prosite.expasy.org/)) that are encoded in the sequence of proteins and serve as binding sites for the various cargo receptors that mediate protein targeting. Often these targeting signals are not easily predicted by straightforward sequence homology, and other characteristics of the sequence must be used to identify these potential sites. For example, targeting to the secretory pathway or to organelles like the mitochondria or chloroplasts involves N-terminal alpha helices with a characteristic hydrophobicity or amphipathicity that act as targeting signals that can be identified by purpose-built algorithms like SignalP or TargetP (Petersen et al. 2011; Emanuelsson et al. 2007). Similar mechanisms can spot regions of a protein predicted to be transmembrane alpha helices (e.g., TMHMM; Krogh et al. 2001) or beta barrels (e.g., PRED-TMBB; Bagos et al. 2004) by the presence of particular characteristics of the sequences based upon known transmembrane proteins. Finally, in the “post-genomic” world, domain classification has been working in reverse (i.e., from a consensus sequence to functional domain, rather than functional domain to consensus sequence). Homologous sequences identified in multiple proteins with unknown function are given “domains of unknown function” assignments.

S

1124

Sequence Information to Identify Motifs

These “DUF” families are numerous in the pfam database (Punta et al. 2012) and have occasionally been predictive of true functional domains following further study. For example, domain of unknown function 9 was identified in many bacterial proteins. This domain was later shown to have diguanylate cyclase activity (Pei and Grishin 2001) and was renamed to be a GGDEF domain (PF00990; http://pfam.sanger.ac.uk/).

Algorithms Used to Identify Motifs and Domains Many different algorithms are available for comparing an unknown sequence to a database of consensus domains and motifs. For example, CDD (NCBI), Pfam (Sanger), SMART (EMBL), and PROSITE (ExPASy) each provide searchable databases of their own consensus domains and/or motifs (see Table 1). These multiple databases do cross-reference and generally provide links to other sources of motif/domain information. However, because they are not identical, results can vary from database to database. Each of these sites presents a fairly standard Web interface that should be familiar to any Web-savvy researcher. A target or query sequence (the unknown sequence of your protein of interest) can be pasted into the search box in virtually any text format. Some equivalent to a “submit,” “search,” or “start” button initiates the analysis. Depending on the database, some version of fast protein alignment compares the query sequence to the consensus sequence of all the motifs/domains in the database. Should any domains or motifs be

identified, a subsequent results page will provide the location and alignments of the query sequence to the identified domain or motif. The results page will also offer links to the pages for each domain or motif, which will provide additional information that may be helpful in understanding these results. Interpreting the results is an important step, however, since biology must always trump the informatics results. It is especially important to take care of the results from a PROSITE search, since that database focuses mainly on the pattern motifs which are often short and highly degenerate. Fortunately, the default settings on ScanProsite have checked the box named “Exclude motifs with a high probability of occurrence” which will hide those profiles with limited sequence data to align. For example, the ASN_GLYCOSYLATION motif (PS000001) is only four amino acids long: N–{P}– [ST]–{P}. This pattern signature means that the first residue must be asparagine (N), followed by any amino acid except proline (i.e., {P} means NOT P), followed by either serine or threonine ([ST] means S OR T), and ending with any residue that is not proline. Statistically, this pattern could occur at random approximately every 221 residues, so large proteins would be expected to have such a sequence match present with high probability. While it is true that the protein complex responsible for N-glycosylation of proteins, N-oligosaccharyltransferase, does recognize this consensus sequence, adding the core glycan to the asparagine residue (reviewed in Schwarz and Aebi 2011), this does not mean that all iterations of the sequence will be N-glycosylated.

Sequence Information to Identify Motifs, Table 1 Databases of motifs and domains Database Conserved Domains Database (CDD) Pfam Simple Modular Architecture Research Tool (SMART) PROSITE

Source National Center for Biotechnology Information, USA Sanger Institute, UK European Molecular Biology Laboratory, Germany Swiss Institute of Bioinformatics, Switzerland

Search mechanism RPSBLAST HMMER3 HMMER3

ScanProsite

Web interface http://www.ncbi.nlm.nih.gov/cdd http://pfam.sanger.ac.uk/ http://smart.embl-heidelberg.de/

http://prosite.expasy.org/

Sequence Information to Identify Motifs

Importantly, the protein complex responsible for this activity is only found in the lumen of the endoplasmic reticulum (or the periplasm of archaea and some bacteria) and only has access to the sequence on surface loops of proteins exposed to the lumen of the endoplasmic reticulum (Schwarz and Aebi 2011). This motif has no relevance for cytosolic proteins, nor if this motif is folded to points not accessible to the enzyme. With this in mind, one should take care when interpreting all search results unless some experimental data is available to back up such predictions. Since interpretation of some results requires information about the subcellular localization of an unknown protein, it is often useful to also attempt prediction algorithms that investigate a protein for targeting signals or motifs. PROSITE contains motif patterns for some targeting signals such as C-terminal peroxisomal targeting signals (PS00342), bipartite nuclear localization signals (PS50079), and C-terminal endoplasmic reticulum retaining signals (PS00014). Such signals are very short pattern motifs and are subject to the same warnings from the last paragraph, but can be helpful with some additional data to support the predictions. Other signals for targeting to the secretory pathway, mitochondria, or chloroplasts can be provided by the use of SignalP or TargetP (Petersen et al. 2011; Emanuelsson et al. 2007). Such programs produce probabilistic results that can be used to identify the potential destination of the target proteins. In support or in addition to such programs, investigating a protein for membrane-spanning regions can help to localize a protein. Proteins can span a biological membrane with one or more transmembrane alpha helices or by a series of beta-sheets. Because the characteristics and the orientation of membrane proteins are also vital for understanding the potential localization of a protein, various matrix-based probabilistic models trained on known transmembrane proteins have been produced (e.g., TMHMM (Krogh et al. 2001), PRED-TMBB (Bagos et al. 2004)). Such programs output data that also must be interpreted carefully, since, like all programs, they are prone to false positive and false negatives.

1125

Importantly, different algorithms and databases can be used to cross-check and support predictions from others. A prediction of a nuclear localization site and a secretory targeting signal are mutually exclusive (since a protein cannot simultaneously be in the nucleus and outside the cell). On the other hand, a prediction of a secretory signal from SignalP, followed by a similar result from TMHMM, would be helpful; this would also support a result of potential N-glycosylation sites from PROSITE, since those sites would potentially be accessible to the oligosaccharyltransferase in the ER lumen. It is also worth noting that all the results from any prediction algorithm will show some variation of a confidence score that reflects the probability that the given results are significant. Each algorithm will have a different method for providing this information. A common example is an E-value (used by most domain and motif algorithms discussed here), which is a number representing the probability that such an alignment may occur by chance alone. An E-value of one represents an alignment likely to occur simply by pure random chance, while values close to zero (e.g., 3.42e-46, which is equivalent to 3.42  1046) represent a high probability of a significant alignment (or, equivalently, a low probability of a random alignment). Alternately, some programs use a traditional p-value to represent confidence (e.g., TargetP, SignalP, and TMHMM). In this case, a p-value near one represents a high probability of a significant result, while lower p-values represent a lower probability. It is crucial to make yourself aware of the particular confidence output of any given algorithm and to carefully consider what level of confidence you should require before acting on any of the predictions.

Conclusions The number of predictive and informatics approaches to biology continues to increase every day. These tools bring substantial power for identifying the functions of an unknown protein (or protein-encoding nucleotide sequence) and can greatly speed progress in many studies.

S

1126

Nonetheless, it is essential to remember that anyone who uses the tools of bioinformatics should keep in mind that the “bio” comes first. No matter how low the E-value (or high the P-value) may be, a prediction is not proof. All these tools should be used as the start of any experiment and only used to tailor the kinds of actual experiments you will perform to investigate the protein you are studying.

Sequence Motif

Sequence Selectivity of DNA Damage Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Definition Cross-References ▶ Predictions from Sequence ▶ Primary Structure ▶ Tertiary Structure Domains, Folds and Motifs

References Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ (2004) PRED-TMBB: a web server for predicting the topology of beta-barrel outer membrane proteins. Nucleic Acids Res 32(Webserver Issue): W400–W404 Emanuelsson O, Brunak S, von Heijne G, Nielsen H (2007) Locating proteins in the cell using TargetP, SignalP, and related tools. Nat Protoc 2:953–971 Krogh A, Larsson B, von Heijne G, Sonnhammer ELL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580 Pei J, Grishin NV (2001) GGDEF domain is homologous to adenylyl cyclase. Proteins 42:210–216 Petersen TN, Brunak S, von Heijne G, Nielsen H (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785–786 Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD (2012) The Pfam protein families database. Nucleic Acids Res 40(Database Issue):D290–D301 Schwarz F, Aebi M (2011) Mechanisms and principles of N-linked protein glycosylation. Curr Opin Struct Biol 21:576–582

Sequence Motif ▶ Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

Different reactive molecules prefer to react at different sites within strands of DNA, due to electronic and base-stacking effects.

Discussion Different chemicals react with different bases and at different atoms on them, as discussed under ▶ “Selectivity of Chemicals for DNA Damage.” In addition, damage is usually not random, and some selectivity is seen along the DNA depending upon neighboring effects. The selectivity varies for different types of damage and may or may not predict mutational spectra. Damage by hydroxyl radicals (OH•) is very nonselective because of the very short lifetime of the species (Fig. 1). Therefore, hydroxyl radicals can be used in “footprinting” for agents that generate OH•. Dimethyl sulfate has also been used as a footprinting agent because of its nonselective reactivity. Some of the larger, more complex drugs show considerable sequence selectivity (Fig. 1; Hurley et al. 1988). Some compounds bind in the major groove of DNA and some in the minor. However, this difference does not have a pronounced effect on sequence selectivity in that the grooves are continuous in Β-DNA. However, certain compounds may target regions of DNA that adopt unusual forms. For example, chloroacetaldehyde has been reported to preferentially cross-link regions of Z-DNA. There is interest in developing drugs that will specifically bind to the G-quadruplex regions of telomeres for cancer therapy.

Sequence Selectivity of DNA Damage

1127

For the majority of carcinogens, there is modest sequence selectivity. Our experience is that there is generally 30 bp) repeats. Rad59 possibly assists Rad52 in strand annealing when the base pairing is limited. Alternatively, Rad59 functions to overcome the inhibitory effect of Rad51 on SSA. Therefore, deletion of RAD51 and RAD59 restores SSA activity close to wild-type level. Srs2, a Rad51-displacing helicase, also contributes to SSA. Successful annealing of repeat sequences forms unique recombination intermediates that contain one or two 30 flaps (Lyndaker and Alani 2009). Cleaving 30 flaps is a key step in SSA as it produces DNA ends with the 30 OH, suitable for repair synthesis by DNA polymerases. An endonuclease complex, Rad1/Rad10 (XPF/ERCC1 in mammals), catalyzes the 30 flap removal. Additionally, SSA requires proteins to stabilize the annealed intermediate, direct the binding of Rad1/Rad10 to 30 flap intermediates, and confer cleavage specificity. For instance, the Msh2/Msh3 mismatch repair complex stabilizes annealed intermediate between shorter (5% mismatches do not support SSA efficiently. In yeast, SSA between divergent sequences is interrupted by unwinding of the annealed intermediates and strand dissociation (Lyndaker and Alani 2009). It is unknown whether similar mechanisms operate to suppress SSA in higher eukaryotes. Further studies on the SSA mechanism and its regulation will assess its contribution to chromosomal instability and uncover ways to modulate SSA events for clinical utility in humans.

Site-Specific Mutagenesis

Cross-References ▶ DNA Recombination, Mechanisms of

References Ciccia A, McDonald N, West SC (2008) Structural and functional relationships of the XPF/MUS81 family of proteins. Annu Rev Biochem 77:259–287 Krogh BO, Symington LS (2004) Recombination proteins in yeast. Annu Rev Genet 38:233–271 Lyndaker AM, Alani E (2009) A tale of tails: insights into the coordination of 30 end processing during homologous recombination. Bioessays 31:315–321 Motycka TA, Bessho T, Post SM, Sung P, Tomkinson AE (2004) Physical and functional interaction between the XPF/ERCC1 endonuclease and hRad52. J Biol Chem 279:13634–13639 Paques F, Haber JE (1999) Multiple pathways of recombination induced by double-strand breaks in Saccharomyces cerevisiae. Microbiol Mol Biol Rev 63:349–404

Site-Specific Mutagenesis Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Definition This is a process in which a single base in a piece of DNA is changed to a chemically modified version and the biological effects (e.g., mutation) are measured in living cells, with prokaryotes or eukaryotes.

Discussion Site-specific mutagenesis is an approach to interrogating the mutagenic potential of a particular DNA lesion by placing it at a specific site in a fragment of DNA, inserting this vector into a cell, and then measuring the mutagenic events that occur (Fig. 1). (The process should not be confused with site-directed mutagenesis, in which

Site-Specific Mutagenesis

1143

S

Site-Specific Mutagenesis, Fig. 1 A strategy for the preparation of single- and double-stranded vectors containing DNA adducts for site-specific mutagenesis. UDG uracil DNA glycosylase

individual amino acids of a protein are systematically replaced.) The method was first reported in 1984 in Prof. John Essigmann’s laboratory (Loechler et al 1984), who constructed a vector containing O6-methylguanine and then

demonstrated its mutagenicity in Escherichia coli. Following replication of the bacteria, the mutants can be screened using several methods, including a certain phenotype (if the mutation leads to a characteristic change in the function of

1144

Site-Specific Mutagenesis

Site-Specific Mutagenesis, Fig. 2 Analysis of results in site-specific mutagenesis

a protein, e.g., lacZ and color differences), to hybridization with sets of matching (labeled) oligonucleotides, restriction digest changes, and even sequence analysis (Fig. 2). Some recent approaches to analysis have included a RACE method and liquid chromatography-mass spectrometry of amplified replication products (Yuan and Wang 2008). There are several considerations in designing site-specific mutagenesis studies. First, the adduct of interest must be stable enough to be synthesized (see entry ▶ “Synthesis of Modified Oligonucleotides”). There is a choice of using a single- or double-stranded vector. The use of the former will force cellular polymerases to copy past it. If double-stranded vector is used, then the polymerases may copy the opposite DNA strand if replication past an adduct is impeded (there are methods for distinguishing the strands to correct for this, however). Another issue is the host, be it bacteria, yeast, or mammalian cells. The outcome will depend upon which DNA polymerases are involved in replication, and today the complexity of the DNA polymerases is appreciated. One issue in the use of mammalian cells is that the inventory of DNA polymerases is seldom identified, and these differences can be expected. Another issue is that almost all such studies with eukaryotic hosts take place in an “extrachromosomal” environment (which may not be necessarily relevant to events in the nucleus); this issue can only be avoided by systems involving chromosomal integration (Akasaka and Guengerich 1999) (which are technically much more demanding). Another

issue is whether DNA repair will occur before replication, and it is possible to compare the extent of mutation in host strains with varying DNA repair capabilities to interrogate the type of DNA repair that occurs with each lesion. Overall, this approach has provided the most direct evidence that DNA adducts are mutagenic and, by extension, carcinogenic.

Cross-References ▶ Depurination ▶ DNA Damage by Endogenous Chemicals ▶ DNA Damage, Frequency of ▶ DNA Damage, Practical Screening for ▶ DNA Damage, Types of ▶ Selectivity of Chemicals for DNA Damage ▶ Sequence Selectivity of DNA Damage ▶ Synthesis of Modified Oligonucleotides

References Akasaka S, Guengerich FP (1999) Mutagenicity of sitespecifically located 1, N2-ethenoguanine in Chinese hamster ovary cell chromosomal DNA. Chem Res Toxicol 12:501–507 Loechler EL, Green CL, Essigmann JM (1984) In vivo mutagenesis by O6-methylguanine built into a unique site in a viral genome. Proc Natl Acad Sci USA 81:6271–6275 Yuan B, Wang Y (2008) Mutagenic and cytotoxic properties of 6-thioguanine, S6-methylthioguanine, and guanine-S6-sulfonic acid. J Biol Chem 283: 23665–23670

Spectroscopy of Damaged DNA

1145

Discussion

Site-Specific Recombination ▶ Conservative Site-Specific Recombination

Slip Mispairing ▶ Mismatch Repair

Sodium-Dependent Glucose Transport ▶ Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

Solvent Denaturation ▶ Chemical Denaturation

Somatic Recombination ▶ V(D)J Recombination

Spectroscopy of Damaged DNA Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Definition UV, fluorescence, circular dichroism, NMR, and other spectroscopic approaches can be used to study the properties of modified DNA (usually at the oligonucleotide level).

Long pieces of DNA with modifications are generally not very amenable to spectroscopy in that the spectral properties of the unmodified DNA bases obscure those of the damaged lesion. Most sophisticated spectroscopic studies have been done with oligonucleotides 18 bases in length and include ultraviolet (UV), fluorescence, circular dichroism (CD), and NMR studies. The synthesis of modified oligonucleotides is discussed under “Synthesis of Modified Oligonucleotides.” Some modified bases have perturbed UV spectral properties, although they are usually not very dramatic (e.g., O6-alkyl guanine, N7-alkyl guanine) (Fig. 1a). Nevertheless, these can be utilized in titrations, etc. (Persmark and Guengerich 1994). C8-aryl modifications can be more dramatic (Humphreys et al. 1992). The UV spectra of individual bases and nucleosides can be used in their identification, however (Marsch et al. 2001). Several modified DNA bases are fluorescent, including O6alkyl guanine, N2,3-e-guanine, and N7-alkyl guanine (weak). Some adducts such as pyrene derivatives (derived from polycyclic hydrocarbons) have fluorescence in their moieties. Fluorescence measurements can be utilized in the analysis of nucleotides (or even modified bases), e.g., online with HPLC. However, fluorescence measurements in oligonucleotides are more problematic in that the fluorescence is usually strongly quenched by neighboring bases. However, there are some useful strategies for using fluorescence with oligonucleotides. One approach is to attach a highly fluorescent label at the end of the oligonucleotide, e.g., 5-[N-(30 -diphenylphosphinyl40 -methoxycarbonyl)phenylcarbonyl)aminoacetamido]fluorescein (FAM). Thus, titrations can be done with proteins to estimate binding affinity. Ano ther approach is to incorporate a relatively highly fluorescent base (not corresponding to an adduct) in the middle of the oligonucleotide, e.g., 2-amino purine or 6-methylpyrrole[2,3-d]pyrimidine-2(3H) one deoxyribonucleoside (pyrroleC). These fair well with T and G and have been utilized in repo rting base-flipping kinetics, in that their fluo rescence is enhanced in extra-helical configuratio ns (Zang et al. 2005).

S

1146

Spectroscopy of Damaged DNA, Fig. 1 (a) UV spectra of some guanosine derivatives (Guengerich et al. 1999). (b) CD spectra of double-stranded oligonucleotides at

Spectroscopy of Damaged DNA

varying temperatures showing the effect of N7-guanine alkylation at varying temperatures (Kim and Guengerich 1993)

Spectroscopy of Damaged DNA, Fig. 2 2-dimensional nuclear Overhauser effect correlated spectroscopy (NOESY) spectrum of a double-stranded oligonucleotide containing 1,q2-e-deoxyguanosine (X) (Shanmugam et al. 2010). The lines are drawn to show the connectivity in making assignments

NMR approaches have been utilized extensively including 1H, 13C, and 31P NMR. Two-dimensional methods can be used to “walk” through sequences and establish the

origin of each signal (Fig. 2). Adducts perturb the backbone, the structure of which can be established with the use of the NMR data and molecular modeling. Another application is the

Symporter

direct observation of H bonds in a downfield 1H position. Oligonucleotides have CD spectra that report the DNA structure, e.g., B, Z, A, . . . (Fig. 1b). CD spectra are useful in ascertaining the effects of certain modifications DNA structure (Kim and Guengerich 1993). A related spectroscopic technique is linear dichroism (Shahbaz et al. 1986).

1147 pyrenediol epoxide-DNA solutions. Biochemistry 25:3290–3296 Shanmugam G, Kozekov ID, Guengerich FP et al (2010) Structure of the 1, N2-Etheno-20 -deoxyguanosine lesion in the 30 -G(edG)T-50 sequence opposite a one-base deletion. Biochemistry 49:2615–2626 Zang H, Goodenough AK, Choi JY et al (2005) DNA adduct bypass polymerization by Sulfolobus solfataricus DNA polymerase Dpo4: analysis and crystal structures of multiple base pair substitution and frameshift products with the adduct 1, N2-ethenoguanine. J Biol Chem 280:29750–29764

Cross-References ▶ Adducts on Tm, Effects of ▶ Base Intercalation in DNA ▶ Damaged DNA, Analysis of ▶ DNA Base Pairing, Modes of ▶ Hydrolytic, Deamination, and Rearrangement Reactions of DNA Adducts ▶ Kinetics of DNA Damage ▶ Selectivity of Chemicals for DNA Damage ▶ Sequence Selectivity of DNA Damage ▶ Synthesis of Modified Oligonucleotides ▶ Ultraviolet Light DNA Damage

Streptophyte Algae ▶ Mitochondrial Genomes of Green, Red and Glaucophyte Algae

Stress Granules ▶ Cytoplasmic mRNA, Regulation of

References Guengerich FP, Mundkowski RG, Voehler M et al (1999) Formation and reactions of N7-aminoguanosine and derivatives. Chem Res Toxicol 12:906–916 Humphreys WG, Kadlubar FF, Guengerich FP (1992) Mechanism of C8 alkylation of guanine residues by activated arylamines: evidence for initial adduct formation at the N7 position. Proc Natl Acad Sci U S A 89:8278–8282 Kim MS, Guengerich FP (1993) Interactions of N7-guanyl methyl- and thioether-substituted d(CATGCCT) derivatives with d(AGGNATG). Chem Res Toxicol 6:900–905 Marsch GA, Mundkowski RG, Morris BJ et al (2001) Characterization of nucleoside and DNA adducts formed by S-(1-acetoxymethyl)glutathione and implications for dihalomethane-glutathione conjugates. Chem Res Toxicol 14:600–608 Persmark M, Guengerich FP (1994) Spectroscopic and thermodynamic characterization of the interaction of N7-guanyl thioether derivatives of d(TGCTG*CAAG) with potential complements. Biochemistry 33:8662–8672 Shahbaz M, Geacintov NE, Harvey RG (1986) Noncovalent intercalative complex formation and kinetic flow linear dichroism of racemic syn- and anti-benzo[a]

Structural Repeat ▶ Repeating Sequences in Proteins: Their Identification and Structural/Functional Implications

S Sunlight Damage to DNA ▶ Ultraviolet Light DNA Damage

Symporter ▶ Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

1148

Synthesis of Modified Oligonucleotides Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synonyms DNA synthesis

Definition Oligonucleotides are short pieces of DNA (5–100 nucleotides long) that can be prepared with

Synthesis of Modified Oligonucleotides

a chemical modification at a single site in order to study biological and physical properties.

Discussion In order to study either the chemical or biological consequences of DNA damage at a mechanistic level, it is necessary to prepare oligonucleotides containing a chemically defined entity relevant to the DNA damage. The lesion must correspond to the damage caused by modification (or be a reasonable facsimile) and be at a specific site. Several methods can be used for synthesis of modified oligonucleotides: (i) Reaction of a “normal” oligonucleotide with a reactive chemical, i.e., a “biomimetic” route resembling what might occur following oxidative or other

Synthesis of Modified Oligonucleotides, Fig. 1 Solid-phase oligonucleotide synthesis (Courtesy of Prof. C. J. Rizzo)

Synthetic Plasmid Biology

enzymatic bioactivation of the chemical. The problem with this approach is that the selectivity of the reaction is generally not absolute and thus difficult separation methods may be necessary. (ii) Alternatively, a modified nucleoside can be prepared by either biomimetic (see above) or non-biomimetic routes, functionalized at the 30 and 50 -positions, and used in a chemical oligonucleotide synthesis (Fig. 1). A requirement is that the modified base be stable to the protection and deprotection schemes used in oligonucleotide synthesis (Fig. 1). (iii) An alternative approach involves synthesis of a nucleoside triphosphate containing the modified base and incorporation of it into an oligonucleotide (using a scaffold template that can be removed by enzymatic reaction). This procedure is not often utilized because (a) conversion of modified nucleosides to triphosphates is not trivial (Hoard and Ott 1965) and (b) the amount of oligonucleotide that can be produced is limited. The criteria for purity and identity of modified oligonucleotides are very strict, particularly if biological experiments are to be done (Guengerich 2006). Purification is usually done by HPLC (shorter) or preparative electrophoresis. Some methods to be used include capillary electrophoresis and mass spectrometry (usually matrix-assisted laser desorption/time-of-flight, MALDI-TOF).

Cross-References ▶ Adducts on Tm, Effects of ▶ DNA Base Pairing, Modes of ▶ DNA Damage, Types of ▶ Selectivity of Chemicals for DNA Damage ▶ Site-Specific Mutagenesis ▶ Spectroscopy of Damaged DNA

References Guengerich FP (2006) Interactions of carcinogen-bound DNA with individual DNA polymerases. Chem Rev 106:420–452 Hoard DE, Ott DG (1965) Conversion of mono- and oligodeoxyribonucleotides to 50 -triphosphates. J Am Chem Soc 87:1785–1788

1149

Synthetic Plasmid Biology Mikkel Bentzon-Tilia1, Søren Johannes Sørensen2 and Lars Hestbjerg Hansen2,3 1 Department of Systemsbiology, The Technical University of Denmark, Lyngby, Denmark 2 Section for Microbiology, Department of Biology, University of Copenhagen, Copenhagen, Denmark 3 Department of Environmental Science, Aarhus University, Roskilde, Denmark

Synopsis Synthetic biology provides an opportunity to explore our understanding of plasmid genomes by creating new genomes incorporating putatively important elements and then testing to see if those predictions are correct. Since plasmids are useful as tools in many contexts and yet do not always turn out to have ideal properties, such creation of novel plasmids has many applications. Creation of small vectors has been achieved for some time, but making a self-transmissible plasmid encoding a full conjugative apparatus is more challenging. Design and synthesis of an artificial IncX plasmid represented a suitable first challenge since these plasmids are the smallest known conjugative plasmids and enough members have been studied to reveal the common properties that should be incorporated. Success in this project provides encouragement to create more BioBricks that represent the building blocks for plasmid genomes from which new plasmid-based tools can be assembled for the future.

Introduction DNA synthesis technologies have in recent years experienced a revitalization with the emergence of the field of “synthetic biology” which covers the creation of novel biological systems or replica constructions mimicking already existing biological systems (Pleiss 2006). Through the past four decades, autonomously replicating DNA

S

1150

constructs of increasing size and complexity have been successfully synthesized, culminating in 2010 with the synthesis of a Mycoplasma mycoides chromosome capable of controlling a bacterial cell (Gibson et al. 2010). The development of synthetic biology as a separate field of research has in large part been facilitated through advances in genome sequencing with publicly available sequences counting more than 2  108 sequence entries on GenBank as of February 2012 (including incomplete genome sequences from whole genome shotgun projects). Furthermore, since the cost per base for gene synthesis has dropped by more than two orders of magnitude over the past decade (Carlson 2009; Wu et al. 2007), a consequence has been a shift in methodologies for heterologous gene expression toward de novo synthesis rather than conventional cloning.

Synthetic Plasmid Biology

Since the synthesis of pUC182Sfi, synthetic replicas of autonomously replicating DNA systems have progressively increased in size. Nonetheless, the synthesis process of larger constructs is still cumbersome today, as small oligonucleotides have to be synthesized, purified, and assembled into larger fragments. Hence, the main constraint on this progression has arguably been a matter of available resources. The synthesis of Mb-size DNA systems give evidence to the potential of DNA synthesis as a means of ultimately creating novel synthetic organisms, yet the bottom-up, rational design of autonomously replicating DNA systems has in larger part been constrained by a dearth of understanding of gene function and synteny. Plasmid genomes have in this respect provided a basis for the progression of synthetic biology toward a more design- than replica-oriented approach as illustrated by the design and synthesis of the novel, conjugative IncX1 plasmid pX1.0 (Hansen et al. 2011, Fig. 1).

The First Synthetic Plasmids Plasmid genomes have, due to their relatively low level of complexity and small size, served as a starting point for the creation of multi-gene synthetic DNA constructions. Thus, the first synthetic DNA construct encoding multiple genes was a 2 kb plasmid containing a pUC replicon, the b-lactamase gene bla, and a lacZ gene fragment (Mandecki et al. 1990). In contrast to the parent pUC plasmids 50% of the inherent restriction sites were eliminated and transcription terminators were introduced downstream from the bla gene and lacZ gene fragment, thus optimizing the construct for subsequent cloning and gene expression analysis. The plasmid was maintained stably in Escherichia coli for more than 120 generations using ampicillin as a selective agent. The creation of a second pUC plasmid was reported in 1995 with the synthesis of the 2.7 kb pUC18 derivative pUC182Sfi (Stemmer et al. 1995). In contrast to the “FokI method for gene synthesis” applied by Mandecki et al. (1988) which relies on cloning and ligation of individual oligonucleotides, pUC182Sfi was synthesized from 134 40-nt oligonucleotides in a single reaction vessel by PCR assembly.

A Synthetic Conjugative Plasmid The IncX1 group consists of mostly conjugative plasmids that are predominantly found in enteric bacteria. As is often the case with conjugative plasmids, the gene content of IncX1 plasmids can be divided into plasmid-selfish genes and accessory genes integrated on transposable elements. The observed accessory genes, or “the genetic load” of the IncX1 plasmids, range from series of un-conserved genes encoding hypothetical proteins to antibiotic resistance genes and genes encoding different cell adhesion factors (Norman et al. 2008; Burmølle et al. 2008). Members of the IncX1 group of plasmids without genetic loads have yet to be observed in nature, and thus the intention of designing pX1.0 was to create such an archetypical IncX1 plasmid. The pX1.0 sequence was derived as the consensus gene content of five related IncX1 plasmids, thus representing a plasmid backbone containing the putative essential plasmid-selfish genes of the IncX1 plasmid group. A total of 41 open reading frames (ORFs) were included in the construct as well as the addition of the chloramphenicol

Synthetic Plasmid Biology

1151

Synthetic Plasmid Biology, Fig. 1 Genetic map of pX1.0. ORFs are depicted as color-coded arrows indicating their direction of transcription. The ORFs are arranged into the following seven modules: gem (gene expression modulation module), tra (transfer module), mob1 (mobilization module 1), rep/stb (plasmid replication/ plasmid stability module), mob 2 (mobilization module 2), par (partitioning module), and res (resistance marker module). Modules are flanked by the indicated restriction sites. MCS: multiple cloning site (From Hansen et al. 2011)

acetyltransferase encoding gene cat. The IncX1 replicon, comprised of the a, b, and g origins of replication as well as two genes encoding the replication initiation proteins p and Bis, was included in the pX1.0 sequence. A 15.8 kb region encoding the entire type IV secretion system as well as the DNA transfer and post-conjugal replication system of the IncX1 plasmids were included as well. Genes encoding plasmid stability determinants characteristic to the IncX1 plasmids, i.e., the StbDE post-segregational killing system and the ParFG active partitioning system, were also retained in the sequence along with conserved ORFs related to gene expression modulation and 12 ORFs of yet unknown function. The inherent modularity of conjugative plasmids facilitated the segregation of pX1.0 into seven modules flanked by unique restriction sites. These sites were chosen on the basis of a normalized frequency index of the restriction sites in the plasmid database and selected bacterial chromosomes, greatly enhancing subsequent manipulation and adding the prospect of using

components of the plasmid in other contexts. pX1.0 maintained itself stably for 50 generations without selective pressure and behaved as expected compared to its nearest relatives.

Modularity of Plasmids Is Suited to Synthetic Biology Not only has the relative simplicity and small size of plasmid genomes facilitated the successful design of various synthetic plasmids, but the inherent modularity of plasmids has also provided the possibility of making a module-based toolbox of functional plasmid components enabling the construction of novel DNA systems with specified properties, which may be useful in relation to the construction of novel cloning vectors. Attempts to implement standardized modules for genetic engineering in bacteria have been done previously, but the emergence of synthetic biology has enabled the creation of a system containing a registry of standardized, compatible parts known as

S

1152

Synthetic Plasmid Biology

Synthetic Plasmid Biology, Fig. 2 Standard assembly of BioBrick ® parts. A BioBrick ® part (a) is inserted upstream from a second BioBrick ® part (b), in this case both carried on the high-copy BioBrick ® assembly plasmid

pSB1AC3. The XbaI/SpeI scar created at the junction upon ligation of the insert and the vector is subsequently indigestible

BioBricks ®. The system provides a catalogue of parts that are easily combined to create a biological circuitry that induces a specified exertion by the host cell. The assembly of BioBrick ® parts is enabled by a prefix and a suffix that contains standard restriction sites (Fig. 2). The prefix consists of an EcoRI site and an XbaI site, whereas the suffix contains SpeI and PstI restriction sites. Upon part assembly, one part, termed the insert, is excised from the carrying vector by digestion either with EcoRI and SpeI to create a front insert or with XbaI and PstI to create a back insert. The terms “back” and “front” refers to where the part is inserted in relation to the second part. The fragment between the restriction sites in the prefix or suffix is removed by endonuclease digestion from the vector carrying the second BioBrick® part to create either a front vector or a back vector. Subsequently, front inserts are

ligated with front vectors and back inserts are ligated with back vectors. In either case, an indigestible XbaI/SpeI ligation takes place at the part junction rendering the fusion construct a new part BioBrick ® in itself. An 8 bp scar will be present at every junction, which poses a problem in relation to ribosomal binding sites and the assembly of protein coding sequences. Alternative assembly strategies are available through BioBricks ®, but entirely different assembly strategies may be preferable in some cases. For a review of different assembly techniques, see Ellis et al. (2011). The BioBricks ® Registry of Standard Biological Parts has since 2003 accumulated approximately 7,000 available standard parts including 189 plasmid backbones, all of which has different properties for facilitation of part assembly under different condition and with different host chassis.

Systems Biology

Synthesis of Plasmids Known Only from Their Sequence The diverse range of known plasmids offer a range of properties of potential interest in the design of circuitry for controlling bacteria, or as components in novel cloning vectors. Examples of such properties are the establishment of more or less discriminatory cell-to-cell contact, dissemination of DNA and/or proteins to a diverse range of recipient cells, maintenance and replication of DNA systems in a broad range of hosts, segregation of DNA elements within a bacterial cell, and control of gene expression, to name a few. Many of such existing properties may not yet be known since the plasmids encoding them have not been isolated and characterized. The host of a given plasmid may be an un-culturable organism, or it may have become extinct at some point throughout time. By creating biology from bits and bytes, an opportunity to retrieve such property-encoding components is presented through synthetic biology. Entire genome sequences of previously unknown plasmids have been obtained from metagenomic data sets (see ▶ Metamobilomics – The Plasmid Metagenome of Natural Environments), yet only a limited insight into their biological properties can be obtained from sequence analysis alone. Synthesis of such plasmids offers an intriguing means of accessing a hitherto unavailable pool of plasmids and could thusly offer a source of novel components for vector design and synthetic circuits, as well as a means of gaining insight into the properties of an uncharacterized part of the global plasmid pool.

1153

▶ Plasmid Cloning Vectors ▶ Plasmid Genomes, Introduction to ▶ Plasmid Incompatibility

References Burmølle M, Bahl MI, Jensen LB, Sørensen SJ, Hansen LH (2008) Type 3 fimbriae, encoded by the conjugative plasmid pOLA52, enhance biofilm formation and transfer frequencies in Enterobacteriaceae strains. Microbiology 154:187–195 Carlson R (2009) The changing economics of DNA synthesis. Nat Biotechnol 27:1091–1094 Ellis T, Adie T, Baldwin GS (2011) DNA assembly for synthetic biology: from parts to pathways and beyond. Integr Biol 3:109–118 Gibson DG, Glass JI, Lartigue C, Noskov VN, Chuang RY et al (2010) Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329:52–56 Hansen LH, Bentzon-Tilia M, Bentzon-Tilia S, Norman A, Rafty L et al (2011) Design and synthesis of a quintessential self-transmissible IncX1 plasmid, pX1.0. PLoS One 6(5):e19912 Mandecki W, Bolling TJ (1988) FokI method for gene synthesis. Gene 68:101–107 Mandecki W, Hayden MA, Shallcross MA, Stotland E (1990) A totally synthetic plasmid for general cloning, gene expression and mutagenesis in Escherichia coli. Gene 94:103–107 Norman A, Hansen LH, She Q, Sørensen SJ (2008) Nucleotide sequence of pOLA52: A conjugative IncX1 plasmid from Escherichia coli which enables biofilm formation and multidrug efflux. Plasmid 60:59–74 Pleiss J (2006) The promise of synthetic biology. Appl Microbiol Biotechnol 73:735–739 Stemmer WP, Crameri A, Ha KD, Brennan TM, Heyneker HL (1995) Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Gene 164:49–53 Wu G, Dress L, Freeland SJ (2007) Optimal encoding rules for synthetic genes: the need for a community effort. Mol Syst Biol 3:134

Cross-References ▶ Conjugative Transfer Systems and Classifying Plasmid Genomes ▶ Metamobilomics – The Plasmid Metagenome of Natural Environments

Systems Biology ▶ Differential Equations and Chemical Master Equation Models for Gene Regulatory Networks

S

T

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation James Marion Department of Chemistry and Biochemistry, University of California, San Diego, CA, USA

Synopsis Originally discovered as a kinase that interacted with the effector proteins TANK and TRAF2 in a ternary complex that could activate NF-kB, TANK-binding kinase 1 (TBK1) has since been characterized as a key regulator of substrates ranging from cell proliferation and vesicle transport to xenophagic elimination of bacteria and antiviral immune response. Also known as NAK (NF-kB-activating kinase) or T2K (TRAF2associated kinase), TBK1 is a ubiquitously expressed 729-amino-acid serine/threonine kinase that is a noncanonical IkB kinase family member, targeting the transcription factors IRF3 and IRF7 in the type I interferon response. TBK1 is composed of an N-terminal kinase domain (KD), which contains an activation loop between subdomains VII and VIII controlling its catalytic activity, and three C-terminal regulatory domains: a ubiquitin-like domain (ULD), which interacts with the KD rather than the known ubiquitin-binding proteins and appears to be necessary for substrate presentation and full activation of the kinase, a leucine zipper-containing # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

dimerization domain (DD), and a small helixloop-helix protein interaction module that has been termed the adaptor-binding (AB) motif. From TBK1 structural studies, the kinase has been shown to be a dimmer with the KD, ULD, and DD forming a three-way interface to position the kinase active sites away from one another and prevent productive autophosphorylation events required for TBK1 activation. Upon recognition of an upstream signal and recruitment to signaling complexes via its AB motif, TBK1 becomes activated either through local clustering of TBK1 molecules (mediates transactivation by autophosphorylation), or by IKKb phosphorylation of TBK1 serine residue 172 within its KD activation loop, both of which lead to the formation of a productive active site center. TBK1 has been shown to phosphorylate several substrates involved in a multitude of molecular events. Best understood today for its role in the induction of type I interferons, activated TBK1 has been shown to target two sites (seven serine/threonine residues) near the C-terminus of the transcription factor IRF3, which induces the production of type I interferons and facilitates an immediate inflammatory response; TBK1 also participates in vesicle transport and xenophagic elimination of bacteria. While TBK1 activation can promote host survival during infection, an unchecked TBK1 response can lead to detrimental effects ranging from cellular proliferation in transformed cells and insulin resistance to numerous autoimmune disorders, obesity, and glaucoma.

1156

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation

Originally discovered as a kinase that interacted with the effector proteins TANK and TRAF2 in a ternary complex that could activate NF-kB (Pomerantz and Baltimore 1999), TANK-binding kinase 1 (TBK1) has since been characterized as a key regulator of substrates ranging from cell proliferation (Ou et al. 2011) and vesicle transport (Da et al. 2011) to the xenophagic elimination of bacteria (Gleason et al. 2011; Wild et al. 2011) and the antiviral immune response (Fitzgerald et al. 2003; Sharma et al. 2003; McWhirter et al. 2004; Perry et al. 2004). TBK1 (Pomerantz and Baltimore 1999), also known as NAK (NF-kBactivating kinase) (Tojima et al. 2000) or T2K (TRAF2-associated kinase) (Bonnard et al. 2000), is a 729-amino-acid serine/threonine kinase that is a ubiquitously expressed noncanonical IkB kinase family member. The canonical IkB kinases, IKKa and IKKb, phosphorylate IkBa in the NF-kB

signaling pathway (Liu et al. 2012), while TBK1 targets the transcription factors IRF3 and IRF7 in the type I interferon response (Fitzgerald et al. 2003). TBK1 is composed of an N-terminal KD (kinase domain), which contains an activation loop between subdomains VII and VIII that controls its catalytic activity, and three C-terminal regulatory domains: a ULD (ubiquitin-like domain), which does not interact with known ubiquitinbinding proteins, but rather with the KD and appears to be necessary for the full activation of the kinase along with substrate presentation, a leucine zipper-containing DD (dimerization domain), and finally a small helix-loop-helix protein interaction module that has been termed the AB (adaptor-binding) motif (Fig. 1) (Ma et al. 2012; Larabi et al. 2013; Tu et al. 2013). From TBK1 structural studies, the kinase has been shown to be a dimeric entity in which the KD, ULD, and DD form a three-way interface that positions the kinase active sites, consisting of the critical activation loops, away from one another (Fig. 2) (Ma et al. 2012). This configuration, as is postulated by numerous researchers, prevents productive autophosphorylation events, which, as will be discussed in the next paragraph, are required for TBK1 activation.

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation, Fig. 1 Structure of TBK1. (a) Structure of the kinase domain of dimeric TBK1 with a potent small molecule inhibitor, BX795 (depicted in red spheres), bound to the active site (PDB ID: 4EUT). (b) Structure of

the asymmetric unit of TBK1. The KD is depicted in red spheres, the ULD in yellow spheres, and the DD in green spheres (PDB ID: 4IM0). (c) Ribbon diagram of the domain structure of TBK1 highlighting the KD, ULD, and SDD of the kinase (PDB ID: 4IM0)

For this reason, effector molecules have been sought and identified that control TBK1 activation and prevent its rampant activation from negating its beneficial effects.

Introduction

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation, Fig. 2 Structural organization of the inactive TBK1 molecule. (a) Structural configuration of the KD and ULD of inactive TBK1. The activation loop of each monomer is depicted in sphere representation with the

1157

critical, S172 residue, highlighted in red (PDB ID: 4EUT). (b) Detailed view of a single chain of TBK1 in the inactive state. The activation loop is again represented in green spheres with the critical, S172 residue, in red (PDB ID: 4EUT)

T

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation, Fig. 3 Multiple signaling pathways converge to activate TBK1

Several signaling pathways converge to activate TBK1 (Fig. 3) (Fitzgerald et al. 2003; Sharma et al. 2003; Chau et al. 2008). Upon recognition of an upstream signal, TBK1 is recruited to signaling

complexes via its AB motif (Larabi et al. 2013). Once there, TBK1 requires phosphorylation of serine residue 172 within its kinase domain activation loop to form a productive active site center

1158

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation, Fig. 4 Active conformation of TBK1 kinase domain. (a) Structural configuration of the KD for active TBK1 with a phosphorylated S172 residue (PDB ID: 4EUU). The activation loop of each monomer is depicted in sphere representation with the critical, S172

residue, highlighted in red (PDB ID: 4EUU). (b) Detailed view of a single chain of TBK1 in an active confirmation with phosphorylated S172. The activation loop is again represented in green spheres with the critical, S172 residue, in red (PDB ID: 4EUU)

(Fig. 4) (Kishore et al. 2002). This event has been shown to be achieved through the kinase activity of IKKb (Clark et al. 2011). Recent crystal structures of the KD and ULD of TBK1 have also shown that this posttranslational modification can occur through a transphosphorylation event mediated by the local clustering of TBK1 molecules upon recognition of a signal (Ma et al. 2012). In this case, neighboring TBK1 molecules interact via an activation loop-swapped conformation. These interactions help to supply critical structural contacts required to achieve an active kinase conformation, as well as to place the activation loop within the catalytic cleft of the adjacent TBK1 KD for phosphotransfer to S172. While kinetic analysis suggests that the initial loop-swapped phosphorylation mechanism is slow, once activated, TBK1 is able to readily phosphorylate activation loop sequences of the remaining unphosphorylated TBK1 molecules leading to activation of any dormant TBK1 constructs (Ma et al. 2012). Upon autophosphorylation of S172, the TBK1 activation loop folds back onto the C-terminal lobe of its KD to complete an apparent binding site for polypeptide substrates (Larabi

et al. 2013). Investigations into the composition of this site along with TBK1 substrate sequence alignments have suggested that a consensus sequence for TBK1 phosphorylation exists (Ma et al. 2012). This sequence reveals that the kinase favors a hydrophobic residue immediately proceeding the modified serine (P0) which is then mirrored by several hydrophobic residues lining the P + 1 position. Indeed, the TBK1 activation loop sequence, which is phosphorylated by activated TBK1, contains a leucine at the P + 1 position. Similarly, a more detailed sequence analysis found that hydrophobic residues comprise the P + 1 position in approximately 70% of TBK1 substrates (Soulat et al. 2008; Ma et al. 2012). Other attempts to derive a TBK1 phosphorylation consensus sequence have suggested that polar amino acids are also a requirement at the P + 3, P + 5, and P + 8 positions or that specific amino acids, defined in each study, are a requirement at the P + 1, P + 3, and P-2 positions, but analysis of a large panel of natural TBK1 substrates suggests that the most prominent requirement is that of the hydrophobic residue at the P + 1 position (Fig. 5) (Soulat et al. 2008; Ma et al. 2012).

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation

1159

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation, Fig. 5 Postulated optimal TBK1 phosphorylation motif. Highlighted in red is the modified

serine residue (S172) of TBK1 which is required for kinase activation

While the composition of the kinase active site appears to be critical to substrate specificity, researchers were also interested in elucidating the enzymatic mechanism by which TBK1 functions. Using an IkBa peptide (amino acids 19–41), which is known to be phosphorylated by TBK1 at serine residue 36, in conjunction with ATP as a varied substrate, researchers observed almost parallel lines in double reciprocal plots, suggesting that the kinase acted through a pingpong mechanism (Khai Huynh et al. 2002). However, additional kinetic experiments soon revealed that TBK1 could not function through a pingpong mechanism as product inhibition studies showed that ADP was a noncompetitive inhibitor of TBK1 with respect to the IkBa peptide (Khai Huynh et al. 2002). If the enzyme functioned through a ping-pong mechanism, ADP would have functioned as a competitive inhibitor with respect to IkBa as both substrates would have to access the same form of TBK1 during the reaction. Excluding a ping-pong mechanism, sequential versus ordered mechanisms were evaluated (Khai Huynh et al. 2002). The noncompetitive inhibition by ADP with respect to the IkBa peptide also excluded an ordered sequential mechanism with IkBa peptide binding first as ADP would then have functioned as an uncompetitive inhibitor (Khai Huynh et al. 2002). To determine if the TBK1 mechanism was therefore random sequential or ordered sequential with ATP binding first, the pattern of inhibition with an IkBa peptide inhibitor was examined (Khai Huynh et al. 2002). With respect to ATP as the varied substrate, the double reciprocal plots intersected at the 1/ATP

axis and to the left of the 1/rate axis, indicating the presence of a noncompetitive inhibitor (Khai Huynh et al. 2002). This inhibition pattern, as with all IKK family members, was consistent with a random sequential mechanism for TBK1 (Burke et al. 1998; Peet and Li 1999; Khai Huynh et al. 2002). Interestingly, further kinetic experiments found that ATP decreased the affinity of IkBa for TBK1 and that the dissociation constant of IkBa for TBK1 was significantly higher than that of the IKKa/IKKb heterodimer (also known to phosphorylate IkBa subsequently leading to the activation of NF-kB), suggesting to researchers, at the time, that other substrates for TBK1 may exist (Burke et al. 1998; Peet and Li 1999; Khai Huynh et al. 2002). Since that initial characterization, activated TBK1 has now been shown to phosphorylate several substrates involved in a multitude of molecular events. Best understood today for its role in the induction of type I interferons (Chau et al. 2008), activated TBK1 has been shown to target two sites (seven serine/threonine residues) near the C-terminus of the transcription factor IRF3 (Lin et al. 1998; Fitzgerald et al. 2003; Liu et al. 2012). Phosphorylation of residues at the first site (Ser385 and Ser386) promotes the dimerization of IRF3 (Lin et al. 1998), required for its translocation to the nucleus, while phosphorylation of residues at the second site (Ser396-Ser405) permits IRF3 interaction with its coactivators p300/CBP (CREBbinding protein) (Chakravarti et al. 1996). These events, in turn, mediate the formation of a nuclear IRF3 nucleoprotein complex, at the promoter region of the IFN-b gene, which induces the

T

1160

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation

production of type I interferons and facilitates an immediate inflammatory response. While primarily studied in the antiviral response, TBK1 has also been shown to play a role in endosomal sorting and trafficking (Da et al. 2011). VPS37C, another protein defined as a TBK1 substrate, is a subunit of ESCRT-1 (endosomal sorting complex required for transport 1), a complex in the class E-VPS (vacuolar protein sorting) pathway required for sorting ubiquitinated transmembrane proteins into internal vesicles of multivesicular bodies (Clark et al. 2011). Unfortunately, this pathway is subverted by retroviruses to mediate budding from host cells and facilitate the spread of infection (Bowie and Unterholzner 2008). TBK1’s phosphorylation of VPS37C negatively regulates PTAP-dependent (an HIV-1 motif) viral budding as this posttranslational modification prevents the competent assembly of viral budding complexes (Clark et al. 2011). In addition to the antiviral response and vesicle budding, TBK1 has also been shown to mediate the sequestration of intracellular bacteria through a number of roles in xenophagy. Infection or invasion by a bacterial pathogen tends to result in compartmentalization of the invading organism within PCVs (pathogen-containing vacuoles) (Gleason et al. 2011). This restriction event limits nutrient intake by the invading microorganism and prevents collateral host damage. In 2007, using TBK1/ cells, a group reported that TBK1 kinase activity was required for the restriction of bacterial infection as TBK1 was able to regulate the integrity of PCVs (Wild et al. 2011). Further experiments, done by the same group, showed that AQP-1 (aquaporin-1), a water channel that regulates the swelling of secretory vesicles, associated with PCVs [104]. In TBK1deficient mice, AQP-1 levels were elevated causing PCV destabilization. This suggested to researchers that TBK1 may control AQP-1 expression, a critical regulatory point in host cell protection (Da et al. 2011). Further investigations into TBK1’s role in xenophagy focused on ubiquitination and the

tendency of polyubiquitinated proteins to accumulate around cytosolic invaders. While it is not known whether invading bacteria are ubiquitinated while in the cell or if polyubiquitinated proteins surround the invading microorganisms, these studies revealed two roles for TBK1 in mediating the selective autophagy of bacteria displaying a polyubiquitinated coat (Gleason et al. 2011; Wild et al. 2011). First, TBK1 has been shown to mediate the activity of a ubiquitin-binding protein, NDP52, through recruitment of adaptor molecules that recognize bacterial ubiquitin coats (Gleason et al. 2011). This event signals cells to activate autophagy against the bacteria attempting to colonize the host cytosol (Gleason et al. 2011). Second, TBK1 has also been shown to enhance the affinity of an autophagy receptor, optineurin, for ubiquitinated proteins, through its phosphorylation of a residue (serine-177) known to enhance the autophagic clearance of cytosolic Salmonella (Wild et al. 2011). This, in turn, is known to promote the selective autophagy of invading intracellular bacteria through interactions with xenophagosomes. Interestingly, however, for all of its positive attributes, aberrant activation of TBK1 has been shown to contribute to human disease. For example, in oncogenic transformation studies, TBK1 activation has been shown to support the suppression of a programmed cell death response to oncogene activation (Ou et al. 2011). In these investigations, biophysical studies revealed that RalB, a key component of the oncogenic Ras signaling network, once activated, promoted a direct interaction between Sec5, a critical Ral effector protein, and TBK1 which resulted in kinase activation (Ou et al. 2011). Using TBK1/ cells, researchers then showed that TBK1 directly interacted with and activated AKT, which, in turn, is associated with tumor cell survival, oncogenic proliferation, and invasiveness (Ou et al. 2011). Similarly, in the insulin response, TBK1 has been implicated in genetic models of diabetes (Tilly-Kiesi et al. 1996; Ou et al. 2011). In co-immunoprecipitation assays with the insulin

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation

1161

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation, Fig. 6 K63-linked ubiquitination of TBK1. (a) Ribbon diagram of a single TBK1 molecule. Residues in TBK1 found to be ubiquitinated by Wang et al. (K69, K154, and K372) are highlighted in red spheres. Residues found to be ubiquitinated by Tu

et al. (K30 and K401) are highlighted in blue spheres (PDB ID: 4IM3). (b) Sphere representation of dimeric TBK1 with ubiquitination residues discovered by Wang et al. displayed in red spheres and ubiquitination residues discovered by Tu et al. displayed in blue spheres

receptor, researchers have found that TBK1 and the insulin receptor interact in obese Zucker rat (OZR) models (Tilly-Kiesi et al. 1996). Through further investigations, researchers showed that TBK1 was able to phosphorylate serine residue 994 in the insulin receptor, a posttranslational modification that has been shown to cause reduced insulin sensitivity in genetic models of diabetes (Tilly-Kiesi et al. 1996). As is evident from the experiments above, while TBK1 activation can promote host survival during infection, an unchecked TBK1 response can lead to detrimental effects ranging from cellular proliferation in transformed cells and insulin resistance, as were detailed above, to numerous autoimmune disorders, obesity, and glaucoma. For this reason, effector molecules have been

sought and identified that control TBK1 activation and prevent its rampant activation from negating its beneficial effects. In searching for regulators of the TBK1 response, two independent groups discovered K63-linked ubiquitination of TBK1 (one group showed ubiquitination on residues K69, K154, and K372 (Wang 2010), while the other group showed ubiquitination on residues K30 and K401 (Tu et al. 2013)) in response to RNA virus infection (Fig. 6). While the E3 ligase responsible for the addition of modifications on residues K30 and K401 has not been elucidated, the ubiquitination modifications on residues K69, K154, and K372 have been shown to be induced by the E3 ligases mind bomb 1 and mind bomb 2 (Wang 2010), in response to RNA virus

T

1162

TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation

infection, or NRDP1, in response to LPS (Wang et al. 2009). From this posttranslational modification, subsequent recruitment of the downstream adaptor NEMO, through ubiquitin-binding domains, has been shown to lead to the assembly of a NEMO/TBK1 complex (Wang 2010). This complex, in turn, is known to be able to activate TBK1 kinase activity, leading to the phosphorylation of the downstream transcription factor, IRF3. While ubiquitination events have not been shown to be required for the activation of TBK1, the presence of this posttranslational modification has been shown to upregulate the type I interferon response and promote host cell defenses. Further investigations into the regulation of TBK1 activity have identified endogenous inhibitors that have been shown to prevent the TBK1mediated downstream phosphorylation of numerous substrates. One such inhibitor, A20, a ubiquitin-editing enzyme, has been shown to disrupt the K63-linked ubiquitination events discussed previously (Parvatiyar et al. 2010). Another such inhibitor, SHP-2, a tyrosine phosphatase, has been shown to negatively regulate TBK1-mediated signal transduction (An et al. 2006). Since tyrosine phosphorylation of recombinant active TBK1 has not been detected, researchers postulate that this protein binds to the KD of the kinase and prevents the autophosphorylation of S172 in the active loop of the KD (An et al. 2006). In addition, a yeast two-hybrid screen for interaction partners of the closely related, noncanonical, IkB kinase, IKKe, identified the protein SIKE (suppressor of IKKe) as a physiological inhibitor of TBK1 through an undefined mechanism (Huang et al. 2005). As investigations into the mechanism by which SIKE inhibits TBK1 function are a major component of this work, characteristic SIKE structure and function details will be explained at length in “▶ SIKE: Discovery, Structure, and Function”.

Cross-References ▶ SIKE: Discovery, Structure, and Function

References An H, Zhao W, Hou J et al (2006) SHP-2 phosphatase negatively regulates the TRIF adaptor proteindependent type I interferon and proinflammatory cytokine production. Immunity 25:919–928 Bonnard M, Mirtsos C, Suzuki S et al (2000) Deficiency of T2K leads to apoptotic liver degeneration and impaired NF-kappaB-dependent gene transcription. EMBO J 19:4976–4985 Bowie AG, Unterholzner L (2008) Viral evasion and subversion of pattern-recognition receptor signalling. Nat Rev Immunol 8(12):911–922 Burke JR, Miller KR, Wood MK, Meyers CA (1998) The multisubunit IkappaB kinase complex shows random sequential kinetics and is activated by the C-terminal domain of IkappaB-alpha. J Biol Chem 273:12041–12046 Chakravarti D, LaMorte VJ, Nelson MC et al (1996) Role of CBP/P300 in nuclear receptor signalling. Nature 383:99–103 Chau TL, Gioia R, Gatot JS et al (2008) Are the IKKs and IKK-related kinases TBK1 and IKK-e similarly activated? Trends Biochem Sci 33:171–180 Clark K, Peggie M, Plater L et al (2011) Novel cross-talk within the IKK family controls innate immunity. Biochem J 434:93–104 Da Q, Yang X, Xu Y et al (2011) TANK-binding kinase 1 attenuates PTAP-dependent retroviral budding through targeting endosomal sorting complex required for transport-I. J Immunol 186:3023–3030 Fitzgerald KA, McWhirter SM, Faia KL et al (2003) IKKepsilon and TBK1 are essential components of the IRF3 signaling pathway. Nat Immunol 4: 491–496 Gleason CE, Ordureau A, Gourlay R et al (2011) Polyubiquitin binding to optineurin is required for optimal activation of TANK-binding kinase 1 and production of interferon b. J Biol Chem 286:35663–35674 Huang J, Liu T, Xu L-G et al (2005) SIKE is an IKK epsilon/TBK1-associated suppressor of TLR3- and virus-triggered IRF-3 activation pathways. EMBO J 24:4018–4028 Khai Huynh Q, Kishore N, Mathialagan S et al (2002) Kinetic mechanisms of IkappaB-related kinases (IKK) inducible IKK and TBK-1 differ from IKK-1/IKK-2 heterodimer. J Biol Chem 277:12550–12558 Kishore N, Khai Huynh Q, Mathialagan S et al (2002) IKK-i and TBK-1 are enzymatically distinct from the homologous enzyme IKK-2. Comparative analysis of recombinant human IKK-i, TBK-1, and IKK-2. J Biol Chem 277:13840–13847 Larabi A, Devos JM, Ng S-L et al (2013) Crystal structure and mechanism of activation of TANKbinding kinase 1. Cell Rep 3:734–746. https://doi.org/ 10.1016/j.celrep.2013.01.034 Lin R, Heylbroeck C, Pitha PM, Hiscott J (1998) Virusdependent phosphorylation of the IRF-3 transcription factor regulates nuclear translocation, transactivation

Target-Primed Mobilization Mechanisms potential, and proteasome-mediated degradation. Mol Cell Biol 18:2986–2996 Liu F, Xia Y, Parker AS, Verma IM (2012) IKK biology. Immunol Rev 246:239–253 Ma X, Helgason E, Phung QT et al (2012) Molecular basis of Tank-binding kinase 1 activation by transautophosphorylation. Proc Natl Acad Sci 109:9378–9383 McWhirter SM, Fitzgerald KA, Rosains J et al (2004) IFN-regulatory factor 3-dependent gene expression is defective in Tbk1-deficient mouse embryonic fibroblasts. Proc Natl Acad Sci U S A 101:233–238 Ou YH, Torres M, Cheng T, White MA (2011) TBK1 directly engages Akt/PKB survival signaling to support oncogenic transformation. Mol Cell 41:458–470 Parvatiyar K, Barber GN, Harhaj EW (2010) TAX1BP1 and A20 inhibit antiviral signaling by targeting TBK1IKKi kinases. J Biol Chem 285:14999–15009 Peet GW, Li J (1999) IKB Kinases alpha and beta show a random sequential kinetic mechanism and are inhibited by staurosporine and quercetin. J Biol Chem 274:32655–32661 Perry AK, Chow EK, Goodnough JB et al (2004) Differential requirement for TANK-binding kinase-1 in type I interferon responses to toll-like receptor activation and viral infection. J Exp Med 199:1651–1658 Pomerantz JL, Baltimore D (1999) NF-kappaB activation by a signaling complex containing TRAF2, TANK and TBK1, a novel IKK-related kinase. EMBO J 18:6694–6704 Sharma S, Tenoever BR, Grandvaux N et al (2003) Triggering the interferon antiviral response through an IKK-related pathway. Science 300:1148–1151 Soulat D, Bürckstümmer T, Westermayer S et al (2008) The DEAD-box helicase DDX3X is a critical component of the TANK-binding kinase 1-dependent innate immune response. EMBO J 27:2135–2146 Tilly-Kiesi M, Knudsen P, Groop L, Taskinen MR (1996) Hyperinsulinemia and insulin resistance are associated with multiple abnormalities of lipoprotein subclasses in glucose-tolerant relatives of NIDDM patients. Botnia Study Group. J Lipid Res 37:1569–1578 Tojima Y, Fujimoto A, Delhase M et al (2000) NAK is an IkappaB kinase-activating kinase. Nature 404:778–782. https://doi.org/10.1038/35008109 Tu D, Zhu Z, Zhou AY et al (2013) Structure and ubiquitination-dependent activation of TANK-binding kinase 1. Cell Rep 3:747–758 Wang L (2010) Mindbomb proteins are E3 ubiquitin ligases essential for TBK1-mediated antiviral activity. J Immunol 184:136.6 Wang C, Chen T, Zhang J et al (2009) The E3 ubiquitin ligase Nrdp1 “preferentially” promotes TLR-mediated production of type I interferon. Nat Immunol 10:744–752 Wild P, Farhan H, McEwan DG et al (2011) Phosphorylation of the autophagy receptor optineurin restricts Salmonella growth. Science 333:228–233

1163

Target-Primed Mobilization Mechanisms Adam R. Parks1 and Joseph E. Peters2 1 Molecular Control and Genetics Section, Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD, USA 2 Department of Microbiology, Cornell University, Ithaca, NY, USA

Synonyms Target-primed reverse transcription

Synopsis Some genetic elements utilize a free 30 OH at a new genetic location to prime DNA synthesis, resulting in a copy of the element in the target DNA. In a simple example of this mechanism, homing endonucleases are encoded within mobile elements, which are used to create a double-strand DNA break(s) in a target DNA molecule. By creating a double-strand DNA break(s), the host genome is activated for homologous recombination, and the mobile element is copied into that DNA site; the 30 ends from one of the broken DNA are used to prime synthesis of the element. Other mobile elements use a process termed target-primed reverse transcription to move via an RNA intermediate. Following transcription of the DNA element, the RNA copy is reverse transcribed into a new DNA site, using a nick in the DNA at that site to prime reverse transcription. There are two major classes of elements that move by a target-primed reverse transcription mechanism, the non-LTR retrotransposons and the group II mobile introns. Non-LTR (non-long terminal repeat) retrotransposons are copied out of their original location by a host-encoded RNA polymerase and directly copied back into a new location by an element-encoded reverse transcriptase. Group II mobile introns excise

T

1164

Target-Primed Mobilization Mechanisms

break site, such that homologous recombination results in integration of the element. These types of elements can be found in all domains of life, including in bacteriophages and viruses.

themselves from a larger mRNA molecule and use a reverse transcriptase to make a DNA copy at a new site. This process also involves the priming of DNA synthesis by a free 30 OH at the target site.

Homing Endonucleases Introduction Homing endonucleases (HE) might be considered the simplest of all mobile elements. Homing is the process of mobile element insertion into an available site that lacks the element (Fig. 1). In their most basic form, they encode a single endonuclease that creates a double-strand DNA break(s) at a rare, yet highly conserved, sequence within the genome of the host that does not already contain a copy of the HE gene (Belfort and Bonocora 2014; Darmon and Leach 2014). By creating a single double-strand DNA break(s), the host genome is activated for homologous recombination at that site. The broken DNA is repaired by homologous recombination, using the host’s own recombination and DNA replication machinery. When the DNA molecule that does contain a copy of the

Some genetic elements mobilize by a mechanism that utilizes a free 30 OH to prime DNA synthesis in such a way that the mobile element is copied into the target DNA. In some cases the 30 OH can be generated by the endonuclease activity of a dedicated protein or protein domain. Other elements are able to use the 30 OHs that occur at the ends of telomeres on linear chromosomes or the ends of Okazaki fragments during DNA replication to prime synthesis. Elements that function by target-primed mechanisms are able to “trick” host cells into copying the element into a new locus by either providing a nonnative template for the polymerase to copy into the new location or by providing DNA sequence that is homologous to the DNA

a

b

ab

a

ab

b

c

a

d

ab

a

b

b

a a

b

e

Target-Primed Mobilization Mechanisms, Fig. 1 Homing endonucleases move by homologous recombination. (a) The homing endonuclease (blue circle) is produced from an open reading frame found within a highly conserved site (a–b) in the chromosome. (b) The endonuclease makes a double-strand break at the naïve a–b site. (c) The break is resected by the host’s DNA

a

b

a

b

b

repair enzymes, such as RecBCD in E. coli. (d) The homology between the naïve a–b sites is paired with the homologous a–b site containing the HE gene. (e) Gene conversion results in the HE gene being copied into the new locus. Since the a–b site is now interrupted by the HE gene, the homing endonuclease can no longer cleave the DNA at that site

Target-Primed Mobilization Mechanisms

a

Introns

1165

a

b

b

c

c

d

Host protein

Inteins

d Homing endonuclease

Host protein

Homing endonuclease

Target-Primed Mobilization Mechanisms, Fig. 2 Introns and inteins both interrupt a host gene but are removed in different ways. (A) The host gene (blue arrow) is interrupted by the mobile element (green box), which also encodes a homing endonuclease (purple box). (B) When the host gene is transcribed, the mobile element is also transcribed. Introns are spliced out before translation. Inteins remain encoded within the host open reading

frame. (C) For introns, the homing endonuclease and host protein are translated separately. Inteins are translated along with the host protein, forming a fusion between the intein and the host protein. (D) Inteins splice out of the host protein, leaving the host protein intact and separating the homing endonuclease from the host protein. In both intron and intein, the host protein is ultimately expressed in its wild-type form

endonuclease gene is used as a template for homologous recombination, the gene is copied into the naïve site. Once a copy of the element is in place, the DNA sequence recognized by the HE is disrupted and is no longer cleaved by the nuclease. The sequences that are recognized by the endonucleases are typically quite long (14–40 bp), compared to other endonucleases, and are often found in highly conserved regions of the chromosome (Belfort and Bonocora 2014). The long recognition sequence ensures that the recognition sequence is very rare in the chromosome, preventing the nuclease from widespread damage to the chromosome. In some cases, these regions are phenotypically silent; insertion of a genetic element does not disrupt any coding sequence. However, there are examples of homing endonucleases that target conserved housekeeping genes. By targeting conserved genes, these elements are capable of maintaining themselves within the genome of a given host. This presents a problem, because highly conserved genes are often essential. Some HEs have evolved an elegant strategy to deal with this trade-off; they are able to excise themselves as an intron from the RNA and restore the original function of the gene. By parasitizing essential

genes, these elements establish a selection for their continued function (Darmon and Leach 2014). If part of the element is damaged or deleted, preventing proper removal of the element, then the host organism will lose the function of that essential gene and potentially die. Elements that are propagated by homologous recombination can occasionally move to new locations by illegitimate recombination; however, this phenomenon can cause conversion of the surrounding genes as well as the new target, making widespread genomic changes (Darmon and Leach 2014). Group I introns are a class of mobile genetic element that moves by homing endonuclease activity (Haugen et al. 2005). When the gene that contains a group I intron is transcribed, the intron is able to catalyze its own removal from the mRNA and rejoin, or splice, the ends of the mRNA back together so that it still encodes the correct, uninterrupted open reading frame (Fig. 2). This activity prevents the disruption of any essential functions that are needed by the cell. Once established within a genome, the HE coding sequence can be lost or degenerate, since its activity is not necessary for the activity of the intron (Haugen et al. 2005).

T

1166

In a strategy similar to that of group I mobile introns, inteins are able to excise themselves from proteins, once the mRNA has been translated (Fig. 2; Barzel et al. 2011). The HE is expressed as a fusion between the host protein and the element-encoded nuclease. The ends of the endonuclease domain catalyze the excision of this foreign protein domain and rejoin the native host protein, preventing disruption of its activity. HEs have been used extensively as biotechnological tools (Belfort and Bonocora 2014). By engineering the DNA binding region, alternative DNA sequences can be targeted. By making sitespecific cuts within a gene, the gene can be disrupted or altered by also providing DNA with sequence homology on either side of the cut site. Double-strand DNA break(s) are often repaired by an error-prone nonhomologous end joining mechanism in eukaryotes that will lead to the disruption of a gene that has been cut, especially if all copies of the gene are converted to cut. Like Zinc finger endonucleases, TALENs, and CRISPRs, homing endonuclease systems have been used in mammalian systems to develop genetic tools and therapeutics. These tools have also been employed in plant, animal, and bacterial models.

Target-Primed Reverse Transcription Mechanisms: When RNA Is the Template Mobile elements that move by copying themselves via an RNA intermediate must be converted into DNA to establish a stable presence at a new location in the genome. This requirement is accomplished by the host RNA polymerases and the activity of reverse-transcriptase proteins that often have other associated activities that aid in the mobilization process. The process by which some elements are copied directly into a new target DNA molecule by a reversetranscriptase protein is called target-primed reverse transcription (TPRT) (Curcio and Derbyshire 2003). There are two major classes of elements that move by a TPRT mechanism, the non-LTR retrotransposons and the group II mobile introns.

Target-Primed Mobilization Mechanisms

Non-LTR Retrotransposons Non-LTR (non-long terminal repeat) retrotransposons are transposons that move by a copyout/copy-in mechanism (Curcio and Derbyshire 2003). They are copied out of their original location by a host-encoded RNA polymerase and directly copied back into a new location by an element-encoded reverse transcriptase (Fig. 3). This contrasts with LTR retrotransposons, which are copied into cDNA molecules before being “pasted in” to a new DNA location by an integrase protein (a DDE-type recombinase). Non-LTR elements do not require the long terminal repeats that are bound by the integrase protein for insertion because they lack this step in their transposition process. The sequences at either end of non-LTR transposons are referred to as 50 - and 30 -UTRs (or untranslated regions) (Han 2010). The 50 -UTR often encodes a promoter sequence, and the 30 -UTR encodes a poly-A tail that is derived from transcription of the element and subsequent reverse transcription of the mRNA poly-A tail. Autonomous non-LTR elements also encode at least one open reading frame that produces a reversetranscriptase protein. This reverse transcriptase often has additional features that are important for the function of the element including an endonuclease activity and an RNAse H activity. Some non-LTR elements contain an additional open reading frame, encoding a protein that promotes proper arrangement of RNA and protein (a chaperone), aiding in the activity of the element (Han 2010). Element-encoded promoters may be used to initiate transcription of the non-LTR element from the host DNA; however, some elements do rely on host-encoded promoters near the 50 -UTR of the element (Han 2010). After transcription of the element, host ribosomes translate the one or two open reading frames within the element (Fig. 4). Typically these proteins bind immediately to the RNA from which they were translated, forming a ribonucleoprotein particle (RNP). The RNP, which includes the RNA, RT, and chaperone protein, is then able to bind to new sites in the genome. The binding of new target sites is mediated to some degree by base pairing with the RNA element but is also a factor of the DNA sequence preference of the RT enzyme (Curcio and

Target-Primed Mobilization Mechanisms

1167

C

B

A D

E

Target-Primed Mobilization Mechanisms, Fig. 3 Life cycle of non-LTR retrotransposons in eukaryotic cells. (A) Transcription of the element occurs within the nuclease by RNA polymerase I or II, depending on the element. (B) The RNA element is exported to the cytoplasm. (C) Within the cytoplasm, the element RNA is translated, expressing its two open reading frames (blue circle and red squares; some elements only have one open reading frame). (D)

a

b

Ribonucleoprotein particles (RNPs) are assembled with the element RNA and its ORFs. For some elements, the RNPs may be further processed within stress granules called P-bodies (red circle). (E) RNPs are transported back into the nucleus, where they are integrated into the chromosome at a different site by target-primed reverse transcription

Reverse Transcriptase

c

RNA Polymerase Mobile Element Poly-A tail Mobile Element

Mobile Element Mobile Element

d

e

Host DNA polymerase? Mobile Element

As

f

T Mobile Element

As

Mobile Element

Target-Primed Mobilization Mechanisms, Fig. 4 Target-primed reverse transcription mediated by a reverse-transcriptase protein encoded within the element copies non-LTR retrotransposons into the target DNA. (a) The mobile element, consisting of 50 - and 30 -UTR regions and an open reading frame encoding a reverse-transcriptase gene, is transcribed by a host RNA polymerase (green oval), producing an RNA transcript of element (red box) including a poly-A tail (commonly added to mRNA in eukaryotes, purple line). (b) The reverse-transcriptase gene is translated,

and the resulting RT protein binds to the RNA transcript. (c) The RT enzyme nicks the target DNA, generating a free 30 OH. (d) The free 30 OH generated is used to prime the synthesis of a new DNA copy of the element at the site of the nick. The opposite DNA strand of the target DNA is also nicked by the RT enzyme. (e) The gap in the target DNA is repaired, by either a host DNA polymerase or the RT enzyme and ligase. (f) The new copy of the mobile element differs from the original copy by the addition of more (A) nucleotides to the 30 end

1168

Target-Primed Mobilization Mechanisms

Target-site duplication

Target-site deletion a

a

ATCG

b

TAGC ATCG

ATCG TAGC

b

ATCG TAGC

TAGC

c

c

ATCG

ATCG

TAGC

d

TAGC

d

ATCG

ATCG

TAGC

TAGC

e f

ATCG

ATCG

TAGC

TAGC

ATCG

ATCG

TAGC

TAGC

e f

AT

CG

TA

GC

Target-Primed Mobilization Mechanisms, Fig. 5 Non-LTR transposition can lead to either targetsite duplications or target-site deletions, depending on whether plus-strand DNA is primed from a site downstream or upstream of the site that primes minus-strand DNA synthesis. (a) For both target-site duplications and deletions, DNA is nicked (bottom strand) to prime reverse transcription. (b) Reverse transcription of the element synthesizes the minus strand (red dotted arrow) of the element, using RNA as a template (green squiggle). (c) For transposition that results in target-site duplications, the top strand is nicked 30 of the bottom strand nick, relative to

the top strand 50 –30 orientation. In target-site deletions, the top strand is nicked 50 to the original bottom strand nick site. (d) Plus-strand DNA (blue dotted arrow) is synthesized using the newly synthesized minus strand. (e) For duplications, gaps that are derived from the same DNA sequence remain at the ends of the element, which are filled in by DNA polymerase. For deletions, flaps containing the sequence that will be deleted remain at the ends. The flaps are removed by a host endonuclease. (f) The end result is either the duplication of sequence proximal to the insert site (depicted with the duplicated “ATCG”) or the deletion of sequence at the insertion site (“ATCG” is missing)

Derbyshire 2003; Tropp 2012). The RT enzyme then uses its endonuclease activity to nick DNA at that site to free a 30 OH for use in priming reverse transcription of the element. In some cases, the RT protein does not have an endonuclease activity (Malik et al. 1999), so free 30 OH groups that either have resulted from DNA damage and replication or occur at the ends of chromosomes may be used to prime synthesis (Morrish et al. 2002). The RT enzyme may also nick the opposite strand of the target DNA, and the location of this nick will dictate whether the final integration of the

element results in small target-site duplications or target-site deletions (Fig. 5). When the opposite strand is nicked 30 of the original nick, it will result in the formation of target-site duplications. When the nick is 50 of the original nick, the final insertion will result in the loss of sequence information at either end of the element, i.e., a target-site deletion (Han 2010). Once the minus strand of the non-LTR transposon is synthesized by the RT enzyme, it is looped back to base pair with the opposite strand of target DNA so that this strand may now be used to prime synthesis of

Target-Primed Mobilization Mechanisms

the plus strand. Sometimes incomplete reverse transcription of the element can lead to 50 -truncation of the element. The RNAse H domain of the RT enzyme degrades the mRNA that was originally used to produce the minus strand. Host enzymes are required to repair gaps, flaps, and/or nicks that are left at the site of insertion to complete the process. Examples of non-LTR transposons include the L1 elements in humans (Homo sapiens), R2 elements in silk worms (Bombyx mori), and I factor found in fruit flies (Drosophila melanogaster). In Drosophila, non-LTR elements (TART, HeT-A, TAHRE) are used in the maintenance of telomeres at the ends of chromosomes (Pardue et al. 2005). Non-LTR elements have nonautonomous derivatives that rely on active (autonomous) elements to produce the necessary RT enzyme (Curcio and Derbyshire 2003). Important examples of autonomous and nonautonomous pairs in humans are the L1 elements (autonomous) and Alu (nonautonomous). Together these elements are thought to comprise ~45% of the entire human genome, while only 80–100 L1 elements are considered active (Kazazian 2004). Non-LTR retroelements can provide essential functions to host cells. For example, Drosophila does not encode their own telomerase enzymes; Het-A and TART elements maintain telomeres through retrotransposition activity (Pardue et al. 2005). These elements can also enhance the diversity of their host genomes, by “exonizing” or converting noncoding sequence into coding sequence by the addition of new splice sites (Cordaux and Batzer 2009). SINE and MIR elements are particularly prone to exonization because they provide conserved splicing elements, when they are inserted in the antisense direction, which can be recognized by the host RNA splicing apparatus. It is estimated that humans encode ~1,800 exons that have been derived by this process (Schmitz and Brosius 2011). Group II Mobile Introns Some RNA molecules (ribozymes) possess catalytic abilities of their own, and in the case of group II mobile introns, the catalytic activity of the RNA molecule enables the element to excise itself from

1169

a larger mRNA molecule in a process referred to as self-splicing (Curcio and Derbyshire 2003; Lambowitz and Zimmerly 2011; Tropp 2012). The secondary structure of the RNA molecule of these elements is critically important in the function of the mobile intron, as the intramolecular interactions within the RNA mediate self-splicing reaction (Fig. 6) (Lambowitz and Zimmerly 2004). The regions of the parent mRNA molecule that reside at either side of the intron are called the 50 - and 30 -exons. The simplest group II mobile introns also encode a single open reading frame that produces a reversetranscriptase protein that enables them to be copied into a new location. Self-splicing of the group II mobile intron is accomplished in two transesterification steps (Curcio and Derbyshire 2003; Lambowitz and Zimmerly 2004). The first step involves base pairing between the secondary structures of the intron RNA with a site within the intron (the intron binding site, or IBS) and a separate base pairing with the sequence 50 of the intron (the exon binding site, or EBS). These interactions fold the secondary and tertiary structure of the RNA itself into an active center that also binds Mg2+ ions, which are essential for the transesterification reaction (Qin and Pyle 1998). The 20 OH of a conserved adenosine base is aligned with the 50 PO4 that links the intron to the rest of the mRNA. The 20 OH is used as the nucleophile in a transesterification reaction that forms a 2–50 bond, resulting in a loop in the intron, commonly referred to as a lariat structure (Lambowitz and Zimmerly 2004). In the second step, the newly exposed 30 OH of the 50 exon is used in a nucleophilic attack of the 50 PO4 of the 30 exon. This activity is also coordinated by the intron RNA and base-pairing interactions within the intron and the exon. The end result of the selfsplicing reaction is the joining of the 50 and 30 exons and the release of the intron RNA as a lariat structure. The reverse-transcriptase protein, translated before splicing, binds to a conserved domain of the intron as well, helping to stabilize some of the necessary interactions. Although the RT enzyme does appear to participate in the spliceosomal complex, the reaction is entirely catalyzed by the RNA of the intron.

T

1170

Target-Primed Mobilization Mechanisms

a

b

c.i.

5’ 2’

Reverse Transcriptase

3’

c.ii. RNA Polymerase

Mobile Element

Mobile Element

c.iii. 2’

3’ 5’

d

e

2’

3’

f

3’

Target-Primed Mobilization Mechanisms, Fig. 6 Mobilization of group II mobile introns. (a) The element is copied out of DNA by a host RNA polymerase enzyme (green oval). (b) Translation of the open reading frame within the element produces a reverse-transcriptase protein (RT, purple oval). (c) Self-splicing of the intron from the mRNA proceeds by the following steps: (i) nucleophilic attack of the splice donor site, a 20 OH group within the intron, leads to the formation of a lariat structure. (ii) The splice donor serves as a nucleophile, attaching the splice acceptor site at the 30 end of the intron. (iii) The

intron is then released as a lariat-shaped RNA molecule with the mRNA splice donor and acceptor sites fused. (d) A new target site is selected through base pairing between the intron and the target DNA, and a reverse splicing reaction joins the intron with DNA. (e) Reverse transcriptase nicks the target DNA generating a 30 OH that is used in priming reverse transcription. (f) RT uses an RNAse H domain to remove the RNA copy of the intron (or it is displaced) during second strand synthesis, generating a double-stranded DNA copy of the intron at the new site

Following self-splicing, the group II intron and RT complex may invade double-stranded DNA to establish the intron in a new site. Retro-homing is the process in which a site identical to the one from which the intron was spliced is recognized by binding of the RT enzyme and base pairing between the new DNA site and the intron RNA (Darmon and Leach 2014; Tropp 2012). Just as observed with homing endonucleases, target sites tend to be highly conserved DNA sequences. Since group II introns are also removed from the mRNA before translation, there is no disruption of the host open reading frame. Some introns may also invade sites that are not identical, and in this case the process is referred to as retrotransposition (Curcio and Derbyshire 2003; Tropp 2012). The RNA of the intron catalyzes a reverse splicing reaction, in which all of the transesterification steps described above are essentially reversed within the new target site,

establishing the RNA within the DNA target. The free 30 OH of the lariat structure attacks the backbone of the target DNA, generating a free 30 OH in the target backbone DNA and covalent attachment of the 30 end of the lariat to the target DNA (Lambowitz and Zimmerly 2004). The free 30 OH from the target DNA attacks the 50 PO4 where it is attached to the internal 20 OH of the intron RNA, completing the reverse splicing reaction. A nick in the bottom strand of the target DNA generates a 30 OH that can prime synthesis of the intron. The RT enzyme utilizes an endonuclease activity to nick the target DNA molecule downstream, and the free 30 OH is used to prime DNA synthesis, templated from the RNA of the intron. In some cases, the endonuclease activity of the RT enzyme is not necessary; the 30 OH may be generated by another mechanism, such as the host’s own DNA replication process. Host DNA replication and repair pathways are used to

Target-Primed Reverse Transcription

remove the intron RNA, copying the newly inserted DNA copy of the element in the process. Group II introns are typically only found in bacteria and in the organelles of eukaryotes (Lambowitz and Zimmerly 2004). Just as in other mobile element systems, there are nonautonomous introns that require the production of reverse transcriptase, and possibly RNA domains that participate in the splicing reactions, from other introns for activity. Nonautonomous introns are called twintrons. Group II introns are thought to be the progenitors of introns that interrupt genes in eukaryotes and expand the coding capacity of eukaryotic genes through alternative splicing reactions. Eukaryotic introns do not possess the complicated secondary and tertiary structures that introns require for forming active complex; however, snRNAs that are produced in trans perform the function that these missing domains accomplish. Group II introns also show promise as biotechnological tools because they offer a variety of useful features (Lambowitz and Zimmerly 2004). While the overall secondary structure of the intron must be maintained, there are regions of the element that are amenable to alteration. They may be targeted to specific sequences by engineering sequence homology into the EBS region of the element, and foreign genetic material can be added to certain locations in the intron. These genetic tools have been dubbed “targetrons,” due to the ease of targeting them to specific locations. Targetrons that are based on the L1.LtrB intron have been used to interrupt genes and insert desirable heterologous genes in both gram-positive and gram-negative bacteria. These elements can insert with very high frequency (>1%) and may not even require a selectable marker to isolate new insertions (Lambowitz and Zimmerly 2004).

1171

References Barzel A, Naor A, Privman E, Kupiec M, Gophna U (2011) Homing endonucleases residing within inteins: evolutionary puzzles awaiting genetic solutions. Biochem Soc Trans 39:169–173 Belfort M, Bonocora RP (2014) Homing endonucleases: from genetic anomalies to programmable genomic clippers. Methods Mol Biol 1123:1–26 Cordaux R, Batzer MA (2009) The impact of retrotransposons on human genome evolution. Nat Rev Genet 10:691–703 Curcio MJ, Derbyshire KM (2003) The outs and ins of transposition: from mu to kangaroo. Nat Rev Mol Cell Biol 4:865–877 Darmon E, Leach DR (2014) Bacterial genome instability. Microbiol Mol Biol Rev 78:1–39 Han JS (2010) Non-long terminal repeat (non-LTR) retrotransposons: mechanisms, recent developments, and unanswered questions. Mob DNA 1:15 Haugen P, Simon DM, Bhattacharya D (2005) The natural history of group I introns. Trends Genet 21:111–119 Kazazian HH Jr (2004) Mobile elements: drivers of genome evolution. Science 303:1626–1632 Lambowitz AM, Zimmerly S (2004) Mobile group II introns. Annu Rev Genet 38:1–35 Lambowitz AM, Zimmerly S (2011) Group II introns: mobile ribozymes that invade DNA. Cold Spring Harb Perspect Biol 3:a003616 Malik HS, Burke WD, Eickbush TH (1999) The age and evolution of non-LTR retrotransposable elements. Mol Biol Evol 16:793–805 Morrish TA, Gilbert N, Myers JS, Vincent BJ, Stamato TD, Taccioli GE, Batzer MA, Moran JV (2002) DNA repair mediated by endonucleaseindependent LINE-1 retrotransposition. Nat Genet 31:159–165 Pardue ML, Rashkova S, Casacuberta E, DeBaryshe PG, George JA, Traverse KL (2005) Two retrotransposons maintain telomeres in Drosophila. Chromosome Res 13:443–453 Qin PZ, Pyle AM (1998) The architectural organization and mechanistic function of group II intron structural elements. Curr Opin Struct Biol 8:301–308 Schmitz J, Brosius J (2011) Exonization of transposed elements: a challenge and opportunity for evolution. Biochimie 93:1928–1934 Tropp BE (2012) Molecular biology : genes to proteins, 4th edn. Jones & Bartlett Learning, Sudbury

Cross-References ▶ DNA Recombination, Mechanisms of ▶ DNA Repair Polymerases ▶ DNA Replication ▶ Double-Strand Break Repair ▶ Homologous Recombination in Lesion Bypass

Target-Primed Reverse Transcription ▶ Target-Primed Mobilization Mechanisms

T

1172

Target-Site Selection Adam R. Parks1 and Joseph E. Peters2 1 Molecular Control and Genetics Section, Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD, USA 2 Department of Microbiology, Cornell University, Ithaca, NY, USA

Synonyms Integration-site selection

Definition Target-site selection refers to the process that is used by mobile genetic elements to identify a new genetic locus for insertion (Craig 1997; Wu and Burgess 2004). Most mobile elements exhibit at least some degree of target-site specificity. The specificity for a given target DNA can vary dramatically from element to element and often results from sequence preferences of the recombinase proteins. In addition to primary DNA sequence preferences, selection of a new DNA target may also be influenced by DNA accessibility, DNA structure, and cellular factors that are bound to the DNA.

Discussion Mobile genetic elements must regulate the timing and location of mobility to reduce damage to the host cell. While many mobile elements show a low level of specificity for any particular primary sequence and therefore can insert into a wide variety of genetic loci, virtually all mobile elements display some bias for certain features of the genome (Bushman et al. 2005). Target-site selection is especially important for elements that remain long-term residents in the host cell, such as retrotransposons in yeast, and these elements must have adaptations to reduce disrupting

Target-Site Selection

vital host functions when they insert into a new location (Wu and Burgess 2004). For other mobile elements that exist in host cells transiently, such as retroviruses, target-site selection can be less stringent and may be a consequence of selective pressures other than host preservation. For example, human immunodeficiency virus (HIV) displays a bias for actively transcribed genes (Bushman et al. 2005), potentially ensuring that the HIV genome will not integrate into a silenced region of the genome and may also be transcribed to complete its life cycle. Since finding a suitable target can be a rate limiting step in the process of mobilization, the process of target-site selection is closely linked to the frequency of mobility in many elements (Nagy and Chandler 2004). In some systems, such as the bacterial transposon Tn7, transposition does not occur at all until a suitable target site has been identified and prepared for transposition. Three major criteria affect target-site selection for mobile elements. These criteria include the following: • Primary DNA sequence • DNA structure and accessibility • DNA associated landmarks, DNA-binding proteins

such

as

The primary sequence of the target DNA can play a large role in the selection of the target site; however, some genetic elements show very little preference for a particular DNA sequence (Wu and Burgess 2004). In the case of conservative site-specific recombination, the DNA-binding sequence of the recombinase determines the sequence specificity of recombination; the recombinases for these elements are sequencespecific DNA-binding proteins. Transposases often have preferred DNA-binding sequences, or bind to DNA with certain structural characteristics. For IS10 and some non-LTR transposons, inherent DNA flexibility is an important feature of target-DNA molecules (Nagy and Chandler 2004). In some cases target-DNA sequences share some sequence similarity to the ends of the transposon. For some elements, such as the Ll. LtrB group II mobile intron in Lactococcus lactis

Target-Site Selection

and the IS200/IS605 family of DNA transposons, base pairing between donor and target molecules serves an important role in the selection of an insertion site (Nagy and Chandler 2004). Most retrovirus and retrotransposon elements have very little primary sequence preference; however, their integration profile is not random (Bushman et al. 2005). DNA structure and accessibility can play a major role in target-site selection of mobile elements. For example, retroviral integration occurs most frequently into the major groove of DNA (Wu and Burgess 2004), and distortion of the DNA, such as DNA site where nucleosomes are bound, can stimulate integration (Bushman et al. 2005). DNA-binding proteins can also hinder mobile element insertion by blocking access to the DNA. Proteins that mediate the insertion of mobile elements often interact with host factors that are associated with DNA at particular locations. This process is sometimes referred to as “tethering,” in the context of retroviral integration. The integrase protein of HIV has been shown to interact with LEDGF/p75, which influences the integration site of the viral prophage (Bushman et al. 2005). For retrotransposon Ty5, integrase interacts with the Sir4p protein in yeast, a regulator of silent information in yeast, to target transposition into telomeric regions and mating-type loci of yeast (Wu and Burgess 2004). Interaction with host cellular factors not only provides mobile elements with the location of specific genomic features, these interactions may also provide information regarding the metabolic status of the host cell. By linking target-site selection to metabolic indicators, mobile elements can coordinate mobilization with specific cellular events (Nagy and Chandler 2004). For example, Tn7 and related elements have dedicated targetselection proteins that bind to target DNAs and direct transposition into those sites. Tn7 has two such proteins, TnsE and TnsD, which independently identify two different classes of targets (Li et al. 2013). These factors enable the element to determine where the transposon will insert based on multiple cellular cues from the host. Along with two other host factors, TnsD binds to a specific, highly conserved, DNA sequence

1173

within the coding region of the glmS gene. This class of target selection ensures that Tn7 can find a target site, called attTn7, in almost any bacterium, without the risk of disruption any genes. Along with two other host factors, TnsE binds to a specific, highly conserved, DNA sequence within the coding region of glmS gene. The additional host factors that interact with TnsD link target-site selection with host cellular metabolism. TnsE directs transposition into actively replicating DNA, with a strong bias toward mobile plasmids that are actively being transported into cells. TnsE identifies these targets through an interaction with specific DNA structures and an essential component of the DNA replication machinery, the b processivity factor (Li et al. 2013). Interaction with the processivity factor of DNA replication, and DNA structures associated with DNA replication, links target-site selection with cellular DNA metabolism. The ability to target two distinct DNA targets enables this transposon to take advantage of the security of a highly conserved site within the chromosome, while still maintaining the ability to target sites associated with mobile DNA molecules, particularly those that can be identified by distinctive forms of DNA replication (i.e., mobile plasmids and certain kinds of bacteriophage (Li et al. 2013)). Transposon Tn5 and Tn10 activate transposition according to the methylation status of the DNA where they reside (Nagy and Chandler 2004); methylation of DNA is a measure of how recently DNA has been replicated. In the case of these elements, transposition is activated just after replication, which ensures that another copy of the chromosome was available after the cut-and-paste transposon leaves the site. Without another copy of the chromosome, there would be no template to facilitate repair. Furthermore, even if repair could occur using a simple end-joining mechanism, there would be no net gain in the number of elements, whereas repair using homologous recombination from the sister chromosome ensures that the element is duplicated and that the original element remains at this location. Tn7 and other mobile elements, such as bacteriophage Mu and Tn3, display a property known as target immunity. These elements discourage the

T

1174

insertion of additional, similar, elements nearby. Preventing the insertion of elements nearby helps to ensure that mobile elements will not insert into themselves, creating nonfunctional elements. In addition, insertion of a second copy of the same element close to the original would provide extended regions of DNA homology that, when involved in host-mediated homologous recombination, can result in deletions or inversions of large regions of DNA. Tn7 inhibits transposition into DNA up to 190 kilobases from an existing element (Li et al. 2013). In both Mu and Tn7, target-site immunity is mediated by a regulatory protein with an ATPase activity that modulates the mobility of the mobile element, based on the ATP or ADP bound status of the protein. Transposon ends that already exist in a potential target-DNA molecule that are bound by transposase proteins stimulate the ATPase activity of the regulator protein, thereby deactivating the transposon regulator protein, discouraging transposition in this region. In the case of Tn7 and related elements, target immunity appears to be at least partly restricted to very closely related elements. Multiple similar but nonidentical Tn7-like elements have been detected within a single attTn7 site, demonstrating the limitations of target immunity for these elements (Li et al. 2013).

Cross-References ▶ DNA Recombination, Mechanisms of ▶ DNA Repair ▶ Double-Strand Break Repair ▶ Homologous Recombination in Lesion Bypass ▶ Mobile DNA: Mechanisms, Utility, and Consequences ▶ Transposons

Tertiary Structure Domains, Folds and Motifs Li Z, Craig NL, Peters JE (2013) Transposon Tn7. In: Roberts AP & Mullany P (eds) Bacterial Integrative Mobile Genetic Elements. Austin, TX: Landes Bioscience, pp. 1–32 Nagy Z, Chandler M (2004) Regulation of transposition in bacteria. Res Microbiol 155:387–398 Wu X, Burgess SM (2004) Integration target site selection for retroviruses and transposable elements. Cell Mol Life Sci 61:2588–2596

Tertiary Structure Domains, Folds and Motifs Walter R. P. Novak Department of Chemistry, Wabash College, Crawfordsville, IN, USA

Synopsis Proteins are complex and irregular structures; however, proteins possess regions of regularity at a local level in the form of secondary structure elements such as alpha helices and beta sheets. Secondary structure elements can combine to form structural motifs that are found in a variety of proteins. Many different motifs may be combined to form stably folding protein domains. Protein domains are regions of a protein that can stably fold and possess a certain function. Proteins may be made up of one or many domains. Each protein domain has a fold, which refers to the arrangement of secondary structure elements in that domain. Relatively few distinct folds exist with respect to the number of protein sequences; accordingly, one fold may be utilized by many different proteins to perform a range of functions.

Introduction References Bushman F, Lewinski M, Ciuffi A, Barr S, Leipzig J, Hannenhalli S, Hoffmann C (2005) Genome-wide analysis of retroviral DNA integration. Nat Rev Microbiol 3:848–858 Craig NL (1997) Target site selection in transposition. Annu Rev Biochem 66:437–474

The overall three-dimensional structure of a protein chain, including the positions of amino acid side chains, is referred to as the tertiary structure of the protein. Knowledge of protein tertiary structure is essential to understanding how enzymes function and how to design, inhibit, and activate proteins.

Tertiary Structure Domains, Folds and Motifs

1175

Tertiary Structure Domains, Folds and Motifs, Fig. 1 Cartoon representation of common structural motifs in proteins. Alpha helices are shown in orange, beta strands in purple, and loop and turn regions in the motif are shown in light green. (a) The lambda-cro protein dimerizes and binds to DNA using a helix-turn-helix motif. Each monomer possesses the same motif (PDB ID: 6CRO). (b) The N-terminal domain of this tyrosine kinase

has two beta-hairpin motifs (PDB ID: 1HUF). (c) The enzyme trypsin possesses a Greek-key motif. This particular Greek-key motif has a short alpha helix inserted between strands three and four. Note how beta-strand four interacts with beta-strand one, common to this motif (PDB ID: 2ZDK). PDB coordinates were downloaded from http://www.rcsb.org/pdb (This figure was created with UCSF Chimera (Pettersen et al. 2004))

As the tertiary structure of the protein is specific to a particular protein sequence, several additional terms describe the architecture of proteins in three-dimensional space, and these terms facilitate the comparison of different proteins on several levels. Protein structures are often composed of independently folding units within the same protein chain, called domains. Every protein domain has a fold, which refers to the arrangement of secondary structure elements in that domain. Protein domains often contain specific arrangements of a few contiguous secondary structure elements. Such units of secondary structure groups that are repeatedly found in a variety of proteins are called motifs or supersecondary structures. Motifs are unable to fold independently and often do not perform a specific function, thus discriminating motifs from protein domains.

quaternary structure, or the arrangement of multiple polypeptide chains. The tertiary structure describes the position of all backbone atoms as well as side-chain atoms in a given polypeptide chain. Using this definition it is clear then that each different protein sequence necessarily has a unique tertiary structure. This can be confusing as the core arrangement of secondary structure elements, also called the fold, can be the same for proteins of different sequences. Therefore, in order to more easily compare protein structures, the folds of individual domains are often used.

Levels of Protein Structure Proteins can be described using four levels or structures: the primary structure, or amino acid sequence of the protein; the secondary structure, or the local spatial arrangement of the polypeptide backbone atoms, often organized into regular elements such as alpha helices and beta sheets; the tertiary structure of a protein, or the overall structure of a single polypeptide chain; and the

Motifs Secondary structure elements describe local regions of polypeptide backbone conformation and include the alpha helix and beta sheet. A few of these contiguous secondary structure elements can combine in specific three-dimensional arrangements to form supersecondary structures or structural motifs. These structural motifs may be associated with a function, such as the DNA-binding helix-turnhelix motif (Fig. 1a), but more often are not associated with any particular function. Other common structural motifs include the beta-hairpin (Fig. 1b) and Greek-key (Fig. 1c) motifs. Motifs are unable to fold or evolve independently and are only found as fragments of a protein. In some cases a series of

T

1176

Tertiary Structure Domains, Folds and Motifs

Tertiary Structure Domains, Folds and Motifs, Fig. 2 Cartoon and schematic representations of protein domains. (a) The glucocorticoid nuclear receptor forms a homodimer and possesses a three-domain structure. The three-dimensional structure of the A/B domain is unknown, but the DNA-binding domain (DBD, shades of blue) and the ligand-binding domain (LBD, shades of green) have been solved separately (PDB IDs: 1M2Z and

1GLU). (b) PRAI and IGPS catalyze successive reactions in the tryptophan biosynthesis pathway (PDB ID: 1PII). C. PSD-95 is a membrane-associated guanylate kinase that possesses a series of protein-protein interaction domains (PDB IDs: 1TP3 and 1JXM). PDB coordinates were downloaded from http://www.rcsb.org/pdb (This figure was created with UCSF Chimera (Pettersen et al. 2004))

motifs may combine to form a stably folding domain. The TIM barrel or (beta-alpha)8 fold is composed of a series of eight overlapping betaalpha-beta motifs (Figs. 2b and 3a).

an N-terminal hypervariable (or A/B) domain, a DNA-binding domain, and a ligand-binding domain (Fig. 2a). Studies with nuclear receptors demonstrated that swapping the DNA-binding domains of two different receptors also swaps the DNA sequence recognized by the protein (Giguere et al. 1987). Domains may be added to the N- or C-terminus of a protein. The zinc-finger domain is a small DNA-binding domain common in eukaryotes. Studies have demonstrated that multiple zinc-finger domains can be linked together to increase the region of DNA recognized by the protein, therefore greatly increasing its specificity (Urnov et al. 2010). Though a given protein domain has a given function, another protein with the same fold (see “Protein Folds” below) as this domain may or may not share this function. This is especially true of larger protein domains which can perhaps more easily evolve new functions; however, many smaller protein domains do possess conserved functions.

Protein Domains Within a given protein structure, there are regions that independently form stable tertiary structures. These regions are called domains. Domains can independently fold and evolve and are associated with a specific function. Protein domains are not always composed of contiguous stretches of polypeptide and may have additional regions of polypeptide inserted into the protein. As domains can fold independently and have a specific function, they can be treated as modules that can be swapped from one protein to another or even added on to a protein. Nuclear receptor proteins have a multidomain structure, including

Tertiary Structure Domains, Folds and Motifs

1177

Tertiary Structure Domains, Folds and Motifs, Fig. 3 Cartoon representations of three protein folds. Alpha helices are shown in orange, beta strands in purple, and loop and turn regions are shown in light green for each domain fold. (a) The (beta-alpha)8 fold of triose-phosphate isomerase, also known as the TIM barrel fold (PDB ID:

1SQ7). (b) The Rossmann (beta-alpha-beta-alpha-beta) twofold of malate dehydrogenase shown with NAD bound. The C-terminal catalytic domain is shown in gray (PDB ID: 1EMD). (c) The zinc-finger beta-beta-alpha fold shown with zinc and its ligands (PDB ID: 1A1I) (This figure was created with UCSF Chimera (Pettersen et al. 2004))

When a domain of conserved function appears repeatedly in nature, they are often given names. For example, there is the classic DNA-binding zincfinger domain, and the PDZ, SH3, and SH2 domains are all protein-protein interaction domains. There are many reasons multidomain proteins exist. Some multidomain proteins are composed of two domains that function to catalyze successive steps in a biosynthetic pathway. The bifunctional enzyme phosphoribosylanthranilate isomerase-indoleglycerolphosphate synthase (PRAI-IGPS) is one such enzyme and is composed of two (beta-alpha)8 barrel domains fused together (Fig. 2b). In this way the product of the first reaction is shuttled directly to the next reaction in the pathway (Wilmanns et al. 1992). Other multidomain proteins play an important role in cell signaling and transport. Proteins involved in signaling may have multiple PDZ and SH3 domains in order to facilitate the formation of protein signaling complexes (Fig. 2c).

with different tertiary structures and a variety of functions can utilize the same fold, and as a result most folds are not associated with a given function. The TIM barrel or (beta-alpha)8 fold (Fig. 3a) is the most common protein fold in existence and can catalyze reactions in five of the six major enzyme classification groups (Wierenga 2001). Therefore, no functional information can be gleaned with only the knowledge that a protein adopts this fold. However, the Rossmann fold (beta-alpha-beta-alpha-beta unit that is often paired with itself; Fig. 3b) is an exception to this rule, and this fold has been demonstrated to bind nucleotides, particularly NAD+ and NADP+ (Rao and Rossmann 1973).

Protein Folds The fold of a protein is closely related to the domain. Each domain possesses a given fold, which refers to the core arrangement of secondary structure elements, excluding side-chain position. The fold of a protein may involve the insertion or deletion of regions of polypeptide. Many proteins

Ambiguity of Terms Confusion can occur when the terms motif and domain are seemingly used interchangeably. However, the term motif is also often used to describe regions of conserved sequence, and therefore, it may be appropriate to discuss both the zinc-finger motif and the zinc-finger domain. The zinc finger is easily recognized in a protein sequence through the conservation of the two cysteine and two histidine residues that coordinate the zinc ion. Additionally, a zinc-finger polypeptide can fold independently. Therefore, the conserved sequence elements are referred to as the

T

1178

zinc-finger motif, but the beta-beta-alpha fold (Fig. 3c) formed by this polypeptide is referred to as the zinc-finger domain. Recall that structural motifs may share little to no sequence identity and often cannot be recognized by the conservation of residues in a protein sequence and are unable to form stably folded structure.

References Branden C-I, Tooze J (1999) Introduction to protein structure, 2nd edn. Garland, New York Giguere V, Ong ES, Segui P, Evans RM (1987) Identification of a receptor for the morphogen retinoic acid. Nature 330:624–629 Petsko GA, Ringe D (2004) Protein structure and function. Primers in biology. Sinauer, Sunderland Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE (2004) UCSF Chimera – a visualization system for exploratory research and analysis. J Comput Chem 25(13):1605–1612 Rao ST, Rossmann MG (1973) Comparison of supersecondary structures in proteins. J Mol Biol 76(2):241–256 Urnov FD, Rebar EJ, Holmes MC, Zhang HS, Gregory PD (2010) Genome editing with engineered zinc finger nucleases. Nat Rev Genet 11:636–646 Wierenga RK (2001) The TIM-barrel fold: a versatile framework for efficient enzymes. FEBS Lett 492(3):193–198 Wilmanns M, Priestle JP, Niermann T, Jansonius JN (1992) Three-dimensional structure of the bifunctional enzyme phosphoribosylanthranilate isomerase: indoleglycerolphosphate synthase from Escherichia coli refined at 2.0 A resolution. J Mol Biol 223(2):477–507

Tertiary Structure, Forces Maintaining the Stability of

Synopsis The tertiary structures of proteins are stabilized by intermolecular forces; the hydrophobic effect, hydrogen bonds, ionic interactions, and London dispersion forces all contribute. The relative contribution of these forces has been quantified. More recently, the strategies employed by thermophilic proteins, which allow them to maintain their structures at extreme temperatures, have been identified. A growing appreciation of the need for some proteins to change their tertiary structure has developed, and the factors which allow them to adopt radically different folds are being identified.

Introduction In order to understand both the stability of protein structure and also their inherent plasticity, it is necessary to understand the transient forces that fold and hold the linear protein chain into a threedimensional shape. Although these forces are often within an individual molecule, these forces are collectively labeled intermolecular forces, to distinguish them from the covalent bonds that link the individual amino acids together to form a protein chain. The relative strength of the intermolecular forces will determine the relative rigidity or flexibility of the protein shape.

The Intermolecular Forces

Tertiary Structure, Forces Maintaining the Stability of Nathan Winter Department of Chemistry and Biochemistry, St. Cloud State University, St. Cloud, MN, USA

Synonyms 3 Structure; Intermolecular bonds; Intermolecular forces; Protein shape; Threedimensional structure

The hydrophobic effect has been recognized as the principal contributing force for maintaining tertiary structure since the late 1930s (Tanford 1997). Stated in its simplest form, like dissolves like, the hydrophobic effect is the force that causes hydrophobic molecules, e.g., hydrocarbons, to aggregate when in a polar solvent such as water. The hydrophobic amino acid side chains, like the indole group of tryptophan, are what form the core of a globular protein. The free energy of the hydrophobic effect is derived in part from changes in entropy (Chandler 2005). Water molecules tend to orient when next to a nonpolar neighbor in

Tertiary Structure, Forces Maintaining the Stability of

order to maximize the number of hydrogen bonds (Hore et al. 2008). Having hydrophobic molecules aggregate, excluding water, reduces the number of oriented water molecules, thus increasing entropy. In order to quantify the contribution that the hydrophobic effect makes toward protein stability, researchers created a series of mutations that altered the hydrophobic profile of proteins (e.g., isoleucine to valine) and measured the stability by way of circular dichroism (Pace et al. 2011). They compared their measured hydrophobic stability to calculated contributions from hydrogen bonding and disulfide linkages and concluded that on average, the hydrophobic effect accounts for 60  4% of the proteins’ stability. Other forces also contribute to protein stability. Hydrogen bonds have been estimated to contribute 40  4% of a protein’s stability (Pace et al. 2011). Disulfide bonds contribute; however, because of their relatively low occurrence, their importance is less. This is also true for ionic interactions. It can be argued that London forces will not contribute to a protein’s stability because both a folded and an unfolded protein will have the same number of interatomic contacts when solvating water is included, but they cannot be discounted, as the following discussion on thermophilic proteins will demonstrate. Thermophilic proteins: Organisms which live in environments with elevated temperatures such as geothermal springs have proteins which resist heat denaturation. The organisms that thrive at temperatures above 45  C and their proteins are both called thermophilic. Some thermophilic proteins are stable at temperatures which exceed the normal boiling point of water. Thermophilic proteins achieve their higher stability by a number of different mechanisms, including having more stabilizing ion pairs and more hydrogen bonds. They frequently have larger hydrophobic cores with better packing and fewer voids. This efficient packing increases the London force stabilization. Other strategies include having smaller surface loops and more proline residues for a more constrained structure (Vieille and Zeikus 2001). Proteins with more than one tertiary structure: The importance of conformational change

1179

for proper protein function has been known for decades, but there is a growing recognition that for some proteins such as lymphotactin, having conformations with radically different secondary and tertiary structures is absolutely required for them to fulfill their biological role (Tuinstra et al. 2008). The plural form, native states, is more appropriate when describing them. These proteins rapidly and reversibly change conformations depending on their environment and binding partners. Since these different conformations must be energetically accessible, such proteins have decreased stability when compared to canonical, single-fold proteins. The purpose of this fold switching is to expose new surfaces, allowing for altered function (Bryan and Orban 2010).

Cross-References ▶ Chemical Denaturation ▶ Predictions from Sequence ▶ Secondary Structure ▶ Tertiary Structure Domains, Folds and Motifs

References Bryan PN, Orban J (2010) Proteins that switch folds. Curr Opin Struct Biol 20:482–488 Chandler D (2005) Interfaces and the driving force of hydrophobic assembly. Nature 437:640–647 Hore DK, Walker DS, Richmond GL (2008) Water at hydrophobic surfaces: when weaker is better. J Am Chem Soc 130:1800–1801 Pace NC, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, Hendricks MM, Iimura S, Gajiwala K, Scholtz JM, Grimsley GR (2011) Contribution of hydrophobic interactions to protein stability. J Mol Biol 408:514–528 Tanford C (1997) How protein chemists learned about the hydrophobic factor. Protein Sci 6:1358–1366 Tuinstra RL, Peterson FC, Kutlesa S, Elgin ES, Kron MA, Volkman BF (2008) Interconversion between two unrelated protein folds in the lymphotactin native structure. Proc Natl Acad Sci U S A 105:5057–5062 Vieille C, Zeikus G (2001) Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiol Mol Biol Rev 65:1–43

T

1180

Theta-Replicating Plasmids, Large Timothy J. Johnson Department of Veterinary and Biomedical Sciences, University of Minnesota, Saint Paul, MN, USA

Synopsis Large bacterial plasmids are mosaic and efficient accessory elements that are capable of conferring a variety of phenotypic properties to a recipient bacterial host. The genomics era has increased understanding about the genetic structures of different plasmid types, and it is evident that there are predictable regions known as “hotspots” in which certain plasmid types are able to acquire accessory genetic material. In analyzing the genomes of these plasmids, it is necessary to think about the plasmid core or backbone that is responsible for the success or failure of the plasmid and its accessory elements, which might be yet more selfish DNA or could be genes for phenotypic properties that benefit the host and thus the plasmid. This entry will focus on self-transmissible plasmids of Gram-negative bacteria, largely Enterobacteriaceae, in order to illustrate the features of such genomes, but generally, experience has indicated that similar features will apply to plasmids from different host backgrounds.

Introduction Plasmid genomes can be circular or linear and can vary in size from as little as 1 kb in size to as large as hundreds of kilobases. The best known medium to large plasmids are circular and use a form of replication called “theta replication” because of the formation of characteristic replication intermediates whose structure resembles the Greek letter y. By definition this involves separation of the double-stranded DNA at a single point (as opposed to the way in which eukaryotic DNA replication involves multiple origins) to create a replication bubble that expands either

Theta-Replicating Plasmids, Large

bidirectionally or unidirectionally until the whole genome is replicated. Even if linear plasmids replicate in this way, they will not create “theta intermediates” because they are not circular. The alternative form of intermediate for circular plasmids is the sigma form due to the resemblance to the Greek letter s, but as discussed in the essay by Del Solar and Espinosa, this replication form is much less suited to large plasmids than small ones. Therefore, apart from linear plasmids, all known large, low copy, transmissible plasmids replicate using this mechanism. Since the acquisition of large plasmids carries with it a metabolic burden to the bacterial host, it is extremely important that these plasmids tightly control their copy number during replication. Once established in a naïve recipient bacterial host, some plasmids use multiple means to maintain a low copy number (at or near one copy) while at the same time ensuring their successful maintenance in a bacterial population. In analyzing the genomes of these plasmids, it is necessary to think about the plasmid core or backbone that is responsible for the success or failure of the plasmid as whole and the cargo that might be even more selfish DNA or could be genes for phenotypic properties that benefit the host and thus the plasmid. This entry will focus on self-transmissible plasmids of Gramnegative bacteria, largely Enterobacteriaceae, in order to illustrate the features of such genomes, but generally, experience has indicated that similar features will apply to plasmids from different host backgrounds. The general classes of control systems used by plasmids to accomplish copy number control include directly repeated sequences (iterons) that interact with replication initiation proteins, counter-transcribed RNAs (ctRNAs) that hybridize to complementary regions of essential RNA, and ctRNAs and an auxiliary protein (del Solar and Espinosa 2000). Counter-transcribed or “antisense” RNAs act as measuring devices and regulators of plasmid replication (Phillips and Funnell 2004). As ctRNAs are constitutively transcribed, increases in plasmid copy number will result in increases in ctRNA concentrations. These increasing ctRNA concentrations in turn bind rep mRNA and inhibit translation. The opposite is also true for

Theta-Replicating Plasmids, Large

decreasing copy number: decreasing ctRNA concentration in turn allows Rep translation to proceed. The mechanistic process of Rep inhibition is complex and multifaceted but well studied and reviewed. Examples of classical plasmid incompatibility types with ctRNA-controlled replication include IncI1, IncFII, and IncL/M plasmid types. The majority of large plasmids utilizing theta replication encode their own replication initiation protein, host factor binding sites, A + T rich regions, and direct repeats known as iterons. These are typical components of the basic or minimal replicon required for autonomous replication (del Solar et al. 1998). Iterons are found near the origin of replication (ori) and serve as binding sites for the replication initiation protein (Rep) and also as binding sites for the control of plasmid replication, often via chromosomally encoded factors in the bacterial host such as DnaA, IHF, and FIS. A number of iterons have been identified as components of their respective minimal plasmid replicon, including IncA/C, IncF, IncN, IncP-1, and IncX. In iteron-containing plasmids, these sequences are required for ori activation. The Rep protein of iteron-containing plasmids often serves as a dual regulator, being required for the activation of plasmid replication but also feedback inhibition of replication once at elevated levels in the cell. In models for iteron-containing plasmid replication, Rep monomers bind and activate ori replication, Rep dimers bind the operator/ promoter region and negatively regulate Rep production, and “handcuffing” occurs where two plasmid iterons are bound by Rep and replication is shut down. Large plasmids also possess an array of systems enabling their successful segregation during replication and maintenance, since they would otherwise likely be lost in the absence of selective pressure. The broad classes of systems preventing such plasmid loss include multimer resolution systems, active partitioning systems, and postsegregational killing systems (Thomas 2000). In the absence of a universal gene such as the 16S rRNA gene of bacteria, the typing and phylogenetic analyses of plasmids have proven challenging given their mosaic structures and complex

1181

evolutionary history. Initially, plasmids were classified based upon their inability to reside together within the same bacterial cell, known as incompatibility (Carattoli et al. 2005; Couturier et al. 1988). Up to 26 different incompatibility groups have been described in the Enterobacteriaceae, and the plasmids most studied for their basic biology belong to these core groups. However, it has become evident that plasmid diversity is vast and extends far beyond this relatively small number of groups studied within Enterobacteriaceae. As a result, other typing methods have been proposed to more universally classify plasmids based upon their core structures. For large conjugative plasmids, relaxase proteins hold great promise for a more structured and universal typing approach since the relaxase is the only essential component common to both mobilizable and conjugative plasmids (Garcillan-Barcia et al. 2009). The remainder of this section will attempt to describe the plasmid genome structures of large, theta-replicating plasmids relative to both incompatibility type and MOB class. To be concise, only the plasmids that have been best studied in their basic biology and for which multiple sequences are currently available will be highlighted. In contrast to the small, mobilizable plasmids that contain a minimal gene set required for replication (Rep) and mobilization (MOB) when in the presence of a helper plasmid, self-transmissible plasmids contain these elements and encode a type IV secretion system allowing for functional self-transfer. A common component to all conjugative plasmids is the relaxase protein, which usually contains a relaxase domain at its N terminus and DNA primase or helicase domains at the C terminus. The universal nature of relaxase proteins in plasmids makes them well suited for phylogenetic studies. The main families of thetareplicating, large conjugative plasmids based upon this typing approach are MOBF, MOBH, and MOBP. An increasing number of completed plasmid sequences have become available in the last 10 years. This has afforded the opportunity to perform comparisons of multiple plasmids belonging to described incompatibility groups in an effort to refine the basic structures of plasmids

T

1182

Theta-Replicating Plasmids, Large

Theta-Replicating Plasmids, Large, Table 1 Characteristics of large, theta-replicating plasmids MOB type F12

Inc type FIB

Replication control Iteron

F12

FII

ctRNA

F11 F11 F11 H12 H11 H12 H12 P12

N P-9 W A/C H P-7 T I1

Iteron Iteron Iteron Iteron Iteron Iteron Iteron ctRNA

P13 P11

L/M P-1

ctRNA Iteron

P4 P3

U X

Iteron Iteron

Stability psiAB, parAB, sopAB psiAB, parAB, sopAB stbABC parABC stbABC parAB parAB parWABC parAB mck, kor, parAB, psiAB parAB kor, incC, kle, kla, parDE-CBA korAB, kfrAC parFG

Integration hotspotsa 5

Core backboneb (kb) 90

Accessory size (kb) 0–50

Total size range (kb) 55–184

4

75

0–150

34–223

3 3 8 3 3 3 ND 1

35 35 30 120 150 75 ND 90

10–40 45–80 5–10 20–80 25–125 45–90 ND 0–30

44–79 81–116 33–39 144–199 241–324 128–200 217 69–120

2 3

50 45

10–40 0–60

60–98 41–108

1 1

35 30

10–55 0–20

45–85 30–51

a

Sequenced IncW plasmids have eight sites where DNA has been inserted, although no apparent integration hotspots are evident b Core backbone, accessory size, and total size range were determined using available complete plasmid genomes

within the same incompatibility group and the basic structures based on broader criteria. The MOBF group includes plasmids belonging to incompatibility groups IncFIB, IncFII, IncN, IncP-9 (Tol), and IncW. The IncF plasmids alone include at least seven incompatibility groups, all belonging to the F12 MOB clade (GarcillanBarcia et al. 2009). IncF plasmids are the most common plasmid type among Enterobacteriaceae and are noteworthy for their carriage of virulence factors of all pathotypes of E. coli, Salmonella enterica, and Shigella spp., among others. IncF plasmids also carry antibiotic resistance determinants, which can co-reside on virulence plasmids. The core backbone of IncF plasmids, belonging to the F12 MOB clade, is well defined due to extensive study on its basic biology and the vast number of completed F-type plasmid sequences available. The sizes of IncF plasmids range from 34 to 223 kb in size with a core backbone of approximately 30–90 kb (Table 1). These plasmids have numerous sites for the integration of accessory elements and have accessory genome sizes that in many cases are larger than the core

backbone itself (Fig. 1). In contrast, plasmids belonging to the F11 MOB clade (IncN, IncP-9, and IncW) have a smaller core backbone (30–35 kb) and an apparently smaller accessory genome (5–80 kb). It is worth noting that the carriage capacity of the accessory genome of IncP-9 and IncN plasmids appears to be larger than that of IncW plasmids. This could be due to the relatively recent ancestry of IncW plasmids and the lack of bona fide integration hotspots for the acquisition of genetic modules. The MOBH group includes IncA/C, IncH, IncP-7, and IncT plasmids. IncH plasmids are the best studied members of this MOB group. This MOB group is notable because its members generally possess larger core backbones (75–120 kb) and carrying load potential (20–125 kb). IncA/C and IncH plasmids are well known for their ability to acquire antimicrobial resistance gene modules, and IncP-7 plasmids are known for carrying accessory modules involved in the biodegradation of naphthalene, salicylate, carbazole, and dioxin. These plasmids have clearly defined integrative hotspots that have

Theta-Replicating Plasmids, Large

1183

T Theta-Replicating Plasmids, Large, Fig. 1 General features of large, theta-replicating plasmids. Black arrows indicate known integrative hotspots for accessory module acquisition. Blue boxes indicate conjugative transfer

systems. Replicons, stability systems, and transcriptional regulators are noted. Figures are not drawn to scale. Backbone colors are as follows: MOBP, red; MOBH, blue; MOBF, orange

been demonstrated to acquire an impressive repertoire of genetic elements, and they generally possess intermediate to broad host range. MOBP is the broadest of the conjugative plasmid relaxase groups and includes IncI1, IncL/M, IncP-1, IncU, and IncX plasmids. IncP-1 plasmids

are the best studied member of this MOB group. With the exception of IncI1, these plasmids have a smaller core backbone (30–40 kb) with integrative hotspots with an apparent carrying load approximately equal in size to the backbone (up to 55 kb). IncI1 plasmids have a larger

1184

backbone of approximately 90 kb due to its multiple conjugative systems and a single integrative hotspot with apparent carrying genetic loads of up to at least 30 kb. These plasmid types are all capable of acquiring antimicrobial resistanceencoding modules, and extensive sequencing of the IncP-1 plasmids in particular has revealed three integrative hotspots for such acquisitions. Plasmid host range plays an important role in distinguishing between a “specialist” of narrow host range and a “generalist” of broad host range (De Gelder et al. 2008). Key work has revealed that subtle changes in the plasmid replication initiation gene in IncP plasmids result in a shift in host range (Sota et al. 2010). Thus, plasmid replicons likely play a key role in determining the range of bacterial hosts in which a plasmid can replicate. It is also likely that other plasmidencoded genes play a role in these functions. It is unknown what determines the ultimate load carrying capacity of a plasmid. However, it likely is a function of plasmid stability and energy cost to the bacterial host as well as essentiality of any accessory genes carried by the plasmid in a particular niche. As mentioned above, plasmids achieve stability through their own machinery that ensures successful resolution, partitioning, and segregation. Energy cost can be theoretically related to gene expression. Broad-host-range plasmids such as IncA/C, IncF, IncH, IncP, and IncX have multiple DNA-binding proteins with similarity to the bacterial host proteins such as H-NS, HU, and IHF. In the IncH and IncP-7 plasmids, their H-NS-like proteins have been shown to mitigate fitness cost and to dampen the host transcriptional response to the plasmid itself (Dillon et al. 2010). Thus, it appears that plasmids with the greatest load carrying capacity, and the best means by which to acquire genetic modules, have acquired their own regulatory genes that mitigate the energy costs to the host. Even among closely related plasmid groups, adaptive evolution has occurred making them distinct and unique in terms of their overall content, genetic load carrying capacity, copy number, and host range. Despite these differences, there are common core attributes between different plasmid types that seem to enable similar characteristics.

Three-Dimensional Structure

References Carattoli A, Bertini A, Villa L, Falbo V, Hopkins KL, Threlfall EJ (2005) Identification of plasmids by PCR-based replicon typing. J Microbiol Methods 63(3):219–228 Couturier M, Bex F, Bergquist PL, Maas WK (1988) Identification and classification of bacterial plasmids. Microbiol Rev 52(0146–0749; 3):375–395 De Gelder L, Williams JJ, Ponciano JM, Sota M, Top EM (2008) Adaptive plasmid evolution results in hostrange expansion of a broad-host-range plasmid. Genetics 178(4):2179–2190 del Solar G, Espinosa M (2000) Plasmid copy number control: an ever-growing story. Mol Microbiol 37(3):492–500 del Solar G, Giraldo R, Ruiz-Echevarria MJ, Espinosa M, Diaz-Orejas R (1998) Replication and control of circular bacterial plasmids. Microbiol Mol Biol Rev 62(2):434–464 Dillon SC, Cameron AD, Hokamp K, Lucchini S, Hinton JC, Dorman CJ (2010) Genome-wide analysis of the H-NS and Sfh regulatory networks in Salmonella Typhimurium identifies a plasmid-encoded transcription silencing mechanism. Mol Microbiol 76(5):1250–1265 Garcillan-Barcia MP, Francia MV, de la Cruz F (2009) The diversity of conjugative relaxases and its application in plasmid classification. FEMS Microbiol Rev 33(3):657–687 Phillips G, Funnell BE (2004) Plasmid biology. ASM Press, Washington, DC, 613 p Sota M, Yano H, Hughes JM, Daughdrill GW, Abdo Z, Forney LJ, Top EM (2010) Shifts in the host range of a promiscuous plasmid through parallel evolution of its replication initiation protein. ISME J 4(12):1568–1580 Thomas CM (2000) The horizontal gene pool: bacterial plasmids and gene spread. Harwood Academic, Amsterdam

Three-Dimensional Structure ▶ Tertiary Structure, Forces Maintaining the Stability of

30 End Processing ▶ Co-transcriptional Eukaryotes

mRNA

Processing

in

3 Structure ▶ Tertiary Structure, Forces Maintaining the Stability of

Toll-Like Receptor 3: Structure and Function

Toll-Like Receptor 3: Structure and Function James Marion Department of Chemistry and Biochemistry, University of California, San Diego, CA, USA

Synopsis Toll-like receptors (TLRs) are a family of innateimmune recognition receptors able to recognize a variety of pathogen-associated molecular patterns (PAMPs) and induce the activation of a number of host defenses. Similar to the rest of the TLR family, TLR3 is a type I integral membrane glycoprotein with an N-terminal ligand recognition domain, a single transmembrane domain, and a C-terminal cytoplasmic signaling domain. TLR3 recognizes dsRNA, a molecular pattern associated with viral infection, and induces the activation of NF-kB and the production of type I interferons. Structurally, while all three domains of TLR3 have not been crystallized together, the N-terminal TLR3-ECD has been independently characterized by two groups; these groups showed that the TLR3-ECD displays a heavily glycosylated horseshoe-shaped solenoid structure consisting of 23 leucine-rich repeat motifs capped on the N- and C-termini by leucine-rich repeat domains (LRR-NT and LRR-CT). TLR3 ligand recognition is specific for dsRNA and dependent upon the pH and dsRNA length, causing ligandinduced dimerization of TLR3-ECDs and leading to TLR3 signal transduction. The dimeric orientation of receptor ECDs is proposed to bring the cytoplasmic TIR signaling domains of the TLR3 homodimer complex into close proximity with one another to create a signaling platform upon which the organization of downstream signaling cascades can occur. At this point, TLR3 diverges from classical TLR signaling as it acts to recruit an adaptor protein known as TRIF. This 712 amino acid adaptor protein has distinct proteininteraction motifs allowing it to recruit effector proteins, resulting in the activation of three possible signaling cascades. The first two cascades,

1185

mediated by the recruitment of TRIF to the TIR domain of TLR3, involve the subsequent recruitment of receptor-interacting protein 1 (RIP1) kinase and subsequently activate one of the two downstream signaling cascades, depending upon the ubiquitination state of RIP1. A third signaling cascade, mediated by the recruitment of TRIF to the TIR domain of TLR3, requires the recruitment of effector molecules that have been shown to include molecular bridge proteins such as tumor necrosis factor receptor-associated factor (TRAF) 1, TRAF2, TRAF3, and TRAF6. These proteins mediate signal transduction pathways by interacting with downstream protein kinases, acting as components of ubiquitin ligase machinery and other adaptor proteins. In the TLR3 signaling cascade, once recruited to the cell surface receptor domain, these bridging proteins are able to mediate the activation of TANK-binding kinase 1 (TBK1). These three defined pathways initiate proinflammatory responses that recruit additional immune cells to the site of infection, inhibit bacterial and viral replication, communicate danger signals to the surrounding cells, and induce apoptosis in cells with overwhelming infection.

Introduction As discussed in ▶ “Toll-Like Receptors: Evolution and Structure” and ▶ “Toll-Like Receptors: Pathogen Recognition and Signaling”, the Tolllike receptors (TLRs) are a family of innateimmune recognition receptors that are able to recognize a variety of pathogen-associated molecular patterns (PAMPs) and induce the activation of a number of host defenses. Similar to the rest of the TLR family, TLR3 is a type I integral membrane glycoprotein with an N-terminal ligand recognition domain, a single transmembrane domain, and a C-terminal cytoplasmic signaling domain (Bell et al. 2005; Choe et al. 2005). Shown to recognize dsRNA, a molecular pattern associated with viral infection, TLR3 is known to induce the activation of NF-kB and the production of type I interferons (Keating et al. 2011; Kumar et al. 2011; Wang and Fish 2012). Structurally, while all three domains of TLR3 have not been

T

1186

Toll-Like Receptor 3: Structure and Function

Toll-Like Receptor 3: Structure and Function, Fig. 1 Structures of the TLR3-ECD solved independently by two groups. (a) TLR3-ECD solved by Choe et al. in 2005 (PDB ID: 1ZIW) (Choe et al. 2005). (b)

TLR3-ECD solved by Bell et al. in 2005 (PDB ID: 2A0Z) (Bell et al. 2005). The two sulfate groups, which were found to have crystallized with the ECD, are depicted in red spheres

crystallized together, the TLR3-ECD has been characterized, independently, by two groups (Bell et al. 2005; Choe et al. 2005). Both groups showed that the TLR3-ECD displayed a heavily glycosylated horseshoe-shaped solenoid structure that consisted of 23 leucine-rich repeat motifs. The structure was further shown to be capped on the N- and C-terminals by leucine-rich repeat domains (LRR-NT and LRR-CT) (Fig. 1). Interestingly, while both groups had elucidated similar ECD structures, they disagreed on the predicted location of the ligand binding site as well as the mechanism by which TLR3 was able to recognize viral dsRNA and initiate signaling. Choe et al. postulated that dsRNA would bind to the convex, glycan-free surface of the TLR3-ECD (Choe et al. 2005). This, they hypothesized, would enable dsRNA to bind to the positively charged residues of the TLR3-ECD surface. Bell et al., however, proposed three dsRNA binding sites (Bell et al. 2005). In their crystal structure, two sulfate molecules from the crystallization medium had bound to the concave surface of the TLR3-ECD. Since the sulfate ions shared a similar atomic arrangement with phosphate groups, the researchers reasoned that the sulfite sites

could represent two potential dsRNA binding sites. The third predicted site was proposed to be a binding site on the glycan-free face of the receptor. These discrepancies were not resolved until 2008 when the crystal structure of mouse TLR3ECD (mTLR3), bound to a 46-base-pair dsRNA molecule, was published (Fig. 2) (Liu et al. 2008). As was detailed in ▶ Toll-Like Receptors: Pathogen Recognition and Signaling, the TLR3-ECD exists as a monomer in solution, and upon ligand binding, dimerization occurs. The mTLR3dsRNA structure showed that dsRNA interacted with basic residues at both the N- and C-terminal sites on the lateral, glycan-free side of the TLR3ECD (Liu et al. 2008). When bound to dsRNA, TLR3-TLR3 direct interactions at the C-terminal were also observed effectively defining three points of interaction in the dsRNA-TLR3 complex: N-terminal with dsRNA, C-terminal with dsRNA, and C-terminal receptor interactions (this terminus could contribute to the formation of a TLR3 signaling unit) (Liu et al. 2008). Interestingly, while the N- and C-terminal sites were separated by 55–60 Å in each ECD, the two N-terminal sites in the complex were separated by 110 Å. The latter distance was equivalent to

Toll-Like Receptor 3: Structure and Function

1187

Toll-Like Receptor 3: Structure and Function, Fig. 2 mTLR3 bound to 46 bp dsRNA. Crystal structure solved in 2008 by Liu et al. (PDB ID: 3CIY) (Liu et al. 2008)

that of the length of a 45-base-pair dsRNA molecule, effectively preventing shorter molecules of dsRNA from binding and suggesting to investigators that this was a mechanism for preventing autoreactive responses against self-dsRNA (endogenous siRNA consists of 25 base pairs or less) (Liu et al. 2008). Further analysis of the mTLR3-dsRNA complex revealed that dsRNA retains a typical A-DNA-like structure in which the ribose phosphate backbone and the position of the grooves are the determinants in binding (Liu et al. 2008). The TLR3-ECD interacts not with the individual bases of dsRNA, but rather through electrostatic interactions between the phosphate groups of the dsRNA backbone and the imidazole rings of four histidine residues (three in the N-terminal site and one in the C-terminal site) in the TLR3-ECD. These features have been postulated to effectively prevent viruses from escaping detection by dsRNA (Botos et al. 2011). Further experiments done in 2008 focused on dsRNA, its requirements for binding to the TLR3ECD and the minimal signaling unit required for effective TLR3 signal transduction (Leonard et al. 2008). Using a multitude of biophysical

techniques, Leonard et al. were able to show that TLR3 ligand recognition was specific for dsRNA and that this event was dependent upon the pH and dsRNA length (Leonard et al. 2008). Using SPR (surface plasmon resonance), SEC (size exclusion chromatography), and AUC (analytical ultracentrifugation) experiments, Leonard et al. defined the length of dsRNA that would accommodate a TLR3 dimer, the proposed TLR3 signaling unit, as 48 base pairs (Leonard et al. 2008). However, further experiments showed multiple TLR3-ECD dimers bound to long dsRNA strands (dsRNA that was 90 base pairs in length bound two TLR3-ECD dimers, while dsRNA that was 139 base pairs in length bound to three TLR3-ECD dimers), suggesting a more detailed analysis was necessary (Leonard et al. 2008). Using a GFP-reporter gene assay, Leonard was able to show that when TLR3 was located in an environment where the pH was approximately 6.0–6.5, such as that of the early endosome, dsRNA ligands had to be greater than 90 base pairs in length to activate TLR3 complexes (2–3 TLR3-ECD dimers) (Leonard et al. 2008). However, he further showed that when TLR3 proteins were located in

T

1188

environments where the pH was below 5.5, such as that of the late endosomes, dsRNA ligands had to be greater than 48 base pairs to activate TLR3 complexes (1–2 TLR3-ECD dimers) (Leonard et al. 2008). The mouse TLR3-dsRNA crystal structure confirmed these findings as the ligand interaction sites on the TLR3-ECD (two TLR3ECD N-terminal regions) were shown to be separated by 120 Å, the perfect length for stable binding of a TLR3-ECD dimer to 40–50 base pairs of dsRNA (Liu et al. 2008). However, given the observation that the dsRNA length, location, and pH could influence the nature of TLR3-dsRNA complexes, the minimum signaling unit for TLR3 remained ambiguous. While previous reports suggested that ligandinduced dimerization was the only necessary event for the transmission of signal to the cytoplasm and activation of an immune response, recent reports have suggested that proteolytic processing of TLR3 may also be required prior to recognition of viral ligand and TLR3’s subsequent competent signaling to downstream effector molecules (Garcia-Cattaneo et al. 2012; Qi et al. 2012). Newly synthesized TLR3 in the ER (endoplasmic reticulum) is associated with the chaperone protein Unc93b1 (Tabeta et al. 2006; Kim et al. 2008). This association has been shown to be required and retained through TLR3’s trafficking from the ER, to the Golgi, and then to the endosomal compartments (Tabeta et al. 2006; Kim et al. 2008). Once in the endosome, several groups have now shown that the TLR3-ECD can be cleaved by cathepsins B and H (Tabeta et al. 2006; Kim et al. 2008). This cleavage event was localized to amino acids 323–343, encompassing LRR12 and its unique loop structure (Tabeta et al. 2006; Kim et al. 2008). Yet, just as with the proteolytic processing events in TLR7/ 8/9, researchers have yet to agree on the importance of this event. Some groups suggest that cleavage does not, in fact, dissociate the receptor into two fragments. Rather, they postulate that this event acts to induce stability and create a required “cleaved/associated” TLR3 that results in a functional receptor which is able to detect and react to viral dsRNA (Tabeta et al. 2006). However, other groups have shown that uncleaved TLR3 can still

Toll-Like Receptor 3: Structure and Function

induce signal transduction events and postulate that this cleavage event has other, yet undefined, roles in the cell (Kim et al. 2008). Whether proteolytic processing is required for ligand recognition of TLR3 or is involved in other facets of TLR3-ECD function is not yet known. However, what researchers do agree upon is the fact that ligand-induced dimerization of TLR3ECDs is required for TLR3 signal transduction (Schröder and Bowie 2005). The dimeric orientation of receptor ECDs is proposed to bring the cytoplasmic TIR domains of the TLR3 homodimer complex into close proximity with one another (Funami et al. 2008). This nonenzymatic event is then thought to create a signaling platform upon which the organization of downstream signaling cascades can occur (Funami et al. 2008; Imamura et al. 2009). At this point, TLR3 diverges from classical TLR signaling as it does not induce a MyD88-dependent signaling cascade. Instead, the TLR3 TIR-based signaling platform acts to recruit an adaptor protein known as TRIF. This 712 amino acid adaptor protein has distinct protein-interaction motifs that allow it to recruit effector proteins, which results in the activation of, at least, three possible signaling cascades (Fig. 3) (O’Neill and Bowie 2007). The first two cascades, mediated by the recruitment of TRIF to the TIR domain of TLR3, involve the subsequent recruitment of RIP1 kinase (receptor-interacting protein 1), a kinase that associates with the RIP homotypic interaction motif in the TRIF domain (O’Neill and Bowie 2007). From this point, one of two downstream signaling cascades is activated depending upon the ubiquitination state of RIP1. Lysine residue 377 of RIP1 has been shown to be an acceptor site for K63-linked polyubiquitination (O’Donnell et al. 2007). Ubiquitination of this site in RIP1 results in the activation of the IkB kinase complex (IKK) composed of a heterodimer of the catalytic IKK-a and IKK-b subunits and a master regulatory protein termed NEMO (NF-kB essential modulator) (O’Donnell et al. 2007). Once activated, this complex phosphorylates two serine residues on the downstream protein IkB-a (inhibitor of kB-a). IkB-a maintains the NF-kB protein complex, consisting of two subunits, RelA

Toll-Like Receptor 3: Structure and Function

1189

T Toll-Like Receptor 3: Structure and Function, Fig. 3 Three possible TRIF-dependent signaling cascades

(p65) and NF-kB1, from a five-subunit family (RelB, c-Rel, and NF-kB2 (p52 and its precursor p100) are the other members), in a dormant state until this phosphorylation event occurs (O’Donnell et al. 2007; Imamura et al. 2009). Upon phosphorylation, IkB-a is further modified through ubiquitination events that lead to its

degradation by the proteasome (Imamura et al. 2009). Upon dissociation from IkB-a, NF-kB is then able to translocate to the nucleus where it can mediate the induction of genes involved in the immune response, the cell survival response, or cellular proliferation as well as upregulate the expression of its own repressor,

1190

IkB-a, which forms an inhibitory feedback loop and results in oscillating levels of NF-kB activity (O’Neill and Bowie 2007). On the other hand, if ubiquitination does not occur on lysine 377 of the RIP1 kinase, RIP1 complexes with the cell death-associated protein FADD (FAS-associated with death domain protein) and caspase 8 in a complex which initiates downstream caspase 8 activation (Imtiyaz et al. 2006). Caspase 8 activation subsequently promotes cell death by triggering the receptormediated extrinsic apoptotic pathway (Imtiyaz et al. 2006). A third signaling cascade, mediated by the recruitment of TRIF to the TIR domain of TLR3, requires the recruitment of effector molecules that have been shown to include molecular bridge proteins such as TRAF1 (tumor necrosis factor receptor-associated factor), TRAF2, TRAF3, and TRAF6 (Fitzgerald et al. 2003). These proteins mediate signal transduction pathways by interacting with downstream protein kinases, acting as components of ubiquitin ligase machinery and other adaptor proteins. In the TLR3 signaling cascade, once recruited to the cell surface receptor domain, these bridging proteins are able to mediate the activation of TBK1 (TANK-binding kinase 1), a kinase that will be discussed at great length in ▶ TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation for its critical roles in phosphorylating substrates involved in cell proliferation, vesicle transport, xenophagy, and the antiviral response (Fitzgerald et al. 2003). Currently, two competing theories exist regarding the molecular interactions of TBK1 and its surrounding proteins. Originally, a proposed kinase complex was believed to exist that consisted of three adaptor proteins, TANK (TRAF-associated NF-kB inhibitor), NAP1 (NF-kB-activating kinase (NAK)-associated protein 1), and SINTBAD (similar to NAP1-TBK1 adaptor), and two kinases, TBK1 and IKK-e (IkB kinase epsilon) (Fitzgerald et al. 2003; Sharma et al. 2003; McWhirter et al. 2004; Perry et al. 2004). Recent systematic affinity purification-mass spectrometry experiments have shown a mutually exclusive interaction may exist between the adaptors and

Toll-Like Receptor 3: Structure and Function

the kinases, suggesting distinct alternative complexes rather than one large signalosome (Fig. 4) (Goncalves et al. 2011). Yet, with either a large signalosome or distinct mutually exclusive complexes, activation of TBK1 leads to the phosphorylation of an important downstream transcription factor, IRF3 (interferon regulatory factor 3) (Fitzgerald et al. 2003; Sharma et al. 2003; McWhirter et al. 2004; Perry et al. 2004). IRF3, a 427 amino acid protein, is an important transcriptional regulator of the antiviral immune response (Fitzgerald et al. 2003; Sharma et al. 2003; McWhirter et al. 2004; Perry et al. 2004). Upon phosphorylation (phosphorylated by TBK1 on residues 385 and 386 for dimerization and 396–405 to alleviate autoinhibition and allow for interaction with co-activators), IRF3 dimerizes and translocates to the nucleus where it regulates the expression of numerous host defense genes including type I interferons (Fitzgerald et al. 2003; Sharma et al. 2003; McWhirter et al. 2004; Perry et al. 2004). These interferons, including IFN-a and IFN-b, are able to stimulate both macrophages and NK cells to elicit an antiviral response or bind to the IFN-a/b receptor (IFNAR1/2) in either an autocrine or paracrine manner to initiate a positive feedback loop that results in the activation of the JAK/STAT pathway and the production of further antiviral genes (Durbin et al. 2000; Levy and García-Sastre 2001). These three defined pathways initiate proinflammatory responses that recruit additional immune cells to the site of infection, inhibit bacterial and viral replication, communicate danger signals to the surrounding cells, and induce apoptosis in cells with overwhelming infection. Because of this critical response, our work has focused on the activation of type I interferons, the critical effector molecules which were discussed for their production by the proinflammatory response of the TLR3 signaling cascade. In particular, we have examined the activation and regulation of the key kinase in this pathway, TBK1. As will be detailed in ▶ TANKBinding Kinase 1 (TBK1): Structure, Function, and Regulation, TBK1 has become demarcated

Toll-Like Receptor 3: Structure and Function

1191

T Toll-Like Receptor 3: Structure and Function, Fig. 4 Possible configurations of the proposed TBK1 kinase complex

as a centralized catalytic hub and an essential modulator of the host responses mentioned above. As such, understanding the mechanisms by which the function of TBK1 is both activated and regulated provides insight into how this kinase dictates the shape of a multitude of downstream innate-immune responses.

Cross-References ▶ TANK-Binding Kinase 1 (TBK1): Structure, Function, and Regulation ▶ Toll-Like Receptors: Evolution and Structure ▶ Toll-Like Receptors: Pathogen Recognition and Signaling

1192

References Bell JK, Botos I, Hall PR et al (2005) The molecular structure of the Toll-like receptor 3 ligand-binding domain. Proc Natl Acad Sci U S A 102:10976–10980 Botos I, Segal DM, Davies DR (2011) The structural biology of Toll-like receptors. Structure 19:447–459. https://doi. org/10.1016/j.str.2011.02.004 (S0969-2126(11)00072-4 [pii]\r) Choe J, Kelker MS, Wilson IA (2005) Crystal structure of human Toll-like receptor 3 (TLR3) ectodomain. Science 309:581–585. https://doi.org/10.1126/science.1115253 Durbin JE, Fernandez-Sesma A, Lee CK et al (2000) Type I IFN modulates innate and specific antiviral immunity. J Immunol 164:4220–4228 Fitzgerald KA, McWhirter SM, Faia KL et al (2003) IKKepsilon and TBK1 are essential components of the IRF3 signaling pathway. Nat Immunol 4:491–496 Funami K, Sasai M, Oshiumi H et al (2008) Homooligomerization is essential for Toll/interleukin-1 receptor domain-containing adaptor molecule-1mediated NF-kappaB and interferon regulatory factor3 activation. J Biol Chem 283:18283–18291 Garcia-Cattaneo A, Gobert F-X, Muller M et al (2012) Cleavage of Toll-like receptor 3 by cathepsins B and H is essential for signaling. Proc Natl Acad Sci 109:9053–9058 Goncalves A, Bürckstümmer T, Dixit E et al (2011) Functional dissection of the TBK1 molecular network. PLoS One 6:e23971 Imamura M, Tsutsui H, Yasuda K et al (2009) Contribution of TIR domain-containing adapter inducing IFN-b-mediated IL-18 release to LPS-induced liver injury in mice. J Hepatol 51:333–341 Imtiyaz HZ, Rosenberg S, Zhang Y et al (2006) The Fas-associated death domain protein is required in apoptosis and TLR-induced proliferative responses in B cells. J Immunol 176:6852–6861 Keating SE, Baran M, Bowie AG (2011) Cytosolic DNA sensors regulating type I interferon production. Trends Biochem Sci 32:574–581 Kim Y-M, Brinkmann MM, Paquet M-E, Ploegh HL (2008) UNC93B1 delivers nucleotide-sensing Tolllike receptors to endolysosomes. Nature 452:234–238 Kumar H, Kawai T, Akira S (2011) Pathogen recognition by the innate immune system. Int Rev Immunol 30:16–34 Leonard JN, Ghirlando R, Askins J et al (2008) The TLR3 signaling complex forms by cooperative receptor dimerization. Proc Natl Acad Sci U S A 105 (1):258–263 Levy DE, García-Sastre A (2001) The virus battles: IFN induction of the antiviral state and mechanisms of viral evasion. Cytokine Growth Factor Rev 12:143–156 Liu L, Botos I, Wang Y et al (2008) Structural basis of Tolllike receptor 3 signaling with double-stranded RNA. Science 320:379–381 McWhirter SM, Fitzgerald KA, Rosains J et al (2004) IFN-regulatory factor 3-dependent gene expression is

Toll-Like Receptors: Evolution and Structure defective in Tbk1-deficient mouse embryonic fibroblasts. Proc Natl Acad Sci U S A 101:233–238 O’Donnell MA, Legarda-Addison D, Skountzos P et al (2007) Ubiquitination of RIP1 regulates an NF-kB-independent cell-death switch in TNF signaling. Curr Biol 17:418–424 O’Neill LAJ, Bowie AG (2007) The family of five: TIR-domain-containing adaptors in Toll-like receptor signalling. Nat Rev Immunol 7:353–364 Perry AK, Chow EK, Goodnough JB et al (2004) Differential requirement for TANK-binding kinase-1 in type I interferon responses to Toll-like receptor activation and viral infection. J Exp Med 199:1651–1658 Qi R, Singh D, Kao CC (2012) Proteolytic processing regulates Toll-like receptor 3 stability and endosomal localization. J Biol Chem 287:32617–32629 Schröder M, Bowie AG (2005) TLR3 in antiviral immunity: key player or bystander? Trends Immunol 26:462–468 Sharma S, Tenoever BR, Grandvaux N et al (2003) Triggering the interferon antiviral response through an IKK-related pathway. Science 300:1148–1151 Tabeta K, Hoebe K, Janssen EM et al (2006) The Unc93b1 mutation 3d disrupts exogenous antigen presentation and signaling via Toll-like receptors 3, 7 and 9. Nat Immunol 7:156–164 Wang BX, Fish EN (2012) The yin and yang of viruses and interferons. Trends Immunol 33:190–196

Toll-Like Receptors: Evolution and Structure James Marion Department of Chemistry and Biochemistry, University of California, San Diego, CA, USA

Synopsis Toll-like receptors (TLRs) recognize a variety of evolutionarily conserved microbial molecules called pathogen-associated molecular patterns (PAMPs) to initiate an intracellular signaling cascade that activates the innate immune response. First identified in Drosophila, TLRs were found to be conserved in humans where they activate the NF-kB transcription factor and thus its downstream targets. To date, 10 human Toll-like receptors (13 in mice) have been characterized and recognize diverse molecules including lipopeptides, viral dsRNA, lipopolysaccharide

Toll-Like Receptors: Evolution and Structure

(LPS), bacterial flagellin, viral or bacterial ssRNA, and CpG-rich unmethylated DNA. Each TLR is an evolutionarily conserved type-I integral membrane glycoprotein consisting of an N-terminal ligand recognition domain (TLR-ECD), a single transmembrane helix that contains approximately 20 uncharged, mostly hydrophobic residues, and a C-terminal cytoplasmic signaling domain, known as the TIR domain. Named for its homology with the signaling domains of IL-1R family members, TIR domains are also found in many adaptor proteins that interact with the TIR domain of TLRs. Each TLR extracellular domain is constructed of 19–25 tandem copies of a motif known as the LRR (leucinerich repeat) motif. Typically 22–29 residues in length and containing hydrophobic residues spaced at distinct intervals, LRRs adopt a loop structure that begins with three residues in a b-strand configuration. When assembled consecutively in the TLR-ECDs, the LRR motifs form a solenoid structure with a curved configuration involved in ligand recognition and binding. As specific ligands of each receptor were identified, focus shifted to the mechanistic understanding of the resulting activated signaling pathways. The identification of players in the activated TLR signaling cascades and the elucidation of products from each of these pathways culminated with the awarding of the Nobel Prize in Physiology or Medicine to Bruce Beutler, Jules Hoffmann, and Ralph Steinman for revolutionizing our understanding of the immune system by discovering key principles for its activation.

Introduction A crucial part of the innate immune response, Toll-like receptors (TLRs) constitute one family of the signaling class of pattern recognition receptors (PRRs). The first Toll family member, Drosophila Toll, was discovered in 1985 by the Nusslein-Volhard group (Anderson et al. 1985). Identified as a maternal effect gene, mutations caused a fruit fly phenotype that surprised German researchers to the point of exclaiming, “Das ist ja toll,” which translates to “That’s great,” and

1193

subsequently gave the Toll-like receptor family its name (Anderson et al. 1985; Janeway et al. 2005). Further research has since shown that, in the fly, the Toll receptor recognizes a cleaved protein product known as Spatzle (Schneider et al. 1994). In the fly embryo, this recognition induces formation of the dorsoventral axis, while in the adult fly, recognition triggers an intracellular cascade that leads to the activation of a transcription factor termed Dorsal, a fly homologue of the transcription factor NF-kB (Schneider et al. 1994). In 1996, Jules A. Hoffmann and coworkers made the groundbreaking discovery that Toll was critical to a fly’s antifungal innate immune response (Lemaitre et al. 1996). They showed that mutations to the Toll receptor dramatically reduced survival of the host after fungal infection (Lemaitre et al. 1996). It was not until 1997 that Charles Janeway, the scientist who, a decade earlier, had hypothesized that a set of factors, which he termed PRRs, were capable of detecting pathogen-associated molecular patterns (PAMPs) in the immune response (Janeway 1989), discovered the first human Toll receptor (Medzhitov et al. 1997). Through cloning experiments in which Janeway’s group transfected into human cell lines a constitutively active construct of human Toll, they were able to show an induced activation of NF-kB and the expression of NF-kB controlled genes for the inflammatory cytokines IL-1, IL-6, and IL-8 (Medzhitov et al. 1997). Further studies with the constitutively active human Toll showed an induced expression of the co-stimulatory molecule B7.1, required for activation of naïve T cells, and suggested an essential role for this receptor in the human immune response (Medzhitov et al. 1997). A year later, building off of Janeway’s results, Bruce Beutler, investigating what he called the most powerful microbial stimulant of the innate immune response, LPS, discovered TLR4 to be the sole LPS receptor and the gateway to the endotoxin response (Poltorak et al. 1998). Through missense mutations in the TLR4 gene of C3H/HeJ LPS response locus mice, Beutler was able to show that the mammalian TLR4 protein had been adapted to primarily recognize LPS

T

1194

Toll-Like Receptors: Evolution and Structure

Toll-Like Receptors: Evolution and Structure, Fig. 1 The 10 human Toll-like receptors and the pathogens they recognize

and, as he postulated, transduce the LPS signal across the plasma membrane (Poltorak et al. 1998). Knockout mutations of the TLR4 gene further corroborated Beutler’s work as these mice were resistant to the development of Gram-negative sepsis but susceptible to infection (Hoshino et al. 1999). Subsequent studies by several groups (Akira and Takeda 2004) found additional Toll-like receptors that could similarly activate a transcriptional-based response. To date, 10 human Toll-like receptors (13 Tolllike receptors in mice) have been characterized (Fig. 1), each recognizing a variety of

evolutionarily conserved PAMPs including lipopeptides (TLR2 associated with TLR1 or TLR6) (Yang et al. 1998; Ozinsky et al. 2000), viral dsRNA (TLR3) (Alexopoulou et al. 2001), LPS (TLR4) (Poltorak et al. 1998), bacterial flagellin (TLR5) (Hayashi et al. 2001), viral or bacterial ssRNA (TLR7 and TLR8) (Chuang and Ulevitch 2000; Hemmi et al. 2002), and CpG-rich unmethylated DNA (TLR9) (Bauer and Wagner 2002), among others (Kumar et al. 2009; Table 1). While these receptors seem to diverge in the ligands they recognize, clustering studies have found that human TLR1, TLR2,

Toll-Like Receptors: Evolution and Structure

1195

Toll-Like Receptors: Evolution and Structure, Table 1 TLRs and their specific ligands TLRs TLR1 TLR2

Localization Plasma membrane Plasma membrane

TLR3

Endolysosome

TLR4

Plasma membrane

TLR5 TLR6

Plasma membrane Plasma membrane

TLR7

Endolysosome

TLR8

Endolysosome

TLR9 TLR10

Endolysosome Extracellular

Exogenous ligands Lipopeptides Lipoprotein/lipopeptides

Peptidoglycan Lipoteichoic acid Glycolipids Outer membrane protein A Glycoinositolphospholipids Phospholipomannan Hemagglutinin Lipoarabinomannan Zymosan Double-stranded RNA (dsRNA) Lipopolysaccharide

HSP60 Glycoinositolphospholipids Flagellin Diacyl lipopeptides Lipoteichoic acid Zymosan Single-stranded RNA (ssRNA) Single-stranded RNA (ssRNA) Unmethylated CpG motifs Unknown, may interact with TLR1 and TLR2

TLR6, and TLR10 converge into a TLR subgroup (TLR1/2/6/10) based on sequence homology as do TLR7, TLR8, and TLR9 (TLR7/TLR8/TLR9) (Roach et al. 2005; Matsushima et al. 2007). Further, structural studies have determined that a high degree of homology also exists in the framework of each of these receptors. Each TLR is now known to be an evolutionarily conserved type-I integral membrane glycoprotein that consists of an N-terminal ligand recognition domain (TLR-ECD), a single transmembrane helix that contains approximately 20 uncharged, mostly

Source of exogenous ligands Bacteria and Mycobacteria Gram-positive bacteria, Mycoplasma, Mycobacteria, spirochetes Gram-positive bacteria Gram-positive bacteria Treponema maltophilia Klebsiella pneumoniae Trypanosoma cruzi Candida albicans Measles virus Mycobacteria Saccharomyces RNA viruses Gram-negative bacteria

Endogenous ligands

HSP60, HSP70, HSP96, HMGB1, hyaluronic acid

mRNA HSP22, HSP60, HSP70, HSP96, HMGB1, heparin sulfate, fibrinogen

Chlamydia pneumonia Trypanosoma cruzi Gram-positive or Gramnegative bacteria Mycoplasma Gram-positive bacteria Saccharomyces RNA viruses

Endogenous RNA

RNA viruses

Endogenous RNA

Bacteria and viruses

Endogenous DNA

hydrophobic residues, and a C-terminal cytoplasmic signaling domain, known as the TIR domain (Botos et al. 2011). So named for its homology with the signaling domains of IL-1R family members, TIR domains are also found in many adaptor proteins that interact with the TIR domain of TLRs and in plant proteins that confer resistance to pathogens (Burch-Smith et al. 2007). This has suggested to numerous researchers that the TIR domain actually represents an evolutionarily conserved motif that may have had an immune function prior to the divergence of plants and animals.

T

1196

Toll-Like Receptors: Evolution and Structure

Toll-Like Receptors: Evolution and Structure, Fig. 2 hTLR3-ECD consisting of 23 leucine-rich repeats. hTLR3 contains 23 leucine-rich repeats, each of which follows the typical 24-residue repeat pattern of the LRR motif (Takahashi et al. 1985; Buchanan and Gay 1996); xL2xxL5xL7xxN10xL12xxL15xxxxF20xxL23x

where L represents conserved hydrophobic residues, N represents a conserved asparagine, and F represents a conserved phenylalanine. Also depicted is the glycan-free side (ascending side) of the TLR3-ECD, involved in ligand binding in all TLRs (Buchanan and Gay 1996)

In observing the extracellular domain of each crystallized TLR, it has been noted that all are constructed of 19–25 tandem copies of a motif known as the LRR (leucine-rich repeat) motif (Fig. 2; Bell et al. 2003; Matsushima et al. 2007). A single LRR motif is typically 22–29 residues in length and contains hydrophobic residues spaced at distinct intervals (Takahashi et al. 1985). In three dimensions, all LRRs adopt a loop structure that begins with three residues in a b-strand configuration (Buchanan and Gay 1996). When assembled consecutively in the TLR-ECDs, the LRR motifs form a solenoid structure with all consensus hydrophobic residues

pointing to the interior, to form a stable core, and all b-strands aligning to form a hydrogen-bonded parallel b-sheet (Takahashi et al. 1985; Buchanan and Gay 1996). Because the b-strands in the LRR motif are packed more closely than the rest of the structure, the solenoid is forced into a curved configuration in which a concave surface is formed by the b-sheets (Takahashi et al. 1985; Buchanan and Gay 1996). This effectively creates a structure with a concave surface, a convex surface, an ascending lateral surface, and a descending lateral surface on the opposite side (Takahashi et al. 1985; Buchanan and Gay 1996). To date, the structures of the ECDs of

Toll-Like Receptors: Evolution and Structure

TLR1 (Jin et al. 2007), TLR2 (Jin et al. 2007), TLR3 (Bell et al. 2005), TLR4 (Park et al. 2009), TLR5 (Yoon et al. 2012), TLR6 (Kang et al. 2009), and TLR8 (Tanji et al. 2013) have been reported. In common with all of these structures is the fact that the solenoid motif is the basis for ligand recognition as ligand binding occurs on the ascending lateral face, the only portion of the molecule that completely lacks N-linked glycan and is free to interact with ligand (Fig. 2). Interestingly, however, while all of the TLR-ECDs assume a typical horseshoe shape, attempts to predict these structures in modeling studies have been unsuccessful due to variations in the curvature of the structure. As specific ligands of each receptor were identified, focus shifted to the mechanistic understanding of the resulting activated signaling pathways. As will be discussed at great length in ▶ “Toll-like Receptors: Pathogen Recognition and Signaling,” the identification of players in the activated TLR signaling cascades and the elucidation of products from each of these pathways led to the scientific community’s reevaluation of the importance for these receptors in immunity. The culmination of this recognition came in 2011 with the awarding of the Nobel Prize in Physiology or Medicine to Bruce Beutler, Jules Hoffmann, and Ralph Steinman for revolutionizing our understanding of the immune system by discovering key principles for its activation.

Cross-References ▶ Toll-like Receptors: Pathogen Recognition and Signaling

References Akira S, Takeda K (2004) Toll-like receptor signalling. Nat Rev Immunol 4:499–511 Alexopoulou L, Holt AC, Medzhitov R, Flavell RA (2001) Recognition of double-stranded RNA and activation of NF-kappaB by Toll-like receptor 3. Nature 413:732–738. https://doi.org/10.1038/35099560 Anderson KV, Bokla L, Nusslein-Volhard C (1985) Establishment of dorsal-ventral polarity in the Drosophila

1197 embryo: induction of polarity by the Toll gene product. Cell 42:791–798 Bauer S, Wagner H (2002) Bacterial CpG-DNA licenses TLR9. Curr Top Microbiol Immunol 270:145–154 Bell JK, Mullen GED, Leifer CA et al (2003) Leucine-rich repeats and pathogen recognition in Toll-like receptors. Trends Immunol 24:528–533 Bell JK, Botos I, Hall PR et al (2005) The molecular structure of the Toll-like receptor 3 ligand-binding domain. Proc Natl Acad Sci U S A 102:10976–10980 Botos I, Segal DM, Davies DR (2011) The structural biology of Toll-like receptors. Structure 19:447–459. https://doi. org/10.1016/j.str.2011.02.004 (S0969-2126(11)00072-4 [pii]\r) Buchanan SGSC, Gay NJ (1996) Structural and functional diversity in the leucine-rich repeat family of proteins. Prog Biophys Mol Biol 65:1–44 Burch-Smith TM, Schiff M, Caplan JL et al (2007) A novel role for the TIR domain in association with pathogenderived elicitors. PLoS Biol 5:0501–0514 Chuang TH, Ulevitch RJ (2000) Cloning and characterization of a sub-family of human Toll-like receptors: hTLR7, hTLR8 and hTLR9. Eur Cytokine Netw 11:372–378 Hayashi F, Smith KD, Ozinsky A et al (2001) The innate immune response to bacterial flagellin is mediated by Toll-like receptor 5. Nature 410:1099–1103. https:// doi.org/10.1038/35074106 Hemmi H, Kaisho T, Takeuchi O et al (2002) Small antiviral compounds activate immune cells via the TLR7 MyD88-dependent signaling pathway. Nat Immunol 3:196–200 Hoshino K, Takeuchi O, Kawai T, Sanjo H (1999) Cutting edge: Toll-like receptor 4 (TLR4)-deficient mice are hyporesponsive to lipopolysaccharide: evidence for TLR4 as the LPS gene product. J Immunol 162:3749–3752 Janeway CA (1989) Approaching the asymptote? Evolution and revolution in immunology. Cold Spring Harb Symp Quant Biol 54:1–13 Janeway CJ, Travers P, Walport M, Shlomchik MJ (2005) Immunobiology: the immune system in health and disease, 6th edn. Garland Science, New York Jin MS, Kim SE, Heo JY et al (2007) Crystal structure of the TLR1-TLR2 heterodimer induced by binding of a tri-acylated lipopeptide. Cell 130:1071–1082 Kang JY, Nan X, Jin MS et al (2009) Recognition of lipopeptide patterns by Toll-like receptor 2-Toll-like receptor 6 heterodimer. Immunity 31:873–884 Kumar H, Kawai T, Akira S (2009) Pathogen recognition in the innate immune response. Biochem J 420:1–16 Lemaitre B, Nicolas E, Michaut L et al (1996) The dorsoventral regulatory gene cassette spätzle/Toll/cactus controls the spa potent antifungal response in Drosophila adults. Cell 86:973–983 Matsushima N, Tanaka T, Enkhbayar P et al (2007) Comparative sequence analysis of leucine-rich repeats (LRRs) within vertebrate Toll-like receptors. BMC Genomics 8:124

T

1198 Medzhitov R, Preston-Hurlburt P, Janeway CA (1997) A human homologue of the Drosophila Toll protein signals activation of adaptive immunity. Nature 388:394–397 Ozinsky A, Underhill DM, Fontenot JD et al (2000) The repertoire for pattern recognition of pathogens by the innate immune system is defined by cooperation between Toll-like receptors. Proc Natl Acad Sci U S A 97:13766–13771 Park BS, Song DH, Kim HM et al (2009) The structural basis of lipopolysaccharide recognition by the TLR4MD-2 complex. Nature 458:1191–1195 Poltorak A, He X, Smirnova I et al (1998) Defective LPS signaling in C3H/HeJ and C57BL/10ScCr mice: mutations in Tlr4 gene. Science 282:2085–2088 Roach JC, Glusman G, Rowen L et al (2005) The evolution of vertebrate Toll-like receptors. Proc Natl Acad Sci U S A 102:9577–9582 Schneider DS, Jin Y, Morisato D, Anderson KV (1994) A processed form of the Spatzle protein defines dorsal-ventral polarity in the Drosophila embryo. Dev Biol 120:1243–1250 Takahashi N, Takahashi Y, Putnam FW (1985) Periodicity of leucine and tandem repetition of a 24-amino acid segment in the primary structure of leucine-rich alpha 2-glycoprotein of human serum. Proc Natl Acad Sci U S A 82:1906–1910 Tanji H, Ohto U, Shibata T et al (2013) Structural reorganization of the Toll-like receptor 8 dimer induced by agonistic ligands. Science 339:1426–1429 Yang RB, Mark MR, Gray A, Huang A (1998) TLR2 mediates LPS-induced cellular signaling. Nature 395:284–288 Yoon S-I, Kurnasov O, Natarajan V et al (2012) Structural basis of TLR5-flagellin recognition and signaling. Science 335:859–864

Toll-Like Receptors: Pathogen Recognition and Signaling James Marion Department of Chemistry and Biochemistry, University of California, San Diego, CA, USA

Toll-Like Receptors: Pathogen Recognition and Signaling

Toll-receptor-activating ligand, Spatzle, formed a heterotrimeric complex containing two molecules of Toll-receptor ECD for every one molecule of ligand. From these results, the group proposed that Toll-receptor activation required ligandinduced dimerization events leading to secondary receptor–receptor interactions, effectively stabilizing and producing an active signaling complex. Subsequent structural studies have found PAMP binding occurs when two TLR extracellular domains form an “M-shaped” dimer and effectively trap the ligand molecule. This “sandwiching” effect brings the TLR transmembrane and cytoplasmic domains in close proximity and triggers a downstream signaling cascade. However, while ligand-induced dimerization events of all TLRs have common features, the recognition events leading to TLR-ECD dimerization are markedly different between TLR paralogs, as individual TLRs use unique subsets of leucine-rich repeat (LRR) motifs for ligand recognition. This suggests how a limited number of TLRs are able to recognize a diverse array of ligands. This further suggests that TLR ligand recognition and activation is a specific, concerted, and sequential process that is critical for transmission of signal to the cytoplasm. Upon ligandinduced dimerization of the TLR-ECDs, TLRs signal a response to pathogen by dimerization of their C-terminal cytoplasmic TIR signaling domains. This event creates a new platform on which signaling complexes can be built as TIR domain-containing adapter proteins (MyD88, MAL, TRIF, and TRAM) recognize the dimerization event. The subsequent activation of a specific pathway is dependent upon which receptors are involved in the recognition process and which TIR domain-containing adapter molecules are recruited to the TIR domain platform.

Synopsis Introduction A crucial step in understanding the function of Toll-like receptors (TLRs) and their role in innate immunity was to identify the specific pathogenassociated molecular patterns (PAMPs) recognized by each TLR. Nicholas Gay and colleagues, working in Drosophila, were able to show that the

Classifying specific Toll-like receptors (TLRs) with the pathogens they recognize was a critical first step in understanding how this specific type of pattern recognition receptor (PRR) aids in the innate immune response. In addition, detailing the

Toll-Like Receptors: Pathogen Recognition and Signaling

mechanism by which TLRs recognized specific pathogens and understanding how subsequent signal transduction events translated pathogen recognition into a host response were essential in understanding the important role TLRs have in protecting a host from microorganism invasion. This is because each TLR and TLR–pathogenic ligand complex elicits distinct downstream gene expression patterns that lead to innate immune responses tailored toward specific pathogens and to the development of antigen-specific adaptive immune responses. Long established was the fact that signal transduction by type I transmembrane receptors required dimerization events of the N-terminal extracellular domain (ECD) (Lemmon 1994). Originally observed with the full-length Drosophila Toll ECD, unstable dimers were able to form in solution and thought to exist in preassembled, low-affinity complexes before ligand binding (Schneider et al. 1991). However, it was not until 2003, when Nicholas Gay’s group investigated the mechanism by which a protein, Spatzle, activated the Toll receptor in adult Drosophila flies, that researchers were able to propose a mechanism by which microbes induced TLR activation. Using analytical ultracentrifugation in combination with chemical cross-linking and isothermal titration calorimetry experiments, Gay’s group was able to show that the Toll-receptoractivating ligand, Spatzle, formed a heterotrimeric complex containing two molecules of Tollreceptor ECD for every one molecule of ligand (Weber et al. 2003). From these results, Gay’s group proposed that Toll-receptor activation required ligand-induced dimerization events which caused secondary receptor–receptor interactions effectively stabilizing and producing an active signaling complex (Weber et al. 2003). Subsequent structural studies have found that in common with most TLR subfamilies, pathogenassociated molecular pattern (PAMP) binding occurs when two TLR extracellular domains form an “M-shaped” dimer and effectively trap the ligand molecule (Fig. 1) (Botos et al. 2011). This “sandwiching” effect brings the TLR transmembrane and cytoplasmic domains in close proximity and triggers a downstream signaling

1199

cascade. However, while the ligand-induced dimerization events of all TLRs have many common features, the recognition events leading to TLR-ECD dimerization are markedly different between TLR paralogs as individual TLRs use unique subsets of LRRs for ligand recognition. This effectively suggests a means by which a limited number of TLRs are able to recognize a diverse array of ligands. Best understood due to the elucidation of crystal structures for three of the four family members (TLR10 is the only member without a known structure), the TLR1/2/6/10 subfamily differentiates between different PAMPs based on specific dimerization events. Structural studies have shown that TLR2 is able to dimerize with either TLR1 or TLR6 (Fig. 1a, b, respectively) (Jin et al. 2007; Kang et al. 2009) and that the specific heterodimeric complex that is formed determines whether these TLRs bind PAMPs containing di-acylated cysteine-rich lipopeptides (TLR2/6) or tri-acylated cysteine-rich lipopeptides (TLR1/ 2). Further experiments have revealed the reason for this difference as crystallographic analysis has shown that three distinct subdomains exist in each ECD of the subfamily: N-terminal, central, and C-terminal (Jin et al. 2007; Kang et al. 2009). The border between the central subdomain and C-terminal subdomain (LRRs 9–12) contains a group of hydrophobic residues which form a ligand-binding pocket on the convex side of the ECD (Jin et al. 2007; Kang et al. 2009). This pocket can increase or decrease in size, depending upon the formation of distinct TLR complexes. Complexes between TLR1 and TLR2 form a pocket that is large enough to accommodate tri-acylated cysteine-containing lipoproteins (Jin et al. 2007), while complexes between TLR2 and TLR6 form a pocket that is much smaller, due to constraints placed on the structure by phenylalanine residues, effectively binding only di-acylated lipopeptides (Kang et al. 2009). While the fourth member of the family, TLR10, has not been structurally characterized, homology models refined through molecular dynamic simulations suggest that TLR2/TLR10 heterodimeric complexes form a ligand-binding pocket similar to the TLR1/ TLR2 complexes, effectively recognizing

T

1200

Toll-Like Receptors: Pathogen Recognition and Signaling

Toll-Like Receptors: Pathogen Recognition and Signaling, Fig. 1 Representative examples of the ligandinduced M-shaped dimer complex formed by the TLR-ECDs. (a) TLR1 (green)–TLR2 (blue) heterodimer induced by the binding of tri-acylated lipopeptide (purple) (PDB ID, 2Z7X) (Jin et al. 2007). (b) TLR2 (blue)–TLR6

(red) heterodimer induced by the binding of di-acylated lipopeptide mimetic (black) (PDB ID, 3A79) (Kang et al. 2009). (c) TLR4 (raspberry)–MD-2 (yellow) complex induced by the binding of LPS (PDB ID, 3FXI) (Kim et al. 2007)

tri-acylated cysteine-containing lipopeptides, while TLR1/TLR10 heterodimeric complexes and TLR10 homodimeric complexes form ligand-binding pockets similar to those of the TLR2/TLR6 complex, effectively recognizing di-acylated cysteine-containing lipopeptides (Govindaraj et al. 2010). In the TLR3 subfamily, ligand recognition and binding are dependent upon two interactions between dsRNA and the TLR3-ECD, one near the N-terminus, encompassing LRR-NT and LRRs 1–3, and one near the C-terminus, involving LRRs 19–23 (Bell et al. 2005). These interactions correctly position four ligand binding sites in a TLR3 homodimeric complex with dsRNA (Bell et al. 2005). As TLR3 is the focus of much of the work presented in this entry, these associations and subsequent signaling events will be discussed in more detail in ▶ “Toll-Like Receptor 3: Structure and Function.” For the TLR4 subfamily, advances on Beutler’s work (Poltorak et al. 1998), discussed

earlier, have led to a crystallographic analysis of the receptor structure which depicts similarities to the TLR2 heterodimeric structures previously elucidated (Kim et al. 2007). The TLR4-ECD is composed of three subdomains located in similar regions to that of the TLR2-ECD (LRR-NT to LRR5 and LRRs 8–10): N-terminal, central, and C-terminal subdomains (Kim et al. 2007). However, the b-sheet present in the central subdomain of the receptor contains unusually small radii and large twist angles that prevent a ligand-binding pocket, seen in TLR2 heterodimeric complexes, from forming (Kim et al. 2007). Instead, it has been found that a coreceptor, MD-2, binds to TLR4 on the concave surface of the N-terminal and central subdomains (Kim et al. 2007). LPS is then extracted from the bacterial membrane and transferred to TLR4–MD-2 heterodimers by two accessory proteins, LPS-binding protein and CD14 (Kim et al. 2007). LPS interacts with a large hydrophobic pocket in MD-2 and induces the formation of a receptor multimer, composed of

Toll-Like Receptors: Pathogen Recognition and Signaling

two copies of the TLR4–MD-2–LPS complex, by forming hydrophobic interactions with conserved residues on TLR4 (Kim et al. 2007). While it has been determined that PAMPs are not directly recognized by TLR4, LPS binding to MD-2 and subsequent interaction with TLR4 induce the formation of the classical “M-shaped” dimeric receptor complex (Fig. 1c) which is then able to transduce a signal through membranes to the cytosol. The TLR5 subfamily is the only proteinbinding (flagellin) TLR that is conserved in vertebrates from fish to mammals. However, even with its vast evolutionary background, the human TLR5 structure has not been determined due to technical challenges in expression of the protein in a functionally active, soluble form. Recent crystallographic studies, however, have elucidated a TLR5 structure in zebra fish using a baculovirus system and C-terminal deletion variants that, along with computer-based prediction algorithms, have led to an understanding of how TLR5 interacts with bacterial flagellin (Yoon et al. 2012). This PAMP is composed of three specific domains, D1/D2/D3, which are responsible for the ability of flagellin to polymerize into a filament in bacterial flagellum and effectively provide motility to the pathogen (Andersen-Nissen et al. 2005). From a crystallographic analysis of the zebra fish structure, TLR5 is shown to primarily interact on its lateral side (LRR-CT and LRRs 12–13) with three helices that make up the D1 domain of flagellin (Yoon et al. 2012). This interaction forms an extensive primary binding interface that defines a 1:1 heterodimer (Yoon et al. 2012). Upon formation of this complex, further ligand-induced oligomerization events lead to the formation of a symmetric 2:2 complex where TLR5 from the first heterodimer creates additional, weaker interactions with flagellin and TLR5 from the second heterodimer (Yoon et al. 2012). Ligand-induced assembly of two TLR5 receptors juxtaposes the C-terminal tail regions of TLR5-ECD for signaling in a similar manner to all other agonist-bound TLRs (Yoon et al. 2012). Similar to TLR3, members of the TLR7/8/9 subfamily are all located in the endosome.

1201

However, distinct amino acid sequence differences and the recent elucidation of the TLR8ECD structure (Tanji et al. 2013) indicate that the ligand binding mechanisms of TLR7/8/9 are different from their endosomal dsRNA receptor counterpart. The recent elucidation of the TLR8ECD has revealed that similar to all other TLRs, ligand recognition is mediated by a dimerization interface formed by two protomers (Tanji et al. 2013). However, three-dimensional structures, based on homology modeling studies of the LRR motifs in the TLR7/8/9 subfamily (Wei et al. 2009), and the recently reported TLR8-ECD structure (Tanji et al. 2013) have shown that all members contain large insertions in LRR2, LRR5, and LRR8 which give rise to distinct structures that loop out from the dimerization surface of the ECDs. These extensions were postulated and have been confirmed for TLR8 to provide additional support to the receptors upon ligand binding (Wei et al. 2009; Tanji et al. 2013). These studies also report that all members of this subfamily contain a 40-amino acid stretch of residues between LRR14 and LRR15 that have a high degree of species variability among the paralogs and are implicated in signal transduction events which will be discussed in the next paragraph (Wei et al. 2009; Tanji et al. 2013). Finally, studies have also suggested that residues in the insertion face of the ECD (LRR17) for TLR8 (Asp543) and TLR9 (Asp535 and Tyr537) are essential for ligand binding (Wei et al. 2009). While recent structural studies of TLR8 have confirmed the importance of Asp543 in ligand binding, a more detailed structural analysis of the entire subfamily is still required for a complete understanding of the ligand recognition mechanism used by these compartmentalized TLRs. Interestingly, while only the TLR8-ECD structure has been solved in the TLR7/8/9 subfamily, investigations have revealed that specific proteolytic cleavage events are required for effective downstream signaling for this subfamily (Tanji et al. 2013). Studies report that lysosomal proteolysis is required for TLR7/8/9 signaling and that this processing occurs in the undefined regions between LRR14 and LRR15 of the ECDs (proteolytic processing events have also been

T

1202

implicated in TLR3 signaling (Garcia-Cattaneo et al. 2012; Qi et al. 2012) but these will be discussed in more detail in ▶ “Toll-Like Receptor 3: Structure and Function” (Tanji et al. 2013)). Experimental data shows that TLR7, TLR8, and TLR9 are escorted from the endoplasmic reticulum, by a chaperone molecule, Unc93b1, to the endolysosome where they are cleaved in a multistep process that has created much controversy in the scientific community (Tabeta et al. 2006; Kim et al. 2008; Tanji et al. 2013). Researchers agree that primary cleavage events, performed by asparagine endopeptidase and other undefined cathepsin family members (Sepulveda et al. 2009), remove the majority of the ECD domain from the TLRs. However, following this initial proteolytic processing event of TLR7/8/9, researchers diverge on the reasons for these events as experimental data leads to two different conclusions. Further studies done on the cleaved fragments of TLR7/8/9 have suggested that subsequent proteolytic trimming events process the C-terminal fragments of the TLR-ECDs so that only the processed forms are capable of recruiting adapter molecules, effectively activating a signaling cascade, and inducing an immune response (Ewald et al. 2008). However, other studies mutated various residues in the N-terminal fragment of the cleaved TLR-ECDs and showed that these TLRs were inactive, suggesting that the N-terminal fragments were just as important as the C-terminal fragments for signal transduction (Ewald et al. 2011). To this point, no conclusions have been made regarding the nature of these cleaved fragments, but researchers agree that the initial proteolytic processing event is required for effective TLR7/8/9 signaling and the induction of an immune response. Taken together, the events above suggest that TLR ligand recognition and activation is a specific, concerted, and sequential process that is critical for transmission of signal to the cytoplasm. Upon ligand-induced dimerization of the TLR-ECDs, TLRs signal a response to pathogen by dimerization of their cytoplasmic TIR domains. This event creates a new platform on which signaling complexes can be built as TIR domain-containing adapter proteins (MyD88, MAL, TRIF, and TRAM) recognize the dimerization event (Kumar

Toll-Like Receptors: Pathogen Recognition and Signaling

et al. 2009). Subsequent activation of a MyD88 (myeloid differentiation primary-response protein 88)-dependent or TRIF-dependent pathway is dependent upon which receptors are involved in the recognition process and which TIR domaincontaining adapter molecules are recruited to the TIR domain platform. The first pathway, the MyD88-dependent pathway, is so named for the MyD88 protein which functions as an adapter molecule linking the TIR domains of every TLR (except for TLR3 and, at times, TLR4) with downstream signaling molecules so as to induce the activation of the transcription factor NF-kB (Motshwene et al. 2009). MyD88 is directly recruited to the TIR domains of TLR5, TLR7, TLR8, and TLR9 or recruited by an adapter, the MyD88-adapter-like (MAL) protein, and functions to recruit IL-1R-associated kinase 4 (IRAK4), a kinase which is essential in signaling for NF-kB (Motshwene et al. 2009). This recruitment then leads to a pathway which involves the activation of IRAK1, TRAF6 (tumor necrosis factor receptor-associated factor 6), and TAK1 (transforming growth factor b-activated kinase) through their modification by ubiquitylation factors UEV1A (ubiquitinconjugating enzyme E2 variant 1) and UBC13 (ubiquitin-conjugating enzyme 13) (Adachi et al. 1998). Subsequent activation of the IkB kinase complex (IKKa/b) and phosphorylation of an inhibitor protein, IkBa (inhibitor of kappa light chain gene enhancer in B-cells alpha), lead to the activation of the transcription factor, NF-kB (Fitzgerald et al. 2001). Its release from IkBa allows NF-kB to translocate to the nucleus where it is able to induce the expression of inflammatory cytokines, various antiviral and antimicrobial proteins and initiate the adaptive immune response (Fig. 2) (Li et al. 2005). A second signaling pathway, defined by the recruitment of TRIF to the TIR domain of TLR3 or to the bridging adapter, TRAM (TRIF-related adapter molecule), and subsequent recruitment to the TIR domain of TLR4, leads to the activation of the transcription factors NF-kB or IRF3/7 (O’Neill and Bowie 2007). These transcription factors then translocate to the nucleus where they induce the production of immune response

Toll-Like Receptors: Pathogen Recognition and Signaling

1203

Signal

EXTRACELLULAR MATRIX

Signal

Receptor

Receptor

IKKg (NEMO)

IKKa

IKKb

Proteosomal Degradation Ub IkBa

IkBa p65 / RelA NF-kB

p65 / RelA NF-kB

CYTOPLASM

NUCLEUS

T Inflammation, Immune Regulation, Survival, Proliferation

Toll-Like Receptors: Pathogen Recognition and Signaling, Fig. 2 Representation of the NF-kB response following ligand-induced pathway activation

genes or the induction of apoptosis through the activation of the Fas-associated death domain protein (FADD)/caspase 8 pathway (O’Neill and Bowie 2007). As much of this research and part

of this entry focuses on the induction and regulation of the interferon response through the TLR3 signaling cascade, the TRIF-dependent pathway, which is used partially by TLR4 and entirely by

1204

TLR3, will be discussed in greater detail in ▶ “Toll-Like Receptor 3: Structure and Function.”

Cross-References ▶ Toll-Like Receptor 3: Structure and Function

References Adachi O, Kawai T, Takeda K et al (1998) Targeted disruption of the MyD88 gene results in loss of IL-1- and IL- 18-mediated function. Immunity 9:143–150 Andersen-Nissen E, Smith KD, Strobe KL et al (2005) Evasion of Toll-like receptor 5 by flagellated bacteria. Proc Natl Acad Sci U S A 102:9247–9252 Bell JK, Botos I, Hall PR et al (2005) The molecular structure of the Toll-like receptor 3 ligand-binding domain. Proc Natl Acad Sci U S A 102: 10976–10980 Botos I, Segal DM, Davies DR (2011) The structural biology of Toll-like receptors. Structure 19:447–459. https://doi. org/10.1016/j.str.2011.02.004, S0969-2126(11)00072-4 [pii]\r Ewald SE, Lee BL, Lau L et al (2008) The ectodomain of Toll-like receptor 9 is cleaved to generate a functional receptor. Nature 456:658–662 Ewald SE, Engel A, Lee J et al (2011) Nucleic acid recognition by Toll-like receptors is coupled to stepwise processing by cathepsins and asparagine endopeptidase. J Exp Med 208:643–651 Fitzgerald KA, Palsson-McDermott EM, Bowie AG et al (2001) Mal (MyD88-adapter-like) is required for Toll-like receptor-4 signal transduction. Nature 413:78–83 Garcia-Cattaneo A, Gobert F-X, Muller M et al (2012) Cleavage of Toll-like receptor 3 by cathepsins B and H is essential for signaling. Proc Natl Acad Sci 109:9053–9058 Govindaraj RG, Manavalan B, Lee G, Choi S (2010) Molecular modeling-based evaluation of hTLR10 and identification of potential ligands in toll-like receptor signaling. PLoS One 5:1–13 Jin MS, Kim SE, Heo JY et al (2007) Crystal structure of the TLR1-TLR2 heterodimer induced by binding of a tri-acylated lipopeptide. Cell 130:1071–1082 Kang JY, Nan X, Jin MS et al (2009) Recognition of lipopeptide patterns by Toll-like receptor 2-Toll-like receptor 6 heterodimer. Immunity 31:873–884 Kim HM, Park BS, Kim JI et al (2007) Crystal structure of the TLR4-MD-2 complex with bound endotoxin antagonist eritoran. Cell 130:906–917

Toll-Like Receptors: Pathogen Recognition and Signaling Kim Y-M, Brinkmann MM, Paquet M-E, Ploegh HL (2008) UNC93B1 delivers nucleotide-sensing tolllike receptors to endolysosomes. Nature 452: 234–238 Kumar H, Kawai T, Akira S (2009) Pathogen recognition in the innate immune response. Biochem J 420:1–16 Lemmon MA (1994) Specificity and promiscuity in membrane helix interactions. FEBS Lett 346: 17–20 Li C, Zienkiewicz J, Hawiger J (2005) Interactive sites in the MyD88 Toll/interleukin (IL) 1 receptor domain responsible for coupling to the IL1B signaling pathway. J Biol Chem 280:26152–26159 Motshwene PG, Moncrieffe MC, Grossmann JG et al (2009) An oligomeric signaling platform formed by the Toll-like receptor signal transducers MyD88 and IRAK-4. J Biol Chem 284: 25404–25411 O’Neill LAJ, Bowie AG (2007) The family of five: TIR-domain-containing adaptors in Toll-like receptor signalling. Nat Rev Immunol 7:353–364 Poltorak A, He X, Smirnova I et al (1998) Defective LPS signaling in C3H/HeJ and C57BL/10ScCr mice: mutations in Tlr4 gene. Science 282: 2085–2088 Qi R, Singh D, Kao CC (2012) Proteolytic processing regulates Toll-like receptor 3 stability and endosomal localization. J Biol Chem 287:32617–32629 Schneider DS, Hudson KL, Lin TY, Anderson KV (1991) Dominant and recessive mutations define functional domains of Toll, a transmembrane protein required for dorsal-ventral polarity in the Drosophila embryo. Genes Dev 5:797–807 Sepulveda FE, Maschalidi S, Colisson R et al (2009) Critical role for asparagine endopeptidase in endocytic Toll-like receptor signaling in dendritic cells. Immunity 31:737–748 Tabeta K, Hoebe K, Janssen EM et al (2006) The Unc93b1 mutation 3d disrupts exogenous antigen presentation and signaling via Toll-like receptors 3, 7 and 9. Nat Immunol 7:156–164 Tanji H, Ohto U, Shibata T et al (2013) Structural reorganization of the Toll-like receptor 8 dimer induced by agonistic ligands. Science 339: 1426–1429 Weber ANR, Tauszig-Delamasure S, Hoffmann JA et al (2003) Binding of the Drosophila cytokine Spätzle to Toll is direct and establishes signaling. Nat Immunol 4:794–800 Wei T, Gong J, Jamitzky F et al (2009) Homology modeling of human Toll-like receptors TLR7, 8, and 9 ligand-binding domains. Protein Sci 18: 1684–1691 Yoon S-i, Kurnasov O, Natarajan V et al (2012) Structural basis of TLR5-flagellin recognition and signaling. Science 335:859–864

Topoisomerases and Cancer

1205

Introduction

Topoisomerases and Cancer Adam C. Ketron and Neil Osheroff Department of Biochemistry and the Vanderbilt Institute of Chemical Biology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synopsis Topoisomerases are essential enzymes that regulate DNA supercoiling in cells and remove tangles and knots from the genome. However, because they generate DNA strand breaks as requisite intermediates in their catalytic reactions, they also have the capacity to fragment the genome. This potentially lethal feature of topoisomerases I, IIa, and IIb has been exploited to treat a variety of human cancers. Anticancer drugs that target these enzymes kill cells by a unique mechanism. Rather than depriving cells of the essential functions of topoisomerases, these agents “poison” the enzymes and convert them to potent cellular toxins. Thus, they are called topoisomerase poisons to distinguish them from classic catalytic inhibitors. Topoisomerase I-targeted drugs represent an emerging class of anticancer agents, while topoisomerase II-targeted drugs include some of the most widely prescribed chemotherapeutics currently in clinical use. Together, these drugs are used to treat a variety of systemic cancers and solid tumors. Unfortunately, a small number of specific de novo and druginduced leukemias appear to be triggered by topoisomerase II-mediated DNA cleavage. Thus, topoisomerases play significant roles in causing, as well as curing, cancer. This article will discuss the mechanistic basis for the actions of topoisomerase poisons, their use in cancer chemotherapy, and their potential role in triggering leukemic chromosomal translocations.

As described in the chapter on DNA Topology and Topoisomerases, proliferating eukaryotic cells cannot survive without topoisomerases (Champoux and Dulbecco 1972; Wang 1991; McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009a; Pommier 2009). These enzymes regulate DNA supercoiling, remove tangles and knots from the genome, and play critical roles in virtually every nucleic acid process (Champoux and Dulbecco 1972; Wang 1991; McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier 2009). Topoisomerases function by generating transient single-stranded (type I enzymes) or doublestranded (type II enzymes) breaks in DNA. In order to maintain genomic integrity during this process, topoisomerases form covalent bonds between the active site tyrosyl residues and the newly generated DNA termini. These covalent enzyme-cleaved DNA complexes are called “cleavage complexes.” As discussed below, human topoisomerase I and topoisomerase II (a and b) are important targets in the treatment of cancer and in some cases are linked to the generation of the disease. The ability of these enzymes to cleave and ligate the double helix is central to both of these processes.

Topoisomerases as Cellular Toxins Because topoisomerases generate DNA strand breaks as obligate reaction intermediates, they are intrinsically dangerous proteins (Pommier and Marchand 2005; McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier 2009; Pommier et al. 2010). Thus, while necessary for cell viability, these enzymes also have the capacity to fragment the genome (Fig. 1). As a result of this dual persona, cells maintain levels of cleavage complexes in a critical balance. Although cells can

T

1206

Topoisomerases and Cancer

Topoisomerases and Cancer, Fig. 1 The formation of covalent DNA cleavage complexes is required for topoisomerases to perform their critical cellular functions. If the level of topoisomerase II-DNA cleavage complexes falls below threshold levels (left arrow), cells are unable to segregate their chromosomes and ultimately die of mitotic failure. If the level of cleavage complexes (generated by topoisomerase I or topoisomerase II) becomes too high (right arrow), the actions of DNA tracking systems can convert these transient complexes to permanent doublestranded breaks. The resulting DNA breaks, as well as the inhibition of essential DNA processes, initiate recombination/repair pathways and generate mutations, chromosome translocations, and other DNA aberrations. If the strand

breaks overwhelm the cell, they can trigger apoptosis. This is the basis for the actions of several widely prescribed anticancer drugs that target topoisomerase I or topoisomerase II. If the increase in topoisomerase-mediated DNA strand breaks does not kill the cell, mutations or chromosomal aberrations may be present in surviving populations. In some cases, exposure to topoisomerase II-targeted agents has been associated with the formation of acute myeloid leukemias that involve the MLL (mixed-lineage leukemia) gene at chromosome band 11q23 or acute promyelocytic leukemias that feature chromosome 15 and 17 translocations that join the PML (promyelocytic leukemia) and RARA (retinoic acid receptor a) genes (lower right arrow)

survive deficiencies in topoisomerase I, if topoisomerase II cleavage (and activity) drops below threshold levels, daughter chromosomes remain entangled following replication (Wang 1991; Champoux 2001; Deweese et al. 2008; Nitiss 2009). As a result, chromosomes cannot segregate properly, and cells die as a result of catastrophic mitotic failure (Fig. 1). If the activities of topoisomerase I and topoisomerase II are diminished simultaneously, critical cellular processes that require DNA tracking systems (such as replication and transcription) slow or stop. Increased levels of topoisomerase I- or II-DNA cleavage complexes also cause deleterious physiological effects, but for different reasons (Fig. 1)

(Pommier and Marchand 2005; McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier 2009; Pommier et al. 2010). When replication forks, transcription complexes, or other DNA-tracking proteins attempt to traverse covalently bound protein “roadblocks” in the genetic material, accumulated cleavage intermediates are converted to strand breaks that are no longer tethered by proteinaceous bridges. The ensuing damage induces recombination/repair pathways that can trigger mutations, chromosomal translocations, and other aberrations. If the number of DNA breaks overwhelms the repair process, it can initiate cell death pathways.

Topoisomerases and Cancer

Topoisomerase Poisons Compounds that impact the catalytic activity of topoisomerases can be separated into two categories (Pommier and Marchand 2005; McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier 2009; Pommier et al. 2010). Chemicals that decrease the overall activity of the enzyme are known as catalytic inhibitors. Conversely, chemicals that increase levels of topoisomerase-DNA cleavage complexes are said to “poison” these enzymes and convert them to cellular toxins that initiate the mutagenic and lethal consequences described above. Because of their actions, these latter compounds are referred to as “topoisomerase poisons” to distinguish them from catalytic inhibitors that do not increase the concentration of cleavage complexes. Topoisomerase poisons act by two non-mutually exclusive mechanisms. They can raise levels of cleavage complexes by increasing the rate of DNA cleavage or by decreasing the rate of DNA ligation. Although some topoisomerase poisons also inhibit the overall activity, the “gain of function” induced by these compounds in the cell (i.e., increased levels of cleavage complexes) is the dominant phenotype. Different classes of topoisomerase-targeted anticancer drugs (even those targeted to the same enzyme) display considerable structural divergence (see Figs. 2 and 3). However, they share a number of common properties that link them mechanistically (Pommier and Marchand 2005; McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier 2009; Pommier et al. 2010). First, all clinically relevant topoisomerase-targeted anticancer drugs act as poisons rather than catalytic inhibitors. Thus, they kill cells by a fundamentally different mechanism than that of most proteintargeted drugs (which act by robbing the cell of an essential function). Second, clinical topoisomerase poisons act at the enzyme-DNA interface, and interactions with both components (at least within the cleavage complex) are essential. While some compounds bind tightly to either the DNA or the enzyme in the absence of the other component, others are believed to bind specifically to the

1207

topoisomerase-DNA complex. Third, clinically relevant topoisomerase-targeted drugs act primarily by inhibiting the ability of their enzyme target to ligate the cleaved double helix. Fourth, structural data indicate that drugs inhibit ligation by intercalating into the double helix at the cleaved scissile bond. Thus, they present a physical barrier to ligation and act as “molecular doorstops.” Anticancer Drugs Targeted to Human Topoisomerase I Topoisomerase I-targeted drugs represent an emerging class of anticancer agents (Pommier and Marchand 2005; Pommier 2006; Deweese et al. 2008; Nitiss 2009; Pommier 2009; Pommier et al. 2010). These compounds are some of the most active new drugs in the clinic and display activity against malignancies (such as non-small cell lung cancer, metastatic ovarian cancer, and colorectal cancer) that respond poorly to existing therapies. At the present time, all of the clinically approved topoisomerase I-targeted drugs are based on camptothecin (Fig. 2), which is a natural product found in the bark of Camptotheca acuminata, also known as the “Chinese tree of joy” or “tree of life” (Pommier and Marchand 2005; Pommier 2006; Pommier 2009; Pommier et al. 2010; Beretta et al. 2012). Camptothecin was first tested for clinical efficacy in the 1970s, but the parent compound was problematic for two reasons. First, camptothecin displays strong interactions with human albumin. This significantly impacts bioavailability and results in considerable interpatient variability. Second, the drug is rapidly (but reversibly) converted from its active form (“closed” lactone ring) to its inactive form (“open” ring carboxylate) at slightly basic pH. This equilibrium leads to unpredictable pharmacokinetics. Furthermore, since the reverse reaction (in which the carboxylate is converted back to the active lactone) is favored at low pH, drug treatment often leads to severe bladder toxicity due to the slightly acidic pH of urine. To overcome the above problems, camptothecin analogs were developed that displayed reduced interactions with human albumin and more stable lactone rings (Pommier and

T

1208

Topoisomerases and Cancer O

O OH−

N

N N

N O

Closed (Active)

+

H

Open (Inactive)

O

HO

OH O− HO

O

Camptothecin R2

R3

R1

F

O

O

N

N

N

N

F

O

O O HO

Topotecan:

HO

O

R1

R2

R3

OH

CH3N(CH3)2

H

H

CH2CH3

Diflomotecan (Homocamptothecin)

O

Irinotecan:

N

N

O

O N

O

O

O

O

O O

N N

O O

Topovale (Phenanthridine)

N

N

N

O O

NSC 725776 (Indenoisoquinoline)

Topoisomerases and Cancer, Fig. 2 Structures of selected topoisomerase I-targeted anticancer agents. The ring-closed (lactone) and ring-open (carboxylate) forms of camptothecin are shown. Other agents include

the clinically approved camptothecins topotecan and irinotecan, the homocamptothecin diflomotecan, the phenanthridine topovale, and the indenoisoquinoline NSC 725776

Marchand 2005; Pommier 2006; Pommier 2009; Pommier et al. 2010; Beretta et al. 2012). The resulting water-soluble topotecan and prodrug irinotecan (CPT-11) (Fig. 2) are approved for clinical use in the United States. To further address issues related to the camptothecin lactone, homocamptothecins, which contain a seven-membered lactone ring as opposed to the parental six-membered ring, are being developed. Diflomotecan, the lead compound in this class, is

in early clinical trials. The homocamptothecin lactone has a much longer half-life in blood than that of the camptothecins. Moreover, ring opening (and subsequent drug inactivation) is essentially irreversible. Consequently, homocamptothecins display more predictable pharmacokinetics than the camptothecins, making them easier to schedule and reducing interpatient variability. Finally, non-camptothecin topoisomerase I poisons have been described recently (Pommier

Topoisomerases and Cancer R

1209

O

O

O

O OH

HO

O

OH

R1

O

OH

O O O R2

O

O

O

OH O

O

O HO

OH

R2

R1

R

Etoposide:

NH3+

Doxorubicin:

CH3

Teniposide:

CH2OH

Daunorubicin: CH3

OCH3 OCH3

S

Idarubicin:

OH

O

HN

H N

OH

H N

O

CH2OH

H

O S O

HO

O

HN

OH

O

HN

OH N H

Mitoxantrone

OH

O

N

OH

Amsacrine

Genistein

Topoisomerases and Cancer, Fig. 3 Structures of selected topoisomerase II-targeted anticancer agents. The demethyl-epipodophyllotoxins etoposide and teniposide, the anthracyclines doxorubicin, daunorubicin, and idarubicin, and the anthracenedione mitoxantrone are approved for clinical use in the United States. The acridine

amsacrine is used in some salvage regimens for acute refractory myeloid leukemias. Finally, the isoflavone genistein is a natural product found in soy that is a topoisomerase II poison and is believed to have chemopreventive properties

and Marchand 2005; Pommier 2009; Pommier et al. 2010; Beretta et al. 2012). For example, novel phenanthridines and indenoisoquinolines (Fig. 2) currently are in clinical trials and hold promise for a new generation of topoisomerase I-targeted anticancer drugs.

compounds that are commonly used to treat a variety of human malignancies. At the present time, six topoisomerase II-targeted anticancer agents are approved for use in the United States. Additional drugs are in clinical trials, are used as “experimental agents” in salvage regimens, or are prescribed elsewhere in the world. One of the first clinically relevant topoisomerase II-targeted anticancer drugs was etoposide, which is derived from podophyllotoxin (Baldwin and Osheroff 2005). This natural product is produced by Podophyllum peltatum, more commonly known as the mayapple or American mandrake plant. Podophyllotoxin has been used as a folk remedy for over a thousand years and is an antimitotic drug that acts by preventing microtubule

Anticancer Drugs Targeted to Human Type II Topoisomerases Topoisomerase II poisons represent some of the most important and widely prescribed anticancer drugs currently in clinical use (Fig. 3) (McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier et al. 2010). These drugs encompass a diverse group of natural and synthetic

T

1210

formation. The clinical use of this compound as an antineoplastic agent was prevented by high toxicity, but two synthetic analogs, etoposide and teniposide, displayed increased antineoplastic activity and decreased toxicity. Further analysis revealed that these drugs did not interact with microtubules; rather, they acted as topoisomerase II poisons. Etoposide was approved for clinical use in the mid-1980s and for several years was the most widely prescribed anticancer drug in the world. Etoposide and drugs such as doxorubicin (and its derivatives) are frontline therapy for a variety of systemic cancers and solid tumors, including leukemias, lymphomas, sarcomas, lung cancers, and germline malignancies (McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier et al. 2010). Mitoxantrone is used to treat breast cancer, relapsed acute myeloid leukemia, and non-Hodgkin’s lymphoma. Amsacrine also is used to treat relapsed acute myeloid leukemia. Ultimately, half of all anticancer regimens include topoisomerase II-targeted drugs. In addition to the use of mitoxantrone as an anticancer agent, it is used to treat autoimmune diseases such as multiple sclerosis. Both isoforms of human topoisomerase II appear to be targeted by the drugs shown in Fig. 3. However, the relative contributions of topoisomerase IIa and topoisomerase IIb to the chemotherapeutic effects of these agents have yet to be resolved (McClendon and Osheroff 2007; Deweese et al. 2008; Deweese and Osheroff 2009; Nitiss 2009; Pommier et al. 2010). Although some drugs appear to favor one isoform or the other, no truly “isoform-specific” agents have been identified. The isoform specificity of topoisomerase II-targeted anticancer drugs has potential clinical ramifications. For example, levels of topoisomerase IIa are extremely low in quiescent cells. Therefore, at least some of the off-target toxicity of topoisomerase II-targeted agents in differentiated tissues (such as cardiac cells) likely results from drug actions against the b isoform. Furthermore, since topoisomerase IIa is the predominant isoform involved in replication, it has been proposed that cleavage complexes formed with this

Topoisomerases and Cancer

enzyme may be preferentially converted to permanent DNA strand breaks in proliferating cells. Thus, topoisomerase IIa may be an inherently more “cytotoxic” anticancer target than topoisomerase IIb. Finally, there is speculation that DNA cleavage complexes generated by topoisomerase IIb are more likely to produce viable recombination events as opposed to cell death. Consequently, it has been proposed that the b isoform may play a more prominent role in triggering the topoisomerase II-associated leukemias discussed below (Cowell and Austin 2011; Pendleton et al. 2014). Finally, many foods consumed in the human diet contain naturally occurring topoisomerase II poisons. The most prominent of these are bioflavonoids (i.e., phytoestrogens) (McClendon and Osheroff 2007; Deweese and Osheroff 2009; Ketron and Osheroff 2014). Bioflavonoids represent a broad group of polyphenolic compounds that are present in many fruits, vegetables, and plant leaves. They have multiple effects on human cells and function as antioxidants and inhibitors of growth factor receptor tyrosine kinases. Many bioflavonoids, especially genistein (which is prominent in soy), are potent topoisomerase II poisons. Genistein (Fig. 3) is believed to be a chemopreventive agent in adults that contributes to the low incidence of breast and colorectal cancers in the Pacific Rim. However, as discussed below, there is evidence linking genistein consumption during pregnancy to the development of infant leukemias. It is notable that the structure of genistein is remarkably similar to that of quinolone antibacterials (McClendon and Osheroff 2007; Deweese and Osheroff 2009). These latter compounds (which are discussed in an accompanying chapter) target the prokaryotic type II topoisomerases, gyrase and topoisomerase IV, and are in wide clinical use.

Topoisomerase II-Associated Leukemias Despite the importance of topoisomerase II in the treatment of human cancers, evidence suggests that DNA strand breaks generated by the enzyme

Topoisomerases and Cancer

can trigger chromosomal translocations associated with specific types of leukemia (Felix et al. 2006; McClendon and Osheroff 2007; Deweese and Osheroff 2009; Joannides and Grimwade 2010; Joannides et al. 2011; Cowell and Austin 2012; Pendleton et al. 2014). To this point, 2–3% of patients who receive regimens that include topoisomerase II-targeted drugs (especially etoposide) subsequently develop acute myeloid leukemias (AMLs). Most of these leukemias are characterized by translocations with breakpoints in the MLL (mixed-lineage leukemia) gene at chromosomal band 11q23. The MLL protein is a histone methyltransferase that regulates (among other substrates) the Hox genes, which control proliferation in hematopoietic cells. Several breakpoints in MLL have been identified and are located in close proximity to topoisomerase II-DNA cleavage sites that are induced by etoposide. In addition to treatment-related leukemias, 80% of infants with AML or acute lymphoblastic leukemia (ALL) display translocations that involve the MLL gene (Felix et al. 2006; McClendon and Osheroff 2007; Deweese and Osheroff 2009; Pendleton et al. 2014). The chromosomal translocations associated with these cancers have been observed in utero, indicating that infant leukemias are initiated during gestation. Epidemiological studies indicate that the risk of developing these infant leukemias is increased >3-fold by the maternal consumption during pregnancy of foods that are high in genistein and other naturally occurring topoisomerase II poisons. Finally, recent studies indicate a link between topoisomerase II-targeted drugs (especially mitoxantrone) and development of acute promyelocytic leukemias (APLs) (Joannides and Grimwade 2010; Joannides et al. 2011; Pendleton et al. 2014). Patients with these leukemias display translocations between the PML (promyelocytic leukemia) gene on chromosome 15 and the RARA (retinoic acid receptor a) gene on chromosome 17. Once again, chromosomal breakpoints appear to correspond to druginduced sites of topoisomerase II-mediated DNA cleavage.

1211

Conclusions Human topoisomerase I and topoisomerase II are important targets for cancer chemotherapy, and drugs that poison these enzymes represent some of the most efficacious agents in clinical use. Topoisomerase-targeted drugs are routinely prescribed to treat a variety of blood-borne malignancies and solid tumors. These drugs act in an insidious manner that is fundamentally different from those of most other protein-targeted agents. Rather than depriving cells of essential topoisomerase activities, drugs induce a “gain of function” that converts these enzymes to lethal “nucleases” that fragment the genome.

Cross-References ▶ DNA Topology and Topoisomerases ▶ Gyrase and Topoisomerase IV as Targets for Antibacterial Drugs

References Baldwin EL, Osheroff N (2005) Etoposide, topoisomerase II and cancer. Curr Med Chem Anticancer Agents 5:363–372 Beretta GL, Zuco V, Perego P, Zaffaroni N (2012) Targeting DNA topoisomerase I with non-camptothecin poisons. Curr Med Chem 19:1238–1257 Champoux JJ (2001) DNA topoisomerases: structure, function, and mechanism. Annu Rev Biochem 70:369–413 Champoux JJ, Dulbecco R (1972) An activity from mammalian cells that untwists superhelical DNA–a possible swivel for DNA replication (polyoma-ethidium bromide-mouse-embryo cells-dye binding assay). Proc Natl Acad Sci U S A 69:143–146 Cowell IG, Austin CA (2012) Mechanism of generation of therapy related leukemia in response to antitopoisomerase II agents. Int J Environ Res Public Health 9:2075–2091 Deweese JE, Osheroff N (2009) The DNA cleavage reaction of topoisomerase II: wolf in sheep's clothing. Nucleic Acids Res 37:738–748 Deweese JE, Osheroff MA, Osheroff N (2008) DNA topology and topoisomerases: teaching a “Knotty” subject. Biochem Mol Biol Educ 37:2–10 Felix CA, Kolaris CP, Osheroff N (2006) Topoisomerase II and the etiology of chromosomal translocations. DNA Repair 5:1093–1108

T

1212 Joannides M, Grimwade D (2010) Molecular biology of therapy-related leukaemias. Clin Transl Oncol 12:8–14 Joannides M, Mays AN, Mistry AR, Hasan SK, Reiter A, Wiemels JL, Felix CA, Coco FL, Osheroff N, Solomon E, Grimwade D (2011) Molecular pathogenesis of secondary acute promyelocytic leukemia. Mediterr J Hematol Infect Dis 3:e2011045 Ketron AC, Osheroff N (2014) Phytochemicals as anticancer and chemopreventive topoisomerase II poisons. Phytochem Rev 13:19–35 McClendon AK, Osheroff N (2007) DNA topoisomerase II, genotoxicity, and cancer. Mutat Res 623:83–97 Nitiss JL (2009a) DNA topoisomerase II and its growing repertoire of biological functions. Nat Rev Cancer 9:327–337 Nitiss JL (2009b) Targeting DNA topoisomerase II in cancer chemotherapy. Nat Rev Cancer 9:338–350 Pendleton M, Lindsey RH Jr, Felix CA, Grimwade D, Osheroff N (2014) Topoisomerase II and leukemia. Ann N Y Acad Sci 1310:98–110 Pommier Y (2006) Topoisomerase I inhibitors: camptothecins and beyond. Nat Rev Cancer 6:789–802 Pommier Y (2009) DNA topoisomerase I inhibitors: chemistry, biology, and interfacial inhibition. Chem Rev 109:2894–2902 Pommier Y, Marchand C (2005) Interfacial inhibitors of protein-nucleic acid interactions. Curr Med Chem Anticancer Agents 5:421–429 Pommier Y, Leo E, Zhang H, Marchand C (2010) DNA topoisomerases and their poisoning by anticancer and antibacterial drugs. Chem Biol 17:421–433 Wang JC (1991) DNA topoisomerases: why so many? J Biol Chem 266:6659–6662

Transcription Factor Classes April Hill and Rachel McMullan Department of Biology, University of Richmond, Richmond, VA, USA

Synonyms DNA-binding domain motifs

Definition Transcription factors (TFs) are proteins that interact with promoter and enhancer regions of genes to positively or negatively regulate gene expression. There are several discrete functional domains that

Transcription Factor Classes

most transcription factors contain including DNA-binding domains, protein-protein interaction domains, domains for intracellular trafficking signals, and ligand-binding domains. In particular, DNA-binding domains are highly conserved and often contain a short motif that fits into the major groove of DNA. There are roughly 12–15 unique DNA-binding domains found in eukaryotic TFs. TFs with similar structural motifs in the DNA-binding domain have been grouped into four classes which include helix-turn-helix (HTH) proteins, zinc finger proteins, leuciine zipper proteins, and helix-loop-helix proteins.

Discussion The helix-turn-helix (HTH) is a conserved motif consisting of 20 amino acids in the form of an a-helix followed by a turn that is followed by another a-helix. One of the helix strands acts as the stabilization helix, contacting the DNA backbone, while the other acts as the recognition helix, which directly interacts with the exposed major groove of the target DNA. This fit into the major groove by the recognition strand allows the amino acids of the helix to form weak interactions with the chemical groups of the DNA base pairs without unwinding the DNA double helix. In addition, the HTH contains a glycine on the turn which is the most conserved residue found in the motif. Transcription factors containing the HTH motif have been revealed to function as developmental switches in multicellular organisms (Papavassiliou 1995). Homeodomains, encoded by a 180-base pair homeobox sequence, bind to a specific DNA sequence often in the promoter domain of target genes. Homeodomains are normally associated with HTH motifs. One example of proteins containing the HTH motif is the lac repressor. The family of lac proteins, including the lac repressor, contains a conserved HTH region and is able to distinguish between different operator residues through the several non-conserved positions in the HTH motif (Lewis 2005). A well-studied group of genes known as the Hox genes contain a homeobox and encode for transcription factors with homeodomains. Hox transcription factors are

Transcription Factor Classes

responsible for differentiation and identification of segment structures. The zinc finger motif includes a pair of histidines and a pair of cysteines or two sets of cysteines that interact with a zinc ion. The zinc ion in the DNA-binding motif is responsible for stabilizing the three-dimensional structure of the zinc finger. A polypeptide links with the pair of histidines/ cysteines or two cysteines creating the finger configuration consisting of tandem repeats. The amino acid sequence at the base and body of the zinc finger determine the DNA sequence specificity. The a-helix of the zinc finger recognizes DNA when inserted into the major groove. Transcription factors containing zinc fingers operate by binding to dimers in order to respond to elements upstream from the gene being regulated (Papavassiliou 1995). The zinc finger motif was originally discovered in the transcription factor IIIA (TFIIIA) protein involved which contained nine tandem repeats each with pairs of cysteines and histidines and contained a zinc ion necessary for DNA binding (Struhl 1989). The linker region of the zinc finger motif remains the most highly conserved region. The myelin transcription factor, Myt1, is a C2HC-type zinc finger protein that has been shown to interact with basic helix-loop-helix proteins. In addition, it functions in neurogenesis and neuronal differentiation in Xenopus (Bellefroid et al. 1996). Leucine zipper protein motifs contain an extended a-helix with leucines at every seventh position which interact with the partner a-helix creating a coil or “zipper” structure (Papavassiliou 1995). Moreover, this amphipathic helix contains leucines that interact with the adjacent proteins of the other helix. A “y”-shaped structure is formed by two leucine zippers with the zippers creating the stem and the basic regions forming the arms that bind DNA (Guasconi et al. 2003). Basic region leucine zippers are the most widely known leucine zipper proteins and include the second largest family of dimerizing transcription factors in Homo sapiens (Nikolaev et al. 2010). GCN4; jun, fos, and myc oncoproteins; and C/EBP enhancer binding proteins all contain the leucine zipper motif. The leucine zipper in these proteins is responsible for dimer formation

1213

(heterodimers and homodimers) and specific DNA binding (Struhl 1989). GCN4 and c-Jun contain leucine zipper domains that show ribonuclease activity (Nikolaev et al. 2010). The helix-loop-helix (HLH) motifs consist of two sets of hydrophobic residues (amino acids) separated by other residues, normally prolines or glycines. Next to the HLH domain in proteins found at the N terminus are five identical hydrophilic residues that create a basic region or domain. HLH domains form dimers (hetero- or homodimers) necessary for sequence-specific DNA binding. The basic regions of the HLH dimer interact with the major groove of DNA. There are seven different classes of HLH motifs based on their expression, dimer formation, and DNA-binding patterns (Sablitzky 2005). The HLH motifs are found in proteins that interact with immunoglobulin gene enhancers, the muscle-specific transcription factors MyoD and myogenin, and the myc genes. HLH contains an area of positively charged amino acids that interact with DNA similar to proteins with a leucine zipper. HLH and leucine zipper transcription factor families are controlled by heterodimer formation and function in differentiation and development (Papavassiliou 1995). For example, MyoD, a protein involved in muscle cell differentiation, binds strongly to DNA when in a heterodimer formation.

References Bellefroid EJ, Bourguignon C, Hollemann T et al (1996) Cell 87:1191–1202 Guasconi V, Yahi H, Ait-Si-Ali S (2003) Transcription factors. Atlas Genet Cytogenet Oncol Haematol 7:163–170 Lewis M (2005) The lac repressor. C R Biol 328: 521–548 Nikolaev Y et al (2010) The Leucine zipper domains of the transcription factors GCN4 and c-Jun have ribonuclease activity. PLoS One 5:e10765 Papavassiliou A (1995) Transcription factors. N Engl J Med 332:45–47 Sablitzky F (2005) Protein motifs: the helix-loop-helix motif. In: eLS Struhl K (1989) Helix-turn-helix, zinc-finger, and leucinezipper motifs for eukaryotic transcriptional regulatory proteins. TIBS 14:137–140

T

1214

Transcription Repression ▶ Long-Term Genetic Silencing at Centromere and Telomeres

Transcription Repression

response regulator protein. The phosphorylated response regulator then modulates cellular function, frequently by altering gene expression as a transcriptional regulator protein.

Introduction

Transcriptional Control ▶ Chromatin Remodeling and DNA Modification in Transcriptional Regulation, Role of

Transcriptional Regulation ▶ Cis-Regulation of Eukaryotic Transcription

Transcriptional Silencing ▶ DNA Methylation and Cancer

Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems Laura Runyen-Janecky Department of Biology, University of Richmond, Richmond, VA, USA

Synopsis Phosphorylation of proteins by kinases is a universal mechanism of signal transduction across all domains of life. Modular two-component regulatory systems represent a ubiquitous family of signal transduction proteins and are composed of a distinct group of histidine kinases and response regulators. After detection of a specific signal from the environment, the histidine kinase is autophosphorylated and then transfers the phosphoryl group to the cognate

Many bacterial species encounter fluctuating environmental signals during the course of their life cycles; thus, the ability to sense the current environment and alter gene expression accordingly is likely to be important for survival. There are several classes of proteins that control gene expression in response to environmental cues in bacteria, which can be loosely divided into one-component, two-component, and multicomponent systems. The ubiquitous two-component regulatory systems (TCRS), which are also called two-component systems (TCS) or two-component signal transduction systems (TCST), were initially identified on the basis of sequence similarity between regulatory proteins in several bacterial species. Specifically, Nixon et al. (Nixon et al. 1986) found a high level of similarity in the C-terminal regions of NtrB from Bradyrhizobium sp. and EnvZ, PhoR, and VirA from other bacterial species. Likewise, a high level of similarity in the N-terminal regions of NtrC from Bradyrhizobium sp. and OmpR, PhoB, and VirG from other species was noted. Nixon et al. proposed that there was a family of evolutionarily related proteins whose members consisted of two proteins that function as a unit (the two-component regulatory system) to regulate gene expression in response to environmental signals. The prototypical TCRS is composed of a sensor histidine kinase (HK) and a response regulator protein (RR). In response to a specific environmental signal, the HK is autophosphorylated at a histidine residue. This phosphoryl group is then transferred from the HK to a conserved aspartate residue in the receiver domain of the RR. This phosphorylation allosterically alters the RR so that an output domain in the RR becomes functional. The majority of the output domains in RRs are DNA-binding domains, although non-prototypical output

Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems

domains including RNA-binding domains, protein-protein interaction domains, chemotaxis domains, and enzymatic domains exist (Galperin 2010). In the RRs with DNA-binding domains, the RR can bind to specific DNA sequence in the promoters of a subset of genes and alter transcription of these genes. Some TCRS regulate a small number of genes (as few as one or two), whereas others can directly or indirectly regulate a much larger number of genes. For example, two percent of Escherichia coli genes are directly or indirectly regulated by the RR NtrC (Zimmer et al. 2000). NtrC binds directly to certain promoters for genes that encode proteins involved in nitrogen metabolism, thereby activating gene expression directly. Additionally, NtrC activates expression of the nac gene, which encodes a transcriptional activator protein that increases gene expression at other promoters. Thus, NtrC indirectly activates these Nac-regulated genes (Zimmer et al. 2000). Distribution and Classification of TCRS Genes TCRS genes are found in all three domains of life, but predominate in Eubacteria. One-component regulatory systems are the most likely ancestors of the TCRSs that originated in bacteria and were later acquired by Archaea and Eukarya via horizontal gene transfer. Ninety five percent of Eubacterial genomes, 50% of Archaeal genomes, but less than 30% of Eukaryal genomes have at least one TCRS gene (Barakat et al. 2011). More than 35,000 HK and 40,000 RR genes have been identified in sequenced genomes and metagenomes (Barakat et al. 2011). The number of TCRS genes in a single species ranges from zero in many obligate symbionts such as Mycoplasma sp. and in the eukaryotic metazoans to 155 HKs and 135 RRs in Desulfovibrio magneticus (p2cs. org). Bacterial species with broad metabolic capabilities tend to have a larger number of TCRS genes than microbes with limited metabolisms or those that live in stable, unchanging habitats. Thus, it is not surprising that TCRS genes are generally not present in obligate symbionts with reduced genomes. TCRS can be grouped into classes based on the genomic arrangement of the genes. Paired TCRS

1215

genes are composed of two separate genes for the HK and the RR located adjacent to one another on the chromosome and comprise 54% of TCRS identified by genomics (M. Barakat, personal communication). Twenty four of the 31 TCRSs are paired in E. coli. The prevalence of this arrangement is logical as genes with common function are frequently located near one another in bacteria. Orphan TCRS genes are single genes for either a HK or RR and comprise 41% of TCRS identified by genomics (M. Barakat, personal communication). Often times, a cognate gene for the corresponding HK or RR to the orphan gene is located elsewhere on the chromosome. Although orphan genes are less common in E. coli, and only two have non-identified cognate partners, the majority of TCRS genes in some species are orphan genes. For example, in Caulobacter crescentus over 50% of the TCRS genes are orphans. It is not clear why there is so much variation in the number of orphan genes in different species of bacteria. Species with increased orphan genes might have a larger number of hybrid TCRSs (see later). The Prototypical HK The prototypical HK (classical or orthodox family) is composed of two domains: a sensor (input) domain and a transmitter (autokinase core) domain (Fig. 1). The PF00512 (HisKA) and PF02518 (HATPase_C) Pfam domains within the transmitter region define HKs from the TCRS family. The structure and function of TCRS HKs, which has been reviewed by Gao and Stock (2009) and Krell et al. (2010) and Stock et al. (2000), is summarized below. The N-terminal end of each HK contains the variable sensor domain that detects a particular environmental stimulus. Typically, the sensor domain is extracytoplasmically located, although membrane-embedded and intercellular sensor domains also exist. Because each HK senses a particular signal, it is not surprising that this domain is relatively variable. However, there are some common motifs found in sensor domains of multiple HKs. For example, the PAS motif (which contributes to sensing of redox potential, oxygen, cellular energy, or light) facilitates signal

T

1216

Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems

SIGNAL

HK

INPUT/SENSOR DOMAIN

TRANSMITTER DOMAIN

RECEIVER DOMAIN

OUTPUT DOMAIN

Inactive RR

ATP

ADP

HK

INPUT/SENSOR DOMAIN

TRANSMITTER DOMAIN

P P RECEIVER DOMAIN

OUTPUT DOMAIN

Active RR

Alteration of gene expression

Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems, Fig. 1 The prototypical two-component regulatory system

detection by a variety of mechanisms including providing a cavity for signal binding, mediating signal perception by cofactor-binding, and by signal-mediated modulation of PAS domain disulfide bonds. The sensor domain is connected to the cytoplasmic transmitter domain through a transmembrane helix (for membrane HKs) and a short cytoplasmic linker. The transmitter domain contains the autokinase activity and is also called the kinase core or autokinase domain in some literature. The transmitter domain is composed of a dimerization and phosphotransfer subdomain (DHp) and an ATP-binding catalytic subdomain (CA). All TCRS HKs have an invariant histidine for autophosphorylation within a short motif known as the H-box. For most HKs, the H-box is located in the DHp subdomain. The approximately 240 amino acid CA subdomain at the C-terminal end of the HK contains an ATP-binding pocket which is structurally formed

from the N, F, G1, and G2 motifs, which are named after amino acids in each motif. Finally, some HKs have a phosphatase domain to modulate the activity of the phosphorylated RR. The Prototypical RR The structure and function of TCRS RRs has been reviewed recently by Galperin (2010) and Gao and Stock (2009) and is summarized below. The prototypical RR is composed of two domains (Fig. 1). At the N-terminal end of the protein, there is a receiver domain (PF0072/ Response_reg) that catalyzes the acceptance of a phosphoryl group at an invariant aspartate residue and that defines this family of proteins. The active site in the receiver domain of the RR contains several highly conserved residues including the aspartate that is phosphorylated by the HK, two additional aspartate/glutamate residues, and one lysine. The switch site in the receiver domain of the RR contains two highly conserved Ser/Thr

Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems

and Phe/Tyr residues for signal transduction to the C-terminal domain. The output (effector) domain, which is generally a DNA-binding domain, is located in the C-terminal end of the RR. There is less conservation between the DNA-binding output domains in RRs because of the distinct set of promoters that are regulated by each RR and because there are numerous DNA-binding motifs, most of which are variants of the helix-turn-helix DNA-binding motif. Other non-prototypical output domains include RNA-binding domains, protein-protein interaction domains, chemotaxis domains, and enzymatic domains. Theoretically, any modular output domain could be fused to a receiver domain to place that output activity under control of a TCRS. Finally, some RRs have autophosphatase domains that control the length of time that the protein is in the phosphorylated state. More Complex TCRS Proteins In addition to the prototypical TCRS HK and RR proteins, more complex families of HKs and RRs exist (Barakat et al. 2011; Gao and Stock 2009). For example, hybrid HKs contain receiver domains at their C-terminal ends that are typically found in RRs. Members of the unorthodox HK class are comprised of hybrid HKs that also possess additional phosphotransfer domains. The most common of these phosphotransfer domains is the HPt (histidine containing phosphotransfer) domain. These domains can transfer phosphoryl groups from one amino acid to another within the same protein and/or phosphoryl groups from one protein to another and frequently function in complex phospho-relay systems. Other complex TCRS HKs contain both a receiver and output domain. Finally, some TCRSs are modulated by auxiliary proteins that modify the activities of the core HK and RR elements. Biochemical Mechanisms of Signal Transduction A wealth of structural and biochemical analyses has yielded a detailed picture of the common mechanisms of signal transduction in prototypical TCRSs (Fig. 1). Much of the fundamental work described below was reviewed in 2000 by Stock

1217

et al. and, more recently, by Gao and Stock (2009). Signal reception by the HK is still one of the least understood parts of the mechanism, partially because of the large number of predicted signals that can be detected. However, once a signal is received by the sensor domain of the HK, common structural properties mediate signaling between the sensor and transmitter domains. Signal transduction is thought to result in a change in the confirmation of the dimerization domain between the two monomers of the HK, which is relayed to the transmitter domain via an allosteric alteration of the HK. Recent biophysical work suggests that there may be multiple types of alterations (e.g., sliding of dimerization helices, rotational movement of dimerization domain) depending on the HK. Once the signal has been transduced to the transmitter domain, there is an ATP-dependent autophosphorylation event. Specifically, ATP bound in a pocket formed from conserved residues from the N, G1, F, and G2 boxes is used to phosphorylate one of two histidines in the HK homodimer, creating a highenergy phosphoryl group. Although there are a few examples of biological cross talk in which one HK can interact with more than one RR and vice versa, this is the exception rather than the rule in vivo. In fact, kinetic analyses suggest that there is specificity between cognate HK and RR pairs in physiologically relevant conditions. Non-conserved amino acids, located in the transmitter domain of the HK and the receiver domain of the RR, mediate specificity between the cognate pair. The HK-RR interaction creates an active site which catalyzes phosphoryl transfer from the HK to a conserved aspartate in the RR receiver domain. When the RR is phosphorylated, a conserved Ser/Thr residue in the receiver switch domain interacts with the phosphoryl group enabling a conserved Phe/Try residue (which was previously interacting with the output domain to inhibit function) to fill the space previously occupied by Ser/Thr, thereby eliminating inhibition of the output domain. Because output domains vary widely among TCRSs, there are likely to be numerous ways in which the signal transduction leads to an active output domain.

T

1218

Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems

One of the most common mechanisms may be facilitation of dimerization of the RR to allow productive DNA binding. Certainly though, other activation mechanisms will be revealed as more TCRS effector domains are studied using biophysical techniques. Upon activation, the prototypical RRs, which are DNA-binding proteins, regulate transcription of a particular subset of genes by binding to specific sequences in the promoters of these genes. Regulation of transcription can occur by a variety of mechanisms including interaction with the sigma or alpha subunits of RNA polymerase or with other transcriptional regulators or by bending of the DNA. Interaction with RNA polymerase can potentially facilitate recruitment and binding of the RNA polymerase, reposition RNA polymerase on the promoter, or alter the kinetics of transcriptional initiation. Type of Signals Detected and Activities Regulated by TCRSs Examinations of the types of signals that are detected by TCRSs in bacteria reveal an astounding variety (reviewed in Krell et al. (2010)). Some examples include pH, oxygen and carbon dioxide levels, cellular redox state (via quinones), ion levels (nitrate, phosphate, sulfate, magnesium), carbon sources (citrate, hexose-6-phosphospate), chemoattractants (serine, aspartate), nutrients (2-ketogluterate, glutamine), envelope stress, osmolarity, quorum molecules, metabolic end products (formate and acetate), light, and host molecules. As there are over 30,000 different HKs, it is likely that many more signals remain to be discovered. These signals are transduced to the RRs which then regulate expression of genes and activities that comprise a diverse collection of processes including carbon metabolism (respiration, fermentation), transport and assimilation of nutrients (phosphate, nitrogen, potassium), stress responses (oxidative, envelope, stationary phase), motility, group behavior (quorum sensing, biofilms), homeostasis (osmolarity regulation), and virulence. TCRSs have been examined functionally in a variety of bacterial species; not surprisingly,

one of the organisms in which these systems have been studied extensively is E. coli. Zhou et al. (2003) made mutations in each E. coli TCRS and analyzed each mutant using phenotypic microarrays, which allow simultaneous analysis of 2,000 phenotypes per mutant. They found that under conditions tested, 22 of the 37 TCRS mutants showed altered phenotypes. For these 15 mutants that did not show altered phenotypes, the growth conditions for the experiment might not have contained the signal to activate the TCRS, the phenotype might not be detectable in the phenotypic microarray, or there might have been redundant function for the particular phenotype being examined. Regardless, this study showed that a large number of phenotypes can be controlled by TCRSs in a single bacterial species. TCRSs and Regulation of Virulence The role of bacterial TCRSs in virulence has been investigated in numerous bacterial pathogens and has been recently reviewed by Beier and Gross (2006). There is an interest in understanding these TCRSs because disruption of bacterial sensing of the host environment might be a good target for new antimicrobial strategies. The signals for TCRS activation in pathogens are specific to the niches within the host. In many cases, multiple signals are integrated through several TCRSs to regulate virulence. Typical signals include host body temperature, pH, and ion concentrations. Additionally, host-derived molecules can serve as signals for TCRS activation. For example, several bacterial pathogens (E. coli, Salmonella enterica, and Edwardsiella tarda) sense host adrenergic signaling molecules (epinephrine and norepinephrine) through the QseB/QseC TCRS (Clarke et al. 2006). Also, it has been suggested that in the plant pathogen Xanthomonas sp., the cognate HK for HrpG, a RR that regulates expression of virulence, may respond to plant specific molecules. TCRSs can impact virulence by regulating expression of either classical virulence factors (e.g., toxins) or factors that enhance growth of the bacterium within particular niches in the host. As an example, both situations occur in the

Transduction of Environmental Signals by Prokaryotic Two-Component Regulatory Systems

human pathogen Shigella flexneri. The Shigella CpxA/CpxR TCRS mediates pH-dependent activation of transcription of the virulence regulator gene virF, which is one of the master regulators of virulence genes in Shigella. The Shigella EnvZ/ OmpR TCRS mediates activation of the ompC porin gene, which is required for Shigella to spread from cell to cell in the host. It is also possible that other OmpR-regulated genes are involved in Shigella virulence, especially since OmpR is predicted to regulate a large number of genes. Additionally, the NtrB/NtrC TCRS enhances the growth rate of Shigella when the bacterium is within eukaryotic epithelial cells (Runyen-Janecky, unpublished data). Some of these TCRSs have also been shown to regulate virulence in other pathogens, although the particular genes that they regulate may be different in each pathogen. For instance, CpxR regulates the icm and dot virulence genes of Legionella pneumophila, OmpR regulates the Salmonella SPI-2 virulence genes indirectly by activating expression of the ssrA/ssrB TCRS genes, and NtrC is required for normal replication of Brucella suis in spleen tissue. In some pathogens, multiple TCRSs are involved in regulating virulence, whereas in others there is just one TCRS that controls virulence. In Salmonella, there are at least eight TCRSs that coordinate regulation of virulence. Some of these TCRSs are unique to Salmonella, while others are present in nonpathogens and have been coopted for regulation of virulence genes in Salmonella. For example, the PhoQ/PhoP TCRS, which is found in both pathogens and nonpathogens, induces expression of a subset of genes required for Salmonella to survive in the host cell phagocytic vacuole. In that niche, PhoQ responds to the levels of magnesium and calcium in the vacuole to initiate the signal transduction cascade, which alters the expression of over 40 Salmonella genes. These genes encode magnesium metabolism proteins, another TCRS (SsrB/SsrA) that directly regulates expression of the SPI-2 encoded virulence genes, and other virulence factors. In contrast to complex interplay among the multiple TCRS in Salmonella, it appears that there is just one major TCRS for

1219

virulence gene regulation in Bordetella pertussis – BvgS/BvgA. BvgS is activated at 37  C, human body temperature, although other signals may also be physiologically relevant for activation of BvgS. Hundreds of B. pertussis genes are BvgA regulated.

Conclusions Since their identification in the 1980s, TCRSs have been studied using genetic, biochemical, and biophysical tools. These systems are significant in that they are ubiquitous across Eubacterial species and regulate a large numbers of genes in response to environmental cues. Future investigations will answer many outstanding questions including, but not limited to, the nature of the signal detection for the large number of HKs, the mechanism by which signals are communicated between the domains of the HKs, the role of cross talk between TCRSs, and the importance of TCRSs in nonpathogenic symbiosis and in many less studied bacterial species. These studies will add to our understanding of how organisms sense their environment and alter gene expression accordingly.

References Barakat M, Ortet P, Whitworth DE (2011) P2CS: a database of prokaryotic two-component systems. Nucleic Acids Res 39:D771–D776 Beier D, Gross R (2006) Regulation of bacterial virulence by two-component systems. Curr Opin Microbiol 9:143–152 Clarke MB, Hughes DT, Zhu C, Boedeker EC, Sperandio V (2006) The QseC sensor kinase: a bacterial adrenergic receptor. Proc Natl Acad Sci U S A 103: 10420–10425 Galperin M (2010) Diversity of structure and function of response regulator output domains. Curr Opin Microbiol 13:150–159 Gao R, Stock AM (2009) Biological insights from structures of two-component proteins. Annu Rev Microbiol 63:133–154 Krell T, Lacal J, Busch A, Silva-Jimenez H, Guazzaroni M-E, Ramos JL (2010) Bacterial sensor kinases: diversity in the recognition of environmental signals. Annu Rev Microbiol 64:539–559

T

1220 Nixon BT, Ronson CW, Ausubel FM (1986) Two-component regulatory systems responsive to environmental stimuli share strongly conserved domains with the nitrogen assimilation regulatory genes ntrB and ntrC. Proc Natl Acad Sci U S A 83:7850–7854 Stock AM, Robinson VL, Goudreau PN (2000) Two-component signal transduction. Annu Rev Biochem 69:183–215 Zhou L, Lei XH, Bochner BR, Wanner BL (2003) Phenotype microarray analysis of Escherichia coli K-12 mutants with deletions of all two-component systems. J Bacteriol, 185:4956–4972 Zimmer DP, Soupene E, Lee HL, Wendisch VF, Khodursky AB, Peter BJ, Bender RA, Kustu S (2000) Nitrogen regulatory protein C-controlled genes of Escherichia coli: scavenging as a defense against nitrogen limitation. Proc Natl Acad Sci U S A 97: 14674–14679

Translation ▶ Cytoplasmic mRNA, Regulation of

Translation Initiation ▶ Cytoplasmic mRNA, Regulation of

Translation

in many cases, this variation involves acquisition or movement of transposable elements. Transposable elements are defined segments of DNA encoding one or more enzymes that allow the DNA to move from one location to another independently of homologous recombination. They thus allow plasmids to acquire directly new genetic determinants or to modify the existing determinants that they carry. Where the transposable element involved is present at multiple copies in the cell, generally because it transposes by a replicative mechanism, the element can allow homologous recombination between different elements, thus allowing plasmids to integrate with each other, with the chromosome or with phage. Such transactions between genomes can result in large DNA segments being exchanged. Once a plasmid genome contains a transposable element, that element will often act as a hot spot for further insertions, so that insertions within insertions are a common feature. Thus, a plasmid genome can often be seen as a mosaic, with transposable elements defining many of the pieces from which build up that mosaic. Plasmids can thus be considered vehicles carrying cargos of transposons, integrons and insertion sequences that move across bacterial populations. (Revilla et al. 2008)

Translation Repression Introduction ▶ Cytoplasmic mRNA, Regulation of

Transposable Elements and Plasmid Genomes Jon Hobman School of Biosciences, University of Nottingham, Sutton Bonington, UK

Synopsis Although some plasmid genomes appear to be very stable, variants can generally be isolated by application of appropriate selective pressure, and

Plasmids are mosaics of genes consisting of a “backbone” region encoding traits essential for plasmid maintenance and replication and “accessory” regions that encode traits which are beneficial to the host. Often accessory regions of plasmids are found to contain mobile genetic elements such as transposable elements, which are discrete pieces of DNA that can move between genetic loci, using a recombination process called transposition. Transposable elements are involved in the structural evolution of plasmids and chromosomes through (1) interruption of genes, (2) translocations of genes, (3) modification of expression of adjacent genes, or (4) integration of new genes. Transposons can move from plasmid to plasmid and plasmid to chromosome and

Transposable Elements and Plasmid Genomes

1221

Transposable Elements and Plasmid Genomes, Fig. 1 An example of a plasmid genome that consists almost entirely of segments acquired by the action of mobile genetic elements. The E. coli IncG/Pseudomonas

IncP-6 plasmid Rms149 plasmid consists of a plasmid “backbone” of 11.7 kb carrying insertion sequences and transposable elements that make up the rest of the 57 kb genome (Haines et al. 2005)

vice versa. They were first identified in bacterial plasmids during studies on the acquisition of MDR (multidrug resistance) in the 1960s and 1970s. These studies resulted in the discovery of transposable elements on different plasmids which were carrying antibiotic resistance genes (reviewed in Toleman and Walsh (2011)).

reconstruction of how different genome segments were acquired (see Fig. 1) and predict what rearrangements are likely to occur in future. DNA sequence and experimental analysis of the first identified bacterial transposable elements identified three classes: IS (insertion sequence) elements, class I composite transposons, and class II complex transposons (now sometimes called unit transposons). The simplest form of transposable element is the IS element. The ends of IS elements are defined by short perfect or imperfect inverted repeat DNA sequences that flank at each end one or more genes encoding a transposase. Transposition of these IS elements generates short (2–14 bp) direct repeats in the host DNA (Partridge 2011), and in complex genomes

Insertion Sequences and Transposons A detailed knowledge of transposable elements is useful in understanding the structure and behavior of plasmid genomes. Being able to piece together clues from sequence features associated with transposable elements can allow the

T

1222

Transposable Elements and Plasmid Genomes

with multiple insertions, this allows one to establish which ends were acquired at the same time (Greated et al. 2002). When a transposable element contains one or more accessory genes that encode a marker (such as antibiotic resistance, toxins, or virulence factors), they are called transposons. A class I composite transposon is the simplest form of transposon. In a composite transposon, two IS elements flank at each end another unrelated gene. The first detected composite transposons encoded antibiotic resistance genes; examples of these found in plasmids in Gramnegative bacteria are Tn9 from pSM14 (chloramphenicol resistance) and Tn10 from R100 (tetracycline resistance) and in Grampositive bacteria Tn4003 from Staphylococcus aureus plasmid pSK1 (trimethoprim resistance). Another form of transposon is the class II or complex transposon. Complex or unit transposons such as those belonging to the Tn3 family are comprised of 18–21 bp terminal inverted repeat (IR) sequences with genes encoding a transposase and a resolvase involved in excision and integration of the transposon, as well as accessory gene(s). Examples of these transposons from Gram-negative bacterial plasmids include Tn1 from RP4 (ampicillin resistance), Tn3 from plasmid R1 (ampicillin resistance), and Tn21 from R100 (mercury, sulfonamide, kanamycin, and quaternary ammonium compound resistance). Those from Gram-positive bacteria include Tn917 from Enterococcus faecalis multiresistance plasmid pAD2 (erythromycin resistance). Other transposable elements which did not fit this pattern were Tn7 and certain transposing bacteriophages such as Mu.

chromosomes, but many can transfer to a wide range of species, being able to conjugate like plasmids: they are outside the scope of this entry. Perhaps most significant of these gene capture elements are integrons which were first identified in the 1980s (Stokes and Hall 1989). Integrons are genetic elements that can capture genes using sitespecific recombination but appear to have lost the ability to transpose (at least in some cases). One of the best known examples of an integron that has impacted on plasmid evolution is In2, which is found within Tn21, itself carried on plasmid R100 isolated in Japan in the mid-1950s (Liebert et al. 1999). In2 from R100 carries genes conferring resistances to sulfonamide, kanamycin, and quaternary ammonium compounds, within the integron structure, which is inserted between the mercury resistance module and transposon gene module of Tn21. Other transposable elements have also been identified, which are related to IS elements, but can capture and transport resistance genes (Partridge 2011). Examples are ISCR (IS common region) elements, which are found within integrons and ISEcp1. Both propagate by rolling circle replication. ISEcp1 moves adjacent regions of DNA by failing to recognize IR but instead recognizing a weakly related downstream sequence. ISCR elements also mobilize genes by replicating into adjacent sequences when they fail to recognize terminal IR sequences. Both ISEcp1 and ISCR are able to capture and move resistance genes via single copy (unlike composite transposons). ISEcp1 has been found within transposon Tn2 and in plasmids, and there appears to be a role for both ISEcp1 and IS26 in the evolution of diverse clinical enterobacterial multiresistance plasmids (Smet et al. 2010).

Integrons

Plasmids Are Critical for the Success of Transposable Elements

Since these initial descriptions of transposons, other genetic elements have been discovered, some of which are particularly involved in the acquisition of antibiotic resistance genes and are found within plasmids and transposons or on chromosomal resistance islands. ICEs (integrative conjugative elements) are found on

Evidence from analysis of the antimicrobial resistance profiles and incompatibility groups of plasmids from collections such as the Murray collection of clinical “pre-antibiotic” era enterobacteria indicate that incompatibility groups of these plasmids are the same as

Transposons

“antibiotic era” plasmids. There were very low incidences of antibiotic resistance carried by these plasmids, but higher levels of antimicrobial metal ion resistances (Hughes and Datta 1983; Datta and Hughes 1982). This is consistent with the idea that transposons have been drivers of antibiotic resistance gene acquisition in commensal and pathogenic bacteria. More recent evidence for the evolution of plasmids from common backbones by insertion of transposons or acquisition of integrons can be seen in the IncW plasmids R388, pSa, R7K, pIE321, and pIE522 (Revilla et al. 2008) in IncP-1 plasmids (Sen et al. 2011), IncHI-2 plasmids (Cain and Hall 2012), IncN plasmids (Eikmeyer et al. 2012), and IncA/C plasmids (Fernández-Alarcón et al. 2011), as well as other examples. Although transposons and other mobile elements can capture and move accessory genes within an organism, for horizontal gene transfer between organisms, plasmids or other “transporters” (or naked DNA) are needed.

1223 Liebert CA, Hall RM, Summers AO (1999) Tn21-flagship of the floating genome. Microbiol Mol Biol Rev 63:507–522 Partridge S (2011) Analysis of antibiotic resistance regions in Gram-negative bacteria. FEMS Microbiol Rev 35:820–855 Revilla C, Garcillán-Barcia P, Fernández-López R, Thomson NR, Sanders M, Cheung M, Thomas CM, dela Cruz F (2008) Different pathways to acquiring resistance genes illustrated by the recent evolution of IncW plasmids. Antimicrob Agents Chemother 52:1472–1480 Sen D, Van der Auwera GA, Rogers LM, Thomas CM, Brown CJ, Top EM (2011) Broad-host-range plasmids from agricultural soils have IncP-1 backbones with diverse accessory genes. Appl Environ Microbiol 77:7975–7983 Smet A, Van Nieuwerburgh F, Vandekerckhove TTM, Martel A, Deforce D, Butaye P, Haesebrouck F (2010) Complete nucleotide sequence of CTX-M15-plasmids from clinical Escherichia coli isolates: insertional events of transposons and insertion sequences. PLoS One 5:e11202 Stokes HW, Hall RM (1989) A novel family of potentially mobile DNA elements encoding site-specific geneintegration functions: integrons. Mol Microbiol 3:1669–1683 Toleman MA, Walsh TR (2011) Combinatorial events of insertion sequences and ICE in Gram-negative bacteria. FEMS Microbiol Rev 35:912–935

References Cain AK, Hall RM (2012) Evolution of IncHI2 plasmids via acquisition of transposons carrying antibiotic resistance determinants. J Antimicrob Chemother 67:1121–1127 Datta N, Hughes VM (1982) Plasmids of the same Inc groups in enterobacteria before and after the medical use of antibiotics. Nature 306:616–617 Eikmeyer F, Hadiati A, Szczepanowski R, Wibberg D, Schneiker-Bekel S, Rogers LM, Brown CJ, Top EM, Pühler A, Schlüter A (2012) The complete genome sequences of four new IncN plasmids from wastewater treatment plant effluent provide new insights into IncN plasmid diversity and evolution. Plasmid 68:13–24 Fernández-Alarcón C, Singer RS, Johnson TJ (2011) Comparative genomics of multidrug resistance-encoding IncA/C plasmids from commensal and pathogenic Escherichia coli from multiple animal sources. PLoS One 6:e23415 Greated A, Lambertson L, Williams PA, Thomas CM (2002) Complete sequence of the IncP-9 TOL plasmid pWW0. Environ Microbiol 4:856–871 Haines AS, Jones K, Cheung M, Thomas CM (2005) The IncP-6 plasmid Rms149 consists of a small mobilizable backbone with multiple large insertions. J Bacteriol 187:4728–4738 Hughes VM, Datta N (1983) Conjugative plasmids in bacteria of the ‘pre-antibiotic’ era. Nature 302:725–726

Transposons Adam R. Parks1 and Joseph E. Peters2 1 Molecular Control and Genetics Section, Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, National Institutes of Health, Frederick, MD, USA 2 Department of Microbiology, Cornell University, Ithaca, NY, USA

T Synopsis Transposons are mobile genetic elements that can move between different DNA molecules or within an individual DNA molecule. The donor and target DNA molecules do not require any sequence homology for mobilization to occur. Transposons are often characterized by the biochemical strategy used to carry out DNA breaking and joining reactions. Recombinase families include the DDE-, HUH-, DEDD-type and serine

1224

transposases. Transposons may be removed from a donor DNA entirely by being “cut out,” or they may be left in place while a copy of the element is “copied out.” During insertion, the element may be either “pasted in,” moved entirely into the recipient molecule, or they may be “copied in.” Transposons that are copied out as RNA are referred to as retrotransposons, whereas transposons that move exclusively as a DNA molecule and do not require an RNA intermediate are simply referred to as DNA transposons. The assembly and coordination of all of the components involved in transposition is a complex and highly regulated process.

Introduction Transposons are mobile genetic elements that can move between different DNA molecules or within an individual DNA molecule. They are distinct from other mobile elements in that they are not required to move between sites that share sequence homology; they may move to sites with no sequence similarity at all. The most intensively studied type of recombinase that carries out this function is the DDE-type transposase (also called an integrase in eukaryotes); however, there are many other biochemical strategies for recombination. Transposons are often characterized by the biochemical strategy used to carry out DNA breaking and joining reactions. These recombinase families include the DDE-, HUH-, DEDD-type and serine transposases (Curcio and Derbyshire 2003; Siguier et al. 2014). In addition to differences in biochemical strategies used for DNA cleavage and strand transfer, the overall mechanisms of element movement can differ significantly. Other details about the transposition process are also useful for comparing closely related elements. These details can include the sequence and arrangement of terminal inverted-repeat sequences, target-site duplications, arrangement of transposase and accessory genes, and target-site specificity. A convenient way to think about different strategies that mobile elements use for mobility is to consider how they are removed from their original, or donor, DNA molecule and how they are inserted

Transposons

into a new, target or recipient, DNA molecule (Curcio and Derbyshire 2003). Transposons may be removed from a donor DNA entirely by being “cut out,” or they may be left in place while a copy of the element is “copied out” (Fig. 1). Similarly, elements may be either “pasted in,” moved entirely into the recipient molecule, or they may be “copied in.” The terms describing the element’s removal and insertion are combined to describe the entire mobilization process, for example, a cut-and-paste transposon is removed entirely from the donor DNA molecule and pasted directly into a recipient DNA molecule. Transposons that are copied out as RNA are referred to as retrotransposons, whereas transposons that move exclusively as a DNA molecule and do not require an RNA intermediate are simply referred to as DNA transposons.

Anatomy of a Transposon In their most basic form, the genetic structures of transposons are composed of terminal invertedrepeat sequences flanking a single open reading frame encoding a transposase. These simple elements are called insertion sequences (IS) (Siguier et al. 2014). The inverted-repeat (IR) sequences at the ends of the element have the binding site or sites for the transposase protein (Fig. 2). If multiple binding sites are in the element, the most terminal site will indicate where the transposase carries out strand breakage and joining. In simple IS elements, an identical IR that flanks the element with a single transposase binding site may be all that is required; however, more complex elements often have multiple transposase binding sites at their ends. Some elements, such as Tn7, have different numbers of transposase binding sites in the right and left ends, allowing distinction between left and right ends (Craig 2002). For Tn7, this characteristic has implications in regulation of transposition. IRs often contain additional information, such as a promoter for the transposase gene. Some mobile elements can be composed simply of inverted-repeat sequences, such as miniature inverted-repeat transposable elements (MITES); however, these elements are considered to be degenerate products of fully functional

Transposons

1225

a Cut-Out

c Paste-In

Mobile Element

Mobile Element Mobile Element

b Copy-Out

d Copy-In RNAP

Mobile Element

Mobile Element

Mobile Element

3’

5’

5’ 3’

Transposons, Fig. 1 Examples of mobile elements that illustrate the cutout, copy-out, paste-in, and copy-in terms are shown. (a) Elements such as Tn10 and Tn5 use a cutout strategy to remove the element from donor DNA. The transposase (green circles) “cuts out” the element from the donor DNA by joining the top and bottom strands at either end of the element, leaving behind a double-strand DNA break. (b) Retroelements such as Ty1 and L1 are “copied out” of donor DNA as RNA transcripts. (c) Mobile elements that are “pasted in,” such as Tn10 and Tn5,

undergo a strand exchange reaction in which the DNA that encodes the element is joined with the target DNA, as opposed to a polymerase enzyme making a copy of the element at the new site (see d copy-in). (d) Elements that are “copied in” require the activity of a DNA polymerase to make a new copy of the element at the target site, using the target-site DNA as a primer for synthesis and using DNA or RNA element as a template. Tn3 is an example of an element that uses this strategy

transposons and do not move on their own (Tropp 2012). Elements that do not encode a transposase of their own depend on the expression of transposase proteins from elsewhere in the genome. The transposons originally discovered by Barbara McClintock, the Ac and Ds elements, form an autonomous and nonautonomous pair (McClintock 1950); Ac elements encode a transposase gene, flanked by inverted repeats, and Ds elements contain inverted repeats that rely on the presence of Ac-encoded transposase proteins for their mobility (Lazarow et al. 2013). The directionality of the IR sequences dictates the orientation of transposase binding and is therefore important in the formation of nucleoprotein complexes, called transpososomes, that provide the precise arrangement of all components in the transposition reaction (Fig. 2; Gueguen et al. 2005). The DNA ends of the IRs are used as substrates in the chemical reactions catalyzed by the transposase.

In more complex transposable elements, additional factors may accompany the transposase gene between the transposon ends. Some of these factors may be directly involved in the transposition process; however, some genes may simply be “along for the ride,” providing some benefit to the element and/or host (Fig. 3). Factors that do not participate in transposition can include antibiotic resistance genes, components of metabolic pathways, DNA restriction and modification enzymes, toxin/antitoxin systems, and many other classes of genes (Craig 2002; Darmon and Leach 2014). Any genetic information found between the IRs of the transposon will be transported along with the rest of the element. In some cases, host genes may become mobilized by transposition when two of the same IS element insert near one another, flanking host genes (Reznikoff 1993). Such elements are referred to as composite transposons. Tn10 is a composite transposon with IS10 ends (Chalmers and Kleckner 1996). The

T

1226

Transposons

Transposase gene (missing in MITEs)

5’ 3’

ATCG TAGC

ATCG TAGC

3’ 5’

Transposase binding sites (inverted repeats) Break/join site Target-site duplication Transposons, Fig. 2 The key elements of the simplest form of a transposon (an insertion sequence, or IS) are shown here. ISs encode a single transposase gene; however, more complicated transposons can encode much more. The DNA ends of the transposon include inverted repeats that act as binding sites for the transposase. They are oriented in inverse directions to direct the activity of the transposase to the 30 end of the transposon DNA, where

their activity carries out breaking and joining reactions. There may be more than one binding site, where additional transposase proteins aid in forming the appropriate structures for catalysis. Transposons are flanked by direct repeats that do not participate in the transposition reaction, but are a consequence of the transposition process that brought the element to its current site

innermost ends of the IS10 elements have lost their functionality, so the outside ends must be used for transposition, mobilizing all the genetic information found between them. Some transposons may require additional noncoding sequences within the element that help in the transposition process in some way. For example, bacteriophage Mu requires a sequence that enhances gyrase activity, which promotes formation of the paired-end complex (see below) (Gueguen et al. 2005). Tn3 also contains a cis-acting res site that is required for resolving transposition intermediates.

(Gueguen et al. 2005). The nucleoprotein complex that mediates transposition is referred to as the transpososome (Fig. 4). This complex establishes the precise arrangement of all of the substrates in the transposition reaction including transposon ends, transposase enzymes, water molecules (in the case of DDE-type transposases), and target DNA. The transpososome is a dynamic and flexible complex that undergoes several conformational changes to accomplish the DNA transactions that lead to movement of the element and also to usher in host proteins needed for repair afterward. Formation of the transpososome is often assisted by host-encoded DNA bending and bridging proteins, such as HU, IHF, H-NS, and Fis (Craig 2002). In some cases, these accessory proteins are not absolutely required, but stimulate transposition by enhancing the formation or stability of the transpososome. Transposition proceeds in an ordered fashion through the following steps:

Formation of the Transpososome: The Nucleoprotein Complex Required for Transposition The assembly and coordination of all of the components involved in transposition is a complex and highly regulated process

Transposons

1227

Inverted repeats Disabled repeat Transposase gene

a MITES

Antibiotic resistance or other fitness related gene

Resolution site

b Insertion Sequences

Site specific recombinase

c Compound Transposon d Complex Transposon Transposons, Fig. 3 Mobile genetic elements exist at an array of different levels of complexity. (a) The simplest DNA elements are composed of inverted repeats that may be recognized by a transposase encoded elsewhere in the genome. (b) Insertion sequences are the simplest autonomous mobile elements, encoding the transposase that is necessary to carry out strand exchange reactions, and they possess the required inverted repeats that are recognized by the transposase proteins. (c) Compound transposons are elements that are comprised of two insertion sequences flanking genes that were not previously associated with

the insertion sequences. In these elements (such as Tn10) at least one of the inside inverted repeats is often disabled, so only the outermost inverted repeats can be used for successful transposition events. (d) There are many other transposons (such as Tn7, Tn3, and Tn21) that have more complicated arrangements that may also include other recombination systems that are either required for additional processing of the transposon after it has been mobilized (e.g., resolution systems for replicative transposons) or are simply mobilized along with the rest of the transposon (e.g., integrin cassette systems)

1. Transposase binds to the transposon ends. 2. Transposon ends are brought together to form a paired-end complex. 3. Rearrangement of transpososome components leads to the formation of a stable synaptic complex. 4. The donor DNA is cleaved. 5. Further rearrangement of DNA substrates leads to formation of a strand transfer complex, joining the transposon ends to the target DNA molecule.

The DDE-Type Transposase

Each of the steps increases the stability of the complex, driving reactions to completion. In some cases, such as in Mu, the final complex is so stable that it must be removed by the ATP-dependent activity of ClpX, a proteinunfolding function produced by Mu’s host organism, E. coli.

The proteins that carry out the biochemical reactions involved in DNA strand cleavage and joining are called transposases. A major class of these proteins, called DDE-type transposases, contain domains that carry two aspartic acid residues and a single glutamic acid residue that are essential for the coordination of two divalent magnesium ions, similar to the arrangement used by RNAse H enzymes (Fig. 5). Some elements actually use slight variations on this theme, such as DDD, DDN, or DDH, but function in the same way (Dyda et al. 2012; Siguier et al. 2014). The magnesium ions function in stabilizing the transition state that is required both for cleavage of the DNA backbone in the donor DNA molecule and for mediating the strand exchange reaction that joins this newly cleaved DNA with the target DNA. In

T

1228

Transposons

a

e b

c

d

Transposons, Fig. 4 Formation of the transpososome complex is a key step in the transposition process and is heavily regulated. The example shown here is the formation of the Tn5 transpososome. (a) Transposase proteins (pink half circles) are expressed from the transposase gene in the DNA element. (b) Transposase proteins bind to the ends of the transposon and form a paired-end complex. (c) Donor DNA is cleaved by the transposase proteins. In the case of Tn5, the element is removed from donor DNA before a target DNA has been identified. This is

accomplished by the formation of DNA hairpins at the ends of the element. (d) The transpososome captures a target DNA molecule and cleaves the hairpins at the ends of the elements to generate free 30 -OH. The target DNA is cleaved, and the 30 ends of the transposon are used in a transesterification reaction that joins the transposon DNA to the target DNA. (e) The transpososome disassembles, sometimes requiring proteolysis of the transposase proteins

strand cleavage, a single water molecule is used as a nucleophile in a reaction that generates a free 30 -OH group at the end of the transposon. This 30 -OH is, in turn, used as the nucleophile in joining the transposon end to the target DNA molecule. This reaction is referred to as a transesterification reaction because it results in the exchange of phosphodiester bonds between the sugar–phosphate backbone of donor and target DNA molecules. The result is one strand of DNA, from the end of the transposon, joined within a new target DNA. The target DNA now has a free 30 -OH group where target DNA has been exchanged with the transposon end. The remaining 30 -OH group is then used by the host DNA polymerase and ligase enzymes to restore the target DNA molecule to a fully doublestranded DNA form. The activities of the transposases must be coordinated at both ends of the element to ensure that the entire transposition reaction goes to

completion (Nagy and Chandler 2004). For this reason, both ends of the transposon, bound by the transposase proteins, are drawn together in a nucleoprotein complex that sometimes includes the target DNA molecule as well. The assembly of this nucleoprotein complex ensures that the breaking and joining reactions can be coordinated at both ends of the element, prevents insertion of one end of the element to a site that is internal to the transposon, and enables both ends of the transposon to insert at the same site in the target DNA, as opposed to sites that are distant from one another, which would potentially result in large deletions (Darmon and Leach 2014). Since both transposon ends, along with the transposase proteins, cannot occupy the same physical space on the DNA, the sites of strand exchange are staggered, typically by 2–9 nt. This principle of transposons leads to one of the hallmarks of the transposition process, target-site duplications. The staggering of the join sites leads to duplications because the top strand

Transposons

1229

H

BASE

OH + O P O

BASE

O

3’ O H

P

O

5’ 3’ O

5’

O HO

OH

O P HO O

H

O

Mg2+

H O

O O

Mg2+

O

OH

HO

O P HO

E

O D

OH

D Transposase

O P O O

+

H

H O

Transposons, Fig. 5 DDE-type transposase proteins contain a highly conserved aspartate (D), aspartate (D), and glutamate (E) motif that allows them to coordinate two divalent magnesium cations. This arrangement is characteristic of RNase H-like domains. The magnesium ions

stabilize the transition state of the DNA hydrolysis and transesterification reactions involved in DNA strand exchange. The sequence of reactions is shown in the inset, where black lines represent donor DNA, and thick green lines represent target DNA

of the target is joined to one end of the transposon and the bottom strand is joined to the opposite end. Following repair of the gap, DNA sequence derived from the top strand of the target will be present at one end of the element, and identical DNA sequence information derived from the bottom strand of the target site will be present at the other end. The target-site duplications are generally 2–9 nt in length, but can sometimes be as long as 250 nt, depending on the distance between top- and bottom-strand joining. At its simplest form, joining both ends of the transposon to the target DNA is enough to mobilize the element (Fig. 6). If strand transfer at the ends of the element is not accompanied by cleavage of the strand that has not been transferred, the so-called second strand, then the target DNA and the donor DNA will remain joined in a cointegrated DNA molecule (Tropp 2012). As previously mentioned, the 30 -OH remaining at both ends flanking the element after cleavage of the target DNA molecule are used to prime DNA synthesis from both ends,

proceeding through the DNA element and ending where the opposite end of the element was joined. Since DNA replication is involved in inserting the DNA element at the new site, this mechanism is called replicative transposition. The element is both copied out of the donor DNA molecule and copied into the target DNA molecule, so it is considered a copy-out and copy-in mechanism. In some cases, such as Tn3 and Tn917, the cointegration of donor DNA and target DNA is resolved by a separate site-specific recombination system. In other elements, such as bacteriophage Mu, the cointegrate is left unresolved. In some transposition systems, the second strand is also cleaved such that the transposable element is removed entirely from the donor DNA (Fig. 6) (Turlan and Chandler 2000). These elements are commonly called cut-and-paste elements, since they are cut out of one DNA and joined directly into another or “pasted in.” There are several ways in which the second strand is cleaved:

T

1230

Transposons

a

b

e Mobile Element

Mobile Element

Mobile Element

Mobile Element

Mobile Element

Mobile Element

c

d

Mobile Element

Mobile Element

Mobile Element

Mobile Element

Transposons, Fig. 6 Second strand cleavage (at the 50 end of the element) can be carried out by (a) formation of DNA hairpin structures at the ends of the transposon DNA (e.g., Tn5 and Tn10). (b) Cleavage by a second protein that is dedicated only to nicking the DNA at the 50 end (e.g., TnsA of Tn7). (c) Cleavage in an event separate from the transesterification reaction, mediated by the transposase

protein (e.g., Mos1). (d) Asymmetrical cleavage of one DNA strand, generating a circular intermediate that must be resolved by DNA replication. (e) Transcription and subsequent reverse transcription of the element (e.g., Ty1, HIV Int). This mechanism still requires 30 processing by the integrase to generate a new 30 -OH

• Hairpin formation at the ends of the element • Tn10 (IS10), Tn5 (IS50), hAT • Endonuclease activity from a second transposon-encoded protein • Tn7 • Cleavage of both strands by the transposase • Tc/mariner • Asymmetric cleavage and same-side joining (forming a circular ssDNA intermediate) • IS1, IS3, IS110 • Reverse transcription of an RNA intermediate • Ty1, HIV Int

the process is reversed once an appropriate target DNA molecule is bound. The ends are, once again, cleaved and joined, this time to the target DNA molecule. Since the element is no longer connected to the donor DNA, when DNA polymerase extends the 30 -OH left in the target DNA molecule, it only synthesizes the short stretch of DNA that is left by the staggered strand joining; there is no additional DNA to use as a template as seen in replicative transposition. Elements that use this strategy of second strand cleavage include Tn5 and Tn10. Elements within the hAT superfamily, such as hobo, Hermes, and Ac, employ a mechanism in which a hairpin is formed at the donor DNA ends, releasing the double-stranded DNA element. The V(D)J recombination process that generates diversity in immunoglobulin genes in mammals also functions by a similar

In hairpin formation, the free 30 -OH formed at the DNA end is joined with the strand opposite the cleavage site, severing the element from the donor molecule. The ends remain bound to the transposase proteins after excision of the element, and

Transposons

mechanism, and a hAT-like transposon is thought to be an evolutionary antecedent of V(D)J recombination (Curcio and Derbyshire 2003). Tn7 and related transposons encode an additional protein, separate from the transposase, that specifically cleaves the second strand (Craig 2002). This activity enables the transposon to cut the genetic element out of the donor DNA without forming hairpins at the end, but this also means that the transposase and endonuclease activities must be coordinated. In Tn7, excision from the donor DNA does not occur until a target DNA molecule has been identified and bound within a nucleoprotein complex. Tc/mariner elements use the same transposase that cleaves the transferred strand to cleave the non-transferred strand. Since the same transposase subunit cleaves both strands, a large conformational change is required in the transpososome to move the transferred strand out of the active site and the non-transferred strand in. Elements in the IS3 family employ an interesting mechanism of excision that involves an asymmetrical cleavage event of a single DNA strand at one end of the element (Curcio and Derbyshire 2003). The 30 -OH generated by this cleavage event is transferred to the backbone of the same DNA strand upstream of the cleavage site (Fig. 6d). This generates a closed circular singlestranded DNA molecule that is then made doublestranded by DNA replication machinery using an RNA transcript as a primer. Formation of the circular DNA molecule establishes a strong promoter for transcription by joining 35 and 10 regions that were previously on opposite ends of the element. This mechanism helps to ensure that transposition goes to completion once the initial ssDNA circle is formed. The circular dsDNA intermediate is then free to insert elsewhere. This seemingly complicated mechanism is among the most common and is often referred to as a copy-out and paste-in strategy.

When RNA Is an Intermediate: Retrotransposons Retroelements are copied out of the donor DNA molecule as an RNA by a host-encoded RNA

1231

polymerase (Fig. 6e). In order to be inserted back into a new DNA molecule, the RNA molecule must be converted to DNA; host genetic systems do not tolerate a mix of RNA and DNA within their genomes. This activity is accomplished by a reverse transcriptase that is typically encoded by the element itself. In some cases elements may use reverse-transcriptase proteins that are encoded by other mobile elements, or in the case of plants, these reverse transcriptases may actually be encoded within the host genome. The retrotransposons are comprised of two major groups, the LTR (long terminal repeat) transposons and the non-LTR transposons (Craig 2002). Their names refer to differences in the sequence composition of the elements at their ends; however, the mechanisms by which these elements are mobilized are much more profound. LTR elements move by a copy-out and paste-in mechanism, while non-LTR elements move in a copyout and copy-in manner. LTR transposons contain long repeated sequences that delineate the ends of the element and will eventually serve as a binding site for proteins that are involved in the elements’ insertion into a new DNA molecule. These elements are evolutionarily related to retroviruses, and many (e.g., the yeast Ty1, Ty3, and Tf1) also contain functional proteins that enable them to make “viruslike particles” (VLPs) outside the nucleus, where they are processed by the reverse-transcriptase protein (often referred to as “Pol”) (Curcio and Derbyshire 2003; Tropp 2012). The similarity between retrotransposons and retrovirus makes them useful systems for the study of key functions that affect retroviral replication. These elements are typically transcribed by the host’s RNA polymerase II enzyme, which is responsible for transcription of most cellular mRNAs. After transcription the elements are converted from RNA to DNA by the activity of the reverse-transcriptase protein. A DDE-type recombinase protein (often called “integrase” with these elements) binds to the LTR regions at the ends of the element, activating them for transposition by exposing free 30 -OH groups at the 30 end of both strands. The protein–DNA complex then identifies an appropriate

T

1232

Transposons

Transposons, Fig. 7 HUH transposases generate a covalent 50 phosphotyrosine intermediate (Y) to cleave DNA. The conserved histidine residues (H) in these transposases are used to help coordinate a metal ion that is used in the activation and coordination of the DNA nicking

reaction. In this example, glutamine (Q) is also involved in coordinating the metal ion. Red arrows indicate the transfer of electrons in the process. The process is reversed to rejoin DNA ends

target site, and the integrase mediates a strand exchange reaction in which the activated 30 -OH ends are used as a nucleophile to attack the 50 PO4 within the backbone of the recipient DNA molecule. This process happens simultaneously at both ends, but at staggered sites on the recipient DNA molecule. Since the integration of the ends is staggered, a duplication of the target-site sequence results, one copy just outside each end of the element.

likely that this transposition mechanism also involves Holliday junction formation. RuvC, as well as DEDD-type transposases, contains a conserved RNase H fold, just as DDE-type transposases do (Buchner et al. 2005). These recombinases do not generate target-site duplications (Prosseda et al. 2006). IS110 is an example of a transposon that encodes DEDD-type transposase.

HUH Transposases DEDD Transposases Some elements use transposases that resemble RuvC-like enzymes that are involved in resolution of Holliday junctions that are homologous recombination intermediates (Siguier et al. 2014). Very little is known about how DEDD-type transposases function in transposition, but given their similarity to RuvC, it is

Transposons that employ HUH-type transposases function by a mechanism that is very different from DDE- and DEDD-type transposase. HUH transposases form covalent phosphotyrosine intermediates during the process of transposition (Fig. 7; Chandler et al. 2013; Dyda et al. 2012). Superficially, this mechanism is reminiscent of tyrosine recombinase reactions used in some

Transposons

site-specific recombination systems; however, HUH transposases and tyrosine recombinases are unrelated. This class of transposition does not generate target-site duplications. HUH transposases leave behind a free 30 -OH in the donor DNA, following strand cleavage, which can be extended by DNA replication machinery, whereas tyrosine site-specific recombinases leave behind a 50 -OH, which cannot be extended. This distinction is significant in that DNA replication plays an important role in Y2 transposons that are part of the HUH family. As in other transposition systems, HUH transposons do not require specific binding sites for mobilization. This class of element is mechanistically related to relaxases or Rep proteins that are involved in the transfer of conjugal plasmids and replication of rolling-circle plasmids, respectively. HUH transposases receive their name based on the conserved histidine (H)–bulky hydrophobic residue (U)–histidine motif that is important for their function. These residues form a motif that helps coordinate a divalent metal ion, usually Mg2+ or Mn2+, which is used in cleavage of the donor DNA backbone, an arrangement similar to that seen in DDE-type transposases. An additional key feature of these transposases includes at least one conserved tyrosine, or serine, that is used to form a covalent protein–DNA bond. Transposons that attach to DNA by a single tyrosine are called Y1 transposons. Those that involve phosphoserine linkage are called S1 transposons. Yet another class of transposase uses covalent linkage with two separate tyrosine residues and is referred to as Y2 transposons. IS200/IS605 elements serve as models for the activity of the Y1-type transposons (Chandler et al. 2013). The transposases of the IS200/ IS605 family function as a dimer. After cleavage of the donor DNA backbone, a conformational change occurs, and the 30 -OH attacks the phosphotyrosine bond at the opposite end joining the ends of the donor molecule. A similar reaction occurs between the ends of the element, forming a closed circular ssDNA, comprised entirely of the element DNA. A reversal of the reaction is used to insert the element DNA into a single-stranded target DNA.

1233

Rather than inverted-repeat sequences, IS200/ IS605 transposons are flanked by DNA sequences that form conserved secondary structures when the element is single stranded. The structures are recognized and bound by the transposase proteins only when the element is in a single-stranded form, enforcing a dependence on active DNA replication for transposition (Fig. 8). Discontinuous replication of the donor DNA, such as seen on the lagging strand of DNA replication, results in transient single-stranded DNA molecules that are competent for transposition. In addition to the use of the secondary DNA structure for end recognition, the transposase proteins interact with short (~5 bp) sequences at the 50 end of the element, which directs where the transposase cleaves the DNA and guides target-site identification through base-pair interactions. Interactions with the guide DNA help to stabilize the structure of the transpososome. Target DNAs must also be single stranded for insertion to occur. The dependence on discontinuous DNA replication constrains the timing of both excision and insertion of this class of transposon to events such as wide-scale DNA damage, when large ssDNA gaps are frequent. ISDra2 is a HUH-type transposon of Deinococcus radiodurans that is strongly activated following doses of ionizing radiation. Like DDE-type transposons, IS200/IS605 have nonautonomous derivatives. Small IS elements that lack transposase genes have been identified and referred to as bacterial interspersed mosaic elements (BIMEs). These elements can alter gene regulation and higher-order DNA structure and can serve as specific target sites for certain IS elements. Y2 transposons include the IS91 family and the helitrons of eukaryotes (Curcio and Derbyshire 2003). Not many detailed descriptions of these elements are available; however, they are presumed to mobilize by a mechanism that resembles rolling-circle replication of some plasmids. Two tyrosine residues within the recombinase protein are essential for transposition, hence the name Y2. Both ssDNA and dsDNA closed circle intermediates have been detected. As previously mentioned, donor strand cleavage generates a free 30 -OH that is used to prime DNA replication. Leading-strand DNA replication is thought to

T

1234

Transposons

a

a

b

g

c

d

f b

c

d Y Y

c

a

Y

b

Y

e

d

cd

ab

Transposons, Fig. 8 HUH transposons move by a singlestranded DNA intermediate. (a) The transposon DNA is shown in red, between the two arbitrary DNA sites (marked a and b to distinguish the sites). (b) The lagging strand during DNA replication (dotted green arrow) is transiently in a single-stranded form, allowing the formation of secondary structures (squiggled red lines) that are necessary for recognition of the element’s ends by the transposase (blue circles). (c) The ends are brought together, and the DNA is cleaved, with a transposase linked to the donor DNA at the 30 end of the element and

a transposase linked to the transposon DNA at the 50 end. The donor DNA is rejoined, bringing sites a and b together, by the free 30 end near a attacking the phosphotyrosine bond near site b. A similar reaction circularizes the transposon DNA. (d) The donor DNA is made double-stranded again by DNA replication. (e) A new target DNA is identified (c, d). The new target must also be single-stranded DNA. (f) The process that liberated the transposon DNA is reversed at the new site. (g) The newly integrated transposon is made double-stranded by DNA replication

displace the element. As a consequence of the inefficient termination of replication in these elements, ~1% of the mobilized elements also transfer genetic information from the donor molecule that lies adjacent to the typical end of the Y2 element. IS91 elements do contain inverted repeats, which are involved in end recognition.

serine site-specific recombinase cousins do (Fig. 9). Unlike the resolvase/integrases and large serine recombinases, the domain structure of IS607 transposases is switched, with the catalytic domain residing in the C-terminus and the DNA binding domain in the N-terminus. A model for the mechanism of IS607 has been proposed involving transposase tetramers initially forming on a single DNA molecule in an arrangement that is prepared to receive a target DNA molecule (Fig. 10; Boocock and Rice 2013). Transposase dimer binding by sequence-specific interaction with the ends of the element triggers a conformational change that exposes a protein–protein interaction interface, allowing binding of an additional dimer. The assembly of the tetramer positions the DNA binding domains of newly arrived dimer, making them ready to

Serine Transposases Serine transposases function in much the same way that serine recombinases of site-specific recombination systems do and have been found in both bacteria and eukaryotes. IS607 is an example of this class of transposon. The transposase of IS607 generates 50 phosphoserine intermediates and appears to operate as a tetramer, just as their

Transposons

1235

a target

S

c

S

S S

donor

S

S

S

S

b

d P

O-S

S-O P

S-O P P O-S

S-O P

S-O P

P

O-S

P O-S

Transposons, Fig. 9 Serine transposases function very similar to serine site-specific recombinases. (a) Recombinases (yellow and blue circles) form tetramers and bind to specific sites within a circularized transposon DNA molecule and bind to a new target DNA. (b) The

individual recombinase proteins form phosphoserine intermediates. (c) A 180 rotation with respect to their partner dimers aligns the new DNA partners. (d) The free 30 ends (arrow heads) attack the phosphoserine intermediates, rejoining the DNA

Transposons, Fig. 10 IS607 is an example of a serine transposon. (a) Transposition occurs through a circular double-stranded DNA intermediate. The transposase proteins have an N-terminal DNA binding domain (inset, blue rectangle) and a C-terminal catalytic domain (inset, blue circle). (b) A dimer of transposase proteins (blue) binds a specific site on the circularized transposon DNA

molecule. This dimer recruits an additional dimer to make a tetramer. (c) Since the DNA binding domains are preassembled, the nonspecific DNA binding ability of the tetramer is much greater than that of the dimer, so it can now bind to target DNA nonspecifically. (d) Strand exchange reactions join transposon and target DNA molecules

bind target DNA. This preassembly step may increase the affinity of the complex for random DNA sequence, as opposed to the stepwise assembly of inactive dimers seen in site-specific recombination. This model assumes that the random DNA binding ability of each monomer is weak,

but the overall complex is of sufficient stability to coordinate the DNA interaction needed in the reaction. The assembled IS607 transposase tetramers then bridge donor and target DNAs and cleave DNA in trans – the DNA binding domains of monomers that bind to the target

T

1236

DNA – whereas the catalytic domain of the same monomer cleaves the donor DNA, and vice versa, much like how many DDE-type transposases operate. Just as seen in other serine recombinases, the transposase forms a phosphoserine covalent bond. After cleavage, a 180 rotation of the two transposase dimers repositions target and donor DNA ends, and the reverse reaction is carried out. The ends of the transposon are then joined with the target DNA. Further study of IS607-like elements is required to confirm this model.

Cross-References ▶ DNA Recombination, Mechanisms of ▶ DNA Repair Polymerases ▶ DNA Replication ▶ Double-strand Break Repair ▶ Homologous Recombination in Lesion Bypass ▶ V(D)J Recombination

References Boocock MR, Rice PA (2013) A proposed mechanism for IS607-family serine transposases. Mob DNA 4:24 Buchner JM, Robertson AE, Poynter DJ, Denniston SS, Karls AC (2005) Piv site-specific invertase requires a DEDD motif analogous to the catalytic center of the RuvC Holliday junction resolvases. J Bacteriol 187:3431–3437 Chalmers RM, Kleckner N (1996) IS10/Tn10 transposition efficiently accommodates diverse transposon end configurations. EMBO J 15:5112–5122 Chandler M, de la Cruz F, Dyda F, Hickman AB, Moncalian G, Ton-Hoang B (2013) Breaking and joining single-stranded DNA: the HUH endonuclease superfamily. Nat Rev Microbiol 11:525–538 Craig NL (2002) Mobile DNA II. ASM Press, Washington, DC Curcio MJ, Derbyshire KM (2003) The outs and ins of transposition: from mu to kangaroo. Nat Rev Mol Cell Biol 4:865–877

TROSY, Transverse Relaxation-Optimized Spectroscopy Darmon E, Leach DR (2014) Bacterial genome instability. Microbiol Mol Biol Rev 78:1–39 Dyda F, Chandler M, Hickman AB (2012) The emerging diversity of transpososome architectures. Q Rev Biophys 45:493–521 Gueguen E, Rousseau P, Duval-Valentin G, Chandler M (2005) The transpososome: control of transposition at the level of catalysis. Trends Microbiol 13:543–549 Lazarow K, Doll ML, Kunze R (2013) Molecular biology of maize Ac/Ds elements: an overview. Methods Mol Biol 1057:59–82 McClintock B (1950) The origin and behavior of mutable loci in maize. Proc Natl Acad Sci U S A 36:344–355 Nagy Z, Chandler M (2004) Regulation of transposition in bacteria. Res Microbiol 155:387–398 Prosseda G, Latella MC, Casalino M, Nicoletti M, Michienzi S, Colonna B (2006) Plasticity of the P junc promoter of ISEc11, a new insertion sequence of the IS1111 family. J Bacteriol 188:4681–4689 Reznikoff WS (1993) The Tn5 transposon. Annu Rev Microbiol 47:945–963 Siguier P, Gourbeyre E, Chandler M (2014) Bacterial insertion sequences: their genomic impact and diversity. FEMS Microbiol Rev https://doi.org/10.1111/ 1574-6976.12067. [Epub ahead of print] 1–28 Tropp BE (2012) Molecular biology: genes to proteins, 4th edn. Jones & Bartlett Learning, Sudbury Turlan C, Chandler M (2000) Playing second fiddle: secondstrand processing and liberation of transposable elements from donor DNA. Trends Microbiol 8:268–274

TROSY, Transverse RelaxationOptimized Spectroscopy ▶ NMR Approaches to Determine Protein Structure

Type 2 Diabetes ▶ Mammalian Solute Carrier Families SLC2 and SLC5: Facilitative and Active Transport of Hexoses and Polyols

U

Ultraviolet Light DNA Damage Frederick Peter Guengerich Department of Biochemistry and Center in Molecular Toxicology, Biochemistry and Center in Molecular Toxicology, Vanderbilt University School of Medicine, Nashville, TN, USA

Synonyms

significant problem. Aside from some chemicals that cause skin cancer (e.g., arsenic, which affects the skin in areas not exposed to light), skin cancer can be understood primarily in the context of DNA damage. Further, individuals deficient in certain DNA repair genes – or DNA polymerases required for efficient bypass of DNA damage, e.g., DNA polymerase  – show extreme sensitivity to damage and high risks of cancer (e.g., xeroderma pigmentosum).

Sunlight damage to DNA; UV damage to DNA

Basis and Relevance Synopsis Sunlight is healthy and necessary in vitamin D activation, but it is also a major cause of skin cancer. UV light causes the formation of several DNA adducts, of which the most serious ones are cyclobutane dimers and pyrimidine-pyrimidone (6–4) photoproducts. Skin cancer can be understood in the context of these adducts, and a deficiency in DNA polymerase , the translesion DNA polymerase that copies past the major adduct, gives rise to a form of the disease xeroderma pigmentosum.

Introduction Ultraviolet (UV) light is a major issue in human cancer because of skin cancer, which is a very # Springer Science+Business Media, LLC 2018 R.D. Wells (et al.), Molecular Life Sciences, https://doi.org/10.1007/978-1-4614-1531-2

Different wavelengths (energies) of light produce different types of damage. UV-A light exposure (320–400 nm) of the skin produces cyclobutane pyrimidine dimers (CPDs) (Fig. 1). UV-B radiation (290–320 nm) results in the formation of pyrimidine-pyrimidone (6–4) photoproducts. UV-C light (