The OMICs: applications in neuroscience 9780199389803, 0199389802, 9780199855452, 0199855455

Medical DNA sequencing in neuroscience / Karola Rehnström, Arvid Suls, and Aarno Palotie -- Epigenomics : an overview /

606 68 9MB

English Pages ix, 374 pages) : illustrations (black and white [385] Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The OMICs: applications in neuroscience
 9780199389803, 0199389802, 9780199855452, 0199855455

Table of contents :
Medical DNA sequencing in neuroscience / Karola Rehnström, Arvid Suls, and Aarno Palotie --
Epigenomics : an overview / Kevin Huang and Guoping Fan --
The role of epigenomics in genetically identical individuals / Zachary A. Kaminsky --
Transcriptomics / T. Grant Belgard and Daniel H. Geschwind --
Decoding alternative mRNAs in the "omics" age / Yuan Yuan and Donny D. Licatalosi --
Transcriptomics : from differential expression to coexpression / Michael Michael C. Oldham --
High-throughput-RNA interference as a tool for discovery in neuroscience / Lisa P. Elia and Steven Finkbeiner --
The genetics of gene expression : multiple layers and multiple players / Amanda J. Myers --
Proteomics / Jonathan C. Trinidad, Ralf Schoepfer and A.L. Burlingame --
Focused plasma proteomics for the study of brain aging and neurodegeneration / Philipp A. Jaeger, Saul A. Villeda, Daniela Berdnik, Markus Britschgi, and Tony Wyss-Coray --
Cellomics : characterization of neuronal subtypes by high-throughputmethods and transgenic mouse models / Joseph Dougherty --
Neuroscience and metabolomics / Reza Salek --
Brain connectomics in man and mouse / Arthur W. Toga, Kristi Clark, Hong Wei Dong, Houri Hintiryan, Paul M. Thompson, and John D. Van Horn --
Optogenetics / Richie E. Kohman, Hua-an Tseng, and Xue Han --
Characterizing the gut microbiome : role in brain-gut function / Gerard Clarke, Paul W. O'Toole, John F. Cryan, and Timothy G. Dinan --
OMICs in drug discovery : from small molecule leads to clinical candidates / B. Michael Silber --
Pharmacogenomics / Steven P. Hamilton --
Multidimensional databases of model organisms / Khyobeni Mozhui and Robert W. Williams --
Network biology and molecular medicine in the post genomic era : the systems pathobiology of network medicine / Stephen Y. Chan annd Joseph Loscalzo.

Citation preview

THE OMICS

THE OMICs

Applications in Neuroscience

EDITED BY

G I O VA N N I C O P P O L A

Director, Center for Informatics and Personalized Genomics Semel Institute for Neuroscience and Human Behavior Departments of Psychiatry & Neurology David Geffen School of Medicine University of California, Los Angeles

1

1 Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trademark of Oxford University Press in the UK and certain other countries. Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016

© Oxford University Press 2014 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above. You must not circulate this work in any other form and you must impose this same condition on any acquirer. Library of Congress Cataloging–in–Publication Data OMICS (2014) The OMICs : applications in neuroscience / edited by Giovanni Coppola. p. ; cm. Includes bibliographical references. ISBN 978–0–19–985545–2 (alk. paper) I. Coppola, Giovanni, 1971– editor of compilation. II. Title. [DNLM: 1. Brain—physiology. 2. Genomics—methods. 3. Drug Discovery. 4. Translational Medical Research. QU 460] QP376 612.8′2—dc23 2013024509

9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper

CONTENTS

Contributors

vii

PART ONE: DNA

1. Medical DNA Sequencing in Neuroscience Karola REHNSTRÖM, Arvid Suls, and Aarno Palotie 2. Epigenomics: An Overview Kevin Huang and Guoping Fan 3. The Role of Epigenomics in Genetically Identical Individuals Zachary A. Kaminsky

03

27

10. Focused Plasma Proteomics for the Study of Brain Aging and 183 Neurodegeneration Philipp A. Jaeger, Saul A. Villeda, Daniela Berdnik, Markus Britschgi, and Tony Wyss-Coray PART FOUR: CELLS AND CONNECTIONS

63

11. Cellomics: Characterization of Neural Subtypes by High-Throughput Methods 195 and Transgenic Mouse Models Joseph Dougherty 12. Neuroscience and Metabolomics Reza M.  Salek

5. Decoding Alternative mRNAs in the “Omics” Age Yuan Yuan and Donny D. Licatalosi

73

6. Transcriptomics: From Differential Expression to Coexpression Michael C.  Oldham

85

7. High-Throughput RNA Interference as a Tool for Discovery in Neuroscience 114 Lisa P. Elia and Steven Finkbeiner 8. The Genetics of Gene Expression: Multiple Layers and Multiple Players Amanda J.  Myers

9. Proteomics 155 Jonathan C. Trinidad, Ralf Schoepfer, and A. L. Burlingame

42

PART TWO: RNA

4. Transcriptomics T. Grant Belgard and Daniel H. Geschwind

PART THREE: PROTEIN

132

220

13. Brain Connectomics in Man 232 and Mouse Arthur W. Toga, Kristi Clark, Hong Wei Dong, Houri Hintiryan, Paul M. Thompson, and John D. Van Horn 14. Optogenetics Richie E. Kohman, Hua-an Tseng, and Xue Han 

254

15. Characterizing the Gut Microbiome: Role 265 in Brain–Gut Function Gerard Clarke, Paul W. O’Toole, John F. Cryan, and Timothy G. Dinan

vi

CONTENTS

PART FIVE: THERAPEUTICS

16. OMICs in Drug Discovery: From Small Molecule Leads to Clinical Candidates B. Michael  Silber 17. Pharmacogenomics Steven P. Hamilton

291 315

PART SIX: OMICsOME:  INTEGRATION OF OMICs DATA

18. Multidimensional Databases of Model Organisms Khyobeni Mozhui and Robert W. Williams

333

19. Network Biology and Molecular Medicine in the Postgenomic Era: The Systems Pathobiology of 345 Network Medicine Stephen Y. Chan and Joseph Loscalzo Links to Helpful Resources

357

Index

361

CONTRIBUTORS

T. Grant Belgard Department of Psychiatry David Geffen School of Medicine University of California, Los Angeles Los Angeles, CA

Gerard Clarke Department of Psychiatry and Alimentary Pharmabiotic Centre University College Cork Cork, Ireland

Daniela Berdnik Department of Neurology and Neurological Sciences Stanford University School of Medicine Stanford, CA

John F. Cryan Department of Anatomy and Neuroscience & Alimentary Pharmabiotic Centre University College Cork Cork, Ireland

Markus Britschgi F. Hoffmann-La Roche AG pRED, Pharma Research & Early Development, DTA Neuroscience Basel, Switzerland A. L. Burlingame Department of Pharmaceutical Chemistry University of California, San Francisco San Francisco, CA Stephen Y. Chan Division of Cardiovascular Medicine Department of Medicine Brigham and Women’s Hospital Harvard Medical School Boston, MA Kristi Clark The Institute for Neuroimaging and Informatics (INI) and Laboratory of Neuro Imaging (LONI) Keck School of Medicine of USC University of Southern California Los Angeles, CA

Timothy G. Dinan Department of Psychiatry and Alimentary Pharmabiotic Centre University College Cork Cork, Ireland Hong Wei Dong The Institute for Neuroimaging and Informatics (INI) and Laboratory of Neuro Imaging (LONI) Keck School of Medicine of USC University of Southern California Los Angeles, CA Joseph Dougherty Department of Genetics & Department of Psychiatry Washington University School of Medicine in St. Louis St. Louis, MO Lisa P. Elia Gladstone Institute of Neurological Disease and Taube-Koret Center for Huntington’s Disease Research University of California, San Francisco San Francisco, CA

viii

Contributors

Guoping Fan Department of Human Genetics David Geffen School of Medicine University of California, Los Angeles Los Angeles, CA Steven Finkbeiner Gladstone Institute of Neurological Disease and Departments of Neurology and Physiology University of California, San Francisco San Francisco, CA Daniel H. Geschwind Departments of Psychiatry, Neurology and Human Genetics David Geffen School of Medicine University of California, Los Angeles Los Angeles, CA Steven P. Hamilton Department of Psychiatry and Institute for Human Genetics University of California, San Francisco San Francisco, CA Xue Han, Ph.D. Assistant Professor Biomedical Engineering Department and Joint Professor Department of Pharmacology and Experimental Therapeutics Member Photonics Center Boston University Boston, MA Houri Hintiryan The Institute for Neuroimaging and Informatics (INI) and Laboratory of Neuro Imaging (LONI) Keck School of Medicine of USC University of Southern California Los Angeles, CA Kevin Huang Department of Human Genetics David Geffen School of Medicine University of California, Los Angeles Los Angeles, CA Philipp A. Jaeger Departments of Bioengineering and Medicine University of California San Diego La Jolla, CA

Zachary A. Kaminsky Johns Hopkins University School of Medicine Department of Psychiatry Baltimore, MD Richie E. Kohman Biomedical Engineering Department Boston University Boston, MA Donny D. Licatalosi Center for RNA Molecular Biology Case Western Reserve University Cleveland, OH Joseph Loscalzo Department of Medicine Brigham and Women’s Hospital Harvard Medical School Boston, MA Khyobeni Mozhui  Department of Preventive Medicine University of Tennessee Health Memphis, TN. Amanda J. Myers Laboratory of Functional Neurogenomics Department of Psychiatry & Behavioral Sciences Program in Neuroscience Interdepartmental Program in Human Genetics and Genomics Center on Aging University of Miami Miller School of Medicine Miami, FL Michael C. Oldham Department of Neurology The Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research University of California, San Francisco San Francisco, CA Paul W. O’Toole School of Microbiology and Alimentary Pharmabiotic Centre University College Cork Cork, Ireland

Contributors Aarno Palotie Wellcome Trust Sanger Institute Hinxton, United Kingdom and Institute for Molecular Medicine University of Helsinki Helsinki, Finland and Program for Human and Population Genetics The Broad Institute of MIT and Harvard Cambridge, MA Karola Rehnström Wellcome Trust Sanger Institute Hinxton, United Kingdom and Institute for Molecular Medicine University of Helsinki Helsinki, Finland Reza M. Salek Department of Biochemistry and Cambridge Systems Biology Centre University of Cambridge Cambridge CB2 1GA, UK Ralf Schoepfer Laboratory for Molecular Pharmacology NPP (Pharmacology) University College London London, United Kingdom B. Michael Silber Department of Bioengineering and Therapeutic Sciences Schools of Medicine and Pharmacy University of California, San Francisco San Francisco, CA Arvid Suls VIB-Department of Molecular Genetics and University of Antwerp Antwerpen, Belgium Paul M. Thompson The Institute for Neuroimaging and Informatics (INI) and Laboratory of Neuro Imaging (LONI) Keck School of Medicine of USC University of Southern California Los Angeles, CA Arthur W. Toga The Institute for Neuroimaging and Informatics (INI) and Laboratory of Neuro Imaging (LONI) Keck School of Medicine of USC University of Southern California Los Angeles, CA

ix

Jonathan C. Trinidad Department of Pharmaceutical Chemistry University of California, San Francisco San Francisco, CA and Department of Chemistry Indiana University Bloomington, IN Hua-an Tseng Biomedical Engineering Department Boston University Boston, MA John D. Van Horn The Institute for Neuroimaging and Informatics (INI) and Laboratory of Neuro Imaging (LONI) Keck School of Medicine of USC University of Southern California Los Angeles, CA Saul A. Villeda Department of Anatomy and the Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research  University of California, San Francisco San Francisco, CA Robert W. Williams Department of Anatomy & Neurobiology Center for Integrative and Translational Genomics University of Tennessee Health Science Center Memphis, TN Tony Wyss-Coray Department of Neurology and Neurological Sciences Stanford University School of Medicine Stanford, CA and VA Palo Alto Health Care System Palo Alto, CA Yuan Yuan Laboratory of Molecular Neuro-Oncology The Rockefeller University New York, NY

PART  I DNA

1 Medical DNA Sequencing in Neuroscience K A R O L A R E H N S T R Ö M , A R V I D S U L S , A N D A A R N O PA L O T I E

INTRODUCTION The aim of medical genetic studies is to identify genetic variants associated with a disorder or trait of interest. A  hypothesis-free way to conduct gene mapping studies has been available ever since genetic variants, usually referred to as genetic markers, were identified. The first genetic markers used in gene mapping studies were a small number of blood antigens; later, microsatellites were used. The human reference genome and the Hap Map project identified millions of single nucleotide polymorphisms (SNPs) spread all across the genome, which provided a much denser map of genetic markers. Today high-throughput sequencing technology has made it possible to decode every base pair in the human genome, enabling the identification not only of sites, which are polymorphic in a population, but also of private mutations, which are present in only one individual. Despite the feasibility of producing enormous datasets for medical genetic studies, the path from generating the data to identifying the variants involved in the disease and further converting this to an understanding of biological mechanisms is still in its early stages. T H E H I S T O RY O F G E N E MAPPING STUDIES Traditionally, human genetic disorders have been divided into monogenic and complex types. This somewhat simplified division reflects the underlying genetic architecture. Monogenic (or Mendelian) disorders are caused by mutations in one gene. These mutations are highly penetrant and rare in the population (Figure 1.1). Depending on the mode of inheritance, loss of one or two copies is required for the disease to manifest. More than 3,000 such disorders are listed in the Online Mendelian Inheritance in Man (OMIM, www.ncbi.nlm. nih.gov/omim) database, and the causative

genes have been identified in one third to half of these (Bamshad, Ng, et  al. 2011). Although many disorders, particularly monogenic recessive disorders, are clearly caused by mutations in a single gene, there are likely other genes that can modify the phenotypic features. This could prove particularly true for dominant disorders, because they often display reduced penetrance and the phenotype can be highly variable, even within a family where the primary genetic lesion is shared by all affected individuals. Genetic mapping of monogenic disorders has been successful. Linkage analysis and subsequent sequence analysis in a small number of families has often resulted in identification of the causative gene. An excellent example of the power of these approaches, and the power of genetic homogeneity in isolated populations, is successful mapping of genes for monogenic, often recessive disorders in population isolates such as the Finns or the Hutterites (Boycott, Parboosingh, et al. 2008; Norio 2003). Although linkage studies have identified genes for many monogenic disorders, there are still numerous disorders for which the causative gene or genes are not known. These include disorders where families are too small to provide a linkage signal or cases where genetic heterogeneity between families is very high and traditional methods have not been able to identify the disease genes. Complex disorders are caused by a combined load of a large number of genetic variants, each of which confers a very small increase in risk (Figure  1.1). These variants  are relatively common in the population. The genetic background of complex disorders has been extensively characterized during the last decade using genome-wide association studies (GWAS). In these studies, very large cohorts of samples are genotyped at loci known to be polymorphic in the population. Statistical tests

The OMICs

High-penetrance rare mutations cause monogenic (Mendelian) disorders, detectable by linkage or next-generation sequencing in families or affected individuals with shared phenotypic features.

Common, highly penetrant, and deleterious variants are unlikely to exist because of selection.

Low-frequency variants increase risk for oligogenic disorders, detectable by GWAS, linkage, or next-generation sequencing in families or case control cohorts.

Modest Low

Effect size

High

4

Common variants increase risk for complex disorders, detectable by association studies (GWAS or next-generation sequencing data) in large cohorts.

Hard to detect, although nextgeneration sequencing studies in large cohorts could provide limited power to detect some rare variants with low effect size. Rare

Low frequency

Common

Allele frequency FIGURE  1.1: The genetic architecture of diseases and traits ranges from disorders caused by only one highly disruptive and fully penetrant variant to those caused by the additive effects of numerous genetic variants of very small effect, often in combination with environmental factors. Highly disruptive variants (i.e., variants with a large effect size) are rare in the population as they are subject to strong negative selection, whereas variants with lower effect sizes can become more common in the population as one variant alone is insufficient to cause the disorder. Currently available technologies and analysis methods for the identification of these variants have their limitations; choice of the most efficient approach for gene mapping studies depends on the genetic architecture of the trait.

are then performed to determine if a genetic marker is more common in cases than controls. The combination of large-scale SNP identification projects allowing for dense coverage of the whole genome combined with technological advances in high-throughput genotyping technology enabling the genotyping of tens of thousands of samples has resulted in identifying the association of thousands of SNP markers with hundreds of diseases and traits (http:// www.genome.gov/gwastudies/). However, in most cases the GWAS loci explain only a small to moderate part of the heritability of the traits. For complex disorders, the environment is also likely to play a much larger role than for monogenic disorders and will probably prove to be the main susceptibility factor for some of them. In addition to common variation, rare variants with large effect sizes have also been found to play a role in several complex disorders. GWAS

technologies have been poorly equipped to identify such risk variants, whereas large-scale sequencing studies are better equipped to identify them. Many disorders cannot be distinguished as being either monogenic or complex, since there are numerous complex disorders that also have monogenic, very severe, and often early-onset forms. For example, meta-analyses of tens of thousands of individuals have revealed dozens of common susceptibility variants for both type 1 and type 2 diabetes (Bradfield, Qu, et al. 2011; Saxena, Elbers, et  al. 2012). At the same time, rare mutations in GCK (Froguel, Vaxillaire, et  al. 1992)  and HNF1A (Yamagata, Furuta, et  al. 1996)  cause maturity-onset diabetes of the young (MODY), and mutations in KCNJ11 (Gloyn, Pearson, et  al. 2004)  and ABCC8 (Babenko, Polak, et  al. 2006)  cause neonatal diabetes, two monogenic forms of diabetes.

Medical DNA Sequencing in Neuroscience Similarly, GWAS analyses of blood lipid levels have revealed significant overlap between genes with common susceptibility variants and previously identified genes in familiar forms of dyslipidemias (Teslovich, Musunuru, et  al. 2010). For many disorders where the molecular etiology is not known, it is not possible to differentiate between monogenic and complex forms of the disorder based on the phenotype alone; therefore several complementary gene mapping efforts are needed to further our understanding of the genetic architecture of genetic disorders and traits.

C U R R E N T   S TAT U S The development of genotyping and sequencing technologies along with a good partnership between academia and industry has been essential in changing the landscape on how human disease genomics research is done. During the past 10  years genotyping studies have moved from linkage panels based on 400 microsatellites to genotyping up to a million markers for GWAS and lately to sequencing the complete genome in each study sample. As summarized above, gene mapping technologies have successfully identified genes for monogenic as well as more complex disorders. However, there are many cases where neither approach has been successful. Traditional automated Sanger sequencing is very costly and laborious if large linkage intervals must be sequenced, and GWAS are limited in their power to identify susceptibility factors with a very low allele frequency. Next-Generation Sequencing Technology The initial draft of the human genome was produced using automated Sanger sequencing, a technology where modified fluorescent bases are incorporated into a strand of DNA using polymerase chain reaction (PCR) and then separated by gel electrophoresis (Lander, Linton, et  al. 2001). However, the completion of the draft sequence took a large consortium of 20 collaborating research groups a decade and cost $3 billion. Clearly technological advances were required to enable large-scale DNA sequencing projects. The term next-generation sequencing (NGS) is used for the high throughput technologies that have been developed to complement and ultimately replace Sanger sequencing. These methods have been available from 2004

5

(Margulies, Egholm, et  al. 2005)  and have brought with them an immense drop in sequencing cost. Until 2007 the reduction in sequencing cost was well modeled by Moore’s law (which describes a long-term trend in the computer hardware industry that involves the doubling of “compute power” every two years and is often used as a standard to assess whether technological development is being successful). Since the beginning of 2008 the drop in sequencing cost has been much faster than predicted by Moore’s law, allowing for the generation of ever-growing datasets. (Wetterstrand KA. DNA Sequencing Costs:  Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts). NGS has been successfully applied to several areas of genetics and epigenetic research, including but not limited to medical genetic studies, population genetics, evolutionary studies, transcriptomics, and epigenomics. Currently two main approaches are used to generate large-scale resequencing data for medical genetic studies:  selective capture of specific genomic regions and whole-genome sequencing (WGS). Capture of selected genomic regions is suitable for projects where targeted genomic regions, such as loci identified in GWAS, or predefined sets of genes (such as synaptically expressed genes) are being targeted. The benefit of targeted sequencing is that because limited amounts of is being generated, data from several samples can be pooled together in one run on the sequencing instrument; thus a large number of samples can be included in the study. WGS generates a huge amount of data and requires much more sequencing capacity and storage space per sample. Furthermore, the additional data volume results in analytical and interpretational challenges. On the other hand, WGS data is totally hypothesis-free as it allows the assessment of all variation present in an individual’s genome. An often used compromise between the two extremes is whole-exome sequencing (WES), a form of selective capture where all known protein coding regions (exons) are sequenced. The genetic variants causing monogenic disorders usually affect protein structure and function and are thereby located in exons (Kryukov, Pennacchio, et  al. 2007; Stenson, Ball, et al. 2009). Therefore focusing sequencing efforts on the exome will likely reveal variants with large effect sizes that are acting by disrupting or altering protein function. However, the

6

The OMICs

basic assumption that all disorders are probably caused by coding variants is likely untrue. It is possible that the majority of identified variants are exonic because gene identification efforts have been concentrated on exons. In addition, prediction of the consequence of a coding variant on protein function is somewhat easier than prediction of the consequence of noncoding variants. WGS is likely to provide unbiased information about the true genetic architecture of traits. Currently it is widely accepted that WES is well powered to detect variants involved in human disease. WES has so far identified genes for over 100 monogenic disorders (Rabbani, Mahdieh, et  al. 2012). The same approach has also been applied to complex disorders, although with more modest success. In addition to the successes, the challenges of this approach have also become evident. Interpretation of the sequence data and identification of functional disease-causing mutations from the multitude of variants in each exome sample is not a trivial task. Developing the statistical framework guiding the interpretation of WES data is still in progress. Firm guidelines will help in the interpretation of the sequence data.

Sample Preparation and Targeted Sequence Capture The NGS sequencing instruments will sequence every molecule of DNA in the template library loaded onto the instrument. If sequencing is to be limited to specific regions of interest, enrichment of these regions from the entire genome must be performed before the sample is sequenced. In traditional automated Sanger sequencing this was primarily achieved by PCR amplification of regions of interest, and PCR-based methods have also been used for NGS (Meuzelaar, Lancaster, et  al. 2007; Varley and Mitra, 2008). Today, however, enrichment of regions of interest is primarily achieved by targeted hybrid capture methods. Hybrid capture can be used to enrich for any regions of interest, such as a subset of genes (Figure  1.2). One of the most common applications, however, is to capture all protein coding regions of the genome. The protein coding exome comprises only 1.2% of the human genome (Dunham, Kundaje, et  al. 2012). However, what today is called exome capture is actually an enrichment not only for protein coding regions but also other possible functional regions of the

genome, such as micro RNAs (miRNAs) and noncoding exons. In practice, different manufacturers have slightly different content on their exome capture reagents. Comparisons of the most popular products available suggest that certain kits cover a slightly larger amount of protein coding and miRNA genes, but none of the kits cover all Consensus Coding Sequence (CCDS) exons (Asan, Xu, et  al. 2011; Coffey, Kokocinski, et  al. 2011; Sulonen, Ellonen, et  al. 2011). Analogous to GWAS chips, the exome capture assays get updated as new annotation information becomes available to include as much of the coding sequence and other functional regions as possible. Usually the baits included in the exome capture assays are based on information from several different databases and annotation resources, such as genes from the CCDS project (Pruitt, Harrow, et  al. 2009), RefSeq (Pruitt, Tatusova, et  al. 2012), Gencode/ Encode (Harrow, Frankish, et  al. 2012)  and miRbase (Kozomara and Griffiths-Jones 2011) or other miRNA databases. It is highly likely that WES is a temporary compromise that is currently employed for convenience to limit data generation and ease the interpretation of results. It will be routinely replaced by WGS as prices drop, sequencing capacity increases, and better annotation workflows are available. Therefore, in the future, many of the problems and pitfalls associated with WES will be surpassed. Although the limited amount of data produced by WES can simplify interpretation of results, it will limit variant detection to a small part of the genome. Sample preparation using pull-down reagents also increases cost per base pair sequenced compared with WES. On the other hand, the small size of the target DNA allows for cost-efficient sequencing of samples at relatively high coverage (usually 30- to 60-fold coverage), increasing the power to detect rare variants compared with lower-coverage WGS. Despite the improvement of exome capture assays, the coverage of individual exomes is still highly variable even in high-coverage data. A fraction (up to 0.5%) of the target regions are not captured at all or at very low coverage, making the individual exon coverage highly variable (Asan, Xu, et  al. 2011). WGS often produces a more even coverage of the genome, as no bias is introduced by hybrid capture. The uneven distribution of sequence depth in WES data makes the detection of copy number variants (CNVs) more challenging than for WGS data.

Medical DNA Sequencing in Neuroscience

7

DNA extraction and fragmentation Hybridization with biotinylated baits

Wash

Pull-down of biotinylated baits with streptavidin

Adapter ligation

Target enrichment

Amplification

Sequencing of fragments

Image analysis and base calling

ATGCGATCACCGCCTG TGCAGCGGAACCTCAT

FIGURE  1.2: The main steps of next-generation sequencing:  First DNA is extracted and fragmented and adapters that serve as PCR primers are added to the ends of the DNA fragments. If DNA from several samples is sequenced in the same lane of the sequencing instrument, oligonucleotides that serve as barcodes for each individual sample are also added to the fragments (not shown). If only a subset of the genome is to be sequenced, DNA or RNA baits are used to enrich for the desired genomic regions and a biotin-streptavidin‒based pull-down reaction is used to obtain the desired DNA fragments. These are then amplified and sequenced and the images produced by the sequencing instrument are processed to extract the DNA sequence for each amplified DNA fragment.

The workflow for WES consists of three basic steps—template preparation, sequencing, and imaging—followed by bioinformatic analysis (Figures  1.2 and 1.3). To construct a template, a relatively large amount (several micrograms) of genomic DNA is randomly sheared to form fragments, and adaptors (short oligonucleotides) are added to the sequences. Enrichment of the exonic sequence is done by hybridizing the sheared DNA with biotinylated DNA or RNA baits, and the hybridized fragments are then captured by biotin-streptavidin‒based pull-down. The exome library is then massively amplified by using the adapters as primers, and the amplified DNA molecules are sequenced. As current technologies allow for the sequencing of several samples in the same lanes of the sequencing instrument, barcoded indexing tags are introduced at the library preparation stage for identification, after sequencing, of sequences belonging to individual samples. Sample preparation for WGS is simpler as it does not require any template selection. The sequencing library is created from sheared segments of DNA, which are attached to adapters to allow amplification of the DNA. Although most

current technologies rely on amplification before sequencing, some technologies can sequence unamplified DNA (Treffer and Deckert 2010).

Amplification and Sequencing Technology Before the actual sequencing takes place, most currently available sequencing technologies require that the DNA library be massively amplified to provide multiple copies of each DNA fragment. Various approaches are used by the different NGS technologies for the amplification and sequencing steps (Metzker 2010). Amplification can occur by emulsion PCR (Dressman, Yan, et  al. 2003)  where singlestranded DNA is attached to beads and then amplified by PCR (used by Roche/454 and Applied Biosystems/SOLiD). The conditions are optimized so that only one template molecule attaches to each bead and is therefore a clonal copy of the original fragment after amplification. Beads can then be cross-linked to glass surfaces or deposited in microscopic wells for sequencing. Amplification can also be performed in solid phase (Adessi, Matton, et  al. 2000; Fedurco, Romieu, et  al. 2006)  (Illumina/HiSeq). The DNA

8

The OMICs

with the attached adapters is immobilized onto a two-dimensional surface with oligonucleotides that are complementary to the adapters. PCR is then performed, using primers designed to target the adapters of the DNA fragments until clusters of about a million copies of the original DNA molecule are formed. After amplification, the actual sequencing reaction is performed, which involves the steps of base determination, imaging, and initial image processing to decode the order of bases in the DNA fragment (Anderson and Schrijver 2010; Mardis 2008; Metzker 2010). Sequencing can be performed either by synthesis or by ligation. Sequencing by synthesis can be further divided into cyclic reversible termination, single-nucleotide addition, and real-time sequencing. Cyclic reversible termination involves the addition of either one or all four nucleotides, which will bind in a template-defined manner and are added by a mutant DNA polymerase that can incorporate the modified nucleotides. The nucleotides are capped to prevent additional extension reactions and have a fluorescent label. Following incorporation, the unincorporated nucleotides are washed away and imaging by lasers is performed to determine the identity of the nucleotide. Subsequently, the terminating group and fluorescent label are cleaved to allow for another round of template-directed extension. In this method, with the addition of all four bases, each cycle is used by the Illumina/HiSeq, whereas the Helicos BioSciences single molecule sequencing technology uses a cyclic reversible termination with only one base added to each cycle of the sequencing (Braslavsky, Hebert, et al. 2003). Pyrosequencing (Ronaghi, Uhlen, et  al. 1998), used by the Roche/454 (Margulies, Egholm, et al. 2005), is also a DNA polymerase‒ driven method that detects the bioluminescence generated by the release of inorganic pyrophosphate when the DNA sequence is being extended by a complementary nucleotide. The order and intensity of the bioluminescence is recorded by the charge-coupled device (CCD) camera in the instrument. The signal strength is proportional to the number of nucleotides; for example, homopolymer stretches generate a greater signal than single nucleotides. Sequencing by ligation is also a cyclic method but uses a DNA ligase instead of a DNA polymerase (Tomkinson, Vijayakumar, et  al. 2006). The process uses either one-base-encoded probes or two-base-encoded probes. A  fluorescently

labeled probe hybridizes to the target in a template-guided manner and a DNA ligase is added to join the probe with the primer. After nonincorporated probes are washed away, fluorescence detection will determine which nucleotide has been incorporated. Again, the fluorescent dye will then be removed and another set of probes will be added. The Life/ SOLiD technology uses two-base-encoded probes, which yield a sequence every five base pairs because of three degenerate bases on each dinucleotide probe (Shendure, Porreca, et  al. 2005; Valouev, Ichikawa, et  al. 2008). After finishing the first round of ligation, the template is stripped and another primer is used, this time starting at (n-1) position relative to the first round. This way, after doing five rounds of elongation, the whole sequence will have been twice covered by template-specific interrogation bases. Data from the sequencing run is stored in image files, which are processed to determine the base-pair composition of each fragment that has been sequenced. The manufacturers supply algorithms for base calling, but other base-calling algorithms have been developed that provide improvement over the manufacturer-developed methods at the cost of higher computational intensity (Kao, Stevens, et  al. 2009; Kircher, Stenzel, et  al. 2009; Quinlan, Stewart, et  al. 2008; Wu, Irizarry, et al. 2010). The different NGS platforms introduce different biases depending on the strengths and weaknesses of the technology used. For example, the 454 has increased error rates in homopolymer reads due to the wide variety in the observed fluorescence intensity for a homopolymer of a specific length. For Illumina data, the rate of error increases toward the end of the reads as the synthesis process becomes desynchronized between different copies of the DNA template in the clusters. The SOLiD technology suffers from errors due to biases in fluorescence intensities that appear in later cycles. All of these biases must be accounted for in image processing and subsequent analysis steps to produce a reliable dataset.

Bioinformatic Analyses Multiple steps of bioinformatic analyses are required to transform the base call data obtained from the next-generation sequencers into variant lists that can be used in medical genetic studies (Figure 1.3). The first step is to align the sequence data to a known reference sequence

Medical DNA Sequencing in Neuroscience to determine the most likely location in the sequenced genome for each of the individual reads (Flicek and Birney 2009; Li and Homer 2010). If a reference genome is not available, in some cases alignment can be performed using the assembled genome of a closely related species. In some instances sequence data can also be assembled de novo (i.e., without using a reference). De novo assembly is more challenging and requires more computational resources. However, the increase in sequence read length

9

as well as advances in algorithm development have made de novo assembly possible even for large genomes, and over 20 different de novo assemblers are available (Lin, Li, et  al.; Zhang, Chen, et al. 2011). Each NGS platform produces a per-base quality score by using noise estimates from image analysis. After assembly or alignment, quality scores are usually recalibrated to better reflect the true base-calling error rates. After initial alignment, realignment is often

Image analysis and base calling

De novo assembly

Alignment to reference

Realignment, removal of duplicate reads, quality score recalibration

Multisample genotype calling Single-sample genotype calling Imputation

Variant annotation (allele frequency, consequence, conservation)

De novo mutations

Sharing between cases

Variants in linkage regions

Functional assays

FIGURE  1.3: Basic workflow of bioinformatic analyses applied to the DNA sequence data obtained from the sequencing instrument. Raw DNA sequence reads must be assembled or aligned to a reference to determine their location in the genome before sites that differ from the reference (or between samples) can be identified. These variant sites are then annotated with information that will be useful in subsequent analysis, such as allele frequencies of the variants in control databases, predicted consequence on protein function, conservation of the site between species, or other information that could help to identify disease-associated variants. The analytical steps needed to identify the disease-associated variant depend on the study design and the genetic architecture of the trait. In some cases, variants that are not inherited from the parents (i.e., de novo in the affected patient) could be causative. In other cases, sharing of variants between multiple related or unrelated cases can help to identify the causative variants. Usually replication in large datasets as well as functional proof of the effect of the variant are needed to lend further support to the role of the identified variant in the trait of interest.

10

The OMICs

performed around known insertion/deletion polymorphisms (indels)—such as those identified in the 1,000 Genomes project (Abecasis, Auton, et al. 2012)—to decrease mapping errors and improve variant call accuracy. Following alignment, a genotyping step is performed. This can be done either for one sample at a time or, as is more common, across multiple samples. Genotyping is split into two steps, SNP or variant calling followed by genotype calling. In the first phase, the aim is to determine in which positions there is at least one nonreference allele. Genotype calling is then performed only for sites where nonreference alleles are observed to determine the genotype for each sample at the site (Nielsen, Paul, et al. 2011). Early SNP calling methods were simply based on comparing the number of reads with an alternative allele to those with the reference in a set of high-confidence bases and call SNPs based on fixed cutoffs. However, simple counting methods are not suitable for low-coverage data, as fixed cutoffs result in undercalling of heterozygous genotypes and simple filtering on quality score leads to loss of information regarding individual read qualities. Therefore current SNP callers use probabilistic methods (DePristo, Banks, et  al. 2011; Le and Durbin 2010; Li, Handsaker, et  al. 2009; Li, Yu, et  al. 2009), which lead to genotype calls of higher accuracy. In addition, they provide a measure of the statistical uncertainty (in the form of a posterior probability) for each genotype and can incorporate information regarding allele frequencies and linkage disequilibrium (LD) patterns. For single-sample calls, priors may be chosen to assign equal probability to all genotypes, or information from dbSNP or other collections of known variant sites can be used to determine priors. For multiple-sample calls, the priors can be derived from jointly analyzing multiple individuals by using allele frequencies or genotype frequencies. Once allele frequencies are estimated, genotype probabilities can be calculated using the Hardy-Weinberg equilibrium assumption, and uncertainty in estimates of the allele frequency themselves can be incorporated by assigning a prior to the allele frequency itself. Imputation-based methods can also include information of the pattern of LD at nearby sites to improve genotype calls, which leads to a significant improvement in genotype-calling accuracy for common and moderate frequency SNPs (Nielsen, Paul, et al. 2011).

Alignments are most commonly stored  in BAM files, which are binary versions of Sequence Alignment/Map (SAM) files (Li, Handsaker, et  al. 2009). These files can efficiently store information from the large number of reads produced in NGS runs, and only the parts of the alignment which are of interest can be accessed without the need for reading in the whole alignment file for analysis. The called variants, such as the single nucleotide variants (SNVs) and indels are commonly stored in variant call format (VCF) files (Danecek, Auton, et  al. 2011). In addition to the genotypes, VCF files contain information on call quality, read depth, and other necessary quality parameters of the variants. The VCF file also includes a large header containing metainformation about the analytical steps that were taken during the genotype calling as well as information about the fields that were added during annotation of the variants. VCF files can be compressed and are indexable, allowing for quick analysis of the variants, such as retrieval of variants from regions of interest. The final step of data generation usually involves annotation of variants. The type of annotation depends on the needs of downstream analyses. Commonly added annotation includes the frequency of the variant in control databases. Another useful annotation is the predicted consequence of the variant on protein structure. Predicting such consequences is often problematic, although several methods such as PolyPhen (Adzhubei, Schmidt, et  al. 2010)  and SIFT (Ng and Henikoff 2003)  are available. The annotation information can then be used in downstream analysis to aid in the identification of disease-causing variants. Obviously any errors in the annotations can have severe consequences in downstream analyses if variants are erroneously attributed to be conferring loss-of-function (LoF) effects or vice versa if a true LoF variant is not annotated as such. Annotation of WGS data is even more problematic than that of WES data, as very little is known about the functionality of noncoding variants.

Identification of Disease-Associated Variants NGS and subsequent data processing steps produce a list of loci where the sequenced sample differs from a reference genome. The 1,000 Genomes Project reported 36.7  million autosomal SNPs and 1.38  million autosomal indels

Medical DNA Sequencing in Neuroscience in 1,092 low-coverage WGS samples from 14 populations. The average autosomal number of variant SNP sites per individual was around 3.6  million. WES data consisting of the autosomal GENCODE regions contained almost 500,000 SNPs and 1,800 indels in the same amount of samples. Individual exomes contained on average 24,000 variant SNP sites and 440 indels (Abecasis, Auton, et  al. 2012). The large number of variants identified in every sample included in an NGS study presents a challenge for gene identification, and various analytical approaches must be employed to identify which individual variants are associated with a phenotype. For most published studies to date, the assumption has been that disease-associated variants are highly penetrant and not found in dbSNP or other control datasets. This reduces the number of possible disease-causing variants to 1% to 2% of the original list. Ideally, if there are several cases sharing the same disease mutation, only a handful or even one variant will remain after filtering on control frequency and sharing between all samples. However, often the reality is that after filtering, no variants remain at all. Alternatively, filtering will not reduce the candidate variant list sufficiently, or the remaining variants will not overlap between cases. Many published studies have used the 1,000 Genomes Project, dbSNP, and NHLBI GO Exome Sequencing Project as controls. These datasets are useful because they are large; the 1,000 Genomes Project, particularly, includes individuals from a large number of populations. On the other hand, no phenotypic data are supplied for the 1,000 Genomes samples, and variant annotation in dbSNP is poor on phenotypic information. Therefore it is possible that individuals affected with the disorder being studied are included in these reference datasets. Also, particularly for recessive disorders, it is possible that carriers of disease variants are present in the general population. A  specific problem with dbSNP is that it contains poorly validated variants. However, using a filter for variants with low frequency—such as 1% in the general population for recessive disorders and 0.1% for dominant disorders—could decrease the risk of missing true variants owing to disease allele carriers in the control data but still remain powerful (Bamshad, Ng, et al. 2011). As in GWAS studies, the controls should be from the same population as the cases to minimize

11

the risk of false-positive variants due to population stratification. It is tempting to assume that any LoF variant identified in an individual would be a strong candidate for being associated with the disorder. However, studies in healthy reference populations have shown that each person carries, on average, 100 LoF variants. Further, each person has on average 20 genes with two deleterious variants, resulting in complete inactivation of these genes (MacArthur, Balasubramanian, et al. 2012). Large population-based studies have shown that over half of the variants identified in WGS or WES of large population samples are novel (Abecasis, Auton, et  al. 2012; Tennessen, Bigham, et al. 2012); that is, they are not found in reference databases. Each individual carries hundreds of private or very rare variants. Again, assuming a correlation between the lack of a variant in control databases and association with a disease is not necessarily correct. The majority of protein coding variation is evolutionarily recent, rare, and enriched for deleterious alleles, so that analysis of WES in itself enriches for this type of variation. Extra care is needed to link this type of variation to phenotypes (Tennessen, Bigham, et al. 2012). Because control databases include more and more individuals, the probability of seeing multiple copies of very rare variants becomes higher and the risk of identifying very rare benign variants decreases.

Study Designs The choice of study design is guided by the expected frequency and effect size of the underlying variant and the nature of the disease (prevalence, age of onset, etc.). In the case of monogenic traits, where individual variants have a very high impact on the trait, relatively small sample sizes can be sufficient to demonstrate disease causality of a variant. However, because sequencing studies identify a very large number of potential variants, the number of tests will inevitably be large. Thus knowing which variants/mutations is/are disease causing is not always trivial, even in the case of monogenic traits. For monogenic traits the generally accepted criteria developed for positional cloning studies provide a good reference base. In positional cloning studies the chromosomal location was typically first pinpointed by linkage, applying generally agreed significance

12

The OMICs

thresholds. If a variant in the linked region was not seen in a control population, the same or different variants in the same gene had to be replicated in several pedigrees with the same phenotype. Further, at least some functional data had to be presented to convince the field and the reviewers that this variation/mutation was associated with the phenotype. Similar rigor should be applied in WES-based variant identification.

Family-Based Studies Family studies are the default in monogenic traits but have been expanded to more complex traits as well. The hypothesis is that an excess of disease susceptibility variants are clustered and more frequent in families with a specific disease than in the control population. Only a few family members need to be fully sequenced, whereas the remaining relatives can be more sparsely genotyped and the full genome variation imputed. The segregation of a disease-associated haplotype can then be followed in the full pedigree (Figure  1.4). Yet the optimal statistical family-based analysis in complex traits is not fully worked out. So far we are lacking publications that would provide a good understanding of the power and limitations of this approach. Unpublished work suggests that, with this strategy, one cannot hope to capture low-hanging fruits. It is likely too that family-based analyses will need large sample sizes to achieve statistically robust results. De Novo Mutations Spontaneous mutations that arise in parental germ cells are frequent causes of some diseases (Figure  1.5). These mutations are not observed in parents, only in the offspring. A  classic example is achondroplasia (Bellus, Hefferon, et  al. 1995; Shiang Thompson, et  al. 1994)  and, in CNS disorders, the Dravet syndrome (also known as severe infantile epileptic encephalopathy or SMEI) (Claes, Del-Favero, et  al. 2001). Identification of de novo mutations (DNMs) using WES is especially advantageous. When both parents and the proband are sequenced at high coverage, identifying inherited variants is relatively easy, leaving a short list of DNMs. Yet because, on average, each individual carries about 0.8 to 1.3 DNMs in his or her exome, the causality of the DNM still needs verification. To be convincing, deleterious variant in the same gene must be identified in several individuals.

Case Control Studies In case control studies, the sequences of sporadic cases and healthy or population controls are compared. The case control setting is the classic study design in complex traits. Currently, this study design aims to identify rare or low-frequency variants contributing to a complex trait. When WGS or WES becomes more cost-effective than chip genotyping, sequencing might be used also to identify common variants. As the typical effect sizes of variants associated with complex traits ranges between 1.1 and 1.5, typical sample sizes in GWASs range between a few thousand to tens of thousands of samples. In searching for low frequency and rare variants using sequencing, we can foresee a need for sample sizes that are even bigger than in GWASs. Because sequencing is still quite costly if applied in large sample sets, new, more focused low-cost genotyping chips are being developed. These chips (e.g., the exome chip) are based on low-frequency-variant catalogues developed in large sequencing studies, such as the 1,000 Genome Project. This makes it realistic to genotype large enough samples to enable statistically robust low-frequency association studies. Population Isolates It is hypothesized that population isolates provide a middle ground between family and case control studies. This is seen as an extension of the “megapedigree” concept. Because of bottleneck effects, genetic drift, and population expansion, some rare alleles are enriched in population isolates (Figure  1.6). The hypothesis is that some of these alleles, which are extremely rare (population frequency < 0.1%, as seen, for example, in most European populations), have been enriched to frequencies between 1% and 5% in a population isolate such as Finland. Further, selection has not had time to act in recently founded isolates, enabling a higher population frequency of harmful variants. Some of the enriched variants could possibly contribute to common diseases. Even though they could be neutral in an environment where the founder population was established more than thousand years ago, they could contribute to diseases in populations sharing the modern lifestyle and environment. An enrichment of low-frequency alleles in the study population should boost the power significantly compared with more mixed populations. Therefore the expectation is that smaller discovery sets will be needed to achieve significant

13

Medical DNA Sequencing in Neuroscience Use microsatellite or SNPs to identify regions of shared IBD between affected individuals

Sequence one affected individual Identify all variants in the linked regions from sequence data

Identify most likely candidate variants based on consequence on protein structure, conservation, and absence from controls Arginine

Glycine

c.235 G>A

OH A

OH

O

O NH2

NH2

NH H2N ⊕NH2

Human Chimpanzee Mouse Zebrafish Fruit fly

DRGREYRGRLAVTTS GVGMNYRGNVSVTRS ERGROYOGRLAVTSH GLGMNYRGNI SVTRS EKGMLYTGTLSVTLS

G 100% A 0%

FIGURE  1.4: When large families with multiple affected individuals are available, sequencing is required for only a small subset of affected individuals. Microsatellites or SNPs can be used to identify regions shared identical-by-descent (IBD) by all cases; thus sequencing of one index case is enough to survey the full variation of these regions. Candidate variants are identified based on predefined criteria, such as effects on protein function, absence or low frequency of the variant in control databases, or evolutionary conservation. If a large number of meioses separates the individuals in the family, a small number of candidate regions and thus a small number of candidate variants remain after the analysis. Sanger sequencing may be needed to verify the cosegregation of the variant in the pedigree. Replication of the finding in other pedigrees is usually needed to separate benign but extremely rare variants from true disease-associated mutations.

association in population isolates. The success of this strategy has best been demonstrated in the Icelandic population (Holm, Gudbjartsson, et  al. 2011; Jonsson, Atwal, et  al. 2012; Sulem, Gudbjartsson, et  al. 2011). Also, by sequencing a small subset of the study population and using

SNP genotyping in the large majority, efficient imputation of whole-genome sequences can be enabled. Although imputation-based studies are possible in any population, a smaller number of individuals need to be sequenced to capture the majority of all genetic variation in isolated

14

The OMICs

AACTGAACCGTCGAATT AACTGAACCGTCGA ACTGAACCGTCGA TGAACCGTCGAAT TGAACCGTCGAATT

AACTGAACCGTCGAATT AACTGAACCGTCGA ACTGAACCGTCGA TGAACCGTCGAAT TGAACCGTCGAATT

AACTGAACAGTCGAATT

Paternal variants

Maternal variants

AACTGAACCGTCGAATT AACTGAACCGTCGA ACTGAACAGTCGA TGAACCGTCGAAT TGAACAGTCGAATT Small number of variants present in the child but in neither of the parents

Disorders such as autism spectrum disorders, which are subject to strong negative selection in the population, have been shown in some cases to be caused by de novo mutations (i.e., mutations present in the affected individual but in neither of the parents). These can arise in either the paternal or maternal gamete or during early embryogenesis. De novo mutations are relatively easy to identify if both parents and the affected child are sequenced at high coverage. Only a handful of possible candidate variants remain if the data are of high quality and appropriate filtering is used in the analysis. Proving causality of individual variants can be hard, particularly if the disorder has a large mutational target, as a large number of families are needed to identify another family with a de novo mutation in the same gene.

FIGURE  1.5:

populations, as the number of founder chromosomes is lower than in admixed populations.

Current Review of Results So far, over 100 mutations in monogenic disorders have been identified by NGS, mainly by WES (Rabbani, Mahdieh, et  al. 2012). There is an obvious publication bias toward successful studies, so the success rate of gene identification by NGS still remains unclear and will also be strongly dependent on the genetic architecture and availability of samples. One estimate suggests that WES identified the major disease gene in at least 50% of projects focused on rare but clinically well-defined monogenic diseases (Gilissen, Hoischen, et al. 2011). Experience has

shown that for most disorders this is an overly positive prediction; realistically, much smaller yields are often to be expected. Large-scale resequencing can be applied to several different study designs and can identify several different types of risk variants, as summarized above. Several of these approaches have been used to identify genes for intellectual disability (ID), and are described in more detail in the following paragraphs to provide an overview of the different analytical approaches that can be used depending on the expected genetic architecture of the trait being investigated. A large number of monogenic traits with neurological and neurodevelopmental symptoms have been subjected to WGS and WES.

Medical DNA Sequencing in Neuroscience

15

Identify recessive variants

? ?C ? ? ? ?G ? ? ? ? ? ? ? C ? ? ?G ? ? ? ?G ? ? ? ? ? ? ? T ? ? ?C ? ? ? ?C ? ? ? ? ? ? ? T ? ? ?C ? ? ? ?G ? ? ? ? ? ? ? T ? ? ?C ? ? ? ?C ? ? ? ? ? ? ? T ? ? ?G ? ? ? ?C ? ? ? ? ? ? ? T ? ? ?G? ? ? ?G? ? ? ? ? ? ? T?

Sequence small number of individuals A AG T GA AC C G TC G A A T T A AC T GA AGC G TC G A A T T

Imputation of large genotyped cohorts

A AC T GA AC C G TC G A A C T

Population isolates can be powerful in the identification of disease-associated variants. The founding bottleneck has reduced the genetic diversity in the population and drift can enrich disease-associated variants. This is particularly true for recessive disorders, as negative selection is not acting on the asymptomatic disease carriers. Because of an enrichment of the disease allele in the population due to the bottleneck, it is more likely that two individuals are distantly related and carry the same recessive disease mutation, resulting in the risk of having affected offspring. The reduced genetic diversity also makes imputation studies particularly feasible. The founding bottleneck has reduced the number of founding chromosomes, so only a small number of individuals need to be sequenced to capture them. When all founding chromosomes have been sequenced, imputation is efficient and highly accurate. Large numbers of individuals can be genotyped using cheap SNP chips and then imputed using the reference panel generated from the sequenced individuals, enabling large case-control association studies. FIGURE  1.6:

This list is constantly growing. Thus we do not aim to provide a comprehensive list of these diseases but have rather selected a few examples of more complex and challenging phenotypes.

Intellectual Disability ID often has a genetic basis, and positional cloning has shown that at least a subset of ID is caused by monogenic, fully penetrant mutations. ID can present together with other clinical symptoms such as metabolic or structural abnormalities. These syndromic forms of ID make it possible to identify patients with

similar phenotypes, often revealing an underlying shared genetic etiology. However, ID can also present as the only observable phenotype, referred to as nonsyndromic ID (NSID). In these cases, it is impossible a priori to identify cases with a shared genetic etiology. Substantial genetic heterogeneity underlies NSID, since numerous genes have been identified. In fact, most identified genes account for only a very small fraction of cases, and over 100 genes have already been implicated in ID (Ropers 2010). Over 90 of these are located on the X chromosome. It is probable that this bias in

16

The OMICs

identification is largely due to ease of gene identification in large X-linked pedigrees, although unbiased exome and WGS studies will give a more unbiased estimate of the proportion of X-linked versus autosomal ID genes. It has been estimated that genes on the X chromosome account for 10% to 20% of male X-linked ID (Ropers and Hamel 2005). In addition to the high level of genetic heterogeneity, ID is known to be caused by many types of genetic abnormalities ranging from duplication or deletion of large chromosomal segments to small indels and SNVs. Further, many different inheritance patterns have been observed, ranging from autosomal dominant, autosomal recessive, and X-linked to DNMs.

Family Studies and Population Isolates One of the most successful approaches to the identification of ID genes has been to use large families with a X-linked pattern of inheritance. Today there is a large collection of these families, which have been thoroughly studied (http:// goldstudy.cimr.cam.ac.uk, http://www.euromrx. com). Traditionally, microsatellite markers have been used to identify regions of maximum linkage in these families, and Sanger sequencing of all genes in the linkage regions has resulted in the identification of numerous ID genes (Ropers and Hamel 2005). However, in many families, the linkage intervals have been too large to allow for the sequencing of all genes using traditional Sanger sequencing. The first large-scale resequencing study of ID genes was published in 2009, where all known exons of genes on the X chromosome of 208 families with X-linked ID were sequenced (Tarpey, Smith, et  al. 2009). Nine genes were deemed to be associated with ID. However, the authors also discuss extensively the difficulty of identifying true disease-associated variants. More than half of the gene truncating variants did not segregate with the disorder in the families or were found in controls; the authors therefore caution against concluding that truncating mutations in genes are sufficient on their own to be considered causative. Particularly, 8 of the 19 genes with truncating variants that did not segregate with the phenotype or were found in controls have only a single exon. This suggests that some of these genes might be retrotransposed copies without important function that therefore tolerate LoF mutations. The authors also note that although

they screened most of the protein coding exons on the X chromosome, the likely genetic basis for ID was established in only 25% of families. Variants could be missed owing to low coverage, unannotated genes, the presence of copy number changes large enough to go undetected by the sequence data, nonexonic variants, and the presence of autosomal variants despite the appearance of X-linked inheritance in the families. Also, only LoF variants were considered, although it is highly likely that in some families the causative mutation is a missense or even noncoding variant. To identify autosomal ID genes, studies have been performed in consanguineous families or founder populations. Traditionally, microsatellite markers or SNPs have been used to identify regions of homozygosity in affected relatives, and genes in these regions have been resequenced to identify disease-causing mutations. Today, WES allows for a shortcut directly to the causative variants. Still, the identification of homozygous regions either from SNP data or from the exome data themselves is useful to limit the amount of variation that could be considered to be pathogenic. A  novel autosomal ID gene, TECR, was identified by linkage mapping followed by WES in a large consanguineous family with 5 of 13 children affected with ID (Caliskan, Chong, et al. 2011). Linkage analysis first identified a gene-rich region on 19p13 that cosegregated with the phenotype in the family; an SNP array was used to narrow the region to a 2-Mb homozygous segment with over 30 genes. WES was performed for both parents and filtering performed for novel disruptive variants heterozygous in the parents. Only a single variant fulfilling these criteria was identified, and follow-up genotyping in the rest of the family showed cosegregation with ID in the family. Further, homozygosity for the variant was not observed in over 1,000 individuals from the same population. In another study, sequencing was performed in 136 consanguineous families from Iran with autosomal recessive ID. Mutations were identified in 23 previously known ID genes as well as 50 novel candidate genes (Najmabadi, Hu, et  al. 2011). In this study, the targeted genes were not the whole exome but genes from previously identified regions of homozygosity, thus significantly reducing the number of genes sequenced. Another example of how WES has been used to identify autosomal recessive genes is in

Medical DNA Sequencing in Neuroscience the population isolate from the Ashkenazi Jews. Two individuals (the affected offspring and the mother) from a family with Joubert syndrome, an autosomal recessive ID syndrome, were exome sequenced (Edvardson, Shaag, et  al. 2010). The search was concentrated on a linkage region that had previously been identified by homozygosity analysis of a larger pedigree from the same population. Seven variants homozygous in the child and heterozygous in the parent were identified, of which two remained after filtering on dbSNP; only one of the remaining variants (in TMEM216) was nonsynonymous. The added benefit of the WES data was to show that no other disruptive mutations existed in the previously identified region of homozygosity. The mutations segregated with the phenotype in the larger pedigree. Analysis of non-consanguineous families must take into account that the mutation is likely to be compound heterozygous (i.e., a different mutation is inherited from each parent). This can in some cases prove to be more challenging, but as long as the inheritance model is known to be recessive, gene identification can often be successful. In a family with three affected offspring with hyperphosphatasia mental retardation syndrome (Marby syndrome), WES was performed for all three affected offspring (Krawitz, Schweiger, et  al. 2010). First, all common variants and variants not found in all affected persons were excluded, leaving 14 candidate genes. A  Hidden Markov Model was used to infer all loci where the offspring shared both alleles identical by descent (IBD  =  2), reducing the number of candidate genes to two, PIGV and SLC9A1, located in a 13-Mb homozygous block. Mutation screening of PIGV revealed homozygous and compound heterozygous rare variants in other families with the same phenotype, identifying PIGV as the causative gene.

Exome Sequencing in Unrelated Cases Sharing Syndromic Forms of ID WES has successfully been applied to several syndromic forms of ID. Here, patients with similar phenotypes are sequenced; after filtering for variants shared by all or the majority of affected individuals but which are not found in control datasets such as the 1,000 Genomes Project or dbSNP, only a small handful of variants remain. Schizel-Giedion syndrome is characterized by ID, distinctive facial features,

17

multiple congenital abnormalities, and a high prevalence of tumors. Hoischen and colleagues (Hoischen, van Bon, et  al. 2010)  performed WES for four of these patients. After filtering known variants (dbSNP and variants observed in other WES projects from the same laboratory), only two genes were identified where all four affected individuals carried a mutation. One of these variants was of low quality, leading to the identification of SETPB1 as the causative gene. Targeted resequencing of this gene in nine additional cases identified a variant in SETPB1 in eight of these patients. Although in general the identification of genes for syndromic forms of ID is easier as larger patient groups can be collected for study, this can sometimes prove challenging, since genetic heterogeneity can lead to false findings particularly for disorders with a dominant inheritance pattern. A  WES study of Kabuki syndrome, characterized by ID and distinctive facial features (Ng, Bigham, et  al. 2010), identified only one gene, MUC16, with novel variants in all 10 unrelated patients included in the study. However, this was deemed as an unlikely candidate because of its function and expression pattern. In addition, MUC16 is one of the largest genes of the genome and would be expected to show numerous variants based on random chance. Because the only gene carrying mutations in all affected individuals was an unlikely candidate, the authors then focused on nonsynonymous variants present in most but not all of the cases. A  truncating mutation in MLL2 was identified in 7 of 10 patients. In two of the three remaining patients, a small indel, missed by the WES but identified from CGH arrays, was detected in MLL2, strongly implicating this gene in the etiology of Kabuki syndrome. Sanger sequencing of MLL2 identified mutations in 26 of 43 additional patients and among the subset of samples (n = 12) with both parental DNAs available. All mutations were de novo.

De Novo Variants in ID The identification of mutations underlying nonsyndromic ID without any family history has so far been challenging. However, with access to DNA from both parents and the affected child, WES enables relatively straightforward identification of DNMs present in children but not parents. These variants are in general more deleterious than variants segregating in

18

The OMICs

the population because they have not been subjected to evolutionary selection, making them excellent candidates for causing sporadic severe disorders (Eyre-Walker and Keightley 2007). The first study of DNM rates in humans that was based on WGS suggested that on average 74 germline SNVs occur de novo in one individual’s genome but also that there is huge variability in the DNM rate between trios. However, these conclusions were drawn from only two trios and should be interpreted with caution (Conrad, Keebler, et  al. 2011). Later studies have tackled this issue only using WES, and consensus estimates in larger datasets suggest that, on average, 0.82 to 1.3 DNMs are observed per exome (Neale, Kou, et  al. 2012; O’Roak, Vives, et al. 2012). However, these estimates have been derived from trios where the offspring are affected with autism spectrum disorders (ASDs), and although the consensus seems to be that the rate of DNMs is no higher in affected cases than in controls, this caveat should be kept in mind in interpreting these results. Because of the slightly different GC content of the exome compared with the whole genome, the DNM rate for exomes is expected to be higher than for whole genomes. One estimate placed the genome-wide DNM rate at 1.2 x 10-8 per base per generation, whereas the exome mutation rate has been estimated at 1.5  x 10-8 (Neale, Kou, et  al. 2012). Although different studies estimate the exomic DNM rate to be different, the consensus ranges broadly in the same magnitude, around 1.2 to 2.2 x  10-8 per base per generation. Larger studies of population trios will undoubtedly narrow the confidence interval of these estimates. One very consistent finding from multiple studies is that paternal de novo SNVs are much more common than maternal ones, and there is a striking correlation between de novo SNV rate and paternal age (Neale, Kou, et  al. 2012; O’Roak, Vives, et  al. 2012), but there seems to be a large variation between trios (Conrad, Keebler, et  al. 2011). Despite the overall consensus of the DNM rate of SNVs, the calling of small indels and CNVs still needs to be significantly improved to shed light on the rates of DNM in these classes of genetic variation. Previous studies have implied that de novo CNVs are causal in about 15% of cases of ID (Cook and Scherer 2008; de Vries, Pfundt, et al. 2005). DNMs should be particularly common in disorders that have a relatively high prevalence

in the population despite a strong reproductive bias due the fact that the early onset of the disorder will preclude transmission of the disease to subsequent generations. It is likely that DNMs will account for both very rare as well as more common phenotypes depending on the size of the mutational target. Diseases caused by DNMs in just one gene will be rare, but if the mutational target is large enough, DNMs can cause even common disorders, such as ID, ASD, and schizophrenia. So far, only a small proof-of-concept WES study has been published assessing the role of DNMs in NSID, although bigger studies are ongoing (Vissers, de Ligt, et  al. 2010). WES was performed in 10 trios with no family history of ID, no clear syndromic features, no evidence of Fragile X syndrome, and no de novo CNVs detected using CGH to enrich for families with de novo SNVs. The study identified six likely causative variants in six different genes, of which two had previously been implicated in ID. Further work is still required to validate the causative roles of these variants, but the study showed a proof of principle that WES is an appropriate tool to screen for DNMs. The same problem of establishing causality that affects inherited variants also applies to DNMs. Variants in the same gene in multiple cases need to be identified before any claims can be made about causality; particularly for disorders with large mutational targets, very large samples sizes are likely required to obtain sufficient power for replication. Often biological function is used to assess the significance of findings for DNM analyses, but for disorders with large mutational targets this can enrich for false-positive findings, such as assuming that all DNMs in brain-expressed genes are likely to be causative of ID. So far, mutational type seems to be the best predictor, with LoF variants the most likely to be causative, particularly if several LoF DNMs are observed in the same gene (Sanders, Murtha, et al. 2012).

Autism Spectrum Disorders and Schizophrenia ASD and schizophrenia are among the most heritable neuropsychiatric disorders, but specific susceptibility genes remain elusive. Several monogenic forms of ASDs are known (Abrahams and Geschwind 2008), whereas no monogenic forms of schizophrenia have been reported. The role of CNVs as susceptibility

Medical DNA Sequencing in Neuroscience factors for ASDs and schizophrenia is well established. A  substantial number of CNVs are de novo (Gilman, Iossifov, et al. 2011; Xu, Roos, et  al. 2008). This has prompted several WES studies evaluating the role of DNMs, of which the first generation of studies has recently been published. Based on four studies (Iossifov, Ronemus, et  al. 2012; Neale, Kou, et  al. 2012; O’Roak, Vives, et  al. 2012; Sanders, Murtha, et  al. 2012)  encompassing over 900 trios or quads (trios with one unaffected sibling sequenced), the overall rate of DNMs in individuals with ASDs is no higher than that in controls. As the number of sequenced trios keeps increasing, the probability of hitting the same genes in several studies also increases. Simulation experiments taking into account the distribution of gene sizes and GC content across the genome suggest that focus should be on the severe LoF variants, since two or more nonsense and/or splice-site DNMs are highly unlikely to occur in the same gene. This conclusion remains robust to sample size and estimates of locus heterogeneity (Sanders, Murtha, et  al. 2012), whereas if nonsynonymous sites are also included for sample sizes of 1,000 trios or more, at least four hits in one gene are needed to establish causality. These estimates vary strongly depending on the genetic model used and are not nearly as stable as the estimates for LoF variants. So far a total of five genes, CHD8, DYRK1A, KATNAL2, SCN2A and POGZ, have two LoF de novo hits in the 900 published ASD trios. Further, these studies have reported that the proteins encoded by genes with DNMs are more closely linked by protein-protein interaction networks than similarly sized sets of random genes. Especially intriguing is the result that genes with DNMs in a study of over 300 ASD trios found that many of the genes are linked with FMRP, a gene very robustly linked to ASDs and involved in synaptic plasticity. To date, the studies of DNMs in schizophrenia are not as extensive as the data for ASDs, although several large-scale studies are under way. One study of 14 trios identified 15 DNMs (Girard, Gauthier, et  al. 2011). These included four nonsense variants and eleven missense variants. Unsurprisingly for such a limited dataset, no gene was hit twice, and none of the genes had previously been implicated in schizophrenia etiology. The DNM rate was reported to be significantly higher than any of the DNM

19

rates reported in population studies, which led the authors to conclude that there is a DNM burden in schizophrenia. However, the conclusions were drawn on a very limited number of trios; larger replication studies are needed to validate this observation. The second study involved 53 schizophrenia trios (Xu, Roos, et al. 2011)  and identified a total of 40 DNMs in as many genes in 27 individuals. The authors also concluded that DNMs play an important role in schizophrenia and estimated that the mutational target is large, which would explain the high incidence of the disorder worldwide. The third study was larger and included WESs from 231 trios with schizophrenia (Xu, Ionita-Laza, et  al. 2012). The authors reported an excess of both nonsynonymous and LoF variants in cases, but the control group consisted of only 34 trios. One nonsynonymous and one LoF DNMs was identified in four genes (LAMA2, DPYD, TRRAP, and VPS39). No gene with two LoF variants was identified. Interestingly five genes (DGCR2, TOP3B, CIT, STAG1, and SMAP2) were identified where a missense DNM and a de novo CNV were present in the same individual. It seems possible that DNMs—in the form of CNVs, small indels, and SNVs—play a role in ASDs and schizophrenia. It seems likely that the risk is not conferred by an overall increase in mutation rate but by the severe interruption of genes involved in brain development and function. The next question that needs to be addressed is what proportion of these disorders can be explained by these highly penetrant variants. Current studies have found likely DNMs in only a small fraction of the studied individuals, but they are likely to suffer severely from lack of power and false negatives, since most studies have assumed missense variants to be benign. However, several ID genes with missense variants in conserved positions have been identified, so it seems likely that this will also be the case for ASDs and schizophrenia as well as other complex disorders. Further, it seems probable that many variants will be noncoding regulatory variants, which are beyond the scope of WES studies. More data are also needed to determine whether these variants are truly monogenic risk variants (i.e., fully penetrant and sufficient to develop the disease). Interestingly, simulations have been reported where models assuming a large number (such as 100)  of rare, fully penetrant monogenic

20

The OMICs

genes are inconsistent with the observed data, whereas models where functional mutations in hundreds of genes that would increase the risk of the disease by 10- or 20-fold fit the observed data much better (Neale, Kou, et al. 2012). This could suggest that although DNMs play a role in ASDs (and possibly other complex diseases also), they are not necessarily sufficient for disease. There is also evidence suggesting that common variants confer susceptibility to ASDs (Klei, Sanders, et al. 2012), although all GWASs so far have failed to identify genome-wide hits, most likely owing to small sample sizes (Ma, Salyakina, et al. 2009; Wang, Zhang, et al. 2009; Weiss, Arking, et al. 2009). In schizophrenia, an unpublished large case-control GWAS consortium has identified dozens of robustly associated loci. This suggests that several different study designs are needed to identify all possible risk factors for these diseases. A  recent WES study assessed the role of rare variation in 166 individuals with schizophrenia and subsequently genotyped 2,617 individuals with schizophrenia and 1,800 controls. The results suggested that schizophrenia susceptibility is unlikely to be significantly affected by low-frequency variants that are just outside the range of detectability using GWAS (Need, McEvoy, et  al. 2012). The study did, however, detect several variants that were identified in a small number of cases and no controls. These variants could possibly play a role in disease etiology. It will also be interesting to see if DNMs in the same genes cause disease all across the neuropsychiatric spectrum. It is generally accepted, that CNVs in the same genes can cause susceptibility for ASDs, schizophrenia, and ID (Mefford, Batshaw, et  al. 2012). However, large sets of tens of thousands of cases have been genotyped on comparative genomic hybridization arrays, making these comparisons statistically powerful. It will take time to accumulate exome data from such large datasets to make comparison possible across disorders. Despite a significant overlap between rare variants, a recent study could not detect significant overlap of common variation between ASDs and schizophrenia (Vorstman, Anney, et al. 2012). The contribution of DNMs to late-onset disorders, such as Alzheimer’s disease (AD), will be harder to evaluate, since the analysis requires DNA from both parents and usually, for late-onsets disorders, the parents are no longer available for study. However, late-onset

disorders might not be under such strict selective pressure as early-onset ones, allowing for inherited variants to play a larger role in disease etiology.

Monogenic Epilepsies Epileptic seizures are a part of many syndromic developmental diseases but can also be the main or only symptom, thus being nonsyndromic. A  small percentage of these genetic epilepsy syndromes, known as the rare epilepsy syndromes (RESs), are monogenic. By studying these Mendelian disorders, mainly via parametric linkage analysis and positional candidate gene sequencing in large multiplex families, the main concepts of the genetic architecture of epilepsies we have today were unraveled. Many of the known genes implicated in the development of Mendelian forms of epilepsies encode for subunits of ion channels, although it becomes more and more evident that risk variants are not limited to this class of genes. Next to these familial RESs, the availability of genome-wide sequencing technologies has finally made it possible to study the interesting group of epileptic encephalopathies (EEs) genetically in a systematic manner. EEs are severe disorders with early onset, often within the first year of life. They present as distinct epilepsy syndromes often in combination with dysfunctions in the brain, such as ID and spasticity. These disorders severely interfere with reproductive fitness, and evolution strongly selects against the transmission of mutations. Most EE patients present as isolated cases owing to heterozygous de novo dominant mutations. The concept that de novo dominant mutations underlie EE was firmly proved by our observation that de novo LoF mutations in the SCN1A gene result in Dravet syndrome, the prototypical EE (Claes, Del-Favero, et  al. 2001). To date, several distinct EEs are known to be caused by DNMs in genes, like STXBP1, KCNQ2, and many others (Saitsu, Kato, et  al. 2008; Weckhuysen, Mandelstam, et  al.). Many studies on sequencing EE patient-parent trios for the identification of novel genes harboring causal DNMs are in progress to gain more insight in the missing heritability of different EEs. On the other hand, the more common genetic epilepsy syndromes are usually considered to be complex genetic traits. Recently two large-scale studies (Klassen, Davis, et  al. 2011; Heinzen, Depondt, et  al. 2012)  reported the

Medical DNA Sequencing in Neuroscience sequencing of over a hundred sporadic idiopathic epilepsy patients. Klassen, et  al. focused on ion channel genes, whereas Heinzen and colleagues used WES. The Heinzen group tried to replicate almost 4,000 identified candidate epilepsy-susceptibility variants in 878 cases. Both studies failed to convincingly identify any disease associated variant. Both studies were small, but they suggest a similar picture as in many other complex traits; much larger study samples are needed to shed light on the potential contribution of low-frequency variants to epilepsy. Such studies are in progress, and we expect to have results from them in the next two years.

Alzheimer’s Disease AD is the most common form of dementia in the elderly. It is known that low-frequency and rare variants can contribute to the risk of AD, especially for early-onset forms of the disease (Goate, Chartier-Harlin, et  al. 1991; Raux, Guyant-Marechal, et  al. 2005; Sherrington, Rogaev, et  al. 1995). Less is known about the genetics of late-onset AD, the more common form of the disorder. A WES study of 14 individuals with earlyonset AD revealed nonsense or missense mutations in 5 individuals in SORL1 (Pottier, Hannequin, et  al. 2012). The mutations were identified by using a simple filtering strategy where all variants were filtered against dbSNP and 1,000 genomes, HapMap, and an in-house database of 72 WES samples. After validation of variants in genes where multiple individuals were carrying a missense or LoF variant, SORL1 was the gene with the largest number of variants. This gene binds APP, previously known to confer risk for AD. Analysis of 1,500 controls confirmed that the SORL1 variants were not present in the control population. One of the sequenced individuals also had an affected mother, who also had the mutation. SORL1 was sequenced in another 15 index cases, and two more mutations were identified. This study shows that genes can be identified even in sample sets with genetic heterogeneity if several individuals share a mutation in the same gene. The most informative studies of late-onset AD have been reported in the Icelandic population. Iceland is a population isolate with a well-known genetic history (Helgason, Yngvadottir, et  al. 2005). Much of the population has been genotyped using SNP chips,

21

and Iceland has proved to be a treasure chest for GWASs. The extensive genetic information available combined with good genealogical records has also proven to be a useful resource for NGS studies. Some members of this population have undergone WGS, followed by imputation into essentially the entire Icelandic population. The genetic information combined with easily accessible phenotypic data has led to the identification of numerous susceptibility variants, both common and rare, for complex disorders. This approach identified a variant in TREM2 associated with late-onset AD (Jonsson, Stefansson, et al. 2012). WGS of 2,261 Icelanders was used to identify over 34  million variants, 190,000 of them functional. These variants were imputed into 3,550 patients with AD and 1,236 controls who were over 85 years of age and had no symptoms of AD. When a case control study was performed, only one marker in addition to the already known APOE locus, a substitution of histidine for arginine in TREM2, reached genome-wide significance. This rare variant, with a population allele frequency of 0.63%, confers significant risk for AD with an odds ratio (OR) of 2.92. The finding was replicated in 2,000 cases from other populations with a combined OR of 2.83. Interestingly, compared with noncarriers, the variant also confers risk for worse cognitive function in individuals between 80 and 100 years of age but no diagnosis of AD. Another study in the Icelandic population identified a low-frequency variant in the APP gene (population frequency < 0.5%) to be protective of AD (Jonsson, Atwal, et  al. 2012). Variants in APP have previously been linked to early-onset monogenic forms of AD (Alonso Vilatela, Lopez-Lopez, et  al. 2012). The variant was identified from WGS of 1,795 Icelanders and was then chip genotyped in 71,743 individuals and subsequently imputed into 296,496 relatives. These two reports demonstrate the power of imputation in a well-organized cohort in a population isolate. The latter also demonstrates that variants within the same gene can be either predisposing or protective. Family-based studies have not been quite as successful in AD. A study of an individual with AD from a Turkish consanguineous family with a complex history of neurological and immunological disorders identified a nonsynonymous mutation in NOTCH3, which has previously been linked with cerebral autosomal dominant arteriopathy with subcortical infarcts and

22

The OMICs

leukoencephalopathy (CADASIL) (Guerreiro, Lohmann, et  al. 2012). However, the mutation did not cosegregate with the neurological phenotype in the family, leaving the results of the study inconclusive.

FUTURE DIRECTIONS Currently the true success in NGS studies has been achieved for monogenic diseases. However, the rapid reduction in sequencing costs makes larger studies possible, promising hope also for the identification of variants conferring risk for more complex disorders. It does not seem too implausible to predict that the course of sequencing studies will closely resemble that of GWASs a few years ago. In the beginning, the small GWASs identified risk variants only for disorders with relatively high risks. When more data was produced and pooled, robust associations were also identified for variants with very small effect sizes. Increase in sample size is only one avenue of increasing power to detect genetic variants associated with traits and disorders. Sequencing technology keeps improving, and current data already allow for relatively robust genotype determination for SNVs. However, better data and genotyping methods are needed to reliably identify other types of variation from NGS data, such as indels and copy number variants. Improvements in data quality include increased read lengths as well as improvement of the sequencing chemistries. Particularly, the “third generation” of sequencing technologies offers great promise for single molecule sequencing. This would not only reduce the amount template but also significantly reduce the problems in variable read depth caused by the capture and amplification steps. To be able to identify rare disease-associated variants, good-quality datasets with low false-positive and false-negative rates are needed. It seems very likely that data that is of very high quality from a technical viewpoint can soon be achieved. However, at present the bigger challenge is our limited knowledge of the functionality of the genome. Much work is still needed when it comes to the annotation of identified variants. Improved methods of predicting the consequences of variants on protein structure are needed. Increased understanding of the expression patterns (both spatial and temporal) of transcripts can help determine which variants could possibly be involved in the pathogenesis of genetic disorders. Improved

predictions of the pathogenicity of missense and splice variants would decrease the need for downstream in vitro assays to determine the true functional consequences of variants. And current knowledge has only scratched the surface of consequence prediction of noncoding variants, although large consortia such as the ENCODE project are starting to shed light on those shady parts of the genome that we currently understand very little. It is easy to get stuck on the technical aspects and problems of NGS and not see the forest for the trees. Current evidence seems quite convincing that rare variation will play a role in many neurological and neuropsychiatric phenotypes. Until the advent of NGS, this type of variation could be accessed only for very small targeted regions. Now large patient cohorts can be sequenced and it will be possible to assess the role of rare variation in these phenotypes. This also poses a challenge to improve phenotyping, as the subgrouping of patients based on endophenotypes could identify groups who share a genetic etiology, thus improving the probability of identifying disease-associated variants, as has been demonstrated by studies of syndromic ID. For many neuropsychiatric disorders, this is extremely challenging, as there are no biomarkers for diagnosis; that is, the definition of the phenotype relies entirely on observational data. The analysis of rare variants will also pose challenges for the ever-increasing collaboration among researchers. If mutations are identified in one individual or family and genetic heterogeneity in the disorder is high, as has been seen for ID, large replication cohorts will be needed to identify other patients sharing the same genetic etiology. Good examples have already been set up, such as the DECIPHER online repository containing genotype and phenotype information of research subjects with developmental disorders (Firth, Richards, et  al. 2009). Researchers all over the world can access the database and search for other patients with the same genetic variants. This has led to the identification of a number of new ID syndrome genes (Firth, Richards, et al. 2009). NGS is already used in a clinical setting for gene identification in Mendelian diseases, but there are many unanswered questions (Anderson and Schrijver 2010). As in the use of NGS in a research setting, one of the main challenges remaining is the interpretation of results. Proponents of NGS in a clinical setting argue

Medical DNA Sequencing in Neuroscience that once NGS data have been generated, such data could be a useful resource for the individual all through his or her life. Sequence data could be useful for other purposes besides the identification of disease genes, such as personal pharmacogenomics. The many pros of using NGS in the clinical setting are weighed down by a large number of problems and unsolved questions. How can quality control of such large datasets be guaranteed to the same level as current genetic tests, which usually produce little data that can be visually inspected? How are results to be validated? How are incidental findings handled? What will be the impact on relatives who might not wish to know about possible genetic susceptibility factors for diseases? However, there is no reason to assume that NGS cannot be included as part of clinical testing in cases where this procedure provides added clinical utility compared with targeted tests, and the problems identified now should not be thought of as reasons to forever ban the clinical use of NGS; however, such problems must be solved before the widespread use of NGS in medical care can become a reality.

S U M M A RY Current technology allows for the interrogation of every base pair in an individual’s genome. Many successful reports of gene identification using NGS have already been published, mostly for monogenic disorders. For several neurological and neuropsychiatric disorders, such as ID, autism and schizophrenia, the technology has been applied successfully to identify genes. Technological advances provide ever-improving data quality, but analytic approaches must keep up with the technological development to be able to make use of the data and convert the raw sequence to biological understanding. Currently NGS offers a promise that is starting to be realized in a research setting, whereas the routine use of NGS in a clinical setting still faces several challenges. REFERENCES Abecasis, G. R., A. Auton, et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65. Abrahams, B. S., & Geschwind, D. H. (2008). Advances in autism genetics: on the threshold of a new neurobiology. Nat Rev Genet 9(5), 341–355. Adessi, C., Matton, G., et  al. (2000). Solid phase DNA amplification:  characterisation of primer

23

attachment and amplification mechanisms. Nucleic Acids Res 28(20), E87. Adzhubei, I. A., Schmidt, S., et al. (2010). A method and server for predicting damaging missense mutations. Nat Methods 7(4), 248–249. Alonso Vilatela, M. E., Lopez-Lopez, M., et al. (2012). Genetics of Alzheimer’s disease. Arch Med Res 43(8), 622–631. Anderson, M. W., & Schrijver, I. (2010). Next generation DNA sequencing and the future of genomic medicine. Genes 1, 38–69. Asan, Y. Xu, et al. (2011). Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol 12(9), R95. Babenko, A. P., Polak, M., et  al. (2006). Activating mutations in the ABCC8 gene in neonatal diabetes mellitus. N Engl J Med 355(5), 456–466. Bamshad, M. J., Ng, S. B., et al. (2011). Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 12(11), 745–755. Bellus, G. A., Hefferon, T. W., et al. (1995). Achondroplasia is defined by recurrent G380R mutations of FGFR3. Am J Hum Genet 56(2), 368–373. Boycott, K. M., Parboosingh, J. S., et  al. (2008). Clinical genetics and the Hutterite population:  a review of Mendelian disorders. Am J Med Genet A 146A(8), 1088–1098. Bradfield, J. P., Qu, et al. H. Q., (2011). A genome-wide meta-analysis of six type 1 diabetes cohorts identifies multiple associated loci. PLoS Genet 7(9), e1002293. Braslavsky, I., Hebert, B., et al. (2003). Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A 100(7), 3960–3964. Caliskan, M., Chong, J. X., et  al. (2011). Exome sequencing reveals a novel mutation for autosomal recessive non-syndromic mental retardation in the TECR gene on chromosome 19p13. Hum Mol Genet 20(7), 1285–1289. Claes, L., Del-Favero, J., et al. (2001). De novo mutations in the sodium-channel gene SCN1A cause severe myoclonic epilepsy of infancy. Am J Hum Genet 68(6), 1327–1332. Coffey, A. J., Kokocinski, F., et al. (2011). The GENCODE exome:  sequencing the complete human exome. Eur J Hum Genet 19(7), 827–831. Conrad, D. F., Keebler, J. E., et al. (2011). Variation in genome-wide mutation rates within and between human families. Nat Genet 43(7), 712–714. Cook, E. H., Jr., & Scherer, S. W. (2008). Copy-number variations associated with neuropsychiatric conditions. Nature 455(7215), 919–923. Danecek, P., Auton, A., et al. (2011). The variant call format and VCF tools. Bioinformatics 27(15), 2156–2158. de Vries, B. B., Pfundt, R., et  al. (2005). Diagnostic genome profiling in mental retardation. Am J Hum Genet 77(4), 606–616.

24

The OMICs

DePristo, M. A., Banks, E., et al. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43(5), 491–498. Dressman, D., Yan, H., et  al. (2003). Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci U S A 100(15), 8817–8822. Dunham, I., Kundaje, A., et al. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74. Edvardson, S., Shaag, A., et  al. (2010). Joubert syndrome 2 (JBTS2) in Ashkenazi Jews is associated with a TMEM216 mutation. Am J Hum Genet 86(1), 93–97. Eyre-Walker, A., & Keightley, P. D. (2007). The distribution of fitness effects of new mutations. Nat Rev Genet 8(8), 610–618. Fedurco, M., Romieu, A., et al. (2006). BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Res 34(3), e22. Firth, H. V., Richards, S. M., et al. (2009). DECIPHER: database of chromosomal imbalance and phenotype in humans using Ensembl resources. Am J Hum Genet 84(4), 524–533. Flicek, P., & Birney, E. (2009). Sense from sequence reads:  methods for alignment and assembly. Nat Methods 6(11 Suppl), S6–S12. Froguel, P., Vaxillaire, M., et  al. (1992). Close linkage of glucokinase locus on chromosome 7p to early-onset non-insulin-dependent diabetes mellitus. Nature 356(6365), 162–164. Gilissen, C., Hoischen, A., et  al. (2011). Unlocking Mendelian disease using exome sequencing. Genome Biol 12(9), 228. Gilman, S. R., Iossifov, I., et al. (2011). Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron 70(5), 898–907. Girard, S. L., Gauthier, J., et  al. (2011). Increased exonic de novo mutation rate in individuals with schizophrenia. Nat Genet 43(9), 860–863. Gloyn, A. L., Pearson, E. R., et al. (2004). Activating mutations in the gene encoding the ATP-sensitive potassium-channel subunit Kir6.2 and permanent neonatal diabetes. N Engl J Med 350(18), 1838–1849. Goate, A., Chartier-Harlin, M. C., et  al. (1991). Segregation of a missense mutation in the amyloid precursor protein gene with familial Alzheimer’s disease. Nature 349(6311), 704–706. Guerreiro, R. J., Lohmann, E., et  al. (2012). Exome sequencing reveals an unexpected genetic cause of disease: NOTCH3 mutation in a Turkish family

with Alzheimer’s disease. Neurobiol Aging 33(5), 1008 e17–e23. Harrow, J., Frankish, A., et al. (2012). GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res 22(9), 1760–1774. Heinzen, E. L., Depondt, C., et  al. (2012). Exome sequencing followed by large-scale genotyping fails to identify single rare variants of large effect in idiopathic generalized epilepsy. Am J Hum Genet 91(2), 293–302. Helgason, A., Yngvadottir, B., et al. (2005). An Icelandic example of the impact of population structure on association studies. Nat Genet 37(1), 90–95. Hoischen, A., van Bon, B. W., et al. (2010). De novo mutations of SETBP1 cause Schinzel-Giedion syndrome. Nat Genet 42(6), 483–485. Holm, H., Gudbjartsson, D. F., et  al. (2011). A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat Genet 43(4), 316–320. Iossifov, I., Ronemus, M., et al. (2012). De novo gene disruptions in children on the autistic spectrum. Neuron 74(2), 285–299. Jonsson, T., Atwal, J. K., et al. (2012). A mutation in APP protects against Alzheimer’s disease and age-related cognitive decline. Nature 488(7409), 96–99. Jonsson, T., Stefansson, H., et  al. (2012). Variant of TREM2 associated with the risk of Alzheimer’s disease. N Engl J Med 368(2), 107–116. Kao, W. C., Stevens, K., et  al. (2009). BayesCall:  a model-based base-calling algorithm for highthroughput short-read sequencing. Genome Res 19(10), 1884–1895. Kircher, M., Stenzel, U., et  al. (2009). Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10(8), R83. Klassen, T., Davis, C., et  al. (2011). Exome sequencing of ion channel genes reveals complex profiles confounding personal risk assessment in epilepsy. Cell 145(7), 1036–1048. Klei, L., Sanders, S. J., et al. (2012). Common genetic variants, acting additively, are a major source of risk for autism. Mol Autism 3(1), 9. Kozomara, A. & Griffiths-Jones, S. (2011). miRBase: integrating microRNA annotation and deepsequencing data. Nucleic Acids Res 39(Database issue), D152–D157. Krawitz, P. M., Schweiger, M. R., et  al. (2010). Identity-by-descent filtering of exome sequence data identifies PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat Genet 42(10), 827–829. Kryukov, G. V., Pennacchio, L. A., et al. (2007). Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet 80(4), 727–739.

Medical DNA Sequencing in Neuroscience Lander, E. S., Linton, L. M., et  al. (2001). Initial sequencing and analysis of the human genome. Nature 409(6822), 860–921. Le, S. Q., & Durbin, R. (2010). SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res 21(6), 952–960. Li, H., Handsaker, B., et  al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16), 2078–2079. Li, H., & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5), 473–483. Li, R., Yu, C., et al. (2009). SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966–1967. Lin, Y., Li, J., et  al. (2011). Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27(15), 2031–2037. Ma, D., Salyakina, D., et  al. (2009). A genome-wide association study of autism reveals a common novel risk locus at 5p14.1. Ann Hum Genet 73(Pt 3), 263–273. MacArthur, D. G., Balasubramanian, S., et al. (2012). A systematic survey of loss-of-function variants in human protein-coding genes. Science 335(6070), 823–828. Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387–402. Margulies, M., Egholm, M., et  al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057), 376–380. Mefford, H. C., Batshaw, M. L., et al. (2012). Genomics, intellectual disability, and autism. N Engl J Med 366(8), 733–743. Metzker, M. L. (2010). Sequencing technologies—the next generation. Nat Rev Genet 11(1), 31–46. Meuzelaar, L. S., Lancaster, O., et al. (2007). MegaPlex PCR:  a strategy for multiplex amplification. Nat Methods 4(10), 835–837. Najmabadi, H., Hu, H., et  al. (2011). Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature 478(7367), 57–63. Neale, B. M., Kou, Y., et al. (2012). Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485(7397), 242–245. Need, A. C., McEvoy, J. P., et  al. (2012). Exome sequencing followed by large-scale genotyping suggests a limited role for moderately rare risk factors of strong effect in schizophrenia. Am J Hum Genet 91(2), 303–312. Ng, P. C., & Henikoff, S. (2003). SIFT:  predicting amino acid changes that affect protein function. Nucleic Acids Res 31(13), 3812–3814.

25

Ng, S. B., Bigham, A. W., et al. (2010). Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42(9), 790–793. Nielsen, R., Paul, J. S., et al. (2011). Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12(6), 443–451. Norio, R. (2003). The Finnish Disease Heritage III: the individual diseases. Hum Genet 112(5-6), 470–526. O’Roak, B. J., Vives, L., et al. (2012). Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 485(7397), 246–250. Pottier, C., Hannequin, D., et  al. (2012). High frequency of potentially pathogenic SORL1 mutations in autosomal dominant early-onset Alzheimer disease. Mol Psychiatry 17(9), 875–879. Pruitt, K. D., Harrow, J., et  al. (2009). The consensus coding sequence (CCDS) project:  identifying a common protein-coding gene set for the human and mouse genomes. Genome Res 19(7), 1316–1323. Pruitt, K. D., Tatusova, T., et al. (2012). NCBI Reference Sequences (RefSeq):  current status, new features and genome annotation policy. Nucleic Acids Res 40(Database issue), D130–D135. Quinlan, A. R., Stewart, D. A., et  al. (2008). Pyrobayes:  an improved base caller for SNP discovery in pyrosequences. Nat Methods 5(2), 179–181. Rabbani, B., Mahdieh, N., et al. (2012). Next-generation sequencing: impact of exome sequencing in characterizing Mendelian disorders. J Hum Genet 57(10), 621–632. Raux, G., Guyant-Marechal, L., et  al. (2005). Molecular diagnosis of autosomal dominant early onset Alzheimer’s disease: an update. J Med Genet 42(10), 793–795. Ronaghi, M., Uhlen, M., et  al. (1998). A sequencing method based on real-time pyrophosphate. Science 281(5375), 363, 365. Ropers, H. H. (2010). Genetics of early onset cognitive impairment. Annu Rev Genomics Hum Genet 11, 161–187. Ropers, H. H., & Hamel, B. C. (2005). X-linked mental retardation. Nat Rev Genet 6(1), 46–57. Saitsu, H., Kato, M., et al. (2008). De novo mutations in the gene encoding STXBP1 (MUNC18-1) cause early infantile epileptic encephalopathy. Nat Genet 40(6), 782–788. Sanders, S. J., Murtha, M. T., et  al. (2012). De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 485(7397), 237–241. Saxena, R., Elbers, C. C., et  al. (2012). Large-scale gene-centric meta-analysis across 39 studies

26

The OMICs

identifies type 2 diabetes loci. Am J Hum Genet 90(3), 410–425. Shendure, J., Porreca, G. J., et al. (2005). Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309(5741), 1728–1732. Sherrington, R., Rogaev, E. I., et al. (1995). Cloning of a gene bearing missense mutations in early-onset familial Alzheimer’s disease. Nature 375(6534), 754–760. Shiang, R., L.Thompson, M., et al. (1994). Mutations in the transmembrane domain of FGFR3 cause the most common genetic form of dwarfism, achondroplasia. Cell 78(2), 335–342. Stenson, P. D., Ball, E. V., et  al. (2009). The Human Gene Mutation Database:  providing a comprehensive central mutation database for molecular diagnostics and personalized genomics. Hum Genomics 4(2), 69–72. Sulem, P., Gudbjartsson, D. F., et al. (2011). Identification of low-frequency variants associated with gout and serum uric acid levels. Nat Genet 43(11), 1127–1130. Sulonen, A. M., Ellonen, P., et al. (2011). Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol 12(9), R94. Tarpey, P. S., Smith, R., et  al. (2009). A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation. Nat Genet 41(5), 535–543. Tennessen, J. A., Bigham, A. W., et al. (2012). Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337(6090), 64–69. Teslovich, T. M., Musunuru, K., et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466(7307), 707–713. Tomkinson, A. E., Vijayakumar, S., et al. (2006). DNA ligases: structure, reaction mechanism, and function. Chem Rev 106(2), 687–699. Treffer, R., & Deckert, V., (2010). Recent advances in single-molecule sequencing. Curr Opin Biotechnol 21(1), 4–11. Valouev, A., Ichikawa, J., et al. (2008). A high-resolution, nucleosome position map of C. elegans reveals a

lack of universal sequence-dictated positioning. Genome Res 18(7), 1051–1063. Varley, K. E., & Mitra, R. D. (2008). Nested Patch PCR enables highly multiplexed mutation discovery in candidate genes. Genome Res 18(11), 1844–1850. Vissers, L. E., de Ligt, J., et al. (2010). A de novo paradigm for mental retardation. Nat Genet 42(12), 1109–1112. Vorstman, J. A., Anney, R. J., et al. (2012). No evidence that common genetic risk variation is shared between schizophrenia and autism. Am J Med Genet B Neuropsychiatr Genet. Wang, K., Zhang, H., et al. (2009). Common genetic variants on 5p14.1 associate with autism spectrum disorders. Nature 459(7246), 528–533. Weckhuysen, S., Mandelstam, S., et  al. (2012). KCNQ2 encephalopathy: emerging phenotype of a neonatal epileptic encephalopathy. Ann Neurol 71(1), 15–25. Weiss, L. A., Arking, D. E., et al. (2009). A genome-wide linkage and association scan reveals novel loci for autism. Nature 461(7265), 802–808. Wu, H., Irizarry, R. A., et al. (2010). Intensity normalization improves color calling in SOLiD sequencing. Nat Methods 7(5), 336–337. Xu, B., Ionita-Laza, I., et  al. (2012). De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia. Nat Genet 44(12), 1365–1369. Xu, B., Roos, J. L., et  al. (2011). Exome sequencing supports a de novo mutational paradigm for schizophrenia. Nat Genet 43(9), 864–868. Xu, B., Roos, J. L., et al. (2008). Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet 40(7), 880–885. Yamagata, K., Furuta, H., et  al. (1996). Mutations in the hepatocyte nuclear factor-4alpha gene in maturity-onset diabetes of the young (MODY1). Nature 384(6608), 458–460. Zhang, W., Chen, J., et al. (2011). A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One 6(3), e17915.

2 Epigenomics: An Overview K E V I N H U A N G A N D G U O P I N G   FA N

INTRODUCTION Epigenetics is the study of mechanisms that can alter gene expression without changing the underlying DNA sequence. Under this broad term many mechanisms have been considered epigenetic, including DNA methylation, histone modifications, and noncoding RNA. Often these epigenetic mechanisms work in concert to influence both gene expression and each other. Epigenetic landscapes are extremely complex with a vast spectrum of variations that are used to fine-tune gene expression. In order to fully understand the regulatory domains in the genome, all epigenetic regulatory forces must be considered. Recent advances in high-throughput technology have afforded the opportunity to survey epigenetic features across entire genomes, bringing forth a vibrant field of “epigenomics”based research. This chapter focuses on how epigenetic mechanisms shape the transcriptome, the tools we use to study these pathways on a genome scale, and the insights we have gained from these epigenomic-driven studies. Throughout the text, we highlight the impact and relevance of epigenomic studies on illuminating novelties in neurobiology. HIGH-THROUGHPUT T E C H N O L O G I E S PAV I N G T H E W AY F O R E P I G E N O M I C S Microarrays The advent of microarray technology provided a phenomenal method of measuring multiple events in a single experiment. Expanding on classical complementary hybridization-based detection methods, the microarray platform relies on hybridization of fluorescently labeled DNA to predefined probes that uniquely represents portions of the genome (Heller, 2002; Schulze & Downward, 2001; Young, 2000).

DNA probes are usually evenly spaced and attached to a solid surface commonly referred to as a chip. Because of the limited size of the chip, only a finite number of probes can be placed onto a single chip. For experimental designs that attempt to exhaustively represent the genome by having probes found every few kilobases across the genome (so-called tiling arrays), multiple chips are required. Other experimental designs that focus on promoters alone may require fewer chips to fully represent all mammalian promoters. Microarray had clear advantages compared with previous approaches, in particular the ability to sample large portions of the genome in a cost-effective and less time-consuming manner.

Next-Generation Sequencing In more recent years, high-throughput DNA sequencing has supplanted most microarray technologies for many reasons, including improved high throughput, sensitivity, and accuracy (Metzker, 2010; Shendure & Ji, 2008). However, it is worth mentioning that many laboratories are continuing to use microarray-based platforms primarily because of matured analytical tools (Allison et  al., 2006; Gentleman et  al., 2004; Li, 2008), and the costs are still lower for studies geared more for sample sizes in the hundreds and thousands. Sequencing offers many advantages over the microarray platform, including base-pair resolution and unbiased surveying of the genome. In general, all library construction protocols share fundamental commonalities (Metzker, 2005, 2010; Shendure & Ji, 2008). Since the goal is to generate short reads, the majority of library construction protocols share common procedures such as DNA or RNA fragmentation to a desired size distribution, adapter ligation, and PCR amplification. The PCR step is

28

the OMICs

necessary because most library construction methods yield small amounts of DNA that may not be easily detected on the sequencer. On the other hand, the PCR step also remains one of the banes of library construction because PCR amplification introduces a variety of biases that confound quantitative analyses. Indeed, several groups are working on methods for circumventing the PCR step in library construction, which will simplify library construction and data analysis in the future.

types. We now know that different cells have their own unique DNA methylation signature, and these characteristics are important for governing cell identity. For example, neural genes are repressed in nonneural tissues by promoter DNA methylation but are unmethylated in neural cells, indicating a direct role for DNA methylation (Meissner et  al., 2008; Mohn et  al., 2008). These types of studies have revealed an immense amount about the methylation status of gene promoters in regulating gene expression and cellular differentiation.

D N A M E T H Y L AT I O N Background DNA methylation is one of the best-studied epigenetic mechanisms and involves the covalent attachment of a methyl group to the 5 carbon position of cytosine (Bird, 1986; Reik, 2007). In mammals, this action is catalyzed by a family of DNA methyltransferases (Dnmts), including Dnmt1, Dnmt3a, and Dnmt3b. Loss of any of these enzymes during embryogenesis is lethal, indicating an essential role for DNA methylation during development (Li et  al., 1992; Okano et  al., 1999). The prevailing hypothesis on the mechanism of action for DNA methylation involves repression via its presence on the proximal promoter (Miranda & Jones, 2007). It is thought that DNA methylation suppresses gene activity either by acting as part of a signaling pathway that recruits repressor complexes or by sterically hindering transcription factor binding (Huang & Fan, 2010; Moore et  al., 2012). Nevertheless, with some exceptions, global mapping of gene promoters indicates a negative correlation between promoter methylation and gene activity (Suzuki & Bird, 2008). Epigenomic studies using mouse embryonic stem cells (ESCs) revealed that promoters can be subclassified based on their CpG content (Fouse et  al., 2008; Meissner et  al., 2008; Mikkelsen et  al., 2007; Mohn et  al., 2008). For example, proximal promoters with a high density of CpG dinucleotides tend to be hypomethylated, whereas promoters with a low density of CpGs are hypermethylated. However, the absence of DNA methylation does not necessarily predict gene activity; many gene promoters that lack DNA methylation can also be transcriptionally inactive (Fouse et  al., 2008; Lister et  al., 2009; Meissner et  al., 2008; Mohn et  al., 2008; Weber et  al., 2007). Furthermore, DNA methylation patterns differ in various cell

Gene Body Methylation Global DNA methylation mapping has revealed many novel facets of DNA methylation beyond the classical model of gene regulation. For example, outside of gene promoters, DNA methylation appears to be highly enriched within the gene body (the transcribed portion of the gene). In many species, gene body methylation appears to have both repressive and enhancer roles (Zemach et  al., 2010). Recent methylome studies across phyla found that gene body methylation is mostly enriched for genes with moderate expression. In other words, genes that are expressed either highly or lowly are depleted of gene body methylation. However, mammals do not seem to share this trait. Gene expression in both humans and mice does not correlate tightly with CG methylation in the gene body of protein coding genes (Feng et  al., 2010b; Lister et  al., 2009). Furthermore, there is still no conclusive evidence to indicate that gene body methylation plays a role in regulating gene expression. On the other hand, gene body methylation for repetitive elements (such as retrotransposons) seems to be widely conserved across a diverse array of species (Feng et  al., 2010b; Zemach et  al., 2010). In many cases, heavy methylation across the repeat gene body results in stable silencing. Indeed, experiments that artificially remove DNA methylation within an organism result in a dramatic induction of repeat elements and may lead to cell death (Chen et  al., 2007; Fan et  al., 2005; Hutnick et al., 2009.). It is thought that DNA methylation has evolved as a defense mechanism to silence foreign DNA such as from viruses, which are capable of invading a host cell for viral replication (Zemach et  al., 2010). So although foreign viral DNAs have successfully integrated into host genomes over time, the cell has used DNA methylation as a way of providing genomic

Epigenomics: An Overview stability. By the same token, it has also been proposed that repeat elements are an important evolutionary driving force (Bourque et al., 2008; Deininger et  al., 2003). Retrotransposable elements are capable of “jumping” from their original loci and reintegrating in a random position of the genome; moreover, retrotransposon that randomly integrate into a gene’s regulatory element will directly affect its expression and may change the fitness of that cell. Indeed, mouse cells that are devoid of DNA methylation show increased expression of certain repeat elements, some of which can be found within transcripts (e.g., chimeric transcripts), reinforcing the idea that DNA methylation plays a role in silencing repetitive elements in order to promote genomic stability (Karimi et  al., 2011). Overall, DNA methylation within the gene bodies of repetitive sequences seems to be functionally important and conserved across phyla.

Non-CG Methylation For a long time it was thought that CG methylation was a predominant motif in mammalian cells. Indeed, this appears to be the case in most somatic cell types but not in select cell types. For example, non-CG methylation accounts for about 25% of all methylation in human embryonic stem cells (hESCs) (Lister et  al., 2009, 2011). However, not much is known about the functional relevance of non-CG methylation. Non-CG methylation is nonrandomly distributed across the genome and appears to be particularly enriched in gene bodies (Lister et  al., 2009, 2011). In fact, non-CG methylation in the gene body appears to correlate better with gene expression. In hESCs, non-CG methylation has been suggested to enhance gene expression, since the highly expressed genes tend to have more non-CG enrichment (Lister et  al., 2009). Evidence suggests that non-CG methylation is primarily catalyzed by the de novo class of DNA methylation enzymes, such as Dnmt3a and Dnmt3b (Dodge et  al., 2002; Ramsahoye et  al., 2000; Suetake et  al., 2003). However, no study has teased apart the exact contributions from either enzyme. In summary, although non-CG methylation contributes approximately 25% of global DNA methylation, its relevance to gene regulation is not well understood. 5-Hydroxymethyl-Cytosine Recent studies have identified a stable derivative of DNA methylation that is beginning to show

29

evidence of regulatory capacity (Branco et  al., 2012; Tan & Shi, 2012; Wu & Zhang, 2011). This derivative, or 5-hydroxymethylcytosine (5hmC), is generated by the oxidation of 5’-methylcytosine (5mC) by a family of dioxygenases or ten eleven translocation enzymes (Tets). Although the existence of 5hmC has been known for decades, it has only recently been brought into the spotlight when it was found to be enriched in select cell types such as Purkinje neurons and embryonic stem cells (Kriaucionis & Heintz, 2009; Tahiliani et  al., 2009). Since then, there has been considerable interest in elucidating the potential roles for 5hmC and Tet enzymes. Interestingly, recent knockout studies in the mouse demonstrated that Tet1-deficient mice are viable, indicating that Tet1 is dispensable for embryogenesis (Dawlaty et  al., 2011). It has been postulated that Tet enzymes are involved in the demethylation pathway by labeling 5mC for downstream base-excision repair, which eventually swaps out modified 5mC to unmodified cytosine (Wu & Zhang, 2010). Support for this hypothesis comes from genome-wide Tet1 localization studies, which have found that Tet1 is localized to hypomethylated regions of high CpG density and that Tet1-deficiency leads to increased global levels of DNA methylation (Ficz et  al., 2011; Williams et al., 2011). Furthermore, loss of the TDG, a major base-excision repair enzyme, is embryonically lethal and shows a more dramatic increase of global DNA methylation levels in embryonic stem cells (Cortazar et  al., 2011), suggesting that DNA methylation dynamics are tightly regulated during development.

DNA Methylation in Neuroscience From genetic studies we know that DNA methylation is important for neural development and neuronal survival. Dnmt deficiencies in the developing central nervous system (CNS) disrupt the vital function of neural control of respiration (Fan et al., 2001) and lead to defects in neural differentiation (Fan et  al., 2005), neuronal survival (Wu et al., 2010, 2012), and adult neurogenesis (Wu et  al., 2010). In the adult brain, most cells are nondividing, yet high levels of Dnmt1 and Dnmt3a can still be detected, suggesting that DNA methylation patterns must be actively maintained even in postmitotic neurons (Feng et  al., 2005). Dnmt1 and Dnmt3a share some redundant function, as loss of

30

the OMICs

either enzyme does not result in any significant genomic changes, whereas loss of both lead to deficits in synaptic plasticity as well as both learning and memory (Feng et  al., 2010a). This study suggested that neuronal activity and DNA methylation were somehow linked. Indeed, genome-wide DNA methylation patterns measured before and after neuronal stimulation reveal that some cytosine undergoes either de novo methylation or demethylation, suggesting that neuronal activity changes the methylation landscape and that DNA methylation is dynamically regulated in postmitotic neurons (Guo et al., 2011a). In addition, our knowledge of the neural methylomes has benefited from pioneer studies done in ESCs (Meissner et  al., 2008; Mohn et  al., 2008). For example, recent studies have shown that mouse neural cells are also heavily enriched with non-CG methylation, similar to levels found in ESCs (Xie et  al., 2012). However, studies of non-CG methylation in mouse brain suggest an opposite effect. That is, non-CG methylation is inversely correlated with gene expression, resembling the regulatory roles of CpG methylation (Xie et al., 2012). This conclusion has been confirmed in a more recent study involving both human and mouse brains (Lister et  al., 2013). Notably, the study also found that non-CG methylation appears to be the dominant form of DNA methylation in adult neurons, accounting for ~53% of the methylated cytosines. Analysis between neuron and glial cells identified regions of differential non-CG methylation that seem to have a role in gene repression. For example, genes that are non-CG hypermethylated in glial cells are hypomethylated and expressed in neurons, and also tend to be enriched for genes involved in neuronal function (Lister et  al., 2013). Thus, non-CG methylation appears to play a different role in neural cells compared to embryonic stem cells. These conflicts in conclusions may be explained by differences in Dnmt3a/ Dnmt3b expression or species-specific regulation. Of the de novo enzymes, it has been found that embryonic stem cells highly express both Dnmt3a and Dnmt3b (Okano et al., 1999), whereas neuronal cells have high expression of Dnmt3a only (Feng et  al., 2005). Furthermore, recent whole-genome methylome studies have tried exhaustively to identify all mouse brain imprinting control regions regulated by allele-specific methylation (Xie et  al., 2012).

Surprisingly, the authors report a relatively small number of novel DMDs, confirming their rarity. Whether this is true for all cell types or cell lineages is yet to be determined. Finally, although 5-hydroxymethylation was initially rediscovered in Purkinje neurons and ESCs (Kriaucionis & Heintz, 2009; Tahiliani et  al., 2009), the first studies have primarily been in ESCs. More recently, more 5hmC maps in neural cells have been emerging (Guo et  al., 2011b; Jin et  al., 2011; Lister et  al., 2013; Szulwach et  al., 2011)  and many of the conclusions are consistent with those found in ESCs. Of note, Tet1 and 5hmC have brought about a novel pathway for active demethylation in postmitotic neurons, thus providing an important missing piece of the puzzle for changing DNA methylation dynamics in nondividing cells.

Methods Because DNA methylation is essential for embryonic and neural development, one important question in the field is to understand how the process occurs. To that end, scientists have been trying to estimate DNA methylation at global levels and at unique regions in the genome so as to understand how its presence or absence regulates cell function. However, traditional methods have been limited by the number of genomic regions that could be detected at a reasonable resolution. Because DNA methylation is one of the best-studied epigenetic mechanisms, a wealth of approaches have been developed to assay 5’-methylcytosine. These methods range from well-defined chemical reactions to deductive enzymatic methods. Various methods have been used to estimate global levels of DNA methylation. MeDIP or MeDIP-Seq One commonly used method to enriched for methylated regions of the genome involves the performance of chromatin immunoprecipitation (ChIP) using an antibody targeting 5mC (Shen et  al., 2009). This approach unbiasedly selects for fragments of DNA that are methylated and is followed by either hybridization onto a microarray chip or DNA library construction. It allows for quick assessment of methylated regions in the genome. As control, researchers typically use input DNA (without antibody pull-down) and determine enrichment or depletion based on the difference of signal with input. Although both MeDIP-Seq and

Epigenomics: An Overview MeDIP-chip give qualitative yes/no answers as to whether regions are methylated or unmethylated, the estimation for this answer is much more quantitative for MeDIP-Seq. Important features that MeDIP-Seq can capture that microarray chip may not be able to detect are methylation in gene bodies and repeat regions, two domains that we know to be heavily methylated.

Illumina Beadchip and SequenomARRAY Some researchers using microarray-based approaches have tried to use bisulfite methods. However, because of probe hybridization, the probe designs are less straightforward. To address this limitation, probes are designed to complement either fully methylated or unmethylated DNA after bisulfite treatment. Then the signal intensity ratio between converted or unconverted probes provides a quantitative measure of methylation level. One pitfall of this approach is that a set of probes essentially measures the status of a single CpG site. The logic behind this approach is that a single-CpG site may be representative of the entire state of a small locus. However, this reasoning may still be intuitively dissonant in terms of statistical sampling; that is, this approach is limited by the number of cytosine sites it can measure. Recent studies comparing the results of this approach with bisulfite sequencing data suggest that the estimates obtained by the microarray approach are about 80% concordant, indicating that, by and large, it agrees with the “gold standard” bisulfite sequencing method. This is likely due to the bimodal, all-or-nothing, methylation status at a particular site. Most of the disagreement appears to be in quantitation between 25% and 75% methylation. To this day, people still use microarray because it is more cost-effective for very large sample sizes. Methylation-Sensitive Restriction Enzymes (MSREs) Methylation-sensitive or methylation-insensitive enzymes can be used to generate libraries with base-pair resolution of individual CG sites. One popular method—which uses both HpaII and MspI (the HELP assay originally designed for customized microarrays, also called methylation-sensitive cut counting [MSCC] for sequencing platforms)—relies on the HpaII activity toward unmethylated C*CGG (where * is the site of the cut when the CG motif is

31

unmethylated) but not toward the methylated CG motif (Khulan et  al., 2006). The quantitation for each CG site will be the number of reads that start or end at the CCGG site versus reads where the same CCGG is found in the middle of a read (in other words, where it was not cleaved). Often a separate experiment using MspI digestion, which is insensitive to either methylated or unmethylated CCGG motifs, is used as an internal control for copy number and PCR amplification biases. In practical terms, this technique can usually cover about 200,000 or more CpGs with high confidence.

Bisulfite Sequencing Although the approaches described above are capable of profiling genome-wide levels of DNA methylation, they are still limited by their resolution and quantification. Bisulfite sequencing remains one of the gold standards for assaying DNA methylation (Feng et al., 2011; Pomraning et  al., 2009). It has the power to examine DNA methylation at the highest resolution possible at the single-nucleotide level. In addition, measurement of the methylation level per site is among the most accurate methods because of the binary methylated or unmethylated calling for each read. Thus the methylation level at each site is merely the number of methylated calls over the sum of methylated and unmethylated calls. In practice this method does not require any normalization, although it may potentially generate false positives through incomplete versions of genomic DNA. Whole-genome bisulfite sequencing is the most comprehensive approach to examining methylation levels across the genome, but it is extremely costly and not practical for multiple samples. To overcome this, researchers have developed several targeted approaches to survey portions of the genome, such as through MspI restriction enzyme digestion. This approach, also called reduced representation bisulfite sequencing (RRBS), starts with digesting the genome with MspI, a methylation-insensitive enzyme that cuts at all CCGG motifs through the genome (Meissner et  al., 2005; Smith et  al., 2009). The fragments are selectively sized between 40 and 220 bp and then undergo bisulfite treatment. RRBS covers approximately 1% of the genome but enriches for CpG regions and can cover nearly 10% of the mammalian CpG sites. RRBS covers mostly CpG-rich regions including CG islands, which are found

32

the OMICs

at many gene promoters (Gu et  al., 2010, 2011; Smith et al., 2009). Thus most studies that have used RRBS can report on the CpG status at promoters. However, one pitfall for RRBS (and also HELP or MSCC, described earlier) is that the efficient cutting of the enzymes results in reads that always start in the same position. Because of this, it impossible to distinguish whether two reads that start at the same position originate from different cells or a PCR amplification artifact.

Comparisons Among Methods There has been considerable debate as to which method is best. The consensus is that each method has its pros and cons and the method of choice will depend on the experimental design. In brief, MeDIP is a good technique for enriching regions of the genome that are highly methylated; these will tend to be promoters of low or intermediate CpG densities (high-CG-density promoters are unmethylated). By contrast, RRBS and MSCC enriches for CG-rich regions and are suitable for surveying CpG islands, which often coincide with a large portion of gene promoters. Two independent studies comparing various DNA methylation sequencing techniques have been covered elsewhere in quite some detail (Bock et  al., 2010; Harris et al., 2010). Future Developments In summary, DNA methylation is one of the epigenetic mechanisms with the richest set of tools for examining the localization of this “5th base.” Using microarray approaches, DNA methylation status can be cost-effectively surveyed in multiple samples. This is probably most relevant to clinical samples, which can be involved in cohort studies involving hundreds or thousands of patients. By contrast, sequencing approaches allow for high-resolution detail on DNA methylation maps. At the moment, sequencing costs may still be beyond the capabilities for massive studies. In addition, some regions of the genome are difficult to sequence, such as repeat regions, where it is not easy to find unique regions within repeat-dense domains. As technology allows longer reads, this will help improve mappability, making it possible to align uniquely within repeat regions. In addition, as sequencing costs are reduced, it will also enable more methylome maps to be generated.

5-hydroxymethylcytosine (5hmC) has recently been receiving increasing attention as a novel epigenetic regulator. Unfortunately bisulfite sequencing cannot distinguish between 5mC and 5hmC; this casts doubt on years of bisulfite sequencing data gathered in the past, which may be confounded by some 5hmC. However, it should be noted that 5hmC is an order of magnitude less abundant than 5mC, and its regulatory significance is still unclear (Le et  al., 2011). So far, the primary approach to examining 5hmC has been to use antibodies specific for 5hmC but not 5mC. Although this has helped us to understand where 5hmC appeared to localize, it is still uncertain whether it is functionally important. Finally, two recent studies published breakthrough methods for examining 5hmC at base-pair resolution (oxBS-Seq, TAB-Seq), both of which rely on deductive quantification (Booth et  al., 2012; Yu et  al., 2012). Currently, both methods require preparing two libraries. For example, in one method, genomic DNA is first treated with an oxidizing reagent, which would convert all 5hmC to 5fC (5-formylcytosine); this is then followed by bisulfite treatment. Unlike 5hmC, 5fC can be deaminated in bisulfite conditions and will be read as thymine. Then, by comparing the result with a nonoxidized bisulfite sequencing, we can deduce how much 5hmC was at a given site. In the future, the best way would be to identify a method to distinguish 5hmC and 5mC in one simultaneous run to reduce artifacts and also costs. Currently, one method can achieve this. The SMRT sequencer uses a unique sequencing technology based on DNA polymerase wobble (Flusberg et  al., 2010). However, this technology is far from high-throughput and is not yet suitable for mammalian genomes.

H I S T O N E M O D I F I C AT I O N S Background In the nucleus, genomic DNA is frequently found wrapped around proteins that help to fold and compact DNA. These core proteins, or histones, contain long N-terminal tails that are liable to undergo extensive posttranslational modification (Bannister & Kouzarides, 2011; Muers, 2011). Some of the best-characterized modifications involve methylation or acetylation of either lysine or arginine residues. Because these two residues carry a partially positive charge in aqueous environments, the negative

Epigenomics: An Overview charge of the acetyl group is thought to neurtralize the negatively charged phosphate backbone of DNA. This effectively reduces interactions between histone and DNA and is thought to reduce compaction and mediate transcription by allowing an open environment for recruiting RNA polymerase and other transcriptional machinery. Currently, over 50 histone modifications are known, but only a few have been well characterized (Muers, 2011). Early global histone occupancy maps were huge undertakings that have provided essential insight into the DNA chromatin environment and gene regulation (Barski et  al., 2007; Ernst et  al., 2011; Hawkins et al., 2010; Heintzman et al., 2009; Wang et al., 2008). However, some histone modifications are rare and localized only in highly select regions of the genome, making it difficult to assess their occupancy and function.

Bivalency Domains Intriguingly, multiple modifications may be found within a single histone tail, sometimes with contradictory roles. One of the best examples of this phenomenon is the colocalization of histone H3 lysine 4 trimethylation (a mark of active genes) and histone H3 lysine 27 trimethylation (a mark of repressed genes). This feature was first identified in embryonic stem cells; since then, these so-called bivalent domains have been proposed to be “poised” for quick activation or repression upon differentiation. Indeed, many bivalent genes in embryonic stem cells resolve to monovalency after differentiation (Meissner et  al., 2008; Mikkelsen et  al., 2007). More recently, bivalent domains have been identified in other progenitor stem cells (Cui et al., 2009; Palacios et al., 2010), suggesting that bivalency may be unique to stem cells for fine tuning the differentiation process. Cross Talk with DNA Methylation In addition to cooccupancies of multiple histone modifications, other epigenetic marks are also often found localized together, indicating a clear cross talk between multiple epigenetic mechanisms (Cedar & Bergman, 2009; Meissner et al., 2008). For example, histone H3 lysine 36 trimethylation is often associated with high levels of gene body methylation, whereas H3K4me3 and H3K27me3 are often negatively correlated with DNA methylation. In neural stem cells, the mechanism for the latter process has been postulated to be carried out of Dnmt3a, whose

33

DNA methylation activity inhibits recruitment of PRC2 complexes that catalyze the trimethylation of H3K27 (Wu et  al., 2010). Thus both DNA methylation and histones act together to regulate gene activity.

Histone Modifications in Neuroscience The contribution from histone modifications has been extensively studied in neurobiology. Although many different histones have been studied, we here highlight some of the better-studied mechanisms as case examples. For example, histone acetylation has been shown to be involved in maintaining neural progenitors cells. If deacetylation is inhibited by valproic acid (VPA), a known HDAC inhibitor, cells will begin to undergo neural differentiation (Hsieh et  al., 2004). Mouse neural stem cells treated with trichostatin A  (TSA), another HDAC inhibitor, also showed increased neuronal differentiation but reduced astrocyte differentiation (Balasubramaniyan et  al., 2006). Interestingly, in vivo studies of developing neuronal precursors deficient in both HDAC1/HDAC2 showed severe abnormalities in brain formation owing to cell death and reduced differentiation potential (Montgomery et  al., 2009). Together, these studies indicate important roles for histone acetylation and deacetylation in carefully orchestrating the dynamic process of neural development. However, the genome-wide occupancies of these acetylation marks have not been thoroughly examined in neural cells. Besides acetylation, histone methylation is another well-studied mark that also contributes to neural development. For example, the H3K27 demethylase Jmjd3 is upregulated during neural differentiation and has been shown to play an important role in neural commitment (Burgold et  al., 2008). Overexpression of Jmjd3 promotes expression of neuronal genes, whereas knockdown leads to suppressed neural lineage differentiation. On the other hand, H3K27 methylation is catalyzed by the polycomb repressor complex (PRC). Interestingly, whereas PRC proteins and H3K27me3 repress neural-specific genes in ES cells, they need to be demethylated for neural commitment, and PRC components such as Ezh2 are needed to promote NPCs to more terminally differentiated states (Boyer et  al., 2006; Hirabayashi et  al., 2009). Genome-wide mapping of PRC proteins and H3K27me3 reveals that they exert this effect via direct repression of

34

the OMICs

neural-specific genes in ES cells, which is alleviated during differentiation (Boyer et  al., 2006). Further mapping of other histone methylation marks has revealed that promoter states reflect lineage commitment (Mikkelsen et  al., 2007). As mentioned before, we now know that ES cells have unique bivalent domains (marked by both H3K4me3 and H3K27me3), which eventually resolves to either H3K4me3 (if active gene) or H3K27me3 (if repressed gene) in neural cells. By using the in vitro neural differentiation model, neurobiology has benefited by acquiring genome-wide chromatin maps. Thus, by understanding the chromatin state, the researcher has an improved understanding of how neural-related genes may be regulated.

Methods ChIP-Chip/ChIP-Seq The most powerful method for examining the localization of histone modifications uses antibodies that target specific modifications, followed by chromatin pull-down; then either microarray hybridization or high-throughput sequencing is carried out. Two types of controls have been used. One employs the conventional input DNA and another uses antibodies that target H3. The advantage of using the latter is that it can determine whether depletion of a modification at a particular locus is due to absence of a histone in general or absence of the modification (Pellegrini & Ferrari, 2012). However, the chromatin immunoprecipitation strategy still suffers from relatively low resolution; yet to be identified is a method to precisely map the locations of histone modifications at base-pair resolution. Current resolution is roughly 100 to 200 bp, roughly the average size of fragments after chromatin sonication. BisChIP One of the more innovative ways of using histone-ChIP has been to couple it with bisulfite sequencing (Brinkman et  al., 2012; Statham et  al., 2012). By doing so, researchers can simultaneously assess histone occupancy and DNA methylation status. This method works by performing a protocol for bisulfite sequencing on ChIP DNA. Thus reads that map to the genome are presumably DNA-associated with the target occupancy but have the added bonus of containing methylation information as well. The two landmark studies that employed this method found clear cross talk between

H3K27me3 and DNA methylation, suggesting that these two pathways somehow regulate each other. However, one drawback of this approach is that it would be difficult to look into differential methylation at sites with differential protein binding. The reason is that if the DNA-binding protein is absent, there should theoretically also be a lack of DNA methylation at that site, which precludes any possibility for differential methylation analysis.

Future Developments Unlike the binary presence or absence of DNA methylation, one of the major challenges in fully understanding the histone code is the wide range of modifications. Although we have covered only the well-studied modifications, other modifications—such as phosphorylation, ubiquination, and sumoylation—are possible on histone tails. In addition, there are many histone variants, and finding out how all these factors work together will be a massive undertaking. Recently there has been considerably debate regarding whether these bivalent domains are valid or could be artifacts from the limitation of ChIP resolution. Mass spectrometry may help to resolve some of these debates, because this technique can sequence entire peptides and definitively characterize the modifications of each residue on a peptide (Baker, 2012; Sidoli et  al., 2012). However, the application of MS to entire histone tails have yet to fully mature. Overall, one of the major challenges for studying histone environments is the sheer number of modifications and the unique chromatin state of every cell type. NONCODING  RNAS Background From a historical perspective, noncoding RNAs did not enter center stage because of the proteomics-centric emphasis on gene regulation, including roles of transcription factors, kinases, and other regulatory enzymes. However, in recent years there has been dramatically increased interest in studying noncoding RNA. In part, this is due to advances in technology that allow facile assaying for noncoding RNAs and also to increasing evidence that noncoding RNAs play important roles in gene regulation. Noncoding RNAs are a class of RNAs that do not encode proteins but appear to have some regulatory function (Mattick & Makunin, 2006). Conceptually,

Epigenomics: An Overview noncoding RNAs work by complementary binding to signal increase or decrease downstream expression. Two broad classes appear to exist, one comprises long intergenic noncoding RNAs (lincRNAs) and another the small RNAs (microRNAs and piRNAs).

MiRNAs MicroRNAs (miRNAs) are short 18- to 22-nt RNAs that appear to play primarily a repressive role (Winter et  al., 2009). It has been postulated that miRNAs can act either by promoting mRNA degradation or inhibiting translation. More recently, it has been shown that miRNAs appear to act primarily at the posttranscriptional level either through complementary binding or intronic or 3’UTR regions to mediate transcript degradation (Guo et  al., 2010). MiRNAs are initially transcribed as a longer pre-miRNA RNA, which is later processed by Dicer and Drosha enzymes to produce the mature functional miRNAs. Dicer and Drosha enzymes are essential for miRNA biogenesis; the loss of either of these enzymes has severe effects on the cells (Winter et  al., 2009). However, it is debated how loss of these enzymes affects cell survival and cell differentiation. On one hand, some have argued that deregulated miRNA network have caused the phenotypes, whereas others have argued that accumulation of pre- and pri-miRNAs lead to cellular toxicity (Fineberg et  al., 2009). Regardless of these arguments, the loss of miRNAs directly affects global gene expression, suggesting that miRNAs make a substantial contribution to shaping the transcriptomes. Because miRNAs are short sequences, each miRNA can complement many similar sequences across the genome and therefore can potentially regulate thousands of genes (Bartel, 2009). It is becoming increasingly apparent that each cell type has a unique repertoire of signature miRNAs (Marson et al., 2008), and these miRNAs contribute to cellular identity. piRNAs More recently, a class of Piwi-interacting RNA (piRNAs) have also been shown to be important for regulating gene expression (Thomson & Lin, 2009). In mammals, piRNAs are most abundant in germ cells, but they have recently also been identified as embryonic stem cells and in neural cells. Within the genome,

35

piRNAs are found in large clusters within intergenic heterochromatin regions; they are thought to be transcribed in one long precursor form and then processed to smaller 28- to 30-nt RNAs. Unlike the case for miRNAs, the exact mechanism for how piRNAs are generated is still unclear, but the favored mechanism entails a “ping-pong” action, suggesting that piRNA biogenesis utilizes a forward feedback loop (Olovnikov et  al., 2012). In addition, piRNAs appear to be transcribed in a strand-specific manner, possibly indicating a necessary complementary action for silencing repeat sequences also found within heterochromatin.

LincRNAs LincRNAs, like messenger RNAs, are transcribed in a precursor form, which also undergoes processing such as splicing and tail polyadenylation (Mercer et  al., 2009). LincRNAs range in size between 1 and 10 kb. Not much is known about lincRNAs except that some are well conserved across species and appear to contain regulatory promoters that follow conventional rules of transcription start sites. In fact, a wide repertoire of lincRNAs was originally predicted by examining chromatin marks that did not correspond to any annotated protein coding regions but showed corresponding RNA transcripts (Cabili et al., 2011; Guttman et al., 2009). Much of our understanding of long noncoding RNA comes from studying the X-inactivation center, a cluster of noncoding RNAs that has important regulatory roles for X-inactivation (Lee, 2011). From these studies we have learned a lot about the regulatory capacity of noncoding RNAs, such as base-complementarity to facilitate X-chromosome pairing and inhibition of neighboring transcription. More recently, a comprehensive study that systematically used knockdown individual lincRNAs in mouse embryonic stem cells showed that a majority of lincRNAs are key regulatory components of the pluripotency network (Guttman et  al., 2011). Together, these results demonstrate that lincRNAs have important regulatory capacities on par with transcription factors, which are also implicated in the pluripotency program. Noncoding RNAs in Neurobiology Many miRNAs have been identified that are important for neural development and maturity,

36

the OMICs

including miR-124, miR-132, and let-7 (Fineberg et  al., 2009). This is supported by conditional Dicer knockouts in developing mouse CNS, although, as mentioned above, the exact cause of severe phenotypes is unclear. Genome-wide studies of miRNAs in neurons reveal that each region of the brain has a unique set of miRNAs that are required for proper development (Fineberg et  al., 2009), further supporting the notion of cell-specific miRNAs even between different neural subtypes. Other small RNAs such as piRNA have been less well studied in the brain or neural tissue. However, piRNA have been shown to be necessary to the de novo methylation of a single imprinting gene, Rasgfr1, in the mouse brain (Watanabe et  al., 2011), suggesting cross talk between piRNAs and other epigenetic mechanisms. In addition, piRNAs have been shown to be highly expressed in the Aplysia central nervous system and to mediate DNA methylation at the CREB2 site, thereby facilitating a mechanism for long-term synapse plasticity and persistent memory (Rajasethupathy et al., 2012). Long noncoding RNAs are only beginning to show evidence of functional importance in neural development. Systemic analysis has identified some lincRNAs that appeared to be enriched for cells within the neuroectoderm (Guttman et  al., 2011). These lincRNAs have been postulated to be drivers for differentiation and also as working to repress the pluripotency network. Functional studies for lincRNAs have come from zebrafish models in which morpholino knockdown of select lincRNAs led to defects in neural development, including failure of neural tube closure and improper brain and eye formation (Ulitsky et  al., 2011). More recently, long noncoding RNAs have been comprehensively characterized in the subventricular zone of the adult brain (Ramos et  al., 2013). Using knockdown of candidate lncRNAs, these authors showed that lncRNAs have potential roles for regulating neural differentiation. Thus, lncRNAs may have important regulatory roles during adult neurogenesis.

Methods Many of the high-throughput tools were easily adapted to examining genome-wide expression of noncoding RNAs. Prior to high-throughput sequencing, many of the studies used Northern blot analysis to assay for expression; however, these approaches did not provide a

comprehensive examination of the noncoding RNAs of interest.

Oligo-DT Purification Followed by Microarray or Sequencing A large fraction of lincRNAs have polyadenylated tails, which allow for detection using conventional RNA purification via oligo-d(T) affinity. Therefore tools like microarrays and RNA-Seq can be easily adapted to the examination of lincRNAs. Currently there are custom microarray designs that specifically target a large fraction of predicted lincRNAs. In addition, conventional RNA-Seq datasets should also readily be able to detect loci of predicted lincRNAs. Microarrays—Taqman Microarrays designed to hybridize to known species of miRNAs have been used to assess genome-wide expression levels of miRNAs. However, as we have discussed, microarray signals do not always produce the most quantitative interpretation. To circumvent this, researchers have applied the Taqman-based PCR system to quantify miRNA expression. The principle behind Taqman relies on the exonuclease activity of Taq DNA polymerase to activate a fluorescently labeled probe that has annealed to the sequence of interest. In this case, the Taqman probe anneals to miRNA sequences and amplification via Taq polymerase releases a fluorescence signal that can be accurately quantified. Because only a small number of known miRNAs have been identified, large plates of ~300 to 400 wells can be used to quantify all miRNAs in one experiment. However, the Taqman system is costly and limited by the small number of known miRNA species. Small-RNA  Seq Small-RNA Seq essentially enriches for RNAs ~20 to 30 nt long; thus it will cover the wide spectrum of both miRNAs and piRNAs. The protocol for small-RNA sequencing can start either with total RNA or purified small RNAs. In the latter case, a specialized column is used to enrich for small RNAs and remove larger RNAs. Unlike mRNA-Seq, which usually has a polyA purification step, mature small RNAs are not polyadenylated and cannot be detected by this method. However, some pre-miRNAs that are indeed polyadenylated may be detected through polyA purification, but these intermediates are not always stable and cannot be used

Epigenomics: An Overview to directly infer or quantify the total abundance of the processed mature miRNAs. The advantage of using the small RNA-Seq library is to unbiasedly detect all species of small RNAs including those we have not discussed (e.g., snoRNAs, tRNAs, etc.) in a single experiment. In addition, small RNA-Seq is considered fairly quantitative and allow for simultaneous comparisons of abundance with other small RNA species as well as within its own species.

Ribosome RNA-Depletion  Seq One attractive solution is to create libraries out of total RNA and examine both coding and noncoding RNA all within a single library. However, ribosomal RNA constitutes over 95% of total RNA, which means that only < 5% of total RNA will yield useful information. To circumvent this, methods have been made available to remove ribosomal RNA from total RNA, usually passing through a column with probes designed to complement rRNA, and the elution should be rRNA-free. However, because of the overwhelming abundance of rRNA, most available datasets using the ribodepletion method still contain a large amount of rRNA; this can skew quantitation of mRNAs and reduce the cost-efficiency of RNA-Seq. Currently the field appears to favor performing mRNA-Seq and small RNA-Seq separately rather than using a single rRNA-depleted sequencing library, suggesting that the latter method is still not mature. Future Developments Ideally, rRNA-depleted libraries should be the favored method in the future, since a single experiment can survey the entire RNA landscape. In addition, other species of RNAs, including those originating from repeat or telomeric regions, may also be assessed. Interestingly, it has been estimated that protein-coding RNAs constitute only a small fraction of the nonribosomal RNA; therefore there is still a slew of RNAs that have yet to be determined and examined. One of the major limitations in studying noncoding RNAs is that their function is not well understood. Bioinformatics studies have been pivotal for identifying classes of noncoding RNAs, but we have yet to understand the relevance of these small-RNAs to neurobiology and biology in general. Although many studies that have provided evidence for some functional role of noncoding RNAs in particular cell types, there are still orders of magnitude more

37

unique noncoding RNAs than available studies. Therefore we are far from having a complete perspective of all the actions of noncoding RNAs. To illustrate this point, miRNAs can target potentially hundreds to thousands of genes based on computational predictions. However, different prediction algorithms show dramatically varying results. For example, a recent study compared the three major miRNA prediction algorithms and found minimal overlap between them, indicating that there is still no robust method for determining the precise targets of a given miRNA (Sumazin et  al., 2011). Although we use miRNA as a prime example, this problem also extends to lincRNAs and piRNAs, where we still do not know the true targets throughout the genome. Much like the histone code, the sheer number different noncoding RNA species makes it difficult to obtain a comprehensive picture. Future systemic analyses on different noncoding RNA species will help bring us closer to obtaining a complete epigenomics view of the cell.

CONCLUDING REMARKS The field of epigenomics has been made possible by high-throughput technologies such as microarrays and next-generation sequencing. This has allowed us to procure genomic maps of the epigenetic landscape and in the process opened a Pandora’s Box of complexities that will probably take many years to fully decipher. The deluge of data will ultimately rely on adequate manpower to analyze all the data in an integrative and intuitive manner to uncover meaningful biology. Nevertheless, the authors surmise both technology and computation will follow a “Moore’s law” type of growth, which will certainly make studying epigenomics orders of magnitude more affordable and easy. Surely many more waves of epigenetic breakthroughs will be discovered and fields such as neurobiology will certainly benefit from this wealth of epigenetic information. REFERENCES Allison, D. B., Cui, X., Page, G. P., & Sabripour, M. (2006). Microarray data analysis:  from disarray to consolidation and consensus. Nature Reviews Genetics 7, 55–65. Baker, M. (2012). Mass spectrometry for chromatin biology. Nature Methods 9, 649–652. Balasubramaniyan, V., Boddeke, E., Bakels, R., Kust, B., Kooistra, S., Veneman, A., & Copray, S. (2006). Effects of histone deacetylation inhibition on

38

the OMICs

neuronal differentiation of embryonic mouse neural stem cells. Neuroscience 143, 939–951. Bannister, A. J., & Kouzarides, T. (2011). Regulation of chromatin by histone modifications. Cell Research 21, 381–395. Barski, A., Cuddapah, S., Cui, K., Roh, T. Y., Schones, D. E., Wang, Z., . . . Zhao, K. (2007). High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837. Bartel, D. P. (2009). MicroRNAs:  target recognition and regulatory functions. Cell 136, 215–233. Bird, A. P. (1986). CpG-rich islands and the function of DNA methylation. Nature 321, 209–213. Bock, C., Tomazou, E. M., Brinkman, A. B., Muller, F., Simmer, F., Gu, H., . . . Meissner, A. (2010). Quantitative comparison of genome-wide DNA methylation mapping technologies. Nature Biotechnology 28, 1106–1114. Booth, M. J., Branco, M. R., Ficz, G., Oxley, D., Krueger, F., Reik, W., & Balasubramanian, S. (2012). Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution. Science 336, 934–937. Bourque, G., Leong, B., Vega, V. B., Chen, X., Lee, Y. L., Srinivasan, K. G., . . . et  al. (2008). Evolution of the mammalian transcription factor binding repertoire via transposable elements. Genome Research 18, 1752–1762. Boyer, L. A., Plath, K., Zeitlinger, J., Brambrink, T., Medeiros, L. A., Lee, T. I., . . . et  al. (2006). Polycomb complexes repress developmental regulators in murine embryonic stem cells. Nature 441, 349–353. Branco, M. R., Ficz, G., & Reik, W. (2012). Uncovering the role of 5-hydroxymethylcytosine in the epigenome. Nature Reviews Genetics 13, 7–13. Brinkman, A. B., Gu, H., Bartels, S. J., Zhang, Y., Matarese, F., Simmer, F., . . . et  al. (2012). Sequential ChIP-bisulfite sequencing enables direct genome-scale investigation of chromatin and DNA methylation cross-talk. Genome Research 22, 1128–1138. Burgold, T., Spreafico, F., De Santa, F., Totaro, M.G., Prosperini, E., Natoli, G., & Testa, G.  (2008). The histone H3 lysine 27-specific demethylase Jmjd3 is required for neural commitment. PloS one 3, e3034. Cabili, M. N., Trapnell, C., Goff, L., Koziol, M., Tazon-Vega, B., Regev, A., & Rinn, J. L. (2011). Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes & Development 25, 1915–1927. Cedar, H., & Bergman, Y. (2009). Linking DNA methylation and histone modification:  patterns and paradigms. Nature Reviews Genetics 10, 295–304. Chen, T., Hevi, S., Gay, F., Tsujimoto, N., He, T., Zhang, B., Ueda, Y., & Li, E. (2007). Complete inactivation

of DNMT1 leads to mitotic catastrophe in human cancer cells. Nature Genetics 39, 391–396. Cortazar, D., Kunz, C., Selfridge, J., Lettieri, T., Saito, Y., MacDougall, E., . . . et al. (2011). Embryonic lethal phenotype reveals a function of TDG in maintaining epigenetic stability. Nature 470, 419–423. Cui, K., Zang, C., Roh, T. Y., Schones, D. E., Childs, R. W., Peng, W., & Zhao, K. (2009). Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation. Cell Stem Cell 4, 80–93. Dawlaty, M. M., Ganz, K., Powell, B. E., Hu, Y. C., Markoulaki, S., Cheng, A. W., . . . et al. (2011). Tet1 is dispensable for maintaining pluripotency and its loss is compatible with embryonic and postnatal development. Cell Stem Cell 9, 166–175. Deininger, P. L., Moran, J. V., Batzer, M. A., & Kazazian, H. H., Jr. (2003). Mobile elements and mammalian genome evolution. Current Opinion in Genetics & Development 13, 651–658. Dodge, J. E., Ramsahoye, B. H., Wo, Z. G., Okano, M., & Li, E. (2002). De novo methylation of MMLV provirus in embryonic stem cells:  CpG versus non-CpG methylation. Gene 289, 41–48. Ernst, J., Kheradpour, P., Mikkelsen, T. S., Shoresh, N., Ward, L. D., Epstein, C. B., . . . et  al. (2011). Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49. Fan, G., Beard, C., Chen, R. Z., Csankovszki, G., Sun, Y., Siniaia, M., . . . et al. (2001). DNA hypomethylation perturbs the function and survival of CNS neurons in postnatal animals. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience 21, 788–797. Fan, G., Martinowich, K., Chin, M. H., He, F., Fouse, S. D., Hutnick, L., . . . et al. (2005). DNA methylation controls the timing of astrogliogenesis through regulation of JAK-STAT signaling. Development 132, 3345–3356. Feng, J., Chang, H., Li, E., & Fan, G. (2005). Dynamic expression of de novo DNA methyltransferases Dnmt3a and Dnmt3b in the central nervous system. Journal of Neuroscience Research 79, 734–746. Feng, J., Zhou, Y., Campbell, S. L., Le, T., Li, E., Sweatt, J. D., . . . Fan, G. (2010a). Dnmt1 and Dnmt3a maintain DNA methylation and regulate synaptic function in adult forebrain neurons. Nature Neuroscience 13, 423–430. Feng, S., Cokus, S. J., Zhang, X., Chen, P. Y., Bostick, M., Goll, M. G., . . . et  al. (2010b). Conservation and divergence of methylation patterning in plants and animals. Proceedings of the National Academy of Sciences of the United States of America 107, 8689–8694. Feng, S., Rubbi, L., Jacobsen, S. E., & Pellegrini, M. (2011). Determining DNA methylation profiles

Epigenomics: An Overview using sequencing. Methods in Molecular Biology 733, 223–238. Ficz, G., Branco, M. R., Seisenberger, S., Santos, F., Krueger, F., Hore, T. A., . . . Reik, W. (2011). Dynamic regulation of 5-hydroxymethylcytosine in mouse ES cells and during differentiation. Nature 473, 398–402. Fineberg, S. K., Kosik, K. S., & Davidson, B. L. (2009). MicroRNAs potentiate neural development. Neuron 64, 303–309. Flusberg, B. A., Webster, D. R., Lee, J. H., Travers, K. J., Olivares, E. C., Clark, T. A., . . . Turner, S.W. (2010). Direct detection of DNA methylation during single-molecule, real-time sequencing. Nature Methods 7, 461–465. Fouse, S. D., Shen, Y., Pellegrini, M., Cole, S., Meissner, A., Van Neste, L., . . . Fan, G. (2008). Promoter CpG methylation contributes to ES cell gene regulation in parallel with Oct4/Nanog, PcG complex, and histone H3 K4/K27 trimethylation. Cell Stem Cell 2, 160–169. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., . . . et  al. (2004). Bioconductor:  open software development for computational biology and bioinformatics. Genome Biology 5, R80. Gu, H., Bock, C., Mikkelsen, T. S., Jager, N., Smith, Z. D., Tomazou, E., . . . Meissner, A. (2010). Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature Methods 7, 133–136. Gu, H., Smith, Z. D., Bock, C., Boyle, P., Gnirke, A., & Meissner, A. (2011). Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nature Protocols 6, 468–481. Guo, H., Ingolia, N. T., Weissman, J. S., & Bartel, D. P. (2010). Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835–840. Guo, J. U., Ma, D. K., Mo, H., Ball, M. P., Jang, M. H., Bonaguidi, M. A., . . . et al. (2011a). Neuronal activity modifies the DNA methylation landscape in the adult brain. Nature Neuroscience 14, 1345–1351. Guo, J. U., Su, Y., Zhong, C., Ming, G. L., & Song, H. (2011b). Hydroxylation of 5-methylcytosine by TET1 promotes active DNA demethylation in the adult brain. Cell 145, 423–434. Guttman, M., Amit, I., Garber, M., French, C., Lin, M. F., Feldser, D., . . . et al. (2009). Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227. Guttman, M., Donaghey, J., Carey, B. W., Garber, M., Grenier, J. K., Munson, G., . . . et  al. (2011). lincRNAs act in the circuitry controlling pluripotency and differentiation. Nature 477, 295–300.

39

Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., . . . et  al. (2010). Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature Biotechnology 28, 1097–1105. Hawkins, R. D., Hon, G. C., Lee, L. K., Ngo, Q., Lister, R., Pelizzola, M., . . . et al. (2010). Distinct epigenomic landscapes of pluripotent and lineage-committed human cells. Cell Stem Cell 6, 479–491. Heintzman, N. D., Hon, G. C., Hawkins, R. D., Kheradpour, P., Stark, A., Harp, L. F., . . . et  al. (2009). Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108–112. Heller, M. J. (2002). DNA microarray technology:  devices, systems, and applications. Annual Review of Biomedical Engineering 4, 129–153. Hirabayashi, Y., Suzki, N., Tsuboi, M., Endo, T. A., Toyoda, T., Shinga, J., . . . Gotoh, Y. (2009). Polycomb limits the neurogenic competence of neural precursor cells to promote astrogenic fate transition. Neuron 63, 600–613. Hsieh, J., Nakashima, K., Kuwabara, T., Mejia, E., & Gage, F. H. (2004). Histone deacetylase inhibition-mediated neuronal differentiation of multipotent adult neural progenitor cells. Proceedings of the National Academy of Sciences of the United States of America 101, 16659–16664. Huang, K., & Fan, G. (2010). DNA methylation in cell differentiation and reprogramming:  an emerging systematic view. Regenerative Medicine 5, 531–544. Hutnick, L., Golshi, P., Namihira, M., Xue, Z., Matynia, A., Yang, W., . . . Fan, G. (2009). DNA hypomethylation restricted to the murine forebrain induces cortical degeneration and impairs postnatal neuronal maturation. Human Molecular Genetics 18(15):2875–2888. Jin, S. G., Wu, X., Li, A. X., & Pfeifer, G. P. (2011). Genomic mapping of 5-hydroxymethylcytosine in the human brain. Nucleic Acids Research 39, 5015–5024. Karimi, M. M., Goyal, P., Maksakova, I. A., Bilenky, M., Leung, D., Tang, J. X., . . . et  al. (2011). DNA methylation and SETDB1/H3K9me3 regulate predominantly distinct sets of genes, retroelements, and chimeric transcripts in mESCs. Cell Stem Cell 8, 676–687. Khulan, B., Thompson, R. F., Ye, K., Fazzari, M. J., Suzuki, M., Stasiek, E., . . . et  al. (2006). Comparative isoschizomer profiling of cytosine methylation:  the HELP assay. Genome Research 16, 1046–1055. Kriaucionis, S., & Heintz, N. (2009). The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 324, 929–930.

40

the OMICs

Le, T., Kim, K. P., Fan, G., & Faull, K. F. (2011). A sensitive mass spectrometry method for simultaneous quantification of DNA methylation and hydroxymethylation levels in biological samples. Analytical Biochemistry 412, 203–209. Lee, J. T. (2011). Gracefully ageing at 50, X-chromosome inactivation becomes a paradigm for RNA and chromatin control. Nature Reviews Molecular Cell Biology 12, 815–826. Li, C. (2008). Automating dChip:  toward reproducible sharing of microarray data analysis. BMC Bioinformatics 9, 231. Li, E., Bestor, T. H., & Jaenisch, R. (1992). Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell 69, 915–926. Lister, R., Mukamel, E.  A., Nery, J.  R., Urich, M., Puddifoot, C. A., Johnson, N. D., Lucero, J., Huang, Y., Dwork, A. J., Schultz, M. D., et al. (2013). Global epigenomic reconfiguration during Mammalian brain development. Science 341, 1237905. Lister, R., Pelizzola, M., Dowen, R. H., Hawkins, R. D., Hon, G., Tonti-Filippini, J., . . . et  al. (2009). Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322. Lister, R., Pelizzola, M., Kida, Y. S., Hawkins, R. D., Nery, J. R., Hon, G., . . . et  al. (2011). Hotspots of aberrant epigenomic reprogramming in human induced pluripotent stem cells. Nature 471, 68–73. Marson, A., Levine, S.  S., Cole, M.  F., Frampton, G.  M., Brambrink, T., Johnstone, S., Guenther, M. G., Johnston, W. K., Wernig, M., Newman, J., et al. (2008). Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 134, 521–533. Mattick, J. S., & Makunin, I. V. (2006). Non-coding RNA. Human molecular genetics 15 Spec No 1, R17–R29. Meissner, A., Gnirke, A., Bell, G. W., Ramsahoye, B., Lander, E. S., & Jaenisch, R. (2005). Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Research 33, 5868–5877. Meissner, A., Mikkelsen, T. S., Gu, H., Wernig, M., Hanna, J., Sivachenko, A., . . . et  al. (2008). Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 454, 766–770. Mercer, T. R., Dinger, M. E., & Mattick, J. S. (2009). Long non-coding RNAs:  insights into functions. Nature Reviews Genetics 10, 155–159. Metzker, M. L. (2005). Emerging technologies in DNA sequencing. Genome Research 15, 1767–1776. Metzker, M. L. (2010). Sequencing technologies— the next generation. Nature Reviews Genetics 11, 31–46. Mikkelsen, T. S., Ku, M., Jaffe, D. B., Issac, B., Lieberman, E., Giannoukos, G., . . . et al. (2007). Genome-wide

maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553–560. Miranda, T. B., & Jones, P. A. (2007). DNA methylation:  the nuts and bolts of repression. Journal of Cellular Physiology 213, 384–390. Mohn, F., Weber, M., Rebhan, M., Roloff, T. C., Richter, J., Stadler, M. B., . . . Schubeler, D. (2008). Lineage-specific polycomb targets and de novo DNA methylation define restriction and potential of neuronal progenitors. Molecular Cell 30, 755–766. Montgomery, R. L., Hsieh, J., Barbosa, A. C., Richardson, J. A., & Olson, E. N. (2009). Histone deacetylases 1 and 2 control the progression of neural precursors to neurons during brain development. Proceedings of the National Academy of Sciences of the United States of America 106, 7876–7881. Moore, L. D., Le, T., & Fan, G. (2012). DNA methylation and its basic function. 38(1):23–38. Muers, M. (2011). Chromatin: a haul of new histone modifications. Nature Reviews Genetics 12, 744. Okano, M., Bell, D. W., Haber, D. A., & Li, E. (1999). DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247–257. Olovnikov, I., Aravin, A. A., & Fejes Toth, K. (2012). Small RNA in the nucleus: the RNA-chromatin ping-pong. Current Opinion in Genetics & Development 22, 164–171. Palacios, D., Mozzetta, C., Consalvi, S., Caretti, G., Saccone, V., Proserpio, V.,...et al. (2010). TNF/p38alpha/polycomb signaling to Pax7 locus in satellite cells links inflammation to the epigenetic control of muscle regeneration. Cell Stem Cell 7, 455–469. Pellegrini, M., & Ferrari, R. (2012). Epigenetic analysis: ChIP-chip and ChIP-seq. Methods in Molecular Biology 802, 377–387. Pomraning, K. R., Smith, K. M., & Freitag, M. (2009). Genome-wide high throughput analysis of DNA methylation in eukaryotes. Methods 47, 142–150. Rajasethupathy, P., Antonov, I., Sheridan, R., Frey, S., Sander, C., Tuschl, T., & Kandel, E.R. (2012). A role for neuronal piRNAs in the epigenetic control of memory-related synaptic plasticity. Cell 149, 693–707. Ramos, A. D., Diaz, A., Nellore, A., Delgado, R. N., Park, K. Y., . . . Lim, D. A. (2013) Integration of genome-wide approaches identifies lncrnas of adult neural stem cells and their progeny in vivo. Cell Stem Cell 12, 616–628. Ramsahoye, B. H., Biniszkiewicz, D., Lyko, F., Clark, V., Bird, A.P., & Jaenisch, R. (2000). Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proceedings of the National Academy of Sciences of the United States of America 97, 5237–5242.

Epigenomics: An Overview Reik, W. (2007). Stability and flexibility of epigenetic gene regulation in mammalian development. Nature 447, 425–432. Schulze, A., & Downward, J. (2001). Navigating gene expression using microarrays—a technology review. Nature Cell Biology 3, E190–E195. Shen, Y., Fouse, S. D., & Fan, G. (2009). Genome-wide DNA methylation profiling:  the mDIP-chip technology. Methods in Molecular Biology 568, 203–216. Shendure, J., & Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology 26, 1135–1145. Sidoli, S., Cheng, L., & Jensen, O.N. (2012). Proteomics in chromatin biology and epigenetics: Elucidation of post-translational modifications of histone proteins by mass spectrometry. Journal of Proteomics 75, 3419–3433. Smith, Z. D., Gu, H., Bock, C., Gnirke, A., & Meissner, A. (2009). High-throughput bisulfite sequencing in mammalian genomes. Methods 48, 226–232. Statham, A. L., Robinson, M. D., Song, J. Z., Coolen, M. W., Stirzaker, C., & Clark, S. J. (2012). Bisulfite sequencing of chromatin immunoprecipitated DNA (BisChIP-seq) directly informs methylation status of histone-modified DNA. Genome Research 22, 1120–1127. Suetake, I., Miyazaki, J., Murakami, C., Takeshima, H., & Tajima, S. (2003). Distinct enzymatic properties of recombinant mouse DNA methyltransferases Dnmt3a and Dnmt3b. Journal of Biochemistry 133, 737–744. Sumazin, P., Yang, X., Chiu, H. S., Chung, W. J., Iyer, A., Llobet-Navas, D., . . . et  al. (2011). An extensive microRNA-mediated network of RNA-RNA interactions regulates established oncogenic pathways in glioblastoma. Cell 147, 370–381. Suzuki, M. M., & Bird, A. (2008). DNA methylation landscapes: provocative insights from epigenomics. Nature Reviews Genetics 9, 465–476. Szulwach, K. E., Li, X., Li, Y., Song, C. X., Wu, H., Dai, Q., . . . et  al. (2011). 5-hmC-mediated epigenetic dynamics during postnatal neurodevelopment and aging. Nature Neuroscience 14, 1607–1616. Tahiliani, M., Koh, K. P., Shen, Y., Pastor, W. A., Bandukwala, H., Brudno, Y., . . . et  al. (2009). Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 324, 930–935. Tan, L., & Shi, Y.G. (2012). Tet family proteins and 5-hydroxymethylcytosine in development and disease. Development 139, 1895–1902. Thomson, T., & Lin, H. (2009). The biogenesis and function of PIWI proteins and piRNAs:  progress and prospect. Annual Review of Cell and Developmental Biology 25, 355–376. Ulitsky, I., Shkumatava, A., Jan, C. H., Sive, H., & Bartel, D. P. (2011). Conserved function of lincRNAs in

41

vertebrate embryonic development despite rapid sequence evolution. Cell 147, 1537–1550. Wang, Z., Zang, C., Rosenfeld, J. A., Schones, D. E., Barski, A., Cuddapah, S., . . . et  al. (2008). Combinatorial patterns of histone acetylations and methylations in the human genome. Nature Genetics 40, 897–903. Watanabe, T., Tomizawa, S., Mitsuya, K., Totoki, Y., Yamamoto, Y., Kuramochi-Miyagawa, S., . . . et  al. (2011). Role for piRNAs and noncoding RNA in de novo DNA methylation of the imprinted mouse Rasgrf1 locus. Science 332, 848–852. Weber, M., Hellmann, I., Stadler, M. B., Ramos, L., Paabo, S., Rebhan, M., & Schubeler, D. (2007). Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nature Genetics 39, 457–466. Williams, K., Christensen, J., Pedersen, M. T., Johansen, J. V., Cloos, P. A., Rappsilber, J., & Helin, K. (2011). TET1 and hydroxymethylcytosine in transcription and DNA methylation fidelity. Nature 473, 343–348. Winter, J., Jung, S., Keller, S., Gregory, R. I., & Diederichs, S. (2009). Many roads to maturity:  microRNA biogenesis pathways and their regulation. Nature Cell Biology 11, 228–234. Wu, H., Coskun, V., Tao, J., Xie, W., Ge, W., Yoshikawa, K., . . . Sun, Y. E. (2010). Dnmt3a-dependent nonpromoter DNA methylation facilitates transcription of neurogenic genes. Science 329, 444–448. Wu, H., & Zhang, Y. (2011). Mechanisms and functions of Tet protein-mediated 5-methylcytosine oxidation. Genes & Development 25, 2436–2452. Wu, S. C., & Zhang, Y. (2010). Active DNA demethylation: many roads lead to Rome. Nature Reviews Molecular Cell Biology 11, 607–620. Wu, Z., Huang, K., Yu, J., Le, T., Namihira, M., Liu, Y., . . . Fan, G. (2012). Dnmt3a regulates both proliferation and differentiation of mouse neural stem cells. Journal of Neuroscience Research 90, 1883–1891. Xie, W., Barr, C. L., Kim, A., Yue, F., Lee, A. Y., Eubanks, J., . . . Ren, B. (2012). Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell 148, 816–831. Young, R. A. (2000). Biomedical discovery with DNA arrays. Cell 102, 9–15. Yu, M., Hon, G. C., Szulwach, K. E., Song, C. X., Zhang, L., Kim, A., . . . et al. (2012). Base-resolution analysis of 5-hydroxymethylcytosine in the mammalian genome. Cell 149, 1368–1380. Zemach, A., McDaniel, I. E., Silva, P., & Zilberman, D. (2010). Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328, 916–919.

3 The Role of Epigenomics in Genetically Identical Individuals ZACHARY A. KAMINSKY

INTRODUCTION The application of genomic techniques in any context allows for a global view of the behavior of specific factors across the genome. While human biology will always contain exceptions to the rules, defining just what the rules are in terms of the broad molecular behavior of specific molecular marks has been greatly aided by the use of genome-wide techniques. In the bourgeoning field of epigenetics, there is a great need to understand the behavior of these factors because they are becoming increasingly recognized as important for normal phenotypic variation as well as pathophysiological variation related to many complex diseases. This chapter presents work where genomic techniques have enabled a snapshot of the behaviors of molecular epigenetic signals across the genome in the classical twin design, which includes genetically identical monozygotic (MZ) and fraternal dizygotic (DZ) twins. The application of epigenomic techniques to the classical twin design, in its simple elegance, enables a number of questions to be asked, not only as to the behavior of epigenetic patterns in an experimental system independent of the confounding effects of genetic heterogeneity but also regarding a number of the key mysteries inherent in the study of phenotypic traits. In short, twins can help us understand epigenetics, which, in turn, can help us better understand the nature and nurture of human phenotypic variation. This chapter discusses what epigenomic twin studies have taught us so far and elaborates on the various strengths and weaknesses of the studies performed to date. Topics covered relate to the epigenomics of twin discordance, chorionicity, and tissue of origin in the interpretation of epigenomic variation in the twin

design as well as evidence for the heritability of epigenetic signals.

EPIGENETICS AND THE CLASSICAL TWIN  DESIGN The comparison of phenotypic concordance in MZ to DZ twins has long been one of the most elegant systems whereby to infer the influence of inherited factors on a trait, since the degree of both genetic and environmental variation occurring in both groups is known. MZ twinning occurs when a single fertilized egg produces two embryos, while DZ twins result when two eggs are fertilized by separate sperm.1 Therefore MZ twins share approximately 100% DNA sequence identity, whereas DZ twins, on average, share 50% of all segregating DNA polymorphisms.2–4 Both MZ and DZ twins essentially share the same environment; thus, in the traditional model, if the DZ twin group is more variable for a trait, it is said to be under the influence of heritable factors. In the classical model, “heritability” (H)3,5 has been attributed solely to genetic factors and has been defined by the following equation: H = 2* (DZ difference – MZ difference) For quantitative traits, the correlation between twin groups serves as the most common measure of co-twin similarity or, inversely, co-twin difference; it is often represented by the intraclass correlation coefficient (ICC).6 In using ICCs, the equation for heritability becomes: H = 2* (ICCMZ – ICCDZ) Based on such models and assuming that only genetic and environmental variations are at

The Role of Epigenomics in Genetically Identical Individuals play, the proportion of genetic and environmental influences contributing to a phenotype can be calculated; this approach has directed research efforts to elucidate the specific causes implicated in such studies. For example, classical twin studies of major depressive disorder, one of the most common psychiatric disorders, have determined a heritability ranging from 0.36 to 0.7.7-12 An evaluation of 36 MZ and 53 DZ twins found panic disorder to be twice as frequent in MZ twins as in DZ twins,13 suggesting a heritability of 1 or that the disease is caused purely by genetic factors. Calculation of heritability based on probandwise concordance for bipolar disorder from 55 MZ and 54 DZ twins from the Danish twin registry yielded a value of 0.94,14 with recent studies in alternative populations reaching the same conclusions.15 An analysis of five twin studies in schizophrenia reported heritability values ranging from 0.80 to 0.85.16 The field of behavioral genetics has employed the classical twin design to fuel debate regarding the influence of “nature” and “nurture” on human behavior.17 Some recently investigated traits include children’s food preferences,18 beverage intake,19 antisocial behavior,20 and sexual excitation in men.21 The results and interpretation of such twin studies have fueled the search for the implicated genetic and environmental factors; however, a closer inspection of the assumptions behind the classical twin design shows that it may actually be less simply elegant than at first it seemed.

MZ Twin Discordance The first problem with the classical twin design involves the concept of MZ twin discordance, which is a hallmark of complex non-Mendelian disease. Probandwise, MZ concordance for major depression is 31% for male MZ and 48% for female MZ twins,22 62% to 79% for bipolar disorder,14 and 41% to 65% for schizophrenia.16 Traditionally the levels of discordance are attributed to differences in environmental effects between twins. As these environmental influences must differ between the co-twins, this influence has been termed nonshared environment23; however, experimental evidence does not support a strong role for nonshared environment in phenotypic outcome. A number of studies have taken advantage of adoption registries and measured phenotypic similarity in MZ twins who were reared together as compared with those reared apart. The most

43

famous of these studies, the Minnesota study of twins reared apart, tracked more than 200 twin pairs longitudinally from 1970 to 1990 and measured a number of behavioral characteristics, temperament, leisure time, occupation, and social interests. 24 Independent of the environmental influences over the course of these 20  years, the MZ co-twins in both groups were remarkably similar in the correlations for these measures.25 While perhaps the largest of its kind, the Minnesota study is not the only adoption study to address the issue of non-shared environment. A review of studies from 1920 to 1987 demonstrated that MZ twin concordance rates for schizophrenia were ~46%, independent of whether twins were reared together or apart.26 Twin discordance in atopic disease did not vary in measures of asthma, rhinitis, skin-test response, and serum IgE levels between twins reared together and apart.27 Levels of type A–like behavior were evaluated in a large Swedish adoption/twin study cohort including 229 and 160 MZ twins reared apart and together, respectively, where no evidence for nonshared environment was found.28 In addition to humans, a considerable degree of phenotypic variability has been observed in inbred and cloned animals; these contain minimal genetic variation, even after strictly controlling for environmental variation.29–32 It is apparent from these examples that for a number of phenotypes, environmental influence alone is insufficient to explain the observed levels of discordance between genetically identical organisms. This observation calls for the incorporation of additional factors capable of contributing to twin discordance, the leading candidates among which are epigenetic marks.

Missing Heritability The second problem with the interpretation of the classical twin design relates to the identification of the implicated heritable factors. The lack of replicating genetic associations to complex traits over the past ~30 years and underwhelming results from genome-wide association studies (GWASs) in psychiatric disease since their inception has led to the increasingly common issue of “missing heritability.”33 In fact, it is not the heritability that is missing but identification of the underlying factors responsible for it. There is little doubt that complex diseases such as schizophrenia and bipolar disorder are heritable, but why, then, have so few genetic

44

the OMICs

variants been discovered? One possibility is that the genetic technologies used thus far have not had the resolution necessary to identify rare genetic variants undetectable by GWAS—a prospect that has fueled a recent surge of next-generation sequencing studies in complex disease. However, an alternative possibility is that the molecular factors implicated by heritability studies may go beyond genetic sequence variants and encompass epigenetic variation.

Epigenetics The term epigenetics refers to regulation of various genomic functions controlled by partially stable modifications of DNA and histones.34 This information is encoded in two types of synergistically acting covalent modifications:  DNA methylation and chromatin protein modification.35 In mammals, DNA methylation occurs most commonly on cytosines, which are directly followed by guanine, forming what is known as a CpG dinucleotide. Clusters of CpG dinucleotides are referred to as CpG islands.36 Five-methyl cytosine is often referred to as the fifth base of the genetic code; however, its function is related to transcriptional control as opposed to DNA sequence–based coding. There are numerous examples demonstrating that the binding affinity of transcription factors is directly limited by the presence of methylation at the binding sites.37,38 The density of 5-methyl cytosine in a gene regulatory region also contributes to gene activity, with a large number of genes exhibiting an inverse correlation between the degree of methylation and the level of gene expression.39,40 DNA methylation can regulate genomic functioning not only in the expression of genes but also in the suppression of repetitive DNA sequences41 and the formation of architecturally functional chromatin structures such as centromeric regions.42 The methylation of DNA is mediated by proteins called DNA methyltransferases (DNMTs).43 The most widely studied of these include DNMT1, and DNMT3a/ DNMT3b, representing the primary maintenance and de novo methyltransferases respectively. Until recently, DNA methylation was believed to represent a more permanent mark denoting silenced genomic regions; however, since the recent discovery of 5-hydroxymethyl cytosine and its role in active DNA demethylation, these assertions are being called into question.44,45 DNA modification acts in concert with the alterations in chromatin structure that occur through the acetylation, methylation,

phosphorylation, ubiquitination, and sumoylation of various histone amino acid residues including lysine, arginine, and serine.46–51 A  majority of epigenomic twin studies performed to date measure DNA methylation, since chromatin immunoprecipitation assays used to investigate histone modifications require large amounts of tissue and are not always as feasible in a clinical population. Epigenetic signals are necessary for the proper regulation and functioning of the genome,34 with epigenetic mutations, or epimutations, having the potential to be as harmful to an organism as genetic mutations. Knockout mice with homozygous deletions of DNA methyltransferase 1 (DNMT1) exhibit embryonic lethality. In addition to the regulation of gene activity,37,52–56 epigenetic factors may affect DNA mutability57 and genetic recombination.58 Epigenetic patterns are established in a tissue-specific manner and are believed to be responsible for establishing and maintaining the cellular identity of the more than 200 cell types in the human body.59–62

Epigenetic Metastability The epigenetic status of genes and genomes is far more dynamic than the DNA sequence and is subject to changes under the influence of developmental programs, in the presence of internal or external environmental epigenetic modifiers, or simply as a result of stochastic processes relating to maintenance of epigenetic factors. Numerous lines of evidence suggest that DNA methylation undergoes a stochastic rearrangement referred to as metastability. Cell culture models of higher eukaryotic systems have demonstrated that metastability can result from the relatively low fidelity of the DNA methylation maintenance enzymes, such as DNMT1, as compared with that of the DNA repair machinery.6. Experiments in tissue culture and mice have found that maintenance DNA methylation fidelity ranges from 95% to 99.9%64 along with an additional fluctuation of 3% to 5% per mitosis in the form of de novo methylation.65 This translates to a difference of roughly three orders of magnitude lower mitotic fidelity of epigenetic patterns as compared with the DNA sequence (10−6 and 10−3 for DNA sequences and DNA modification, respectively).65 It is clear that some DNA methylation signals will be lost or gained through thousands of replication events, resulting in a DNA sequence-independent drift of epigenetic signals.

The Role of Epigenomics in Genetically Identical Individuals Returning to the classical twin design, the metastable nature of epigenetic signals on one hand and their primary role in determining various phenotypes on the other makes them ideal candidates to account for phenotypic differences in genetically identical organisms, including identical twins. A  number of studies have identified epigenetic differences in MZ twins. In an investigation of skin fibroblasts from five female MZ twin pairs discordant for Beckwith-Wiedemann syndrome, an imprinting defect was identified at the KCNQ1OT1 gene of 11p15 only in the affected cases.66 The authors suggested that this locus may be vulnerable to aberrations in DNA methylation maintenance occurring during preimplantation and that the epimutation itself may predispose to the twining event.66 A large-scale investigation of more than 300 twins demonstrated substantial variation of DNA methylation levels at a differentially methylated region (DMR) associated with the H19/IgF2 locus.67 DNA methylation patterns of pairs of MZ twins discordant for schizophrenia investigated in the promoter region of the DRD2 gene demonstrated that disease-affected individuals were more epigenetically similar to each other than to unaffected co-twins.68 After failing to identify sequence differences between a pair of MZ twins discordant for caudal duplication, a sodium bisulfite modification–based investigation revealed differential methylation of a CpG island functioning as a promoter in the AXIN1 gene; it was believed to be responsible for discordance.69 This methylation discordance was subsequently identified in affected singletons. A  methylation-sensitive restriction screening method was applied to a pair of MZ twins discordant for bipolar disorder. Sodium bisulfite modification–based analysis through pyrosequencing revealed an altered DNA methylation pattern at the peptidylprolyl isomerase E-like (PPIEL) gene, which correlated with the expression differences between the twins.70 Using an HPLC based detection method, Fraga et  al. performed the first large-scale investigation of epigenetic differences in twins, identifying 35% of twins (N  =  40) displaying significant global DNA methylation differences and acetylation differences involving histones H3 and H4. Twins displaying large global epigenetic differences corresponded to global gene transcription differences as measured by mRNA hybridization to Affymetrix Human U133 Plus 2.0 gene chips. Using a restriction enzyme–based

45

enrichment technique, the authors enriched the methylated fraction of genomic DNA, cloned it into plasmid vectors, and sequenced it, identifying similar percentages of DNA methylation difference corresponding to the twin pairs identified by global results. Hybridization of these enriched products from pairs of variable co-twins to metaphase chromatin demonstrated a higher density of epigenetic variation in the telomeres and several gene-rich regions.71 A  metastable drift of epigenetic factors could account for the observations of twin discordance in complex traits. These initial studies paved the way to identifying the presence of epigenetic differences in genetically identical organisms; however, the transition to genomic techniques as outlined in the following section has allowed deeper insight into the nature and nurture of epigenetic differences in twins.

THE EPIGENOMICS OF  TWINS Epigenomic Technologies The past two decades have seen a dramatic increase in the available technologies for epigenetic profiling, both at individual loci and at the genome-wide level; these have led to a promising beginning to the profiling of the epigenome. Many of the new methods reflect an assimilation of existing high-throughput genome-scanning technologies such as microarrays, following a refitting to meet the complexities of epigenetic studies. The primary epigenetic technologies can be broken down into two categories, including sodium bisulfite– and enrichment-based technologies. The first set of methods is specific to DNA methylation and involves chemical treatment with sodium bisulfite, which deaminates all unmethylated cytosines to uracil while 5-methyl and 5-hydroxymethy cytosine remains protected. This procedure produces sequence polymorphisms that, through PCR amplification, can be detected with a variety of techniques including cloning and sequencing, mass spectrometry,72 single-nucleotide extension techniques,73–77 and pyrosequencing.78 Until recently, these techniques have been confined to a site-specific quantification of DNA methylation percentages; however, the application of Illumina bead microarray technology to the measurement of these chemically induced

46

the OMICs

sequence polymorphisms represents a shift toward measuring DNA methylation using sodium bisulfite modification on the genomic scale. The first iteration of this assay was the Illumina GoldenGate assay,79 capable of interrogating DNA methylation status at 1,536 CpGs. Successive iterations resulted in ~ 27,000 and ~480,000 CpGs interrogated and were referred to as the HM27 and HM450 platforms, respectively. Next-generation sequencing techniques can also be applied to sodium bisulfite–modified DNA; however, because bisulfite modification reduces the 4 bp DNA sequence code to a 3 bp code, performing accurate alignment of sequence reads is challenging, making it a very expensive venture at this point in time. Importantly, it is the single-bp resolution of post–sodium bisulfite modification techniques that makes them the “gold standard” method of DNA methylation quantification. Despite the high degree of accuracy of these techniques, sodium bisulfite modification does not distinguish between 5-methyl cytosine and 5-hydroxy methyl cytosine, such that the detected percentage of unconverted cytosines at the end of the assay represents the cumulative total of these modifications per genomic coordinate. The second set of methods for epigenetic analysis rely on the segregation of the desired components of the genome, either with antibodies specific to the epigenetic mark or through the selective cutting of methylation-sensitive restriction enzymes that will cut only at specific unmethylated consensus sequences. Enrichment of the DNA sequences incorporating the desired components is followed by identification of the isolated sequences through hybridization to microarrays or sequencing techniques. While these techniques can be used to interrogate DNA methylation, they are not confined to the 5-methyl cytosine modification and thus represent the primary methods used to investigate histone modifications and 5-hydroxymethylation; however they are confined to the precision of the enrichment technology and therefore often perform at a much lower resolution than sodium bisulfite modification techniques. The epigenomic technologies most heavily used in twin studies involve the interrogation of tens of thousands of genomic regions with microarrays. The microarray-based studies are limited to the array platform employed;

a number of these with varying resolutions are available, such as spotted oligo arrays ranging from 12,000 loci of about 1 kb in length, such as the human CpG island microarray, 80 to tiling arrays with ~40  million probes spaced at regular intervals, often below 100 bp. Microarray probes are often designed to highlight key functional areas, like CpG islands, gene promoters, exons, and 3’UTRs, among others. In this way, these platforms fulfill a form of candidate gene approach, but over tens or hundreds of thousands of regions.

Epigenomic Discordance in Identical  Twins The application of genomic techniques to the study of epigenetic variation in the classical twin design has led to a dramatic increase in our understanding of how these important molecular signatures behave as a function of age, tissue, and differing genomic features such as gene promoters and CpG islands (CGIs). These studies pave the way for an investigation of epigenetic factors as etiological factors in twin discordant phenotypes, as in complex disease. While the study by Fraga et  al.71 was the first to identify large-scale epigenetic differences between twins, the global techniques employed lack resolution and the ability to characterize specific genomic regions subject to high or low epigenetic drift. In 2009, Kaminsky et  al. published the first microarray-based DNA methylation profiling study using the classical twin design.81 In 20 pairs of MZ twins, the authors used a DNA methylation–sensitive enzyme technique82 to enrich for the unmethylated fraction of genomic DNA from peripheral blood, buccal swab, and gut biopsy samples. These were hybridize onto a CpG island microarray containing 12,192 probes located primarily in CpG-rich regions across the genome.80 The blood and buccal tissue originated from adolescent twins ranging between 12 and 16  years of age; the gut biopsy age range covered a larger spectrum. As this was the first epigenomic twin study performed using a novel genome-wide epigenetic technology, it was first critically important to prove the existence of detectable biological epigenetic differences between MZ twins over levels of technical variation. To accomplish this, the authors compared the degree of epigenetic variation between MZ twin and MZ co-twin hybridizations to that of MZ twin and self-hybridizations across four twin pairs (Figure 3.1).81 Using a nonparametric

47

The Role of Epigenomics in Genetically Identical Individuals (B)

(C)

(D)

4

4

3

3

3

3

2 1 0

2 1 0

–0.4 –0.2 0.0

0.2

Fold Change

0.4

-Log (P value)

4 –Log (P value)

4 –Log (P value)

–Log (P value)

(A)

2 1 0

–0.4 –0.2 0.0

0.2

0.4

Fold Change

2 1 0

–0.4 –0.2 0.0

0.2

Fold Change

0.4

–0.4 –0.2 0.0

0.2

0.4

Fold Change

Biological vs. technical variation. Volcano plots of four MZ twin vs. co-twin WBC DNA methylome comparisons (black) overlaid with four matched twin DNAs vs. self comparisons (gray) for each set of MZ twins. The x-axis represents the mean fold change across the four replicas; the y-axis represents the  –log10 of the P  value from a paired t-test. Higher significance denotes a higher consistency between replicates. Significant variation in the spread of detected biological difference exists between twin pairs (Kruskal-Wallis χ2 = 16.3, df = 3, P  =  0.001) with a symmetrical large (A and B), symmetrical small (C), and asymmetrical (D)  variation of the DNA methylome between co-twins. For each twin pair, a nonparametric Ansari-Bradley test demonstrated that levels of variance (σ2) in the MZ twin—co-twin comparison were significantly larger than σ 2 in the self-self comparisons (twin set A: variance ratio = 2.91, P = 1.4  × 10−238; set B: 2.14, P = 1.1  × 10−202; set C: 1.12, P = 2.1  × 10−7; set D:  2.63, P  =  2.6  × 10−39). Levels of technical variation were not significantly different between groups (Kruskal-Wallis χ2 = 1.81, df = 3, P = 0.62). FIGURE  3.1:

Source: Reproduced from Kaminsky et al., Nature Genetics¸2009.81

Ansari Bradley variance test, a higher degree of DNA methylation variation was identified in the co-twin hybridizations as compared to self-self hybridizations across all pairs, while the levels of technical variation were not different between groups. This represents the first experiment to demonstrate on a genome-wide scale that differences in DNA methylation exist between genetically identical twin pairs. It was apparent from the findings that epigenetic differences existed at numerous genomic regions and that the degree of epigenetic drift varied per sibling pair. The authors were next interested in categorizing regions of epigenetic change per tissue. Using the ICC statistic across the MZ twin cohorts, representing the degree of epigenetic similarity per sib pair, Kaminsky et  al. found that the degree of epigenetic drift within gene promoter regions and CGIs was significantly smaller relative to other regions of the genome in both blood and buccal tissue. This change was consistent in the gut sample but did not survive correction for multiple testing. In order to take advantage of the higher resolution of this technique compared with earlier methods, gene ontology analysis was performed on the top and bottom fifth percentile of epigenetically variable regions per tissue. Thus it was found that epigenetic drift was minimized at genes seemingly relevant to the function of

the tissue of origin, whereas higher epigenetic drift appeared to be occurring at genes involved in cell division. Since the time of this initial report, a growing number of studies have investigated the degree of epigenetic variation in MZ twins using genome-wide approaches. A  large majority of these studies have either replicated these initial observations or added to our understanding of the behavior of epigenetic changes in genetically identical individuals. Gervin et al. isolated CD4+ cells from 49 MZ and 40 DZ pairs, cultured them, and performed sodium bisulfite sequencing of the major histocompatibility complex (MHC) region, resulting in quantification of DNA methylation status of 1,760 individual CpGs.83 A strength of this study is that the isolation of a specific population of cells will eliminate spurious findings due to cellular heterogeneity; however, a potential concern is that these cells were cultured prior to DNA methylation analysis. Many recent genome-wide DNA methylation studies have investigated the effects of culturing on epigenetic patterns and identified a random degradation of DNA methylation patterns, which increases with increased cell culture passage.84–86 To control for such effects, Gervin et  al. performed a series of control experiments to demonstrate their ability to detect biological signals over technical artifacts, suggesting that the degree of culture-induced degradation of the DNA methylation pattern

48

the OMICs

may have been very low. Consistent with the results presented by Kaminsky et  al., MZ twin ICC values were higher at the 5’ regions of genes and within CGIs as compared with conserved noncoding regions (CNCs) and randomly selected sequences. In an investigation of 23 MZ and 23 DZ matched twin pairs and 96 singletons, Boks et al. used the Illumina Golden Gate assay to investigate the association of DNA methylation with age at 280 CpGs that they deemed to be of good quality in their sample.87 The authors identified 56 loci associated with age in the twin group that were replicated in the alternative cohort of singletons, with the ages of the two cohorts ranging between the early twenties to the midto late fifties. 87 In a similar study, Blocklandt et  al. performed epigenomic profiling using the Illumina HM27 microarray in a set of MZ twins ranging in age from 21 to 55  years; they identified DNA methylation variation correlated with age at 80 genes.88 The top epigenetic differences from the twin study were replicated in an alternative sampling of the general population and were found to be predictive of age within 5  years. Although the number of genes identified in these two studies appears to be relatively small compared with the ~25,000 genes in the human genome, both studies attempted to identify age-associated DNA methylation replicable across multiple cohorts. Taken together, these results suggest that factors influencing the divergence of epigenetic patterns with age may not be confined purely to stochastic epigenetic drift, as one might not expect to see age-associated loci consistent across multiple populations in a purely random model. Although the analysis of X-chromosome inactivation is not performed on the genomic scale, skewed X-chromosome inactivation stands as a metric of stochastic epigenetic drift and thus fits into the discussion of metastable epigenetic changes during development. Two recent studies have been performed to date investigating the degree of skewed X-chromosome inactivation in MZ and DZ twins over varying ages. In a prospective sample of MZ and DZ twins, Wong et  al. identified a relatively stable rate of skewed X-chromosome inactivation between 5 and 10 years within the same individual. 89 When the skewing rates were compared between the two time points in the same individuals, 75% of the sample demonstrated a change in skewing rate less than 10%, while the remaining 25%

demonstrated a 20% to 30% change over the 5-year period. In an earlier report investigating 118 twin pairs ages 18 to 53 and 82 twin pairs ages 55 to 95, skewed X-chromosome inactivation rates were demonstrated at 15% and 35% respectively in the two populations.90 Together these studies paint the picture of relatively stable rates of stochastic change earlier in life followed by a progression of epigenetic change with time. Upon close examination, it appears that periods of major hormonal rearrangement, as at puberty and menopause, may contribute to the observed levels of stochastic change. For example, in the study by Wong et  al., a portion of the sample appears to increase the skewing rate to ~ 20%, with extreme cases around 30%. Could the individuals showing this trend have hit puberty earlier than the others? In the study by Kristiansen et  al., the age groupings and divergent rates of X-chromosome inactivation reported are separated around the mid-fifties, a period likely to correspond with menopause. The prospect of specific periods of epigenomic instability and rearrangement coinciding with periods of hormonal change such as puberty and menopause may add credence to the suggested involvement of epigenetic changes in psychiatric diseases, which exhibit increased incidence at these time points. All of the above studies investigate epigenetic discordance during postnatal life; however, one study has made use of samples taken at birth from the Peri/Postnatal Epigenetic Twin Study to investigate prenatal epigenetic changes. Gordon et  al. evaluated gene expression at birth as a proxy for epigenetic changes using cord blood mononuclear cells (CBMCs) that included T and B cells and CD31+ umbilical cord vascular endothelial cells (HUVECs) in 12 and 10 MZ pairs, respectively. The authors identified a significant degree of variation in gene expression in both tissue types.91 The results of these studies demonstrate that epigenetic factors diverge within genetically identical individuals, most likely because of the metastable epigenetic drift highlighted above. As these individuals have essentially shared the same environment, the conclusion derived from such studies is that, barring the existence of additional factors differing between genetically identical and environmentally similar individuals, a divergence of epigenetic factors represents a promising candidate to account for phenotypic discordance observed in twin studies. That being said, there are added complexities to this

The Role of Epigenomics in Genetically Identical Individuals conclusion that must be factored into the interpretation of twin studies related to phenotype.

Factors Influencing Discordance The first factor influencing twin discordance relates to the timing of the twinning event in MZ twins, which is responsible for the generation and subclassification of monochorionic (MC) and dichorionic (DC) twins. DC twinning occurs prior to four days postfertilization in approximately 25% to 30% of twinning events and ultimately results in MZ twins with separate chorions and amniotic sacs.92 MC twinning represents a majority of twinning events and occurs approximately four days after fertilization.92 At this point in developmental timing, the formation of the chorion has already begun and results in both twins sharing a placenta. In the latter MC twinning scenario, an unequal distribution of placental connections, or anastomoses, per twin may essentially result in a different nutritional environment in utero. In the literature, higher rates of phenotypic discordance are generally observed in MC over DC MZ twins, most prominently in birth weight discordance and gestational outcomes,93,94 which is often attributed to an unequal distribution of nutritional and placental resources reaching the two developing embryos.95,96 Extreme forms of unequal blood flow to MC MZ twins result in a phenotype called the twin-to- twin transfusion syndrome (TTTS) and are associated with possible premature death of one twin and an increased risk for neurological and other complications later in life for surviving siblings.97 Consistent with the interpretation of intrauterine environmental influences on epigenetic status in the offspring, Kaminsky et  al. evaluated the DNA methylation of 10 pairs of MC MZ twins and 10 pairs of DC MZ twins in DNA derived from buccal tissue; they identified a markedly higher degree of epigenetic variation in the MC MZ twins.81 These results are in opposition to a recent microarray study investigating gene expression discordance between MZ twins, reviewed above.91 Unlike the results of the Kaminsky study, Gordon et  al. identified a greater discordance in DC MZ twins as compared with MC MZ twins in CBMCs and HUVECs.91 It seems unlikely that the use of gene expression as a proxy for epigenetic programming can account for these differences, as the behavior of prenatal gene expression and DNA methylation have been observed to

49

follow similar/anticorrelated trajectories, suggesting that expression should be a good proxy of genome-scale epigenetic variation at this developmental time point.98 This question will be answered soon, as Gordon et al. are purportedly now analyzing genome-wide DNA methylation in this sample.99 Despite the disparate appearance of these results at first glance, a careful consideration of the tissue of origin of these findings may hold the key to understanding the nature of the difference between studies and lead us into another important consideration in interpreting the epigenomics of twin studies. Another consequence of sharing placental anastomoses is that the hematopoietic stem cells from which the organism’s blood derives can be shared between MC MZ twins, a phenomenon commonly referred to as chimerism. In this way, one twin’s blood becomes a chimera of the other’s. The epigenetic status between MZ twins may become chimeric primarily in the blood and appear similar compared to that of DC MZ twins, while in other tissues the epigenome may be diverging more than DZ MZ twins due the cumulative influences of stochastic epigenetic drift as well as nonshared intrauterine environmental influences. In fact, epigenetic studies of the imprinting control region (ICR) dysregulated in Beckwith-Wiedemann syndrome suggest that chimerism may also result from the shared placenta originating from only one of the two MC MZ twins.100 Chimerism of the blood-derived DNA methylation status appears consistent with the observations to date. Kaminsky et al. did not evaluate peripheral blood DNA methylation in MC MZ twins in order to avoid issues of hematopoietic stem cell sharing in the interpretation of heritability (discussed below); however, they were able to analyze this in DNA derived from buccal swabs. A  non–genome-wide study by Ollikainen et  al. investigated the degree of DNA methylation discordance at four differentially methylated regions (DMRs) surrounding the IGF2 locus in the sample of CBMCs and HUVECs referenced above but also buccal swab and granulocyte-derived DNA.98 Consistent with the proposed model, the authors observed a higher DNA methylation discordance at their interrogated loci in DC MZ twins in the hematopoietic stem cell–derived CMBCs but a lower degree of discordance in the DC MZ group in the HUVEC, buccal, and granulocyte DNAs compared with the MC MZ group.98 Returning

50

the OMICs

to the expression study of Gordon et  al., the authors reference the possibility of hematopoietic stem cell sharing in their sample but point out that one MC MZ twin pair diagnosed with TTTS did not demonstrate the highest or lowest degree of epigenetic discordance. In accordance with the theories presented above related to discordant intrauterine environment, the authors suggest that unequal blood sharing may have affected the observed levels of discordance.91 This final observation seems to tie together the two critical factors influencing the interpretation of epigenomic studies in twin blood as opposed to tissues of nonhematopoietic origin (Figure  3.2), namely that an increased concordance due to hematopoietic stem cell sharing may be potentially attenuated by an increased discordance resulting from differing intrauterine environments. Bearing in mind that a majority of MZ twins (~70%) in the population will be monochorionic, what does this mean for the interpretation of epigenomic studies investigating epigenetic factors in the blood of discordant twins?

Epigenetic Discordance

Implications of Epigenomic Twin Findings One of the underlying implications of this work is that epigenetic drift independent of genetic factors may contribute to phenotypic differences, which may in turn help to explain features of complex non-Mendelian disease such as twin discordance. In fact, there is a growing literature investigating epigenetic factors in MZ twins discordant for complex disease on the genomic scale. In a recent study

performed by Dempster et  al., peripheral blood DNA methylation was assessed using the Illumina HM27 bead array platform in a sample of twins discordant for bipolar disorder and schizophrenia.101 The authors identified over 100 CpG dinucleotides differentially methylated between affected and unaffected MZ twins in a combined bipolar disorder and schizophrenia analysis. Pathway analysis identified enrichment of DNA methylation changes associated with genes related to “psychological disorders,” “dopamine receptor signaling,” and “nervous system development and function.” In an investigation of 105 MZ twin pairs discordant for psoriasis, Gervin et  al. used the Illumina HM27 platform to search for DNA methylation differences in CD4+ and CD8+ cultured lymphocytes.102 Initially, no DNA methylation differences associated with psoriasis were detected; however, the authors subsequently evaluated if MZ co-twin DNA methylation differences correlated with gene expression array measures for these twins. Pathway analysis identified enrichment for “immune response” and “cytokine” pathways previously implicated in the psoriasis phenotype.102 A  genome-wide scan of 30 MZ twins discordant for systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), and dermatomyositis (DM) was performed using Illumina custom bead arrays to evaluate 807 gene promoter regions.103 In the group of five SLE-discordant twin pairs, 49 genes exhibited significant DNA methylation differences and were enriched for gene ontology terms related to immune function, as would be expected for an autoimmune disease.

DC MZ blood MC MZ blood DC MZ nonblood MC MZ nonblood

Unequal blood flow

Chimerism

DC MC Tw Tw inn inn ing ing

Bir th

Age

Epigenetic drift in MC and DC MZ twins. A hypothetical model highlighting influence on the degree of epigenetic discordance measured at birth and in adolescence as a function of tissue type and chorionicity. The respective discordance inducing and reducing effects of unequal blood flow and hematopoietic stem cell sharing (chimerism) on the MC MZ group are depicted by dashed arrows. Note that the depicted degrees and rates of epigenetic change at each given time point are not to scale.

FIGURE 3.2:

The Role of Epigenomics in Genetically Identical Individuals Based on the model depicted in Figure  3.1, it is tempting to make inferences related to the developmental origin of epigenetic differences identified in discordant twin populations and the power that differing chorionicity and tissue groups would have to detect these. In peripheral blood, twin studies to date suggest that MC MZ twins show fewer differences than DC MZ twins at birth. If an epimutation was acquired during prenatal development, chimerism might be expected to reduce the observable effect in MC MZ twins as compared with DC MZ twins, reducing the power to identify epimutations of this developmental origin in population-based twin studies comprising mostly MC MZ twins. Conversely, the epigenetic modifying effects of postnatal environmental influence or stochastic epigenetic drift may be equally powered in both chorionicity types. Therefore a multitissue study taking into account chorionicity may represent the most powerful way to identify epimutations arising during different developmental stages. For example, DC MZ blood and nonblood samples may be ideal for identifying epimutations resulting from stochastic epigenetic drift and postnatal nonshared environmental influences. A  comparison of these findings with those derived from MC MZ nonblood samples may be ideal for pinpointing epimutations likely to result from intrauterine environmental differences, while comparing DC MZ twin epigenetic profiles with MC MZ blood-derived profiles may help to segregate epimutations resultant from postnatal nonshared environmental effects. Unfortunately, as the rate of DC MZ twinning is more than two times lower than that of MC MZ twinning, the prospect of obtaining well-categorized tissues from discordant MZ pairs seems like a lofty goal at this time. Heritability: As mentioned above, one of the primary strengths of performing twin studies is to ascertain the degree of heritability to a given trait. Myriads of twin studies have fueled the search for genetic factors contributing to phenotypes ranging from personality and behavior to complex psychiatric disease, the results of which have not been overly fruitful. Could it be that a portion of the heritability implicated by twin studies results not from DNA sequence– based factors but a passage of epigenetic information through the germline? With the exception of a number of well-characterized loci in mice,104,105 it is conventionally

51

believed that in mammals there is no passage of epigenetic information from parent to offspring generations owing to the massive epigenetic rearrangements occurring during gametogenesis and at fertilization,51,106–109 including a global demethylation of DNA accompanied by massive histone modification rearrangement.110 In mice, however, there are clear exceptions to this rule, and until recently no studies had investigated evidence for a heritable component to DNA methylation on the genomic scale. Is it possible that humans may also exhibit some form of epigenetic inheritance? The application of genome-wide DNA methylation profiling in the classical twin design allows for the assessment of the heritability of epigenetic factors and represents the first step to understanding epigenetic inheritance. In their 2009 study published in Nature Genetics, Kaminsky et  al. performed the first genome-wide evaluation of epigenetic heritability in a human population by comparing 20 pairs of DC MZ twins and with 20 matched DZ twin pairs from peripheral blood and a mixed cohort of 10 MC MZ and 10 DC MZ twins and matched DZ twins in buccal swab samples.81 To avoid the influence of chimerism or unequal blood flow, the authors excluded MC MZ twins from the peripheral blood analysis. The authors biased the sample against falsely identifying evidence for a heritable influence on DNA methylation by matching MZ twin pairs with DZ pairs based on major cell counts derived from blood hematology reports such that MZ co-twin cellular heterogeneity was higher than that in the DZ group. By comparing ICCs generated on the 12K CpG island microarray, DZ twins were found to be more epigenetically variable than MZ twins, a result representing the first evidence for a heritable component to DNA methylation in both the blood and buccal samples on the genomic scale. The observed effect was significantly higher in the buccal tissue; however, as discussed above, the MC MZ twin group displayed higher degrees of epigenetic variation and reduced the observed heritability in this group. A  number of successive studies have replicated these findings. Many of the studies initially investigating MZ twins also evaluated matched cohorts of DZ twins in attempts to assess the heritability of epigenetic marks. The results of these studies must be evaluated with consideration of the available MZ twin chorionicity information.

52

the OMICs

In the study by Gervin et  al., sodium bisulfite sequencing data of the MHC region of 49 MZ twin pairs was compared with data from 40 DZ twin pairs; the investigators reported a modest heritability ranging from 2% to 16% across random sequences, CGIs, CNCs, and 5’ ends of genes.83 While heritability appeared modest, gene promoters demonstrated a higher degree of heritability than nonpromoter regions. Similarly, Boks et al. compared 23 MZ twin pairs consisting of both MC and DC MZ twins with 23 DZ twin pairs using the Illumina Golden Gate assay and obtained significant heritability scores at 23% of their 280 CpGs assayed.87 A  portion of these were found to be associated with genetic variations in cis. A  critical difference between the Kaminsky et  al. study and the studies performed by Gervin et  al. and Boks et  al. is the use DNA originating from peripheral blood cells from both MC and DC MZ. It remains possible that the heritability estimates reported by Gervin et al. and Boks et al. may be affected by the prenatal influences of shared chorionicity on twin discordance described above. Cumulatively and despite the various complexities associated with MZ twin chorionicity, it appears that all epigenomic twin studies performed to date are detecting at least some degree of a heritable influence on DNA methylation patterns across tissues in humans. What then might be the substrate for this heritability? The two explanations for these findings include both genetic and nongenetic inheritance. First, the DZ twin group is genetically nonidentical, so the larger degree of epigenetic variation in this group could be due to the influence of genetic variation on epigenetic patterns. While we defined epigenetic patterns as sequence independent factors, in fact there is an extensive and growing literature demonstrating an influence of genetic variation on epigenetic signals. This should not be surprising, as functional genetic variation within epigenetically modifying genes or the binding motifs for epigenetic modifying complexes would be expected to result in epigenetic variation. Earlier studies have demonstrated that SNPs influencing transcription factor binding may gradually alter epigenetic signatures in the region by altering “methylation encroachment” of CpG islands by highly methylated CGI shores,111 which has recently been observed to occur in various human leukemias.112 Somatic mutations in the EZH2 gene have been shown to increase

the enzyme’s ability to create histone 3 lysine 27 (H3K27) trimethylation marks in follicular and B-cell type lymphoma.113 Genome-wide studies have shown a majority of sequence associated epigenetic change occurs in cis.98,114–117 In an elegant set of experiments, Kerkel et al. performed the first genome-wide screen for allele-specific DNA methylation (ASM) using methylation sensitive and insensitive restriction enzyme treatments on Affymetrix genotyping microarrays to identify alleles of SNPs associated with differing levels of DNA methylation.118 Soon afterward, a number of additional studies confirmed and further documented the extent of ASM in alternate populations and tissues.115,117 In a recent report, reduced representation bisulfite sequencing evaluated genotype and DNA methylation in six members of a three-generation family and identified an enrichment of epigenetic variation under genetic influence in cis that was enriched in gene coding and intergenic regions and underrepresented in CGIs.114 It is important to consider that ASM-associated genetic markers may either influence the establishment of epigenetic variation or may merely exist as partially correlated markers, as in the case of polymorphisms at imprinted loci. Taking all these observations together, it stands to reason that a higher degree of genetic variation in the DZ twin groups studied above will result in a higher epigenetic variation and thus evidence for epigenetic heritability controlled by genetic factors. This interpretation is consistent with the current understanding of phenotypic inheritance but does nothing to address the aforementioned quandaries related to the missing heritability in complex disease. At this point, it is useful to highlight an elegant experiment performed by Gartner and Baunack that suggests evidence of the existence of DNA sequence independent factors influencing inherited phenotype.30,119 By the splitting of murine blastocysts, the authors were able to create inbred MZ twin mice that originated from the same germ cell, which they could then compare with inbred polyzygotic mice fertilized by separate germ cells. This scenario is akin to the means by which MZ and DZ twins, respectively, are fertilized. After strictly controlling for environment and employing a classical twin design analysis, approximately ~75% of the variance in body weight was determined to result from a yet unidentified “third component,” independent of genetic and environmental factors. This

The Role of Epigenomics in Genetically Identical Individuals experiment created a system where the only difference between the zygosity groups, was the status of contributing germ cells. 30,119 Therefore the third component identified in these experiments is an inherited factor independent of the DNA sequence and environment. These results suggest that perhaps the majority of classical twin designs are pointing to molecular factors beyond the DNA sequence and environment that are influencing phenotypic outcome— namely epigenetic factors. Returning to the results implicating a heritable component to DNA methylation, an alternative possibility is that these twin studies are detecting a carryover of variation resulting from the initial germ cells from which the MZ and DZ twins arose. DNA methylation exhibits interindividual variation,120 which, if passed to the offspring, would result in evidence for epigenetic heritability, as observed in the above twin studies. Since MZ twins arise from a single sperm and egg pairing while DZ twins result from separate fertilizations, epigenetic differences between the separately fertilizing germ cells in the DZ pair may result in a larger degree of epigenetic variation in the DZ twin cohort if these signals are not completely erased during fertilization. Most likely, a degree of the heritability detected in the above twin studies does arise from genetic variation in the DZ twin group; however, the results of a subsequent analysis performed by Kaminsky et  al. suggests that the genetic variation may not be the only factor contributing to these results. Through the analysis of DNA methylation profiling at approximate ~3,000 genomic loci in the brain of two groups of genetically identical and genetically nonidentical mice, the authors attempted to quantify the degree to which genetic variation could influence epigenetic variation on the genomic scale.81 If genetic differences alone could account for the observed heritability, the authors expected the genetically nonidentical outbred mouse group to exhibit higher epigenetic variability, much like the DZ twin group. However, just like DZ twins, both genetically identical and nonidentical animals arise from different germ cells, so in essence this experiment represents a model of genetically identical and nonidentical DZ twinning. In fact, the authors did not observe a higher degree of epigenetic variation in the genetically nonidentical animals, suggesting that the higher DZ

53

twin epigenetic variation detected may not be solely due to genetic variation alone but also to the germ cell of origin. One consideration in the  interpretation of the experiment is that the degree of genetic variation in genetically nonidentical outbred mice may be less than what would be expected in a population of DZ twins; however, would still remain markedly above that of the inbred lines. While a genetic sequence–dependent epigenetic heritability most likely occurs to some degree, no studies have conclusively shown that a passage of DNA sequence independent epigenetic information across generations does not occur. Taken together, the human twin studies and animal studies suggest that the door is not completely closed on the possibility of a passage of epigenetic information from parent to offspring. Importantly, this phenomenon must be distinguished from transgenerational epigenetic inheritance, such that it may result only from epigenetic variation occurring during meiotic reprogramming and thus may be reset every generation. Epigenetic reprogramming of the germline cells means that passage of nongenetic information in mammals is distinctive from traditional neo-Lamarckian inheritance, which postulates that any adaptive changes acquired during the life of the organism are transmitted to the offspring.121 Such a scenario is more similar to the inheritance of epigenetic factors in plants, as plant germline cells are derived from the somatic tissues of the mature organisms and erasure of epigenetic information in these cells is less extensive.121 In mammals, any adaptive or “soft” inheritance of this sort is contingent on an environmental influence that affects the epigenetic status of the germline. For example, in mice, exposure to the pesticide vinclozolin alters DNA methylation in the sperm, which is passed on transgenerationally, resulting in altered phenotype in the offspring.122, 123 However, even a passage of epimutations occurring in the germline spontaneously from stochastic rearrangements has the potential to influence phenotype and subsequently the results of classical twin studies. It was mentioned above that traditionally, epigenetic inheritance is dismissed owing to the massive epigenomic rearrangements occurring during gametogenesis and fertilization; however, it is now becoming accepted that a portion of nucleosomal information in the male germline is passed to the developing zygote and that these, in turn, may influence the

54

the OMICs

subsequent DNA methylation reprogramming events. While a majority of the histone proteins in sperm are converted to protamines to enable packaging of the genome, approximately 4% of the genome retains nucleosomes with histone 3 lysine 4 (H3K4) trimethylation and H3K27 trimethylation marks.124, 125 These epigenetic modifications occur at gene promoters and are involved in the proper developmental expression of housekeeping genes in the early zygote.125 This retained epigenetic information is mediated by GC sequence content, is enriched in CGIs, and likely plays a role in protecting the DNA methylation reprogramming in these regions in the early embryo.125,126 The passage of nucleosomal histone code information, potentially influencing subsequent DNA methylation profiles at gene promoters, represents an attractive potential mechanism of the passage of epigenetic information across generations. Consistent with this interpretation, a higher degree of epigenetic similarity in MZ twins and thus heritability in either CGIs or the 5’ gene promoter regions is identified in a number of the above twin studies. The downstream variation in DNA methylation observed in epigenomic twin studies may be only a lasting imprint of this initial variation in the establishment as well as maintenance of these male germline nucleosomal marks. Emerging work in Caenorhabditis elegans shows that induced variation at histone lysine demethylase 1 (KDM1) can allow for a failure to erase developmentally important epigenetic marks in primordial germ cells.127 Could natural variation in systems such as this predispose individuals towards incomplete resetting of epigenetic status and open the door for epigenetic inheritance? The answers to these questions must await a further understanding of the mechanisms occurring during these developmental time periods, a prospect that is becoming increasingly within reach as new technologies enabling epigenetic analyses on smaller cell populations emerge.

THE FUTURE OF EPIGENOMIC STUDIES IN  TWINS The findings of the last 10  years in epigenetic twin studies have addressed key issues in human biology including the DNA sequence independent drift of epigenetic signals and an inheritance of DNA methylation. Importantly, these effects have been shown to be influenced by the

tissue studied, chorionicity, and in some cases, genetic factors. While the genomic study of genetically identical organisms has advanced our understanding of the behavior of molecular epigenetic factors over the course of development, a myriad of new questions arise that will direct the future of epigenomic research in twins. It is becoming clear that intrauterine environmental influence is increasingly important for epigenetically mediated phenotypic consequences in the offspring. Future studies must better characterize specific environmental factors influencing epigenetic variability during this time period. Novel strategies will be required to separate the effects of DNA sequence-dependent and -independent influences on epigenetic heritability. These may include epigenomic profiling of twinning animal models, such as those produced by Gartner and Baunack,30,119 or the incorporation and analysis of genetic information in conjunction with epigenetic information in human DZ twins. The application of the classical twin design to the study of epigenetic signals in the brain is very difficult, since additional sources of epigenetic variation may arise from accumulated age and environmental differences affecting the living twin after one twin has died. Tissues peripheral to the nervous system must be accessed and inferences made as to the behavior of epigenetic signatures in the context of the brain. A growing field of research is identifying the brain as a diverse region of epigenetic variation and change. All of the studies highlighted above interrogated percentages of 5-methyl cytosine; however, novel modifications such as 5-hydroxymethyl cytosine are the subject of intense research related to nervous system– specific epigenetic modulation. As our understanding of the behavior of epigenetic patterns in different tissues changes, the interpretation of the results of epigenetic twin studies to date and their implications must be reevaluated in the context of these new insights. Finally, the current techniques used to study genetically identical organisms to date provide only a relatively low-resolution snapshot of the behavior of epigenetic changes across the genome. Although these studies represent important first steps, higher-resolution approaches such as the next-generation sequencing of epigenetic profiles will greatly improve our understanding of the molecular behavior of epigenetic patterns.

The Role of Epigenomics in Genetically Identical Individuals CONCLUSIONS The study of epigenetics in genetically identical organisms represents an ideal system whereby to elucidate the behavior of epigenetic patterns over the course of development. However, the door of knowledge swings both ways, as the interpretation of decades of twin study–based results may be aided by the incorporation of the current heightened understanding of epigenetic patterns. Twin discordance in disease may be influenced by a divergence of phenotypically important epigenetic variation. This divergence appears to occur early during prenatal development and continue throughout the course of life. The complex intrauterine environmental influences of hematopoietic stem cell sharing and unequal blood flow through vascular connections in MC twins can affect the degree of MZ twin discordance that is observed and may influence the findings of classical twin studies. DNA methylation itself appears to be under the influence of heritable factors. These factors undoubtedly include the influence of DNA sequence variation on epigenetic patterning, but they may also include the passage of epigenetic information from one generation to the next through the germline. Taken together, these findings suggest that the investigation of epigenetic factors of etiological significance to complex diseases is warranted. Henceforth epigenomic twin studies will most likely focus on the next-generation sequencing-based categorization of epigenetic drift and heritability not only of DNA methylation but also novel epigenetic modifications of important neurological phenotypes such as 5-hydroxymethyl cytosine. REFERENCES 1. Gringras, P., & Chen, W. Mechanisms for differences in monozygous twins. Early Hum Dev 2001; 64(2): 105–117. 2. Wong, A. H., Gottesman,  I.I., & Petronis, A. Phenotypic differences in genetically identical organisms:  the epigenetic perspective. Hum Mol Genet 2005; 14 Spec No 1: R11–R18. 3. Boomsma, D., Busjahn, A., & Peltonen, L. Classical twin studies and beyond. Nat Rev Genet 2002; 3(11): 872–882. 4. Martin, N., Boomsma, D., & Machin, G. A twin-pronged attack on complex traits. Nat Genet 1997; 17(4): 387–392. 5. Trumbetta, S. L. & Gottesman, I. I. Twin studies and the genetics of mental disorders in the genomic age. In G. Adelman & B. Smith (Ed.). Neuroscience encyclopedia (3rd ed.), 2004. Elsevier Science: Amsterdam.

55

6. Jinks, J. L., & Fulker, D. W. Comparison of the biometrical genetical, MAVA, and classical approaches to the analysis of human behavior. Psychol Bull 1970; 73(5): 311–349. 7. Bierut, L. J., Heath, A. C., Bucholz, K. K., Dinwiddie, S. H., Madden, P. A., Statham, D. J., et  al. Major depressive disorder in a community-based twin sample:  are there different genetic and environmental contributions for men and women? Arch Gen Psychiatry 1999; 56(6): 557–563. 8. Mcguffin, P., Katz, R., & Rutherford, J. Nature, nurture and depression: a twin study. Psychol Med 1991; 21(2): 329–335. 9. Sullivan, P. F., Neale, M. C., & Kendler, K. S. Genetic epidemiology of major depression: review and meta-analysis. Am J Psychiatry 2000; 157(10): 1552–1562. 10. Torgersen, S. Genetic factors in moderately severe and mild affective disorders. Arch Gen Psychiatry 1986; 43(3): 222–226. 11. Kendler, K. S., Neale, M. C., Kessler, R. C., Heath, A. C., & Eaves, L. J.  The lifetime history of major depression in women:  reliability of diagnosis and heritability. Arch Gen Psychiatry 1993; 50(11): 863–870. 12. Middeldorp, C. M., Birley, A. J., Cath, D. C., Gillespie, N. A., Willemsen, G., Statham, D. J., et al. Familial clustering of major depression and anxiety disorders in Australian and Dutch twins and siblings. Twin Res Hum Genet 2005; 8(6): 609–615. 13. Torgersen, S. Genetic factors in anxiety disorders. Arch Gen Psychiatry 1983; 40(10): 1085–1089. 14. Bertelsen, A., Harvald, B., & Hauge, M.A Danish twin study of manic-depressive disorders. Br J Psychiatry 1977; 130: 330–351. 15. Kieseppa, T., Partonen, T., Haukka, J., Kaprio, J., & Lonnqvist, J. High concordance of bipolar I disorder in a nationwide sample of twins. Am J Psychiatry 2004; 161(10): 1814–1821. 16. Cardno, A. G., & Gottesman, I.I. Twin studies of schizophrenia:  from bow-and-arrow concordances to Star Wars Mx and functional genomics. Am J Med Genet 2000; 97(1): 12–17. 17. Turkheimer, E. Three laws of behaviour genetics and what they mean. Curr Dir Psychol Sci 2000; 9(5): 160–164. 18. Wardle, J., & Cooke, L. Genetic and environmental determinants of children’s food preferences. Br J Nutr 2008; 99 Suppl 1: S15–S21. 19. Faith, M. S., Rhea, S. A., Corley, R. P., & Hewitt, J. K.  Genetic and shared environmental influences on children’s 24-h food and beverage intake:  sex differences at age 7 y. Am J Clin Nutr 2008; 87(4): 903–911. 20. Burt, S. A., & Mikolajewski, A. J. Preliminary evidence that specific candidate genes are associated

56

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

the OMICs with adolescent-onset antisocial behavior. Aggress Behav 2008; 34(4):437–45. Varjonen, M., Santtila, P., Hoglund, M., Jern, P., Johansson, A., Wager, I., et  al.Genetic and environmental effects on sexual excitation and sexual inhibition in men. J Sex Res 2007; 44(4): 359–369. Kendler, K.S., & Prescott, C. A. A population-based twin study of lifetime major depression in men and women. Arch Gen Psychiatry 1999; 56(1): 39–44. Turkheimer, E., & Waldron, M. Nonshared environment: a theoretical, methodological, and quantitative review. Psychol Bull 2000; 126(1): 78–108. Bouchard, T. J., Jr., Heston, L., Eckert, E., Keyes, M., & Resnick, S. The Minnesota study of twins reared apart:  project description and sample results in the developmental domain. Prog Clin Biol Res 1981; 69 Pt B: 227–233. Bouchard, T. J., Jr., Lykken, D. T., Mcgue, M., Segal, N.L., & Tellegen, A. Sources of human psychological differences: the Minnesota Study of Twins Reared Apart. Science 1990; 250(4978): 223–228. Moldin, S. Sponsoring initiatives in the molecular genetics of mental disorders. In Genetics and Mental Disorders 1998. Bethesda, MD:  National Institutes of Health. Hanson, B., Mcgue, M., Roitman-Johnson, B., Segal, N. L., Bouchard, T. J., Jr., & Blumenthal, M. N. Atopic disease and immunoglobulin E in twins reared apart and together. Am J Hum Genet 1991; 48(5): 873–879. Pedersen, N. L., Lichtenstein, P., Plomin, R., Defaire, U., Mcclearn, G. E., & Matthews, K. A. Genetic and environmental influences for type A-like measures and related traits: a study of twins reared apart and twins reared together. Psychosom Med 1989; 51(4): 428–440. Edwards, J. L., Schrick, F. N., Mccracken, M. D., Van Amstel, S. R., Hopkins, F. M., Welborn, M. G., et al. Cloning adult farm animals: a review of the possibilities and problems associated with somatic cell nuclear transfer. Am J Reprod Immunol 2003; 50(2): 113–123. Gartner, K., & Baunack, E. Is the similarity of monozygotic twins due to genetic factors alone? Nature 1981; 292(5824): 646–647. Rhind, S. M., King, T J., Harkness, L. M., Bellamy, C., Wallace, W., Desousa, P., et al.Cloned lambs— lessons from pathology. Nat Biotechnol 2003; 21(7): 744–745. Yanagimachi, R. Cloning:  experience from the mouse and other animals. Mol Cell Endocrinol 2002; 187(1–2): 241–248. Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., et al.Finding the missing heritability of complex diseases. Nature 2009; 461(7265): 747–753.

34. Henikoff, S., & Matzke, M.A.Exploring and explaining epigenetic effects. Trends Genet 1997; 13(8): 293–295. 35. Jenuwein, T., & Allis, C. D. Translating the histone code. Science 2001; 293(5532): 1074–1080. 36. Takai, D., & Jones, P. A.Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci U S A 2002; 99(6): 3740–3745. 37. Ehrlich, M., & Ehrlich, K. Effect of DNA methylation and the binding of vertebrate and plant proteins to DNA. In J. Jost & P. Saluz (Eds.). DNA methylation:  molecular biology and biological significance (pp. 145–168). Basel: Birkhauser Verlag; 1993. 38. Riggs, A., Xiong, Z., Wang, L.,  &  Lebon, J.  Methylation dynamics, epigenetic fidelity and X chromosome structure. In A. Wolffe (Ed.). Epigenetics (pp. 214–  227). Chichester, UK:  John Wiley & Sons; 1998. 39. Yeivin, A., & Razin, A. Gene methylation patterns and expression. In J. Jost & H. Saluz (Eds.). DNA methylation: molecular biology and biological significance (pp. 523–568). Basel: Birkhauser Verlag; 1993. 40. Holliday, R., Ho, T., & Paulin, R. Gene silencing in mammalian cells. In R. Martienssen. VEA Russo & A. D. Riggs (Eds.). Epigenetic mechanisms of gene regulation (pp. 47–59). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 1996. 41. Druker, R., & Whitelaw, E. Retrotransposon-derived elements in the mammalian genome:  a potential source of disease. J Inherit Metab Dis 2004; 27(3): 319–330. 42. Ekwall, K. The roles of histone modifications and small RNA in centromere function. Chromosome Res 2004; 12(6): 535–542. 43. Jeltsch, A. Molecular enzymology of mammalian DNA methyltransferases. Curr Top Microbiol Immunol 2006; 301: 203–225. 44. Zhang, P., Su, L., Wang, Z., Zhang, S., Guan, J., Chen, Y., et al. The involvement of 5-hydroxymethylcytosine in active DNA demethylation in mice. Biol Reprod 2012; 86(4): 104. 45. Bhutani, N., Burns, D. M., & Blau, H. M.DNA demethylation dynamics. Cell 2011; 146(6): 866–872. 46. Vaquero, A., Loyola, A., & Reinberg, D. The constantly changing face of chromatin. Sci Aging Knowledge Environ 2003; 2003(14): RE4. 47. Schotta, G., Lachner, M., Peters, A. H., & Jenuwein, T. The indexing potential of histone lysine methylation. Novartis Found Symp 2004; 259:  22–37; discussion 37–47, 163–169. 48. Wang, Y., Fischle, W., Cheung, W., Jacobs, S., Khorasanizadeh, S., & Allis, C. D.  Beyond the double helix:  writing and reading the histone code. Novartis Found Symp 2004; 259:  3–17; discussion 17–21, 163–169.

The Role of Epigenomics in Genetically Identical Individuals 49. Liu, H., Heath, S. C., Sobin, C., Roos, J. L., Galke, B. L., Blundell, M. L., et al.Genetic variation at the 22q11 PRODH2/DGCR6 locus presents an unusual pattern and increases susceptibility to schizophrenia. Proc Natl Acad Sci U S A 2002; 99(6):3717–22. 50. Geiman, T. M. & Robertson, K. D.  Chromatin remodeling, histone modifications, and DNA methylation:  how does it all fit together? J Cell Biochem 2002; 87(2): 117–125. 51. Li, E. Chromatin modification and epigenetic reprogramming in mammalian development. Nat Rev Genet 2002; 3(9): 662–673. 52. Riggs, A., & Porter, T. Overview of epigenetic mechanisms. In R. Martienssen, VEA Russo,  & A. D. Riggs (Eds.). Epigenetic mechanisms of gene regulation (pp. 29–45). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 1996. 53. Constancia, M., Pickard, B., Kelsey, G., pp. 29–45 Reik, W. Imprinting mechanisms. Genome Res 1998; 8(9): 881–900. 54. Nan, X., Ng, H. H., Johnson, C. A., Laherty, C. D., Turner, B. M., Eisenman, R. N., et al. Transcriptional repression by the methyl-CpG-binding protein MeCP2 involves a histone deacetylase complex. Nature 1998; 393(6683): 386–389. 55. Jones, P. L., Veenstra, G. J., Wade, P. A., Vermaak, D., Kass, S. U., Landsberger, N., et al. Methylated DNA and MeCP2 recruit histone deacetylase to repress transcription. Nat Genet 1998; 19(2): 187–191. 56. Razin, A., pp. 29–45 Shemer, R. Epigenetic control of gene expression. Results Probl Cell Differ 1999; 25(2): 189–204. 57. Yang As, J. P., & Shibata A. The mutational burden of 5-methylcytosine. In R. Martienssen, VEA Russo, & A. D. Riggs (Eds.). Epigenetic mechanisms of gene regulation (pp. 77–94). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 1996. 58. Petronis, A., Bassett, A. S., Honer, W. G., Vincent, J. B., Tatuch, Y., Sasaki, T., et al. Search for unstable DNA in schizophrenia families with evidence for genetic anticipation. Am J Hum Genet 1996; 59(4): 905–911. 59. Ohgane, J., Yagi, S., & Shiota, K. Epigenetics: The DNA methylation profile of tissue-dependent and differentially methylated regions in cells. Placenta 2008; 29S: 29–35. 60. Nagase, H. and Ghosh, S. Epigenetics: differential DNA methylation in mammalian somatic tissues. FEBS J 2008; 275(8): 1617–1623. 61. Sakamoto, H., Kogo, Y., Ohgane, J., Hattori, N., Yagi, S., Tanaka, S., et  al. Sequential changes in genome-wide DNA methylation status during adipocyte differentiation. Biochem Biophys Res Commun 2008; 366(2): 360–366. 62. Suzuki, M., Sato, S., Arai, Y., Shinohara, T., Tanaka, S., Greally, J. M., et  al. A new class of

63.

64.

65.

66.

67.

68.

69.

70.

71.

72.

73.

74.

57

tissue-specifically methylated regions involving entire CpG islands in the mouse. Genes Cells 2007; 12(12): 1305–1314. Ooi, S. K. & Bestor, T. H.  Cytosine methylation:  remaining faithful. Curr Biol 2008; 18(4): R174–R176. Vilkaitis, G., Suetake, I., Klimasauskas, S., & Tajima, S. Processive methylation of hemimethylated CpG sites by mouse Dnmt1 DNA methyltransferase. J Biol Chem 2005; 280(1): 64–72. Riggs, A. D., Xiong, Z., Wang, L., & Lebon, J. M.  Methylation dynamics, epigenetic fidelity and X chromosome structure. Novartis Found Symp 1998; 214: 214–225; discussion 225–232. Weksberg, R., Shuman, C., Caluseriu, O., Smith, A. C., Fei, Y. L., Nishikawa, J., et  al. Discordant KCNQ1OT1 imprinting in sets of monozygotic twins discordant for Beckwith-Wiedemann syndrome. Hum Mol Genet 2002; 11(11): 1317–1325. Heijmans, B. T., Kremer, D., Tobi, E. W., Boomsma, D. I., & Slagboom, P. E.  Heritable rather than age-related environmental and stochastic factors dominate variation in DNA methylation of the human IGF2/H19 locus. Hum Mol Genet 2007; 16(5): 547–554. Petronis, A., Gottesman,  I.I., Kan, P., Kennedy, J. L., Basile, V. S., Paterson, A.D., et al. Monozygotic twins exhibit numerous epigenetic differences: clues to twin discordance? Schizophr Bull 2003; 29(1): 169–178. Oates, N. A., Van Vliet, J., Duffy, D. L., Kroes, H. Y., Martin, N. G., Boomsma, D. I., et al. Increased DNA methylation at the AXIN1 gene in a monozygotic twin from a pair discordant for a caudal duplication anomaly. Am J Hum Genet 2006; 79(1): 155–162. Kuratomi, G., Iwamoto, K., Bundo, M., Kusumi, I., Kato, N., Iwata, N., et al. Aberrant DNA methylation associated with bipolar disorder identified from discordant monozygotic twins. Mol Psychiatry 2008; 13(4): 429–441. Fraga, M. F., Ballestar, E., Paz, M. F., Ropero, S., Setien, F., Ballestar, M. L., et al. Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci U S A 2005; 102(30): 10604–10609. Tost, J., Schatz, P., Schuster, M., Berlin, K., & Gut, I.G.Analysis and accurate quantification of CpG methylation by MALDI mass spectrometry. Nucleic Acids Res 2003; 31(9): e50. Kaminsky, Z. A., Assadzadeh, A., Flanagan, J., & Petronis, A. Single nucleotide extension technology for quantitative site-specific evaluation of metC/C in GC-rich regions. Nucleic Acids Res 2005; 33(10): e95. Tost J.,  Shatz  P., Schuster M., Berlin K., & Gut I.  G. Analysis and accurate quantification of

58

75.

76.

77.

78.

79.

80.

81.

82.

83.

84.

85.

86.

the OMICs CpG methylation by MALDI mass spectrometry. Nucleic Acids Research 2003; 31(9): e50. Gonzalgo, M.L., & Jones, P. A. Rapid quantitation of methylation differences at specific sites using methylation-sensitive single nucleotide primer extension (Ms-SNuPE). Nucleic Acids Res 1997; 25(12): 2529–2531. Nguyen, T. T., Mohrbacher, A. F., Tsai, Y. C., Groffen, J., Heisterkamp, N., Nichols, P. W., et al.Quantitative measure of c-abl and p15 methylation in chronic myelogenous leukemia: biological implications. Blood 2000; 95(9): 2990–2992. Mark L., & Gonzalgo, P.  A.  J.Quantitative methylation analysis using methylation-sensitive single-nucleotide primer extension (Ms-SnuPE). Methods 2002; 27: 128–133. Tost, J., Dunker, J., & Gut, I. G.  Analysis and quantification of multiple methylation variable positions in CpG islands by Pyrosequencing. Biotechniques 2003; 35(1): 152–156. Bibikova, M., Lin, Z., Zhou, L., Chudin, E., Garcia, E. W., Wu, B., et al. High-throughput DNA methylation profiling using universal bead arrays. Genome Res 2006; 16(3): 383–393. Heisler, L. E., Torti, D., Boutros, P. C., Watson, J., Chan, C., Winegarden, N., et  al. CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genome. Nucleic Acids Res 2005; 33(9): 2952–2961. Kaminsky, Z. A., Tang, T., Wang, S. C., Ptak, C., Oh, G. H., Wong, A. H., et al. DNA methylation profiles in monozygotic and dizygotic twins. Nat Genet 2009; 41(2): 240–245. Schumacher, A., Kapranov, P., Kaminsky, Z., Flanagan, J., Assadzadeh, A., Yau, P., et  al. Microarray-based DNA methylation profiling:  technology and applications. Nucleic Acids Res 2006; 34(2): 528–542. Gervin, K., Hammero, M., Akselsen, H. E., Moe, R., Nygard, H., Brandt, I., et  al.Extensive variation and low heritability of DNA methylation identified in a twin study. Genome Res 2011; 21(11): 1813–1821. Grafodatskaya, D., Choufani, S., Ferreira, J. C., Butcher, D. T., Lou, Y., Zhao, C., et al. EBV transformation and cell culturing destabilizes DNA methylation in human lymphoblastoid cell lines. Genomics 2010; 95(2): 73–83. Saferali, A., Grundberg, E., Berlivet, S., Beauchemin, H., Morcos, L., Polychronakos, C., et al. Cell culture-induced aberrant methylation of the imprinted IG DMR in human lymphoblastoid cell lines. Epigenetics 2010; 5(1): 50–60. Brennan, E. P., Ehrich, M., Brazil, D. P., Crean, J. K., Murphy, M., Sadlier, D. M., et al. Comparative

87.

88.

89.

90.

91.

92. 93.

94.

95.

96.

97.

98.

99.

analysis of DNA methylation profiles in peripheral blood leukocytes versus lymphoblastoid cell lines. Epigenetics 2009; 4(3): 159–164. Boks, M. P., Derks, E. M., Weisenberger, D. J., Strengman, E., Janson, E., Sommer, I. E., et al. The relationship of DNA methylation with age, gender and genotype in twins and healthy controls. PLoS One 2009; 4(8): e6767. Bocklandt, S., Lin, W., Sehl, M. E., Sanchez, F. J., Sinsheimer, J. S., Horvath, S., et al. Epigenetic predictor of age. PLoS One 2011; 6(6): e14821. Wong, C. C., Caspi, A., Williams, B., Houts, R., Craig, I. W., & Mill, J. A longitudinal twin study of skewed X chromosome-inactivation. PLoS One 2011; 6(3): e17873. Kristiansen, M., Knudsen, G. P., Bathum, L., Naumova, A. K., Sorensen, T. I., Brix, T. H., et al. Twin study of genetic and aging effects on X chromosome inactivation. Eur J Hum Genet 2005; 13(5): 599–606. Gordon, L., Joo, J. H., Andronikos, R., Ollikainen, M., Wallace, E. M., Umstad, M. P., et al. Expression discordance of monozygotic twins at birth: effect of intrauterine environment and a possible mechanism for fetal programming. Epigenetics 2011; 6(5): 579–592. Hall, J. G.  Twinning. Lancet 2003; 362(9385): 735–743. Gul, A., Cebeci, A., Aslan, H., Polat, I., Sozen, I., & Ceylan, Y. Perinatal outcomes of twin pregnancies discordant for major fetal anomalies. Fetal Diagn Ther 2005; 20(4): 244–248. Race, J. P., Townsend, G. C., & Hughes, T. E.  Chorion type, birthweight discordance and tooth-size variability in Australian monozygotic twins. Twin Res Hum Genet 2006; 9(2): 285–291. Blickstein, I., Mincha, S., Goldman, R., Machin, G., G Keith, L. The Northwestern twin chorionicity study: testing the “placental crowding” hypothesis. J Perinat Med 2006; 34(2): 158–161. Machin, G., Still, K., & Lalani, T. Correlations of placental vascular anatomy and clinical outcomes in 69 monochorionic twin pregnancies. Am J Med Genet 1996; 61(3): 229–236. Lopriore, E., Oepkes, D., & Walther, F.J. Neonatal morbidity in twin-twin transfusion syndrome. Early Hum Dev 2011; 87(9): 595–599. Numata, S., Ye, T., Hyde, T. M., Guitart-Navarro, X., Tao, R., Wininger, M., et  al. DNA methylation signatures in development and aging of the human prefrontal cortex. Am J Hum Genet 2012; 90(2): 260–272. Saffery, R., Morley, R., Carlin, J. B., Joo, J. H., Ollikainen, M., Novakovic, B., et  al. Cohort Profile:  The Peri/post-natal Epigenetic Twins Study. Int J Epidemiol 2012; 41(1): 55–61.

The Role of Epigenomics in Genetically Identical Individuals 100. Bliek, J., Alders, M., Maas, S. M., Oostra, R. J., Mackay, D. M., Van Der Lip, K., et  al. Lessons from BWS twins: complex maternal and paternal hypomethylation and a common source of haematopoietic stem cells. Eur J Hum Genet 2009; 17(12): 1625–1634. 101. Dempster, E. L., Pidsley, R., Schalkwyk, L. C., Owens, S., Georgiades, A., Kane, F., et  al. Disease-associated epigenetic changes in monozygotic twins discordant for schizophrenia and bipolar disorder. Hum Mol Genet 2011; 20(24): 4786–4796. 102. Gervin, K., Vigeland, M. D., Mattingsdal, M., Hammero, M., Nygard, H., Olsen, A. O., et  al. DNA methylation and gene expression changes in monozygotic twins discordant for psoriasis: identification of epigenetically dysregulated genes. PLoS Genet 2012; 8(1): e1002454. 103. Javierre, B. M., Fernandez, A. F., Richter, J., Al-Shahrour, F., Martin-Subero, J. I., RodriguezUbreva, J., et al. Changes in the pattern of DNA methylation associate with twin discordance in systemic lupus erythematosus. Genome Res 2010; 20(2): 170–179. 104. Rakyan, V. K., Chong, S., Champ, M. E., Cuthbert, P. C., Morgan, H. D., Luu, K. V., et al. Transgenerational inheritance of epigenetic states at the murine Axin(Fu) allele occurs after maternal and paternal transmission. Proc Natl Acad Sci U S A 2003; 100(5): 2538–2543. 105. Morgan, H. D., Sutherland, H. G., Martin, D. I., & Whitelaw, E. Epigenetic inheritance at the agouti locus in the mouse. Nat Genet 1999; 23(3): 314–318. 106. Allegrucci, C., Thurston, A., Lucas, E., & Young, L. Epigenetics and the germline. Reproduction 2005; 129(2): 137–149. 107. Santos, F., & Dean, W.Epigenetic reprogramming during early development in mammals. Reproduction 2004; 127(6): 643–651. 108. Santos, F., Peters, A. H., Otte, A. P., Reik, W., & Dean, W. Dynamic chromatin modifications characterise the first cell cycle in mouse embryos. Dev Biol 2005; 280(1): 225–236. 109. Morgan, H. D., Santos, F., Green, K., Dean, W., & Reik, W. Epigenetic reprogramming in mammals. Hum Mol Genet 2005; 14 Spec No 1: R47–R58. 110. Hajkova, P., Ancelin, K., Waldmann, T., Lacoste, N., Lange, U. C., Cesari, F., et  al.Chromatin dynamics during epigenetic reprogramming in the mouse germ line. Nature 2008; 452(7189): 877–881. 111. Mummaneni, P., Yates, P., Simpson, J., Rose, J., and Turker, M. S.  The primary function of a redundant Sp1 binding site in the mouse aprt gene

112.

113.

114.

115.

116.

117.

118.

119.

120.

121.

122.

123.

59

promoter is to block epigenetic gene inactivation. Nucleic Acids Res 1998; 26(22): 5163–5169. Boumber, Y.A., Kondo, Y., Chen, X., Shen, L., Guo, Y., Tellez, C., et  al. An Sp1/Sp3 binding polymorphism confers methylation protection. PLoS Genet 2008; 4(8): e1000162. Yap, D. B., Chu, J., Berg, T., Schapira, M., Cheng, S. W., Moradian, A., et al. Somatic mutations at EZH2 Y641 act dominantly through a mechanism of selectively altered PRC2 catalytic activity, to increase H3K27 trimethylation. Blood 2010; 117(8):2451–9. Gertz, J., Varley, K. E., Reddy, T. E., Bowling, K. M., Pauli, F., Parker, S. L., et al. Analysis of DNA methylation in a three-generation family reveals widespread genetic influence on epigenetic regulation. PLoS Genet 2011; 7(8): e1002228. Tycko, B. Allele-specific DNA methylation: beyond imprinting. Hum Mol Genet 2010; 19(R2): R210–R220. Bell, C. G., Finer, S., Lindgren, C. M., Wilson, G. A., Rakyan, V. K., Teschendorff, A. E., et  al. Integrated genetic and epigenetic analysis identifies haplotype-specific methylation in the FTO type 2 diabetes and obesity susceptibility locus. PLoS One 2010; 5(11): e14040. Schalkwyk, L. C., Meaburn, E. L., Smith, R., Dempster, E. L., Jeffries, A. R., Davies, M. N., et  al. Allelic skewing of DNA methylation is widespread across the genome. Am J Hum Genet 2010; 86(2): 196–212. Kerkel, K., Spadola, A., Yuan, E., Kosek, J., Jiang, L., Hod, E., et  al.Genomic surveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation. Nat Genet 2008; 40(7):904–8. Gartner, K. A third component causing random variability beside environment and genotype. A reason for the limited success of a 30 year long effort to standardize laboratory animals? Lab Anim 1990; 24(1): 71–77. Flanagan, J. M., Popendikyte, V., Pozdniakovaite, N., Sobolev, M., Assadzadeh, A., Schumacher, A., et al. Intra- and interindividual epigenetic variation in human germ cells. Am J Hum Genet 2006; 79(1): 67–84. Richards, E. J.  Inherited epigenetic variation— revisiting soft inheritance. Nat Rev Genet 2006; 7(5): 395–401. Guerrero-Bosagna, C., Settles, M., Lucker, B., & Skinner, M. K. Epigenetic transgenerational actions of vinclozolin on promoter regions of the sperm epigenome. PLoS One 2010; 5(9): pii: e13100. Anway, M. D., Cupp, A. S., Uzumcu, M., & Skinner, M. K.  Epigenetic transgenerational actions of endocrine disruptors and male fertility. Science 2005; 308(5727): 1466–1469.

60

the OMICs

124. Hammoud, S. S., Nix, D. A., Zhang, H., Purwar, J., Carrell, D. T., & Cairns, B. R.  Distinctive chromatin in human sperm packages genes for embryo development. Nature 2009; 460(7254): 473–478. 125. Vavouri, T., & Lehner, B. Chromatin organization in sperm may be the major functional consequence of base composition variation in the human genome. PLoS Genet 2011; 7(4): e1002036.

126. Ooi, S. K., Qiu, C., Bernstein, E., Li, K., Jia, D., Yang, Z., et al. DNMT3L connects unmethylated lysine 4 of histone H3 to de novo methylation of DNA. Nature 2007; 448(7154): 714–717. 127. Katz, D. J., Edwards, T. M., Reinke, V., & Kelly, W. G. A C. elegans LSD1 demethylase contributes to germline immortality by reprogramming epigenetic memory. Cell 2009; 137(2): 308–320.

PART  II RNA

4 Transcriptomics T. G R A N T B E L G A R D A N D D A N I E L H . G E S C H W I N D

INTRODUCTION Studies of RNA, both as a topic of fundamental concern and as a proxy for protein abundance, have been an especially hot area of “omics” innovation in neuroscience in the last decade. Revolutionary advances in microarray and sequencing technologies now allow high-throughput measurements of epic proportions. Despite challenges in algorithmic development and requirements for serious computational infrastructure, methodological breakthroughs continue to move the boundaries of neuroscience knowledge. BACKGROUND The ability to measure a significant cross section of the messenger RNAs (mRNAs) in a cell or tissue by microarray-based transcriptional profiling has led to a revolution in molecular biology and biomedicine (DeRisi, Iyer, & Brown 1997; Lockhart et  al. 1996). This approach took advantage of the physical properties of nucleic acids, which, in contrast to proteins, permit generic approaches to the study of RNA en masse. Initially many neuroscientists considered mRNA profiling to be a practical if imperfect substitute for the study of actual protein products, whose functional understanding was always the end goal, since mRNA makes protein. In retrospect, this narrow view was a misconception due to our relative ignorance of the genome. Recent work has shown that, in addition to 21,000 protein coding genes, the human genome contains approximately 8,800 small noncoding RNA genes, 11,000 pseudogenes, and 10,000 long noncoding RNAs. Most Parts of this chapter are adapted from T.  Grant Belgard’s DPhil thesis “Comparative neurotranscriptomics in mammals and birds,” deposited at the University of Oxford.

of these have unknown functions. Therefore, for reasons both practical and fundamental, transcriptional profiling remains a crucial means to understand the function of biological systems. RNA plays a central role in molecular biology as the intermediary between the primary repository of inheritable information, the DNA, and the primary (but not sole) heavy lifters in cell biology, the proteins. RNA also shares characteristics of both DNA and proteins. The genetic information encoded in RNA can occasionally be inserted back into DNA. Indeed, much of the human genome was likely born in this way (Cordaux & Batzer 2009). Moreover, by forming critical secondary and tertiary structures, RNAs can fulfill enzymatic roles more typically played by proteins. Indeed, the key reaction of the ribosome, the RNA-protein enzymatic complex (a “ribozyme”) that translates triplet codons in an RNA molecule to the primary sequence of amino acids in a protein, is mediated primarily by its RNA components. RNA’s dual purposes— the ability to self-replicate in vitro and centrality in the most key reaction spanning life—suggests that the ancestor of all life on Earth was a self-replicating RNA molecule (the “RNA world hypothesis”) (Gilbert 1986). The most appreciated function of RNA is to encode proteins. There are approximately 21,000 protein coding genes that produce over 120,000 human protein coding transcripts (Harrow et al. 2012). However, RNAs have an astounding variety of functions that extend far beyond simply encoding proteins. Whereas only a couple percent of the human genome encodes protein, a full 75% is transcribed (Djebali et al. 2012) and at least 30,000 noncoding but transcribed RNA loci have been identified (Harrow et  al. 2012). Some of these noncoding RNAs join with proteins in macromolecular complexes. For

64

the OMICs

example, ribosomal RNAs (rRNAs) are critical elements of the ribosomal complex; signal recognition particle RNA (SRP RNA) is part of a ribonucleoprotein critical for protein trafficking and secretion (Leung & Brown 2010), and telomerase RNA component (TERC) is part of the telomerase ribonucleoprotein involved in telomere extension (Zhang, Kim, & Feigon 2011). Transfer RNAs (tRNAs) are adapters that translate trinucleotide DNA “codons” to amino acids. RNAs known as ribozymes can catalyze chemical reactions. For example, ribonuclease P (RNase P) is a ribozyme that catalyzes RNA cleavage (Jarrous & Gopalan 2010). Some small nuclear RNAs (snRNAs) join small nuclear ribonucleoprotein complexes (snRNPs) involved in RNA splicing (Douglas & Wood 2011). Small nucleolar RNAs (snoRNAs) mediate nucleoside modifications in rRNA (Gerbi et  al. 2001). MicroRNAs (miRNAs) and small interfering RNAs (siRNAs) are both involved in the RNA interference (RNAi) pathway, crucial for posttranscriptional gene silencing (Liu & Paroo 2010). Other RNA classes, such as the products of endogenous retroviruses (ERVs) and retrotransposons, are often neutral or deleterious to the organism built from the host genome; therefore there are numerous mechanisms to prevent the activity of these elements (Maksakova, Mager, & Reiss 2008). Among these, piwi-interacting RNAs (piRNAs) form protein complexes with piwi proteins in an attempt to suppress retrotransposons in germline cells (Senti & Brennecke 2010). Long intergenic noncoding RNAs (lincRNAs) have the following characteristics:  They are (1)  not translated, (2)  longer than 200 nucleotides (nt), (3)  not included in another recognized class of noncoding RNA (such as rRNA), and (4)  transcribed from loci that do not overlap protein coding loci (on either strand) (Ponting & Belgard 2010). LincRNAs are a subset of a broader class of long noncoding RNAs (lncRNAs) for which the last constraint is relaxed. As with proteins encoded by mRNAs and the other classes of noncoding RNAs described above, lncRNAs have a large variety of functions often associated with regulating the expression of protein coding genes. LincRNAs are evolutionarily conserved as a group, but their exonic sequence is generally less conserved than that of protein coding genes (Guttman et al. 2009). Further evidence of functionality is provided by the conservation of their promoters and dinucleotide

splicing motifs (Chodroff et  al. 2010; Ponjavic, Ponting, & Lunter 2007)  and their enrichment in predicted secondary structures (Marques & Ponting 2009; Ponjavic et  al. 2009). The preceding is not an exhaustive list of functions or RNA classes, and it continues to grow. It does, however, provide some key examples of functions, some of which have been appreciated only in recent years.

What Are Transcriptomes? While the “-omics” suffix is admittedly overused, understanding the concept of a transcriptome is key to understanding modern biology. This is true even for twentieth-century methods in molecular biology, since virtually every one of these involves some sort of quantitative or semiquantitative normalization. In this chapter, the transcriptome is defined as a comprehensive set of RNAs found in a cell or grouping of cells at a given time. This set need not include all RNAs—it can be a well-defined subset. The transcriptome can be studied at many levels. For example, the transcriptome of a tissue will include RNAs from all of its constituent cell types. Not all of these cells will express all of the transcripts and not all cells will necessarily contain the same amount of total RNA. Likewise, many RNAs are preferentially spatially localized within a cell. Beyond high-level organization—nuclear, cytoplasmic, synaptic, membrane-bound, and more—they can also form macromolecular complexes. Experimentally, these factors affect how well one RNA molecule can be isolated relative to others. Some RNAs may derive from ancient or very recent viral insertions, while others may derive from bacteria such as tuberculosis or the brain-colonizing parasitic worm Toxoplasma gondii. Finally, the very word transcriptome is a misnomer, as the steady-state snapshot reflects many factors beyond transcription, including processing, RNA editing, and degradation. In measuring the transcriptome, as with the proteome, one typically measures RNA abundance rather than transcription per se. The abundance of an mRNA is moderately correlated with that of its corresponding protein (Maier, Guell, & Serrano 2009). This correlation should not be overstated, as there are numerous levels of regulation of protein abundance beyond the transcriptome, including transcript sequestration, rate of translation, and protein longevity. From

Transcriptomics our perspective, transcriptomic data represent the current state of the cell or tissue and should be interpreted as such. Transcriptomes are stochastic. The production, processing, trafficking, and breakdown of RNAs depend upon inherently stochastic molecular interactions. Transcriptomes are temporally regulated. This can be periodic, as with circadian (with a roughly 24-hour period, such as the human sleep-wake cycle), infradian (with a longer period, such as a menstrual cycle), and ultradian (with a shorter period, such as human appetite) rhythms (Bustos et  al. 2011; Kawasaki et  al. 2009; McMaster et  al. 2011)  or throughout stages of the cell cycle (Lenz & Sogaard-Andersen 2011). It can also be driven by various developmental factors. These factors can be signaled exogenously or endogenously and are processed in the context of the epigenetic substrate of that particular cell’s lineage (such as DNA methylation and chromatin modifications). This results in some degree of reproducible transcriptome states at equivalent anatomical positions and developmental milestones (Kang et  al. 2011). Transcriptomes can respond to affect, physiological insult, and pathological conditions. Even controlling for anatomical location, rhythmic effects, developmental time point, and environmental history and allowing for stochastic changes, transcriptomes will not be equivalent for all members of an outbred species, either in sequence or in transcript abundance due to natural variation. Transcriptomes are also spatially organized within a cell. Some transcripts are primarily nuclear and others cytoplasmic. Some are trafficked to distant reaches of the cell (Mikl et  al. 2010)  while others form macromolecular complexes. In neuroscience, the synaptic transcriptome is an excellent example of a critical functional localization. Rates and modes of transcription and degradation can also be related to RNA function and are areas of active study; a transcript’s abundance in the transcriptome is not necessarily proportional to its rate of transcription from the genome, since transcripts are degraded at different rates (Hayles, Yellaboina, & Wang 2010).

Why Study Transcriptomes? Transcriptomes are usually studied for one of two reasons. Either the fundamental question can be answered only at the level of the transcriptome or the transcriptome is being used as a proxy for

65

some other level of biology that cannot itself be assessed as easily, accurately, or cheaply. The two are distinguished in this section. The simplicity of the static genome relative to the dynamic transcriptome is simultaneously an advantage and disadvantage. On the one hand, genetic studies have a clear direction of causality. On the other hand, without information about where, when, and how a gene is used, it is impossible to understand how a causal genetic variant acts. A  causal genetic variant without a proposed mechanism has little value for drug development. The dynamism of transcriptomes, epigenomes, and proteomes facilitates discovery of a genetic variant’s myriad downstream effects, which can then be investigated with hypothesis-driven experiments. Differences in expression level (i.e., abundance), by far the most widely studied aspect of a transcript, are generally thought to reflect differences in transcript activity between conditions. For example, more mRNA often corresponds to greater protein abundance. Any particular case, however, must be confirmed, as the mean correlation between transcript and protein abundance ranges from 0.50 to 0.66 (Ghaemmaghami et  al. 2003; Greenbaum et  al. 2003; Ishihama et  al. 2005; Nie, Wu, & Zhang 2006; Schmidt et  al. 2007). Some of this stems from differences in translational efficiency, but there are many effectors of posttranslational protein stability. Technologies to measure protein abundance currently lag far behind those used to measure transcript abundance, which can be accomplished with higher sensitivity and dynamic range. Nanopore technologies capable of determining the amino acid sequence of denatured proteins are promising in this regard. This is also true for alternative splicing of protein coding genes. However, a future in which sequence determination for proteins is as easy as it is for nucleic acids would not obviate the use of sequencing protein coding transcriptomes. While a primary goal is often to determine differences in protein abundance, a secondary goal is then to understand the mechanism. For example, does it occur at the level of the transcriptome or the proteome? Can it be attributed to a straightforward genomic effect such as an eQTL, or is it more complicated? Protein isoform abundance is crucial, as different isoforms often have different functions. However, the provenance of these isoform abundance differences must be

66

the OMICs

determined. Is one protein isoform differentially degraded or is the transcript differentially spliced? Thus, while its relative importance may diminish as proteomics technologies advance, there will always be an important role for sequencing protein coding transcriptomes. This is especially true given the growing appreciation of the biological roles of non‒protein coding transcripts. It will be just as important to understand the noncoding regulators of transcript abundance and translational efficiency as it is understand protein levels. Furthermore, the known mechanisms of many RNAs are unrelated to the regulation of protein abundance or splicing.

Transcriptomics:  Yesterday and  Today The “-omics” methods in this chapter—microarrays and “next-generation sequencing” (also sometimes called second-generation sequencing)—have many lower-throughput predecessors. These approaches—including Northern blotting, RNase protection assays, and RT-PCR—were based on the notion that measuring mRNA transcripts is of great value, and some of these (especially RT-PCR) are still in wide use today. A  family of higher-throughput methods, including serial analysis of gene expression (SAGE) and massively parallel signature sequencing (MPSS), uses sequenced molecular tags to determine RNA abundance rather than predefined molecular probes. These early next-generation techniques have since been overtaken by newer, more powerful technologies. Other sequencing methods were more frequently used to determine the sequence rather than the precise abundance. For example, expressed sequence tags (ESTs) are derived from short, single-pass shotgun Sanger sequencing (also known as chain termination) of cloned cDNA fragments. Another method uses the 5′ cap to capture and clone full-length cDNAs that are subsequently Sanger sequenced (Haas et  al. 2002). Both of these methods were instrumental for early gene definitions but have since fallen by the wayside in favor of the next-generation sequencing methods discussed in this chapter, which produce orders of magnitude higher throughput in the form of shorter sequence reads. Sanger sequencing still has a niche role in confirming the most important variants found with newer methods.

While methods for quantifying RNA abundance have been revolutionized by methods such as qRT-PCR, these low-throughput approaches require a preexisting implicit hypothesis. This creates a major scientific bottleneck owing to the sheer number of plausible hypotheses surrounding most biological or biomedical problems. The advent of genome-scale knowledge and technologies facilitates virtually complete measurement of select aspects of a biological system, permitting the testing of thousands of hypotheses in parallel; these developments constitute a considerable advance in efficiency over previous approaches (Geschwind & Konopka 2009). Two families of technologies—microarrays and next-generation sequencing—have enabled such large-scale, systematic “discovery” assays. Such approaches, being both relatively unbiased and efficient, are especially important in neuroscience, where there is a poor understanding of all but the most basic phenomena.

Microarray Technologies For over a decade, microarrays have provided a powerful tool for transcriptional profiling in the neurosciences. Early microarrays measured the degree of complementary hybridization of labeled cDNAs from a sample of interest to cDNA probes that were systematically spotted onto a surface. In contrast, modern arrays use oligonucleotide probes, as these are easier to synthesize and more specific. In comparison to the newer sequencing technologies, microarrays are inexpensive, relatively easy, and fast to analyze and interpret; moreover, they require only a modest computational infrastructure. Thus at this time they remain a powerful tool for the modern neuroscientist. However, they are less sensitive and have a lower dynamic range in comparison to sequencing technologies and are limited by what is explicitly queried on the array. These factors limit the usefulness of microarrays for more sophisticated analyses such as splicing and cross-species comparisons (Preuss et  al. 2004). Nonetheless, microarrays were pivotal in introducing “omics” to neuroscience (Geschwind & Konopka 2009; Geschwind 2003). Platforms Used for Transcriptome Sequencing Several technologies fall under the umbrella of next-generation sequencing. Some support paired-end sequencing, in which both

Transcriptomics ends of a cDNA fragment are sequenced. All produce “reads” of contiguous sequence that can be “mapped” to an appropriate position in the reference genome or transcriptome to determine their likely origin. The chromosomal locations of these mapped reads are then used to compute relative expression levels of genes and transcripts, sometimes with some form of correction for biases in sequence priming or GC content. Less frequently, reads are “assembled” or stitched together to build their originating transcripts from scratch. Studies using differential expression or network analysis often include functional enrichment analyses. These should be carefully done and treated with caution in RNA sequencing (RNA-seq) given the astronomical difference in accuracy for highly expressed versus lowly expressed transcripts (Young et al. 2010). Despite its name, next-generation sequencing has now been in use for years. It is often called “second-generation sequencing,” though this too is misleading, as there were many preceding generations of sequencing technologies. The major platforms in widespread use are discussed below. Pyrosequencing, commercialized by 454 Life Sciences (currently a subsidiary of Roche), was one of the earliest widely used next-generation sequencing platforms. These platforms produce longer reads than other second-generation methods (on the order of 1,000 bp). The equipment is also relatively inexpensive and the runs are relatively fast, requiring only a couple of hours. However, these runs are also relatively expensive per base pair for the sequence produced and, though generally highly accurate, the technology is poor at determining the correct number of bases in a homopolymer run (i.e., where several contiguous identical bases appear in the sequence). Sequencing by synthesis, commercialized by Solexa (now a subsidiary of Illumina), is currently the most common next-generation sequencing platform. In Illumina sequencing, the synthesis of complementary strands is visualized with fluorescent-tagged bases within clusters of identical cDNA fragments. Reagent cost is several orders of magnitude cheaper per base than Sanger sequencing, making this one of the cheapest of next-generation sequencers. 1

http://www.genomeweb.com/sites/default/files/pdfs/ IS_survey_2012_supplement.pdf

67

Unfortunately the initial capital investment for a standard Illumina sequencing platform can be large, but there are less expensive versions available for smaller-scale projects. Sequencing by ligation, commercialized as the SOLiD platform by Applied Biosystems (now a subsidiary of Life Technologies), also has a particularly low per-base cost. By effectively sequencing each base twice with a degenerate color space, SOLiD sequencers are known for their high quality. However, it is also one of the slower platforms and is limited to shorter reads. This platform appears to be falling out of favor.1 Ion semiconductor sequencing, commercialized by Ion Torrent (now a subsidiary of Life Technologies), in contrast to the usual approaches, eschews optical detection for a semiconductor-based detection of proton by-products of the polymerization reaction. This approach is fast and the equipment relatively inexpensive, but (like the 454 platform) it sequences homopolymer runs inaccurately. Several single-molecule methods (sometimes called “third-generation sequencing”) are also commercially available. These platforms avoid the PCR amplification of clusters of identical fragments used in second-generation sequencing methods, and some permit the sequencing of exceptionally long reads. Despite these obvious advantages, third-generation techniques currently lag behind the most developed second-generation platforms in adoption due to higher per-base costs and error rates. The major currently available third-generation platforms employ single-molecule real-time sequencing (SMRT) technology from Pacific Biosciences and Heliscope single-molecule sequencing from Helicos BioSciences. These second- and third-generation sequencing technologies are enormously sensitive:  many modern RNA-seq studies have a dynamic range of six orders of magnitude and can detect transcripts at a level lower than one transcript per cell. RNA-seq is also immensely powerful, permitting the detection even of small differences in the expression level of genes expressed at a moderate to high level. Moreover, these sequencing technologies allow analyses of expression levels, splicing, and novel transcript discovery with the same data (Belgard et al. 2011). While sequencing technologies have greatly accelerated the omics revolution begun by

68

the OMICs TABLE  4.1. SOME TRANSCRIPTOMIC APPLICATIONS OF SEQUENCING IN

NEUROSCIENCE Example Analysis

Citation

Single neuron transcriptome profiling Gene expression evolution RNA editing Sequence variation and gene regulation Effects of transposable elements on gene expression Novel long noncoding RNA discovery MicroRNA profiling Imprinted gene expression Novel isoform discovery Comparative neurotranscriptomics Splicing dysregulation in a neurological disorder Allele-specific expression Anatomical and temporal effects on gene expression

Qiu et al. (2012) Brawand et al. (2011) Danecek et al. (2012) Keane et al. (2011) Nellaker et al. (2012) Belgard et al. (2011) Shao et al. (2010) DeVeale, van der Kooy, and Babak (2012) Roberts et al. (2011) Belgard et al. (2013) Voineagu et al. (2011) Shen et al. (2012) http://www.brainspan.org

microarrays (Table  4.1), they come at the cost of greatly increased analytic complexity. For example, read mapping and assembly, expression level quantification, and tests of differential expression are in a constant state of flux. These algorithms generally have specific strengths and weaknesses that should be understood in order to reduce mistaking methodological bias for biological fact (see below). Few systematic comparisons among different technologies and analytic methods have been reported, but there are sometimes substantial differences. For example, even differential expression programs that assume a negative binomial distribution of read counts often have only a modest overlap in their reported genes. The huge variety of parameters and custom scripts used in RNA-seq analysis can complicate straightforward replication. Consequently it is critical that publications using such data make available the results at several stages of analysis such that the data can be reconsidered using alternative methods.

Reproducibility and the Rise of a New Technology Investigators must be careful to avoid drawing the wrong conclusions from their transcriptomic studies. Unfortunately, this is all too common owing to poor or inappropriate study design, tissue or RNA quality, bioinformatics, or interpretative framework. Such issues must be thoroughly considered before a project is begun. Remember:  “garbage in, garbage  out.”

Study design is the first crucial consideration. Can the question be properly answered by the study as designed? Does a transcriptomic approach even make sense in this context? What are possible outcomes, and how would they be interpreted? Statistical tests and stratification should be determined in advance to avoid understating the likelihood of type I errors. Potential confounders, such as age and sex, should be measured and ideally matched or stratified. In some cases this is difficult to impossible—for example with the confounding effects of medications on postmortem expression studies of brains from people with schizophrenia. Such caveats should be noted. If regions are being compared, could apparent expression differences be attributed to gross dissections? (For example, gray matter may be easier to selectively isolate from some areas than from others.) If cases and controls are derived from different sources, could apparent changes in expression be due to different banking procedures or dissections? How would one test this? Even a well-designed study can encounter problems if tissue or RNA quality varies. Postmortem Interval (PMI), pH, and RNA Integrity Number (RIN) can all affect gene expression and should be accounted for in any transcriptomic analysis. Often poor-quality samples must simply be discarded, as transcript degradation may not be linear in response to these variables, complicating regression analyses aiming at removing this confounding variable.

Transcriptomics Informatics is not a trivial support function. New algorithmic innovations are being reported every week and, although today’s studies must be done using the methods available today, one must be vigilant as to how known and potential biases may affect one’s present analysis. Consider, for example, a paper that used RNA-seq data to conclude that RNA-DNA differences (and presumably RNA editing events) were far more pervasive than previously realized (Li et  al. 2011). It was later demonstrated that these results were accidentally inflated by an order of magnitude owing to improperly accounting for incorrect mappings that were individually rare but collectively common, genetic variation, and sequencing errors (Kleinman and Majewski 2012; Lin et  al. 2012; Pickrell, Gilad, & Pritchard 2012). A  summary of all the numerous analytic hurdles is beyond the purview of this chapter, but a thorough current review of the relevant bioinformatic literature should be conducted before an experiment is designed. Finally, there should be a reasonable theoretical framework to interpret changes in gene expression. Of course mechanistic details need not be known in advance, but one should bear in mind the possible reasons for an apparent change in steady-state gene expression. Could the change be attributed to differences in cell type composition or to expression differences within a cell type? What implications would this have? One should consider the distinction between causal and reactive changes as well as when the distinction does and does not matter. Perhaps a reactive difference is aggravating the pathology started elsewhere and may have a more druggable critical node.

THE FUTURE OF TRANSCRIPTOMICS: LONGER READS, HIGHER THROUGHPUT As costs continue to fall and throughput continues to skyrocket, RNA-seq will likely eventually overtake arrays for studies of gene expression. There are several challenges that must be met before this occurs. For one, RNA-seq informatics is still in a state of constant change, with few widely accepted standards. In contrast, methods to get expression levels from microarrays are relatively standardized. Consequently, the average time required to analyze information-dense RNA-seq data is considerably greater than that

69

needed with array data. Likewise, RNA-seq analysis requires a more sophisticated IT infrastructure than microarrays owing to its greater computational demands. This may be a challenge for individual research groups going forward. While the cloud is an option, much of the cost of cloud computing is driven by the volume of data, which is large for RNA-seq. Nonetheless, RNA-seq has already overtaken arrays in niche areas of transcriptomics, as in studies of nontraditional model organisms or of alternative splicing. Ultimately, longer reads and greater depth are needed for better isoform deconvolution, which will allow for more accurate estimates of both gene and transcript expression levels. Single-molecule sequencing promises these longer reads at greater read depths. These may be provided by descendants of the existing third-generation sequencing platforms, but it remains possible that nanopore sequencing will leapfrog those existing technologies. This family of technologies, in development by Oxford Nanopore, promises extremely long reads of directly sequenced single-stranded RNAs (Ayub & Bayley 2012). Regardless of who will deliver them, higher-throughput long reads are the way of the future. Similarly, analytic methods will be refined. Network analysis and methods to integrate data from multiple biological levels will be key to attaining maximal mechanistic biological insight (Geschwind & Konopka 2009). Some current frontiers are in allele-specific expression, predicting effects of rare genomic variants, alternative splicing, and the inference of causal regulatory models.

S U M M A RY A N D C O N C L U S I O N S Transcriptomics is already revolutionizing molecular studies in neuroscience, allowing one to start from a discovery stance and then develop testable hypotheses. Current technologies are at different stages of maturity. Microarray analysis is already well developed, with stable analytic protocols. Despite its (deserved) soaring popularity, RNA-seq analysis remains in a state of constant flux, and the analytic demands and idiosyncrasies have not yet been well worked out. High-throughput, single-molecule, and long-read sequencing technologies will probably overtake current transcriptomic methods within the next decade. The methods to enable this are in development, and the analytic demands of

70

the OMICs

such raw data are high. Current transcriptomic technologies, coupled with other new emerging technologies, will permit efficient single-cell analyses, enabling a systematic understanding of cell and circuit diversity and function in the nervous system. Similarly, analytic methods will be refined. One must move beyond identifying lists of the most changing genes to understanding transcriptomic organization at a higher level (Oldham et  al. 2008; Winden et  al. 2009; Konopka et  al. 2009; Miller et  al. 2010). Network analysis and methods to integrate data from multiple biological levels will be key to attaining maximal mechanistic biological insight (Geschwind & Konopka 2009). A  particularly salient example where network methods have led to disease insight is in autism, where transcriptomics has recently defined a molecular pathology of autism (Voineagu et  al. 2011). Other current frontiers are in allele-specific expression (DeVeale, van der Kooy, & Babak 2012), predicting effects of rare genomic variants (Luo et  al. 2012), alternative splicing (Mazin et  al. 2013), and the inference of causal regulatory models (Wexler et al. 2011).

REFERENCES Ayub, M., & H. Bayley. 2012. Individual RNA base recognition in immobilized oligonucleotides using a protein nanopore. Nano Letters no. 12 (11): 5637–5643. doi: 10.1021/nl3027873. Belgard, T. G., A. C. Marques, P. L. Oliver, H. O. Abaan, T. M. Sirey, A. Hoerder-Suabedissen, . . . C.  P. Ponting. 2011. A transcriptomic atlas of mouse neocortical layers. Neuron no. 71 (4):605–616. doi: 10.1016/j.neuron.2011.06.039. Belgard, T.  G., J.  F. Montiel, W.  Z. Wang, F. García-Moreno, E. H. Margulies, C. P. Ponting, Z.  Molnár. 2013. Adult pallium transcriptomes surprise in not reflecting predicted homologies across diverse chicken and mouse pallial sectors. Proceedings of the National Academy of Sciences of the United States of America. Published online before print July 22, 2013. doi:  10.1073/ pnas.1307444110. Brawand, D., M. Soumillon, A. Necsulea, P. Julien, G. Csardi, P. Harrigan, . . . H. Kaessmann. 2011. The evolution of gene expression levels in mammalian organs. Nature no. 478 (7369):343–348. doi: 10.1038/nature10532. Bustos, D. M., M. J. Bailey, D. Sugden, D. A. Carter, M. F. Rath, M. Moller, . . . D. C. Klein. 2011. Global daily dynamics of the pineal transcriptome. Cell

and Tissue Research no. 344 (1):1–11. doi: 10.1007/ s00441-010-1094-1. Chodroff, R. A., L. Goodstadt, T. M. Sirey, P. L. Oliver, K. E. Davies, E. D. Green, . . . C. P. Ponting. 2010. Long noncoding RNA genes:  conservation of sequence and brain expression among diverse amniotes. Genome Biology no. 11 (7):R72. doi: 10.1186/gb-2010-11-7-r72. Cordaux, R., & M. A. Batzer. 2009. The impact of retrotransposons on human genome evolution. Nature Reviews Genetics no. 10 (10):691–703. doi: 10.1038/nrg2640. Danecek, P., C. Nellaker, R. E. McIntyre, J. E. Buendia-Buendia, S. Bumpstead, C. P. Ponting, . . . D.  J. Adams. 2012. High levels of RNA-editing site conservation amongst 15 laboratory mouse strains. Genome Biology no. 13 (4):26. doi: 10.1186/gb-2012-13-4-r26. DeRisi, J. L., V. R. Iyer, & P. O. Brown. 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science no. 278 (5338):680–686. DeVeale, B., D. van der Kooy, & T. Babak. 2012. Critical evaluation of imprinted gene expression by RNA-Seq:  a new perspective. PLoS Genetics no. 8 (3):e1002600. doi: 10.1371/journal. pgen.1002600. Djebali, S., C. A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, . . . T.  R. Gingeras. 2012. Landscape of transcription in human cells. Nature no. 489 (7414):101–108. doi: 10.1038/nature11233. Douglas, A. G., & M. J. Wood. 2011. RNA splicing: disease and therapy. Briefings in Functional Genomics no. 10 (3):151–164. doi: 10.1093/bfgp/elr020. Gerbi, S. A., A. V. Borovjagin, M. Ezrokhi, & T. S. Lange. 2001. Ribosome biogenesis:  role of small nucleolar RNA in maturation of eukaryotic rRNA. Cold Spring Harbor Symposia on Quantitative Biology no. 66:575–590. Geschwind, D. H. 2003. DNA microarrays: translation of the genome from laboratory to clinic. Lancet Neurology no. 2 (5):275–282. Geschwind, D. H., & G. Konopka. 2009. Neuroscience in the era of functional genomics and systems biology. Nature no. 461 (7266):908–915. doi: 10.1038/ nature08537. Ghaemmaghami, S., W. K. Huh, K. Bower, R. W. Howson, A. Belle, N. Dephoure, . . . J. S. Weissman. 2003. Global analysis of protein expression in yeast. Nature no. 425 (6959):737–741. doi: 10.1038/nature02046. Gilbert, W. 1986. Origin of Life:  The RNA World. Nature no. 319:618. Greenbaum, D., C. Colangelo, K. Williams, & M. Gerstein. 2003. Comparing protein abundance and mRNA expression levels on a genomic scale.

Transcriptomics Genome Biology no. 4 (9):117. doi:  10.1186/ gb-2003-4-9-117. Guttman, M., I. Amit, M. Garber, C. French, M. F. Lin, D. Feldser, . . . . E. S. Lander. 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature no. 458 (7235):223–227. doi: 10.1038/nature07672. Haas, B. J., N. Volfovsky, C. D. Town, M. Troukhan, N. Alexandrov, K. A. Feldmann, . . . S. L. Salzberg. 2002. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biology no. 3 (6):RESEARCH0029. Harrow, J., A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, . . . T.  J. Hubbard. 2012. GENCODE:  the reference human genome annotation for The ENCODE Project. Genome Research no. 22 (9):1760–1774. doi:  10.1101/ gr.135350.111. Hayles, B., S. Yellaboina, & D. Wang. 2010. Comparing transcription rate and mRNA abundance as parameters for biochemical pathway and network analysis. PloS One no. 5 (3):e9908. doi:  10.1371/ journal.pone.0009908. Ishihama, Y., Y. Oda, T. Tabata, T. Sato, T. Nagasu, J. Rappsilber, & M. Mann. 2005. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Molecular & Cellular Proteomics no. 4 (9): 1265–1272. doi: 10.1074/mcp.M500061-MCP200. Jarrous, N., & V. Gopalan. 2010. Archaeal/eukaryal RNase P:  subunits, functions and RNA diversification. Nucleic Acids Research no. 38 (22): 7885–7894. doi: 10.1093/nar/gkq701. Kang, H. J., Y. I. Kawasawa, F. Cheng, Y. Zhu, X. Xu, M. Li, . . . N. Sestan. 2011. Spatio-temporal transcriptome of the human brain. Nature no. 478 (7370):483–489. doi: 10.1038/nature10523. Kawasaki, M., I. Sekigawa, K. Nozawa, H. Kaneko, Y. Takasaki, K. Takamori, & H. Ogawa. 2009. Changes in the gene expression of peripheral blood mononuclear cells during the menstrual cycle of females is associated with a gender bias in the incidence of systemic lupus erythematosus. Clinical and Experimental Rheumatology no. 27 (2):260–266. Keane, T. M., L. Goodstadt, P. Danecek, M. A. White, K. Wong, B. Yalcin, . . . D.  J. Adams. 2011. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature no. 477 (7364): 289–294. doi: 10.1038/nature10413. Kleinman, C. L., & J. Majewski. 2012. Comment on ‘Widespread RNA and DNA sequence differences in the human transcriptome.’ Science no. 335 (6074):1302; author reply 1302. doi: 10.1126/ science.1209658.

71

Konopka, G., J.  M. Bomar, K.  Winden, G.  Coppola, Z. O. Jonsson, F. Gao, . . . D. H. Geschwind. 2009. Human-specific transcriptional regulation of CNS development genes by FOXP2. Nature no 462 (7270):213–217. doi: 10.1038/nature08549. Lenz, P., & L. Sogaard-Andersen. 2011. Temporal and spatial oscillations in bacteria. Nature Reviews Microbiology no. 9 (8):565–577. doi:  10.1038/ nrmicro2612. Leung, E., & J. D. Brown. 2010. Biogenesis of the signal recognition particle. Biochemical Society Transactions no. 38 (4):1093–1098. doi:  10.1042/ BST0381093. Li, M., I. X. Wang, Y. Li, A. Bruzel, A. L. Richards, J. M. Toung, & V. G. Cheung. 2011. Widespread RNA and DNA sequence differences in the human transcriptome. Science no. 333 (6038):53–58. doi: 10.1126/science.1207018. Lin, W., R. Piskol, M. H. Tan, & J. B. Li. 2012. Comment on ‘Widespread RNA and DNA sequence differences in the human transcriptome’. Science no. 335 (6074):1302; author reply 1302. doi: 10.1126/ science.1210624. Liu, Q., & Z. Paroo. 2010. Biochemical principles of small RNA pathways. Annual Review of Biochemistry no. 79:295–319. doi:  10.1146/ annurev.biochem.052208.151733. Lockhart, D. J., H. Dong, M. C. Byrne, M. T. Follettie, M. V. Gallo, M. S. Chee, . . . E. L. Brown. 1996. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology no. 14 (13):1675–1680. doi: 10.1038/nbt1296-1675. Luo, R., S.  J. Sanders, Y.  Tian, I.  Voineagu, N.  Huang, S.  H. Chu, . . . D.  H. Geschwind. 2012. Genome-wide transcriptome profiling reveals the functional impact of rare de novo and recurrent CNVs in autism spectrum disorders. American Journal of Human Genetics no.  91 (1):38–55. doi: 10.1016/j.ajhg.2012.05.011. Maier, T., M. Guell, & L. Serrano. 2009. Correlation of mRNA and protein in complex biological samples. FEBS Letters no. 583 (24):3966–3973. doi: 10.1016/j.febslet.2009.10.036. Maksakova, I. A., D. L. Mager, & D. Reiss. 2008. Keeping active endogenous retroviral-like elements in check: the epigenetic perspective. Cellular and Molecular Life Sciences no. 65 (21):3329–3347. doi: 10.1007/s00018-008-8494-3. Marques, A. C., & C. P. Ponting. 2009. Catalogues of mammalian long noncoding RNAs: modest conservation and incompleteness. Genome Biology no. 10 (11):R124. doi: 10.1186/gb-2009-10-11-r124. Mazin, P., J.  Xiong, X.  Liu, Z.  Yan, X.  Zhang, M. Li, . . . P. Khaitovich. 2013. Widespread splicing changes in human brain development and aging.

72

the OMICs

Molecular Systems Biology no. 9:633. doi: 10.1038/ msb.2012.67. McMaster, A., M. Jangani, P. Sommer, N. Han, A. Brass, S. Beesley, . . . D.  W. Ray. 2011. Ultradian cortisol pulsatility encodes a distinct, biologically important signal. PloS One no. 6 (1):e15766. doi: 10.1371/journal.pone.0015766. Mikl, M., G. Vendra, M. Doyle, & M. A. Kiebler. 2010. RNA localization in neurite morphogenesis and synaptic regulation:  current evidence and novel approaches. Journal of Comparative Physiology A no. 196 (5):321–334. doi:  10.1007/ s00359-010-0520-x. Miller, J.  A., S.  Horvath & D.  H. Geschwind. 2010. Divergence of human and mouse brain transcriptome highlights Alzheimer disease pathways. Proceedings of the National Academy of Sciences of the United States of America no.  107 (28): 12698–12703. doi: 10.1073/pnas.0914257107. Nellaker, C., T. M. Keane, B. Yalcin, K. Wong, A. Agam, T. G. Belgard, . . . C. P. Ponting. 2012. The genomic landscape shaped by selection on transposable elements across 18 mouse strains. Genome Biology no. 13 (6):R45. doi: 10.1186/gb-2012-13-6-r45. Nie, L., G. Wu, & W. Zhang. 2006. Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: a multiple regression to identify sources of variations. Biochemical and Biophysical Research Communications no. 339 (2):603–610. doi: 10.1016/j.bbrc.2005.11.055. Oldham, M. C., G. Konopka, K. Iwamoto, P. Langfelder, T.  Kato, S.  Horvath, & D.  H. Geschwind. 2008. Functional organization of the transcriptome in human brain. Nature Neuroscience no.  11 (11):1271–1282. doi: 10.1038/nn.2207. Pickrell, J. K., Y. Gilad, & J. K. Pritchard. 2012. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome.” Science no. 335 (6074):1302; author reply 1302. doi: 10.1126/science.1210484. Ponjavic, J., P. L. Oliver, G. Lunter, & C. P. Ponting. 2009. Genomic and transcriptional co-localization of protein-coding and long non-coding RNA pairs in the developing brain. PLoS Genetics no. 5 (8):e1000617. doi: 10.1371/journal.pgen.1000617. Ponjavic, J., C. P. Ponting, & G. Lunter. 2007. Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. Genome Research no. 17 (5):556–565. doi: 10.1101/gr.6036807. Ponting, C. P., & T. G. Belgard. 2010. Transcribed dark matter: meaning or myth? Human Molecular Genetics no. 19 (R2):R162–R168. doi:  10.1093/ hmg/ddq362. Preuss, T. M., M. Caceres, M. C. Oldham, & D. H. Geschwind. 2004. Human brain evolution: insights from microarrays. Nature Reviews Genetics no. 5 (11):850–860. doi: 10.1038/nrg1469.

Qiu, S., S. Luo, O. Evgrafov, R. Li, G. P. Schroth, P. Levitt, . . . K. Wang. 2012. Single-neuron RNA-Seq:  technical feasibility and reproducibility. Frontiers in Genetics no. 3:124. doi:  10.3389/ fgene.2012.00124. Roberts, A., H. Pimentel, C. Trapnell, & L. Pachter. 2011. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics no. 27 (17):2325–2329. doi:  10.1093/bioinformatics/ btr355. Schmidt, M. W., A. Houseman, A. R. Ivanov, & D. A. Wolf. 2007. Comparative proteomic and transcriptomic profiling of the fission yeast Schizosaccharomyces pombe. Molecular Systems Biology no. 3:79. doi: 10.1038/msb4100117. Senti, K. A., & J. Brennecke. 2010. The piRNA pathway:  a fly’s perspective on the guardian of the genome. Trends in Genetics no. 26 (12):499–509. doi: 10.1016/j.tig.2010.08.007. Shao, N. Y., H. Y. Hu, Z. Yan, Y. Xu, H. Hu, C. Menzel, . . . P. Khaitovich. 2010. Comprehensive survey of human brain microRNA by deep sequencing. BMC Genomics no. 11:409. doi: 10.1186/1471-2164-11-409. Shen, Y., T. Garcia, V. Pabuwal, M. Boswell, A. Pasquali, I. Beldorth, . . . R.  B. Walter. 2012. Alternative strategies for development of a reference transcriptome for quantification of allele specific expression in organisms having sparse genomic resources. Comparative Biochemistry and Physiology Part D Genomics Proteomics no. 8 (1):11–16. doi: 10.1016/j.cbd.2012.10.006. Voineagu, I., X. Wang, P. Johnston, J. K. Lowe, Y. Tian, S. Horvath, . . . D. H. Geschwind. 2011. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature no. 474 (7351):380–384. doi: 10.1038/nature10110. Wexler, E.  M., E.  Rosen, D.  Lu, G.  E. Osborn, E. Martin, H. Raybould & D. H. Geschwind. 2011. Genome-wide analysis of a Wnt1-regulated transcriptonal network implicates neurodegenerative pathways. Science Signaling no.  4(193):ra65. doi: 10.1126/scisignal.2002282. Winden, K. D., M. C. Oldham, K. Mirnics, P. J. Ebert, C.  H. Swan, P.  Levitt, . . . D.  H. Geschwind. 2009. The organization of the transcriptional network in specific neuronal classes. Molecular Systems Biology no. 5:291. doi: 10.1038/msb.2009.46. Young, M. D., M. J. Wakefield, G. K. Smyth, & A. Oshlack. 2010. Gene ontology analysis for RNA-seq:  accounting for selection bias. Genome Biology no. 11 (2):R14. doi:  10.1186/ gb-2010-11-2-r14. Zhang, Q., N. K. Kim, & J. Feigon. 2011. Architecture of human telomerase RNA. Proceedings of the National Academy of Sciences of the United States of America. doi: 10.1073/pnas.1100279108.

5 Decoding Alternative mRNAs in the “Omics”  Age Y U A N Y U A N A N D D O N N Y D . L I C ATA L O S I

INTRODUCTION Major goals in neuroscience research are to identify genes with important roles in neuronal specialization and function and to understand how they are regulated. What genes are required for the development of the vast array of morphologically and functionally distinct cell types that comprise the mammalian nervous system? How are cell type‒specific genes regulated in different neuronal cell types? What gene-regulatory events drive cells through differentiation and maturation processes critical for proper cell identity and function? Finally, how do defects in the regulation of specific genes contribute to neurologic disease? Advances in genome profiling technologies have yielded new insights into the molecular mechanisms of gene regulation at an unprecedented level of detail, revealing widespread roles for posttranscriptional control of gene expression via mRNA regulation (Blencowe et  al. 2009; Licatalosi and Darnell, 2010). mRNAs are subject to extensive tissue- and cell-specific regulation controlled by trans-acting regulatory factors such as micro-RNAs and RNA-binding proteins (RBPs). It is generally believed that each mRNA associates with overlapping but distinct combinations of factors that will define how each pre-mRNA will be processed into mRNA and how those transcripts will be regulated at the level of mRNA localization, translation, and decay. This chapter highlights the importance of alternative mRNA regulation in the nervous system, beginning with a review of different ways in which alternative RNA regulation can “fine tune” gene expression. In addition, we describe recent technical and methodological advances that have ushered in a new era of molecular neuroscience research, allowing the molecular landscape of the cell and its regulation to

be explored at an unprecedented level of coverage and detail. These methods have not only provided new insights into mechanisms of gene expression control but are revealing genome-wide networks of alternative mRNAs, how alternative RNA processing events impact cellular events, and connections between mRNA misregulation and neurologic disease.

A LT E R N AT I V E M R N A R E G U L AT I O N I N T H E NERVOUS  SYSTEM An overview of mRNA alternative splicing and polyadenylation is presented below, along with a discussion of their impact on a wide array of processes in the nervous system. In addition, traditional approaches to the study of alternative mRNA regulation are described. Alternative mRNA Processing Most protein coding genes have a modular organization whereby the protein coding segments (exons) are interrupted by significantly longer noncoding sequences (introns). The processing of precursor-mRNA transcripts into mRNA involves the addition of a caplike structure to the 5ʹ end of the transcript, a polyadenylate (polyA) tail to the 3ʹ end, and the removal of introns and precise joining (or “splicing”) of exons to define the amino acid coding sequence (Figure  5.1). Soon after the identification of genes “in pieces” and exon splicing, it was hypothesized that regulation of splicing could be a means to posttranscriptionally modulate gene expression (Crick, 1979; Gilbert, 1978). Today we know that regulation of exon splicing is a major mechanism of posttranscriptional gene regulation that affects the expression of over 90% of all human protein coding genes (Pan et  al. 2008; Wang et  al. 2008). While the majority of exons are constitutively spliced

74

the OMICs

b

a

Gene

c

Transcription

PolyA Pre-mRNA

b

a

Processing

mRNA

c

PolyA b

a

2

1 a

c

c

AAAAAA

a

b

c

AAAAAA

Translation

Protein

Activator

Repressor

A single gene gives rise to identical pre-mRNAs that are alternatively spliced in one of two ways. In splicing pattern 1, all exons are spliced in the mRNA, resulting in a template for translation of a hypothetical protein that functions as an “activator”. In splicing pattern 2, alternative exon “b” is skipped and the resulting mRNA codes for a repressor.

FIGURE  5.1:

(included in the mature mRNA), splicing of some exons is highly regulated in different tissues and stages of development. In most cases the functional consequence of alternative splicing is the production of mRNA isoforms that encode polypeptides with distinct amino acid sequences and therefore distinct biochemical properties. This is possible since the length of most alternative exons is divisible by three; thus their alternative splicing does not result in a coding frameshift (Resch et  al. 2004). Recent bioinformatic analyses indicate that a common functional consequence of alternative splicing is the inclusion or exclusion of codons for amino acids that can be differentially phosphorylated (Merkin et al. 2012). Thus, cell type‒specific differences in the relative levels of exon-spliced to exon-skipped isoforms can alter the activity of signaling pathways in a cell type‒specific manner. A second widely regulated alternative mRNA processing event that can significantly impact gene expression is alternative polyadenylation (Di Giammartino et  al. 2011). The 3ʹ ends of nearly all protein coding mRNAs are generated by exonucleolytic cleavage followed by the addition of a polyadenylate (polyA) tail. Most genes yield pre-mRNAs that possess multiple alternative polyA sites used in a

tissue- or developmental stage‒specific manner (Figure 5.2) (Wang et al. 2008). Alternative polyadenylation can produce mRNAs with distinct 3’ untranslated regions (3ʹUTRs) where regulatory elements reside that can impact mRNA abundance, subcellular location, and translation. Thus alternative polyadenylation of pre-mRNA in the nucleus can affect downstream mRNA regulatory events in the cytoplasm. Tissue-specific regulation of alternative spliced and polyadenylated mRNAs is thought to be largely dependent on cell-specific combinations of ubiquitous and tissue-restricted RNA binding proteins (RBPs) (Barash et  al. 2010; Licatalosi & Darnell 2010; Wang & Burge 2008). Each newly synthesized pre-mRNA is bound by RBPs, some of which are “core” RNA processing factors that interact with essential cis-acting sequences to demarcate exons and introns or potential polyA sites, while some RBPs function as trans-acting auxiliary factors that regulate processing. Some RBPs are known to act in a positive manner to enhance processing of an exon or polyA site, while others act as repressors. Interestingly, some RBPs can act as either positive or negative factors depending on their position (upstream, downstream, or within) relative to the alternative exon or polyA

Decoding Alternative mRNAs in the OMICs Age

b

a

Gene

75

c

Transcription p1

Pre-mRNA

a

Processing mRNA

b

p2 b

a

c

Short

AAAAAA

p2

c

Signaling pathway

2

1 a b c

p1

a b c Long

AAAAAA

Translation Protein

Alternative polyadenylation coupled to posttranscriptional regulation. In this example, alternative polyadenylation at two sites can yield mRNAs with either a short (1)  or long (2)  3ʹUTR.  mRNAs with the long 3ʹUTR are targets of signaling cascades that stimulate new protein synthesis.

FIGURE  5.2:

site. Transcripts that have similar combinations of cis elements will interact with similar sets of regulatory factors (miRNAs and RBPs) and therefore be coordinately regulated across different tissues and cell types (Matlin et al. 2005).

mRNA Regulation and Neurobiology Multiple comparative studies have revealed extensive tissue-specific expression of alternative mRNAs and ranked the brain as the top mammalian tissue with respect to the number and complexity of gene products expressed (Clark et  al. 2007; Yeo et  al. 2004). In addition to expressing more genes than most tissues, the nervous system relies on extensive posttranscriptional mRNA control to modulate gene expression and generate a remarkable number of distinct proteins from a significantly smaller number of genes. In the nervous system, alternative mRNA processing impacts several essential processes, from neuronal precursor maintenance and cell migration to modulation of membrane receptor properties (for detailed reviews, see Grabowski & Black 2001 and Li et al. 2007). Multiple studies have also shown that alternative mRNA processing events can be altered in response to neuronal activity. For example, a recent study (Iijima et  al. 2011)  demonstrates that activation of specific signaling pathways in neuronal

cultures can lead to phosphorylation of the RBP Sam68, which in turn affects the ability of Sam68 to regulate the splicing of alternative exons in target RNAs. It is not clear whether changes in alternative exon splicing in response to different stimuli are widespread or restricted to specific networks of mRNAs. Alternative polyadenylation has not been studied as extensively as alternative splicing. However, in recent years increased evidence has emerged to suggest that alternative polyadenylation has important roles in nervous system gene regulation. Computational analyses first showed that mRNAs expressed in nervous system tissue contain long 3ʹUTRs compared with the same mRNAs expressed in other tissues (Zhang et  al. 2005). The functional importance of long 3ʹUTRs due to alternative polyadenylation in the nervous system is not clear but may reflect the need for additional layers of gene regulation via 3ʹUTR sequences as compared with other tissues. In one study (Lau et  al. 2010), activityinduced accumulation of BDNF protein was shown to be the result of increased translation from BDNF mRNAs bearing one of two possible alternatively polyadenylated variants. BDNF mRNAs bearing the extended 3ʹUTR were translated in response to cell stimuli while the translation of the shorter 3ʹUTR isoform was unchanged. Thus alternatively polyadenylated

76

the OMICs

3ʹUTR variants can be differentially “primed” for increased mRNA translation in response to specific stimuli. Cell type‒specific differences in polyA site processing could therefore allow different cell types to control the quantity of proteins that are made in response to specific stimuli (Figure 5.2). A second example of a link between alternative polyadenylation and neuronal activity was the demonstration that neuronal activity can promote the production of alternatively polyadenylated mRNA variants, indicating that the cell can respond to stimuli by changing the relative levels of specific alternative 3ʹUTR isoforms (Flavell et al. 2008) (Figure 5.3).

RNA Misregulation of Neurological Disease Given the prevalence of alternative processing and its ability to alter gene output, it is not surprising that defects in alternative processing have been identified in a wide range of human diseases (for reviews, see Licatalosi & Darnell 2006, Danckwardt et al. 2008, and Cooper et al. 2009). Several direct links have been established between neurologic disease and either (1)  cis-acting mutations in pre-mRNA sequences that are necessary for efficient control of exon splicing or (2)  trans-acting mutations, in which splicing is misregulated as a result of aberrant levels of a critical trans-acting splicing regulatory RBP. Examples of diseases with causative

Traditional Tools for the Study of Alternative  mRNAs Much of our understanding of the molecular controls and functional significance of alternative processing comes from studies in which genes and their protein products were interrogated in isolation (one at a time) and often in different cellular contexts or in vitro conditions. Two widely used approaches to assess the functions of specific cis and trans factors important for mRNA processing include in vitro processing assays and minigene reporter assays in cell culture. In both systems, a processing event is monitored in a reporter RNA bearing an alternative exon or polyA site in different conditions, such as in the presence or absence of a candidate regulatory RBP or when putative regulatory sequences are mutated. While such approaches have been instrumental in uncovering basic strategies of alternative splicing and polyadenylation regulation (Chen

b

a

Gene

splicing defects include spinal muscular atrophy, neurofibromatosis, myotonic dystrophy, and frontotemporal dementia with Parkinsonism linked to chromosome 17 (FTDP-17) (for additional examples, see Licatalosi & Darnell 2006). In other diseases, altered patterns of alternative mRNA expression have been described; however, the direct causes of these changes and their contribution to disease progression and pathology have not been established.

c

Signaling pathway

Transcription p1

Pre-mRNA

b

a

Processing

50%

c

p1

a

b

p2

c

2

1 a b c

AAAAAA

a b c

AAAAAA

50% mRNA

p2

100%

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

FIGURE  5.3: Activity-induced changes in alternative polyadenylation. In this example, activation of a specific signaling cascade leads to increased production (from 50% to 100%) of alternative polyA variants with a long 3ʹUTR.

77

Decoding Alternative mRNAs in the OMICs Age

2010). These methods, and representative examples of how they have advanced knowledge of gene regulatory mechanisms in the nervous system, are described below.

& Manley 2009), they are often not well suited to recapitulate complex processing reactions that occur in vivo. In addition, both in vitro processing assays and minigene reporter assays are labor-intensive and needed to be optimized for each individual RNA substrate, thus making them unsuitable to investigate large numbers of processing events at one time. As a result, mechanistic details have been obtained for only a small number of mRNA substrates. What are the global networks of alternative mRNA processing events necessary for proper nervous system development and function, and how do defects in mRNA processing contribute to and/or drive neurological disease? In the next section powerful new transcriptome profiling methods are described that, when combined with genetic and biochemical tools, are beginning to provide new insights into the roles of alternative mRNA processing in neurobiology and mechanisms of mRNA regulation.

Microarray Analysis of Alternative  mRNAs Microarray technology has played a remarkable role in shaping our understanding in transcriptome regulation and complexity. The development of “exon arrays” (Clark et  al. 2007)  and “exon-junction arrays” (Johnson et  al. 2003; Pan et  al. 2004)  has extended the application of this technology to the alternative processing field. Exon and exon-junction arrays differ significantly from the prototype 3ʹ expression microarrays in the number and placement of oligonucleotide probes (Figure  5.4). For example, on the widely used Affymetrix Human Exon Array 1.0 ST, there are 1.4  million probe sets corresponding to all annotated and putative human exons, thus making it possible to globally quantify the expression of various mRNA splice isoforms (Clark et al. 2007). In some cases, multiple probesets are available in different regions of 3ʹUTRs to allow analyses of alternatively polyadenylated variants (Licatalosi et  al. 2008; Sandberg et  al. 2008). Although powerful in detecting exon and 3ʹUTR-level changes, exon arrays fail to provide sufficient information to elucidate exon-exon connectivity. This caveat was partly overcome through the development of exon-junction arrays that include probesets complementary to the sequences formed when specific exons are skipped or spliced. Custom

G L O B A L A N A LY S E S O F A LT E R N AT I V E M R N A R E G U L AT I O N Recent technological advancements have made it possible to characterize whole transcriptomes or global RNA-protein interactions with nucleotide resolution in physiological contexts. Applying these methods to the study of the nervous system is yielding new transcriptome-wide and high-resolution views of the complexity of mRNAs expressed in healthy and diseased tissue and providing new insights into underlying molecular mechanisms (Licatalosi & Darnell p1 p2 Pre-mRNA

a

b

c

a

3’end probeset

Alternative mRNA variants

p1 p2 b

c

p1 p2 a

Exon probesets

b

c

Exon junction probesets

a c

AAAAAA

a c

AAAAAA

a c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

a b c

AAAAAA

Comparison of microarray platforms and detection of alternatively processed mRNAs. Probeset features for each microarray type are mapped onto a representative transcript to indicate which common and alternative mRNA sequences can be interrogated by probesets present on different microarray platforms.

FIGURE  5.4:

78

the OMICs

exon-junction arrays have facilitated the analysis of tissue-specific splicing in mammals (Johnson et al. 2003; Pan et al. 2004). Comparative analyses using microarray datasets have confirmed earlier bioinformatic studies (of limited numbers of ESTs) in reporting that the brain expresses the greatest number of alternative mRNA variants relative to other tissues (Clark et  al. 2007; Fagnani et  al. 2007; Modrek et  al. 2001; Yeo et  al. 2004, 2005). In addition, exon array analyses have uncovered unexpected similarities in mRNA splicing patterns observed in human medulloblastoma and the developing mouse cerebellum, suggesting links to developmental signaling pathways (Menghi et al. 2011). While microarrays have provided information into alternative processing patterns and their control, both exon and exon-junction arrays require prior knowledge of gene structure and therefore are limited to the interrogation of previously annotated or predicted alternative processing events. In addition, microarray technologies suffer from the common limitations associated with hybridization-based nucleic acid quantification, including hybridization noise, probe specificity, and a narrow dynamic range of signal intensities. These limitations can be complemented by a new technology, high-throughput RNA sequencing.

Next-Generation RNA Sequencing (RNA-seq) RNA-seq takes advantage of the recently developed high-throughput DNA sequencing technologies to map and quantify transcriptomes (for a review, see Wang et al. 2009). In general, total or

a subpopulation of RNA in a sample is converted to cDNA fragments with adapters attached to both ends. Millions of these fragments are simultaneously amplified by PCR and sequenced in a high-throughput manner to generate sequence reads corresponding to each cDNA fragment. In contrast to microarrays, RNA-seq eliminates the need for prior knowledge of genome sequence and annotation and generates a relatively unbiased and high-resolution digital readout of the transcriptome. Known or novel transcript structures, such as the usage of cryptic alternative splice or polyadenylation sites, can be readily detected in RNA-seq datasets (Pan et  al. 2008; Wang et  al. 2008)  (Figure  5.5). In addition, RNA-seq allows for absolute measurement of RNA copy numbers, making it possible to compare transcript levels across separate experiments as well as various tissues. Numerous studies have demonstrated that RNA-seq provides highly reproducible and quantitative measurements of transcript abundance (Wang et al. 2009). Side-byside comparisons between RNA-seq and microarray datasets have also illustrated that RNA-seq can reliably measure known exon inclusion rates with the added power to detect previously unannotated splice junctions (Liu et  al. 2011; Pan et al. 2008). Consistent with other methods, RNA-Seq analysis identified nervous system tissue as having the most complex transcriptome characterized by high levels of transcription and alternative RNA processing that are regulated in a temporal and spatial manner (Pan et  al. 2008; Wang et  al. 2008). The importance of this detailed molecular specification for normal brain function

Exon-spanning reads

Exon body reads

Pre-mRNA

a

b

c

RNA-Seq analysis of alternative processing. In this example, exon-body and exon-spanning sequencing reads (black boxes and separated black boxes, respectively) are aligned to the gene. Relative differences in the density of reads in different regions gene indicate that exon b is spliced in a subset of mRNAs and that the majority of the mRNAs are processed at the first of two polyA sites in the 3ʹUTR.

FIGURE  5.5:

Decoding Alternative mRNAs in the OMICs Age was underscored in a recent study, in which Voineagu and colleagues performed RNA-Seq on three cortical regions in autistic versus control brains and uncovered a link between autism and loss of brain region specific transcription (Voineagu et  al. 2011). Further analysis of this RNA-seq dataset also revealed two co-expression gene modules in autistic brains, one of which centered around the RBP RBFOX1, a known autism-susceptibility factor and regulator of alternative mRNA processing. Interestingly, misregulated splicing of RBFOX1-target mRNAs was observed in autistic brains, suggesting that accurate control of alternative RNA processing may support higher brain functions.

Global Maps of RBP-RNA Interactions in  Vivo The development of HITS-CLIP strategies (high-throughput sequencing following crosslinking and immunoprecipitation) created a breakthrough in mapping global RNA-protein interactions in vivo (Darnell 2010). In HITS-CLIP, whole tissues or intact cells are treated with UV-C radiation, which covalently “freezes” RNA and proteins that are in direct contact. These covalent bonds allow RNA to be coimmunoprecipitated with bound protein under very stringent conditions. During this process, RNA is intentionally fragmented to allow recovery of only small RNA regions (30 to 50 bases)

79

where the protein of interest directly binds. The immunoprecipitated RNA-protein complexes are size-resolved by electrophoresis, and the RNA is purified and converted to a cDNA library suitable for high-throughput sequencing. The resulting sequence reads are mapped to the genome to reveal transcriptome-wide interaction sites for the protein of interest (Darnell 2012)  (Figure  5.6). To date more than 20 genome-wide RBP-RNA interaction maps have been generated using HITS-CLIP (also known as CLIP-seq), including those of Ptbp1 and 2, Fus, TDP-43, Mbnl, and Rbfox proteins (Darnell 2012). While the biochemical maps derived from CLIP provide unprecedented global views of RBP-RNA interactions, additional analyses are necessary to identify interactions that are functional and not merely opportunistic. As described in the next section, integrative analysis of CLIP datasets with datasets derived from microarray and/or RNA-Seq presents a powerful approach to understanding the direct sites of action of specific RBPs and to identifying general rules associated with alternative mRNA control in vivo.

Integrative Analyses and New Insights Each of the high-throughput approaches described above yields expansive datasets; thus there is a dire need for data analysis and integration through computational biology. Bioinformatic

UV AAAAAA

Lyse cells fragment RNA

Cells or tissue

Add antibody

AAAAAA IP protein-RNA complexes Purify RNA

RT-PCR cDNA library

RNA

High-throughput sequencing

Add adapters

CLIP reads

Mapping Reads

Overview of the HITS-CLIP assay. In this hypothetical example, HITS-CLIP analysis reveals that a specific RBP (black circle) binds only one of two possible mRNAs and that this interaction occurs in the 3ʹUTR.

FIGURE  5.6:

80

the OMICs

analyses can range from relatively straightforward tasks such as the determination of gene expression-level estimates from RNA-Seq or microarray data to more advanced applications such as Bayesian network (Zhang et  al. 2010)  and machine learning tools to predict the global RNA targets of a specific RBP, or the characterization of sequence features (or “codes”) associated with tissue-specific alternative processing events (Barash et al. 2010). Bioinformatic integration of datasets generated using a combination of omic methods is a powerful approach to gain new insights into roles of specific RBPs and mechanisms of mRNA regulation in vivo. For example, intersecting CLIP datasets corresponding to the neuron-specific RBP Nova2 with RNA profiling (microarray) data from wild-type and Nova2-knockout mouse brain showed that the position of Nova2 binding determines whether it will be a positive or negative regulator of alternative splicing and alternative polyadenylation (Licatalosi et  al. 2008). Subsequent studies have revealed that many RBPs have position-dependent effects on alternative mRNA processing (Charizanis et  al. 2012; Ince-Dunn et  al. 2012; Licatalosi et  al. 2012; Xue et  al. 2009; Yeo et  al. 2009). Additional analyses have indicated that tissue-restricted RBPs can bind and regulate networks of alternative mRNAs that encode functionally related proteins (Huang et  al. 2005; Licatalosi et  al. 2012; Ule et al. 2005). This section focuses on recent reports that highlight the power of integrating different RNA-omics approaches to study the RBPs TDP-43 and FUS and their roles in the misregulation of RNA processing in amyotrophic lateral sclerosis (ALS). ALS is the most common adult-onset motor neuron disease. Mutations leading to ALS are highly heterogeneous, contributing to a complete lack of direct diagnostic method for ALS. Landmark discoveries in 2006 revealed that an RNA-binding protein, TDP-43, was a major component of the cytoplasmic ubiquitinated inclusions found in affected neurons in all sporadic ALS patients (Arai et  al. 2006; Neumann et  al. 2006). Subsequently, mutations in TDP-43 were linked to a subset of ALS, supporting a causative role for TDP-43 in ALS pathogenesis (Gitcho et  al. 2008; Kabashi et  al. 2008). In healthy neurons, TDP-43 is predominantly nuclear, while in affected

neurons, ubiquitinated wild-type or mutant TDP-43 aggregates and is excluded from the nucleus (Giordana et  al. 2010; Neumann et  al. 2006; Nonaka et al. 2009). The identification of TDP-43 as a major player in ALS pathogenesis was followed by the discovery that a second RNA-binding protein, FUS, is also pathogenetically linked to ALS (Kwiatkowski et  al. 2009; Vance et  al. 2009). Like TDP-43, FUS is localized mainly in the nuclei of healthy neurons. However, ALS-linked FUS mutations predominantly disrupt the nuclear localization sequence (NLS), resulting in cytoplasmic FUS accumulation and aggregate formation (Kwiatkowski et  al. 2009; Vance et  al. 2009). Intriguingly, the degree of nuclear import impairment caused by different ALS-linked FUS mutations is inversely correlated with the age of ALS onset (Dormann et  al. 2010). The mislocalization of both TDP-43 and FUS from the nucleus to cytoplasm in affected neurons raises the possibility that loss of their normal nuclear function and/or gain of toxicity in cytoplasm contributes to ALS pathogenesis. CLIP was used to identify the genome-wide RNA targets for TDP-43 and FUS in mouse and human brain. TDP-43 was found to preferentially bind intronic clusters of GU-rich sequences in at least 30% of protein-coding transcripts in the mouse transcriptome (Polymenidou et al. 2011; Tollervey et al. 2011). To identify mRNAs whose steady-state levels and/or processing patterns are dependent on TDP-43, RNA-seq or microarray analyses were performed on mouse brain or human neuronal cell culture following TDP-43 knockdown. Combining these datasets with CLIP revealed that TDP-43 bound and sustained the levels of long intron-containing transcripts, including some that are crucial for synaptic functions and neuronal survival (Polymenidou et al. 2011). In addition, depletion of TDP-43 led to alternative splicing changes in a variety of human and mouse transcripts, including its own transcript (Polymenidou et al. 2011; Tollervey et al. 2011). Interestingly, TDP-43 autoregulates its transcript level by binding to its 3ʹUTR and promoting a splice isoform subjected to nonsense-mediated mRNA decay (NMD) (Polymenidou et  al. 2011). This autoregulatory loop may participate in the formation of cytoplasmic TDP-43 aggregates and subsequent neuronal death in ALS patients. One possible scenario would be that an initial trap of TDP-43 in cytoplasm—for

Decoding Alternative mRNAs in the OMICs Age example, by stress granules—can act as a “sink” and promote the setup of a feed-forward cycle that results in unchecked TDP-43 production and inclusion formation, which may threaten the survival of neuronal cells. CLIP revealed a much more widespread RNA-protein interaction map for FUS, with FUS binding along the whole length of most nascent RNAs with very limited sequence specificity (Ishigaki et  al. 2012). FUS binding is globally enriched around alternative exons and, not surprisingly, FUS depletion resulted in alternative splicing changes in a few hundred genes, including MAPT, which is involved in a number of neurodegenerative diseases such as Alzheimer’s disease and frontotemporal lobar degeneration (Ishigaki et  al. 2012). Gene ontology analysis has revealed that FUS-regulated exons are highly enriched in genes involved in cell-cell adhesion, apoptosis inhibition, and neuronal development and projection, suggesting a potential role for FUS in promoting neuronal survival and communication (Ishigaki et  al. 2012). Depletion of FUS also led to reduced abundance of transcripts containing exceptionally long introns, which interestingly mirrors TDP-43 (Ishigaki et  al. 2012). This similarity converged on a small number of long pre-mRNAs whose levels were sustained by both TDP-43 and FUS. Intriguingly, protein levels of these genes were markedly reduced in sporadic ALS neurons with TDP-43 inclusions, supporting the hypothesis that FUS and TDP-43 loss of function contributes to ALS pathogenesis (Ishigaki et  al. 2012). This result also provides a molecular connection between two distinct subclasses of ALS, making it tempting to test whether these common TDP-43 and FUS targets could serve as biomarkers for an extended range of  ALS. The studies of TDP-43 and FUS represent excellent examples of how bioinformatic integration of datasets from new RNA-centric high throughput approaches can shed new light on molecular mechanisms of gene regulation in vivo, and how mRNA misregulation contributes to disease. Other successful examples include the identification of Elav-like and Muscleblind-like 2 protein dependent alternative splicing events and their implication in paraneoplastic neurological disorders and myotonic dystrophy, respectively (Charizanis et  al. 2012; Ince-Dunn et al. 2012).

81

A LT E R N AT I V E M R N A - O M I C S IN THE COMING  DECADE Current omics approaches, such as RNA-seq and HITS-CLIP, heavily rely on available high-throughput nucleic acid sequencing technologies, which is a field undergoing rapid evolution and development. In fact, so many innovations have been made in sequencing technologies that the cost of DNA sequencing has decreased at a rate exceeding that predicted by Moore’s Law multiple times. Currently, over 100  million 30-100 nucleotide reads or over one million longer reads (up to 1 kb) can be generated in one run at a cost of a few cents per megabase. This extremely low per-base rate, however, still makes it cost-inhibitive to generate comprehensive transcriptome profiling where an estimated 700  million reads are necessary for the detection of alternative RNA processing events in more than 95% expressed transcripts (Blencowe et al. 2009). With continued lowering of sequencing cost, it is reasonable to predict that affordable methods will be available to generate enough sequence depth for comprehensive transcriptome profiling in the near future. Advances in direct single-molecule sequencing methods will also improve the quality of sequencing datasets, by avoiding library preparation steps such as adapter ligation and PCR that can introduce bias and reduce library complexity. Applying these technologies to RNA-seq and HITS-CLIP will greatly boost the visibility of rare transcripts and low abundance alternative mRNA variants, thus providing more comprehensive pictures of transcriptomes and protein-RNA interactomes. More powerful and cheaper sequencing technologies will allow widespread clinical application of disease-related transcriptome profiling, which will start a new paradigm in pathogenesis research. Identifying signature pathways affected by combinations of mutations in a common disease may provide vital clues on disease pathogenesis as well as potential therapeutic targets. As mentioned in the previous section, transcriptome profiling on a collection of autistic brain samples revealed that RBFOX1 lies in the center of a cohort of affected pathways in autism. This marks the beginning of utilizing a pathway-based framework to assess the enrichment of disease-affected genes. Broader use of omic approaches in the next decade will generate an extensive catalog of transcriptomes and protein-RNA interaction

82

the OMICs

maps in a variety of tissues, developmental stages, organisms and diseases. These datasets will enable the building of more inclusive and quantitative models for predicting alternative mRNA processing profiles and their functional roles and consequences in different biologic contexts. With the integration of personal genomics, it can be envisioned that disease-related splicing and/or polyadenylation misregulation can be predicted from patients’ genomic information.

CONCLUSION AND S U M M A RY Our understanding of alternative mRNA processing has come a long way since the early 1980s. Classical biochemistry and molecular biology methods have been irreplaceable in illuminating the underlying mechanisms in a case-bycase manner, while novel “omics” technologies have brought to the surface a vast amount of genome-scale discoveries beyond the reach of traditional methods. The extent to which new transcriptome and protein-RNA interaction profiling technologies developed in the last 10 years have revolutionized the field of RNA research can rarely be matched in modern science history. As a result, more complete descriptions of the multitude of alternative RNA products expressed in the nervous system and new insights into mechanisms of their regulation have emerged. Modern biology is undergoing a transition from descriptive and qualitative disciplines to quantitative and predictive ones, where “omics” approaches combined with bioinformatics are the driving force. The application of these approaches to a wider range of factors, cell types, and diseases in the near future will synthesize a more comprehensive view on the mechanism and dynamics of alternative mRNA processing and transform our understanding of mRNA-regulatory networks and how they impact neurobiology. REFERENCES Arai T, Hasegawa M, Akiyama H, Ikeda K, Nonaka T, Mori H, Mann D, . . . Oda T (2006). TDP-43 is a component of ubiquitin-positive tau-negative inclusions in frontotemporal lobar degeneration and amyotrophic lateral sclerosis. Biochem Biophys Res Commun 351, 602–611. Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, . . . Frey BJ (2010). Deciphering the splicing code. Nature 465, 53–59.

Blencowe BJ, Ahmad S, Lee LJ (2009). Current-generation high-throughput sequencing:  deepening insights into mammalian transcriptomes. Genes Dev 23, 1379–1386. Charizanis K, Lee KY, Batra R, Goodwin M, Zhang C, Yuan Y, . . . Swanson MS (2012). Muscleblind-like 2-mediated alternative splicing in the developing brain and dysregulation in myotonic dystrophy. Neuron75, 437–450. Chen M, Manley JL (2009). Mechanisms of alternative splicing regulation:  insights from molecular and genomics approaches. Nat Rev Mol Cell Biol 10, 741–754. Clark TA, Schweitzer AC, Chen TX, Staples MK, Lu G, Wang H, . . . Blume JE (2007). Discovery of tissue-specific exons using comprehensive human exon microarrays. Genome Biol 8, R64. Cooper TA, Wan L, Dreyfuss G (2009). RNA and disease. Cell, 136, 777–793. Crick F (1979) Split genes and RNA splicing. Science 204, 264–271. Danckwardt S, Hentze MW, Kulozik AE (2008). 3’ end mRNA processing:  molecular mechanisms and implications for health and disease. Embo J 27, 482–498. Darnell R (2012). CLIP (cross-linking and immunoprecipitation) identification of RNAs bound by a specific protein. Cold Spring Harb Prot 2012, 1146–1160. Darnell RB (2010). HITS-CLIP:  panoramic views of protein-RNA regulation in living cells. Wiley Interdiscip Rev RNA 1, 266–286. Di Giammartino DC, Nishida K, Manley JL (2011). Mechanisms and consequences of alternative polyadenylation. Mol Cell 43, 853–866. Dormann D, Rodde R, Edbauer D, Bentmann E, Fischer I, Hruscha A, . . . Haass C (2010). ALS-associated fused in sarcoma (FUS) mutations disrupt Transportin-mediated nuclear import. EMBO J 29, 2841–2857. Fagnani M, Barash Y, Ip JY, Misquitta C, Pan Q, Saltzman AL, . . . Blencowe BJ (2007). Functional coordination of alternative splicing in the mammalian central nervous system. Genome Biol 8, R108. Flavell SW, Kim TK, Gray JM, Harmin DA, Hemberg M, Hong EJ, . . . Greenberg ME (2008). Genome-wide analysis of MEF2 transcriptional program reveals synaptic target genes and neuronal activity-dependent polyadenylation site selection. Neuron 60, 1022–1038. Gilbert W (1978). Why genes in pieces? Nature 271, 501. Giordana MT, Piccinini M, Grifoni S, De Marco G, Vercellino M, Magistrello M, . . . Rinaudo MT (2010). TDP-43 redistribution is an early event

Decoding Alternative mRNAs in the OMICs Age in sporadic amyotrophic lateral sclerosis. Brain Pathol 20, 351–360. Gitcho MA, Baloh RH, Chakraverty S, Mayo K, Norton JB, Levitch D, . . . Cairns NJ (2008). TDP-43 A315T mutation in familial motor neuron disease. Ann Neurol 63, 535–538. Grabowski PJ, Black DL (2001). Alternative RNA splicing in the nervous system. Prog Neurobiol 65, 289–308. Huang CS, Shi SH, Ule J, Ruggiu M, Barker LA, Darnell RB, . . . Jan LY (2005). Common molecular pathways mediate long-term potentiation of synaptic excitation and slow synaptic inhibition. Cell 123, 105–118. Iijima T, Wu K, Witte H, Hanno-Iijima Y, Glatter T, Richard S, Scheiffele P (2011). SAM68 regulates neuronal activity-dependent alternative splicing of neurexin-1. Cell 147, 1601–1614. Ince-Dunn G, Okano HJ, Jensen KB, Park WY, Zhong R, Ule J, . . . Darnell RB (2012). Neuronal Elav-like (Hu) proteins regulate RNA splicing and abundance to control glutamate levels and neuronal excitability. Neuron 75, 1067–1080. Ishigaki S, Masuda A, Fujioka Y, Iguchi Y, Katsuno M, Shibata A, . . . Ohno K (2012). Position-dependent FUS-RNA interactions regulate alternative splicing events and transcriptions. Sci Rep 2, 529. Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, . . . Shoemaker DD (2003). Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302, 2141–2144. Kabashi E, Valdmanis PN, Dion P, Spiegelman D, McConkey BJ, Vande Velde C, . . . Rouleau GA (2008). TARDBP mutations in individuals with sporadic and familial amyotrophic lateral sclerosis. Nat Genet 40, 572–574. Kwiatkowski TJJ, Bosco DA, Leclerc AL, Tamrazian E, Vanderburg CR, Russ C, . . . Brown RHJ (2009). Mutations in the FUS/TLS gene on chromosome 16 cause familial amyotrophic lateral sclerosis. Science 323, 1205–1208. Lau AG, Irier HA, Gu J, Tian D, Ku L, Liu G, . . . Feng Y (2010). Distinct 3ʹUTRs differentially regulate activity-dependent translation of brain-derived neurotrophic factor (BDNF). Proc Natl Acad Sci U S A 107, 15945–15950. Li Q, Lee JA, Black DL (2007). Neuronal regulation of alternative pre-mRNA splicing. Nat Rev Neurosci 8, 819–831. Licatalosi DD, Darnell RB (2006). Splicing regulation in neurologic disease. Neuron 52, 93–101. Licatalosi DD, Darnell RB (2010). RNA processing and its regulation, global insights into biological networks. Nat Rev Genet 11, 75–87.

83

Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, . . . Darnell RB (2008). HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464–469. Licatalosi DD, Yano M, Fak JJ, Mele A, Grabinski SE, Zhang C, Darnell RB (2012). Ptbp2 represses adult-specific splicing to regulate the generation of neuronal precursors in the embryonic brain. Genes Dev 26, 1626–1642. Liu S, Lin L, Jiang P, Wang D, Xing Y (2011). A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res 39, 578–588. Matlin AJ, Clark F, Smith CW (2005). Understanding alternative splicing:  towards a cellular code. Nat Rev Mol Cell Biol 6, 386–398. Menghi F, Jacques TS, Barenco M, Schwalbe EC, Clifford SC, Hubank M, Ham J (2011). Genome-wide analysis of alternative splicing in medulloblastoma identifies splicing patterns characteristic of normal cerebellar development. Cancer Res 71, 2045–2055. Merkin J, Russell C, Chen P, Burge CB (2012). Evolutionary dynamics of gene and isoform regulation in Mammalian tissues. Science 338, 1593–1599. Modrek B, Resch A, Grasso C, Lee C (2001). Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res 29, 2850–2859. Neumann M, Sampathu DM, Kwong LK, Truax AC, Micsenyi MC, Chou TT, . . . Lee VM (2006). Ubiquitinated TDP-43 in frontotemporal lobar degeneration and amyotrophic lateral sclerosis. Science 314, 130–133. Nonaka T, Kametani F, Arai T, Akiyama H, Hasegawa M (2009). Truncation and pathogenic mutations facilitate the formation of intracellular aggregates of TDP-43. Hum Mol Genet 18, 3353–3364. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40, 1413–1415. Pan Q, Shai O, Misquitta C, Zhang W, Saltzman AL, Mohammad N, . . . Blencowe BJ (2004). Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Mol Cell 16, 929–941. Polymenidou M, Lagier-Tourenne C, Hutt KR, Huelga SC, Moran J, Liang TY, . . . Cleveland DW (2011). Long pre-mRNA depletion and RNA missplicing contribute to neuronal vulnerability from loss of TDP-43. Nat Neurosci 14, 459–468.

84

the OMICs

Resch A, Xing Y, Alekseyenko A, Modrek B, Lee C (2004). Evidence for a subpopulation of conserved alternative splicing events under selection pressure for protein reading frame preservation. Nucleic Acids Res 32, 1261–1269. Sandberg R, Neilson JR, Sarma A, Sharp PA, Burge CB (2008). Proliferating cells express mRNAs with shortened 3’ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647. Tollervey JR, Curk T, Rogelj B, Briese M, Cereda M, Kayikci M, . . . Ule J (2011). Characterizing the RNA targets and position-dependent splicing regulation by TDP-43. Nat Neurosci 14, 452–458. Ule J, Ule A, Spencer J, Williams A, Hu JS, Cline M, . . . Darnell RB (2005). Nova regulates brain-specific splicing to shape the synapse. Nat Genet 37, 844–852. Vance C, Rogelj B, Hortobagyi T, De Vos KJ, Nishimura AL, Sreedharan J, . . . Shaw CE (2009). Mutations in FUS, an RNA processing protein, cause familial amyotrophic lateral sclerosis type 6. Science 323, 1208–1211. Voineagu I, Wang X, Johnston P, Lowe JK, Tian Y, Horvath S, . . . Geschwind DH (2011). Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature 474, 380–384. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, . . . Burge CB (2008). Alternative isoform

regulation in human tissue transcriptomes. Nature 456, 470–476. Wang Z, Burge CB (2008). Splicing regulation:  from a parts list of regulatory elements to an integrated splicing code. RNA 14, 802–813. Wang Z, Gerstein M, Snyder M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10, 57–63. Xue Y, Zhou Y, Wu T, Zhu T, Ji X, Kwon YS, . . . Zhang Y (2009). Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. Mol Cell 36, 996–1006. Yeo G, Holste D, Kreiman G, Burge CB (2004). Variation in alternative splicing across human tissues. Genome Biol 5, R74. Yeo GW, Coufal NG, Liang TY, Peng GE, Fu XD, Gage FH (2009). An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat Struct Mol Biol 16, 130–137. Zhang C, Frias MA, Mele A, Ruggiu M, Eom T, Marney CB, . . . Darnell RB (2010). Integrative modeling defines the Nova splicing-regulatory network and its combinatorial controls. Science 329, 439–443. Zhang H, Lee JY, Tian B (2005). Biased alternative polyadenylation in human tissues. Genome Biol 6, R100.

6 Transcriptomics:  From Differential Expression to Coexpression MICHAEL C.  OLDHAM

INTRODUCTION It is fair to say that every “-omics” approach for biological inquiry has its root in the emergence of new technologies, and “transcriptomics” is no exception. The birth of transcriptomics was the inevitable result of efforts to quantify the expression levels of many genes in parallel, which took off in earnest with the commercialization of microarray technology in the 1990s. Although technologies for measuring gene expression levels have evolved from univariate to multivariate data acquisition, a legacy of univariate thinking has persisted in the analysis of transcriptomic data. It is only recently that this situation has begun to change, as an increasing number of biologists begin to utilize multivariate analysis techniques to extract the enormous amount of information that is captured by microarray (and now, RNA-seq) experiments. In this chapter I  discuss the emergence of gene coexpression analysis as a technique that can (and should) be used to increase our understanding of the molecular and cellular organization of the nervous system. I argue that by considering gene activity only in terms of differential expression (i.e., whether a gene is expressed, on average, significantly higher or lower in one sample group versus another), an enormous amount of information that is captured by transcriptomic experiments remains invisible. To access this information, biologists must adopt a new mindset for transcriptomic data analysis. Instead of asking which genes are differentially expressed between sample cohorts in a particular biological system, coexpression analysis asks:  “What are the most salient patterns of gene activity in my biological system, what do they mean, and how are they organized?”

The use of coexpression analysis holds promise for many areas of neuroscientific research. By leveraging the inherent variability in gene expression that exists among neurobiological replicate samples, expression levels for tens of thousands of transcripts can be collapsed into a much smaller number of coexpression “modules.” Characterization of these modules can offer insights into disparate functional processes that are encoded in the genome, including transcriptional programs that are associated with distinct cell types. Furthermore, the robust and reproducible nature of coexpression modules in neurobiological tissues provides a natural framework for annotating gene function and studying perturbations in gene activity caused by pathological conditions. This chapter is organized into three parts. The first part begins with a historical perspective on the emergence of modern technologies (namely, microarrays and RNA-seq) for measuring gene activity. I  then discuss the emergence and persistence of differential expression analysis as the dominant paradigm for interpreting the results of transcriptomic experiments, despite the assumptions and challenges that are inherent to this approach. These assumptions and challenges are then discussed in detail, particularly in the context of neuroscientific studies. The second part provides a brief introduction to some of the multivariate techniques that can be used to analyze transcriptomic datasets, with an emphasis on the use of network methods for analyzing gene coexpression relationships. I  describe the advantages of integrating analysis of differential expression with coexpression, followed by examination of alternative experimental designs and the types of biological questions that are well suited for

86

the OMICs

coexpression studies. I  then discuss some of the pioneering studies that have analyzed gene coexpression relationships in neurobiological systems along with specific examples of insights obtained using this approach. The third part enumerates the current challenges and limitations of gene coexpression analysis, followed by consideration of possible trends and directions for neuroscientific applications of this technique in the coming decade. Although the technique of coexpression analysis is still in its infancy, its application has already revealed that substantially more biological information exists than has previously been appreciated in transcriptomic datasets generated from neurobiological systems. Thus, while analysis of differential expression provides a simple point of entry for transcriptomics in neuroscience, it no longer provides the only point of entry. By analyzing gene coexpression relationships in neurobiological systems, investigators are poised to discern fundamental insights about the nature of cellular identity, the extent of cellular heterogeneity, and the spectrum of functional processes that are at work in the nervous system.

U N D E R S TA N D I N G GENE EXPRESSION D ATA :   A   H I S T O R I C A L PERSPECTIVE The current state of transcriptomic data analysis is best understood with a historical perspective. After Watson and Crick’s discovery of the double-helical structure of DNA (Watson & Crick 1953), scientists began to explore the hybridization properties of nucleic acids. One result of these efforts was the insight that molecular hybridization using a labeled (known) nucleic acid could distinguish specific DNA sequences in situ (Gall & Pardue 1969). It was subsequently determined that nucleic acids of different sizes could be separated by gel electrophoresis, transferred to a porous membrane, and “probed” with a labeled nucleic acid; the hybridization reaction that ensued could reveal the approximate size and abundance of a DNA (Southern blot) or RNA (Northern blot) fragment of interest (Alwine et  al. 1977; Southern, 1975). In all of these cases, the results were univariate and qualitative in nature, providing the investigator with information regarding the approximate size, abundance, and (for in situ hybridization experiments) location of the nucleic acid under study.

These principles were further extended to enable multivariate measurements of nucleic acid abundance with the introduction of the dot blot (Kafatos et  al. 1979). In a dot-blot experiment, mixtures of nucleic acids (targets) derived from biological samples of interest were applied directly to a porous membrane as a series of dots, which could then be assayed in parallel using a labeled nucleic acid (probe). Because the target nucleic acids were not separated by size via gel electrophoresis, dot blots provided only qualitative measures of nucleic acid abundance; however, they provided such information for many samples in parallel. The opposite approach, which involved affixing multiple oligonucleotide probes to a membrane to detect a labeled target, followed and became known as the reverse dot-blot (Saiki et al. 1989). Subsequently it was shown that oligonucleotide probes could be synthesized on an impermeable substrate (e.g., glass) that offered a variety of advantages over the porous membranes used in dot-blot experiments (Maskos & Southern 1992), and the earliest microarrays were born. Microarray technology quickly diversified into subtypes that differed in the type of probe (cDNA vs. oligonucleotide) and the method (two-channel vs. one-channel) that was used to quantify target abundance. cDNA microarrays were produced mostly in academic settings by investigators who desired customizable platforms for quantifying gene expression. These investigators would generate a library of cDNA clones of interest and then “spot” the clones on a glass slide using a robotic arrayer (Schena et  al. 1995, 1996). For cDNA arrays, gene expression levels were typically quantified as the ratio of target abundance between two samples (e.g., wild-type vs. mutant), which were labeled using distinct fluorescent dyes; these were called “two-channel” or “two-color” experiments. Although cDNA arrays were flexible and cost-effective for many labs, their popularity was diminished by concerns about the reproducibility of in-house spotted arrays (Hess et al. 2001; Yauk et  al. 2004), as well as the emergence of commercial alternatives. Companies such as Affymetrix and Agilent developed new, industrial-scale methods for synthesizing oligonucleotide probes in situ, created standardized array platforms, and through competition drove down costs to enable widespread usage in academic labs. While some commercial providers (notably Agilent) continued to enable

Transcriptomics: From Differential Expression to Coexpression two-channel experiments, the majority shifted to single-channel (one-color) designs in which probes provided intensity data that measured the extent of hybridization with a single labeled target. In general, single-channel arrays made it easier to compare results across experiments, further hastening their adoption. Most commercial microarray platforms available today are extremely precise:  when measured over all probes, the correlation between technical replicates (RNA derived from the same sample and split into separate batches) is typically greater than 0.99. Microarray accuracy has also improved considerably, reflecting upgrades in the quality of genome sequences and algorithms for probe design. Nevertheless, reannotation studies have shown that up to one in four probes on some versions of Affymetrix and Illumina microarrays may be non-specific or mis-targeted (Barbosa-Morais et  al. 2010; Zhang et  al. 2005). It has also been shown that secondary structures in target nucleic acids can affect the hybridization of oligonucleotide probes (Kierzek 2009). It is therefore important to recognize that while the majority of microarray probes are expected to provide accurate data for their intended targets, there are legitimate reasons why individual probes may fail. First-generation oligonucleotide microarray probes were generally designed to target specific genes without consideration of transcript isoforms. With the growing recognition of the scale and importance of alternative splicing (Modrek et  al. 2001), particularly in the brain (Dredge et  al. 2001; Lipscombe, 2005), this situation has changed dramatically. In 2005 Affymetrix engineered a complete redesign of their microarray technology to enable measurements of abundance for individual exons. Their human “exon array” contains over 6  million probes that collectively target over a million exons from known and predicted transcribed regions of the human genome, enabling unprecedented coverage and quantification of alternative splicing events. However, this remarkable achievement may represent the apogee of microarray technology for transcriptomics. In recent years, next-generation sequencing technologies, such as the Genome Analyzer (Illumina), SOLID (ABI), and 454 (Life Sciences) platforms (among others) have dramatically reduced the time and cost to sequence nucleic acids as compared with traditional Sanger sequencing. The ability to

87

sequence whole transcriptomes, or “RNA-seq,” has opened a window unto the vast amount of intergenic transcription that appears to be the rule rather than the exception in biology (Djebali et  al. 2012; Sultan et  al. 2008), but has been invisible to microarrays, which have predominantly targeted the small fraction of the transcriptome that encodes proteins. In light of the recognition of the many critical functions performed by various noncoding species of RNA (Aalto & Pasquinelli 2012; Bartel 2009; Kugel & Goodrich 2012), there is growing interest in routinely quantifying the abundance of these species alongside their messenger RNA (mRNA) cousins. This biological reality, along with the rapidly diminishing cost to generate large amounts of sequence data, is already promoting rapid adoption of RNA-seq for transcriptomic studies in neuroscience. However, it is important to note that there are likely to be sources of technical variation associated with these technologies that are not yet fully appreciated by a majority of investigators (Benjamini & Speed 2012; Risso et  al. 2011; Roberts et  al. 2011; Zheng et  al. 2011). As with microarrays, it will take some time and effort before these problems are identified and best practices emerge.

DIFFERENTIAL EXPRESSION AND THE PERSISTENCE OF U N I VA R I AT E T H I N K I N G Although technologies for quantifying gene expression have evolved considerably and now enable genome-wide data acquisition in a single experiment, most investigators still retain a univariate mindset toward data analysis. This mindset reflects the reductionist thinking that continues to be favored by a majority of biologists. Microarrays and RNA-seq are often still used by investigators to answer the same types of questions that biologists have asked for the past 50 years; for example: “Is gene X expressed at higher or lower levels in condition A or condition B? What other genes are expressed at higher or lower levels in condition A  or condition B?” (Griffin et  al. 2003). As a result, early efforts in transcriptomic data analysis predominantly sought to provide more precise answers for these types of questions. Biologists initially used “foldchange” to describe the relative abundance of a particular transcript between two conditions. The

88

the OMICs

foldchange, which is simply the ratio of the expression level of a gene between condition A  and condition B, emerged naturally from two-channel microarray experiments and was subsequently applied to compare the results of single-channel microarray experiments. While simple and intuitive, foldchange does not take into account the variance of measured gene activity. Furthermore, naive foldchanges often depend on absolute intensity levels (Yang et  al. 2002), indicating a need for data normalization. The increased use of commercial microarray platforms allowed statisticians to begin developing and comparing rigorous algorithms for preprocessing and analyzing microarray data with the goal of finding “differentially expressed” (DE) genes. A  gene is considered DE between two or more sample cohorts if it is expressed, on average, significantly higher or lower in one cohort compared to the other. Over the past 20  years innumerable methods have been proposed for preprocessing microarray data and identifying DE genes. In general, preprocessing of microarray (or RNA-seq) data seeks to (1)  identify and correct for technical (nonbiological) sources of variation via data normalization, (2)  summarize repeated measurements of gene expression levels, and (3)  identify and remove outlying samples. Once the data have been preprocessed, many methods exist to identify DE genes and assess their statistical significance. After identifying DE genes, additional methods have been proposed to control the type I (false-positive) error rate by accounting for the many comparisons that are made in a typical DE study. It is beyond the scope of the present chapter to provide a comprehensive review of these methods, which are discussed extensively elsewhere (McLachlan et  al. 2004; Owzar et  al. 2011; Simon et al. 2005; Speed 2003). Differences in microarray platform design, probe design, preprocessing algorithms, and statistical methods for identifying DE genes combined to produce a conundrum that, in retrospect, may have been predictable:  early microarray studies were difficult to replicate, with poor concordance seen among lists of DE genes produced by experiments that purported to study the same biological phenomena. These concerns created doubts about the reliability of microarrays and led to the creation of the MicroArray Quality Control (MAQC) project, a multicenter study of microarray performance led by the FDA and designed to directly address

this perceived interoperability problem. The MAQC project clearly demonstrated that with careful experimental design and appropriate data preprocessing, there was substantial agreement in the results of microarray experiments performed on different platforms and in different laboratories; observed discordance was mostly attributed to differences in probe behavior with respect to alternatively spliced transcripts or nonspecific cross-hybridization (Shi et  al. 2006). These findings helped quell doubts about the incipient field of transcriptomics but also highlighted the care with which microarray experiments must be designed and analyzed. (Importantly, the MAQC project has now extended its purview via a new phase dubbed the SEQC project, which aims to establish the reproducibility of deep sequencing technologies for measuring gene expression.) Microarray studies of neurobiological samples caught on slowly at first owing to the perceived challenge of tissue complexity (i.e., cellular heterogeneity). Some of the earliest microarray studies in the neurosciences involved disease comparisons, including glioblastoma (Sehgal et  al. 1998), multiple sclerosis (Whitney et  al. 1999), schizophrenia (Mirnics et  al. 2000), and Alzheimer’s disease (Loring et  al. 2001). Since then, there have been hundreds of published neuroscientific studies that have used microarrays or RNA-seq to identify DE genes in many types of comparisons (e.g., disease, injury, pharmacological, physiological, anatomical, interspecies, temporal, etc.). The majority of these studies analyzed relatively small samples, used defensible methods for data preprocessing and detection of DE genes, succeeded in identifying DE genes between sample cohorts, and validated select DE findings using independent techniques. It is not the purpose of the following section to review this large and impressive body of work (see Coppola & Geschwind 2006; Mirnics & Pevsner 2004), but rather, to highlight the assumptions and challenges of DE studies in general and neuroscientific DE studies in particular.

ASSUMPTIONS AND CHALLENGES OF DIFFERENTIAL EXPRESSION STUDIES A typical DE study conducts tens of thousands of univariate tests to assess the significance of

Transcriptomics: From Differential Expression to Coexpression DE for each measured transcript. An implicit assumption of DE studies is that measured gene expression levels are independent of one another. However, this assumption is both intuitively and demonstrably false. Intuitively, biologists know that genes do not function in isolation; rather, they encode elements of cellular compartments, signaling pathways, metabolic reactions, and protein complexes that require careful attention to the stoichiometry of their constituents, which is at least partially maintained at the level of transcription. Furthermore, it is easy to show that in almost any given microarray or RNA-seq dataset of sufficient size, many genes are significantly correlated with one another (Figure 6.1). Widespread correlations among measured gene (A)

expression levels can exert adverse effects on statistical algorithms that explicitly assume independence or weak dependence in the same setting. For example, gene correlations can have a substantial impact on multiple testing procedures, in particular on estimates of false-discovery proportions made using many popular techniques (Efron 2007; Jung & Jang 2006; Kim & van de Wiel 2008). Practically speaking, by ignoring the correlation structure that is present in gene expression data, estimates of false-discovery proportions may themselves be false (Efron 2007). Another implicit assumption of DE studies is that statistical significance corresponds with biological significance. It is now relatively straightforward and inexpensive to conduct a (B)

Significant gences (%)

ACC DLPFC Random

3.0

Density

89

2.0

1.0

0.0

15 10 5 0

–1.0

–0.5

0.0

0.5

ACC

1.0

DLPFC

Random

Pearson correlation ACC

(C)

10

8

4

YWHAZ TPD52 ANK2 GLRB SCAMP1

Sample

SYNJ1 G3BP2 PAFAH1B1 PREPL PRNP

Expression level

Expression level

10

6

DLPFC

(D)

8 6 4

YWHAZ TPD52 ANK2 GLRB SCAMP1

SYNJ1 G3BP2 PAFAH1B1 PREPL PRNP

Sample

FIGURE  6.1: Many genes expressed in the human brain are highly correlated. (A) Shown here are density plots of Pearson correlations among genes expressed in human anterior cingulate cortex (ACC) (Li et  al. 2007), human dorsolateral prefrontal cortex (DLPFC) (Li et  al. 2007), and normally distributed random numbers (Random). All possible pairwise Pearson correlations were calculated for 18,631 microarray probes across 71 samples in each brain region (or an equivalent-sized matrix of random numbers). There is an excess of strong positive and negative correlations in ACC and DLPFC compared with Random. Vertical  lines indicate a stringent threshold for significant (P< 2.88e–10) correlations based on a Bonferroni correction for the number of pairwise comparisons. (B) Approximately 19% of unique genes in ACC and DLPFC possess at least one significant correlation (compared with 0% for Random). As illustrated in (C)  and (D), many significant correlations among genes are conserved between ACC and DLPFC, suggesting that expression levels of these genes depend on shared factors that are present in both datasets.

90

the OMICs

genome-wide screen to identify DE genes, validate DE for a subset of genes using an independent technique (e.g., quantitative RT-PCR), and speculate as to what DE genes might have in common with one another (and therefore why they are in fact DE). While this approach is perfectly defensible, it is important to acknowledge the tenuous connection that exists for most genes between “a change in expression of X%” and a change in biological function. Most functional studies that have sought to knock down the expression of individual genes have done so through deletion or complete inactivation of the resulting protein. Conversely, many functional studies that have sought to overexpress individual genes have done so at multiples of physiological expression levels. In most experimental systems, it remains quite difficult to replicate the physiologically relevant foldchanges that are often deemed statistically significant in DE studies. Furthermore, the relationship between foldchange and biological significance is unlikely to be the same for all genes. Rather, different genes, and perhaps different categories of genes, are likely to experience different levels of functional constraint on their dynamic ranges of expression. Another challenge facing DE studies is determining how DE genes (potentially making up a large list) relate to one another. Traditional DE studies depend on post hoc analyses using external classification systems, such as gene ontology (Ashburner et  al. 2000), to identify meaningful themes among lists of up- or down-regulated genes. If specific categories or pathways are not obviously enriched among DE genes, the investigator may be left with an unsatisfying and pointillistic impression of gene expression differences between sample cohorts. Furthermore, external classification systems currently do not provide adequate context with respect to the tissue and cellular specificity of gene expression patterns in the nervous system. A related problem is knowing which genes from DE studies to prioritize for potentially time-consuming and costly follow-up experiments. As noted above, biological significance does not track linearly with statistical significance, so simply choosing the gene with the most significant P value is unlikely to be the best strategy. Hidden biases in DE studies may also have a substantial effect on both the number and nature of DE genes. Such biases can include

technical and biological factors that are confounded with the investigator’s biological comparison of interest. For example, it is now widely appreciated that processing subsets of microarrays separately within a single experiment can induce pronounced technical (nonbiological) variation in measured gene activity, or “batch effects.” Batch effects can be introduced at various stages of a microarray experiment, including RNA extraction, amplification, labeling, hybridization, or scanning, among others; if unrecognized or uncorrected, they may constitute the dominant source of variation in a microarray dataset. While most investigators today are aware of this situation and take steps to mitigate and correct for batch effects using available tools (Chen et  al. 2011; Johnson et  al. 2007; Luo et al. 2010), this was not the case for much of the first decade of microarray studies. Consequently there are likely to be many examples of published studies in which the identification of DE genes was confounded to varying degrees by the presence of batch effects. This experience should serve as a cautionary tale for biologists who are currently designing RNA-seq experiments:  it is critical to understand the major sources of technical variation within a given experimental system so that they do not impinge upon one’s results. In addition to technical factors, biological factors may also present as hidden biases in DE studies. For example, it has been shown that tissue pH, which can be influenced by stress conditions at the time of death, can have a substantial impact on gene expression levels (Li et al. 2004; Mexal et  al. 2006), especially for categories of genes involved in energy metabolism and mitochondrial function (Vawter et  al. 2006). Such categories have been overrepresented among DE genes in a number of neuroscientific studies (Iwamoto et al. 2005; Prabakaran et al. 2004; Sun et  al. 2006), and it has been proposed that these findings may partially reflect differences in tissue pH among sample cohorts, which may in turn reflect unbalanced stress conditions immediately preceding death (Vawter et  al. 2006). Similarly, the time of death itself can influence expression levels for thousands of transcripts whose abundance oscillates in a circadian fashion (Cirelli et al. 2004; Yang et al. 2007). Biases such as these may be difficult to recognize if they are not explicitly measured and related to gene expression data; they also highlight the importance of validating DE genes

Transcriptomics: From Differential Expression to Coexpression using biological specimens from independent sample cohorts whenever possible. Most importantly for neuroscientific DE studies, hidden biases may result from subtle differences in tissue dissections that alter the cellular composition of analyzed tissue samples. This problem is particularly acute in the brain, which is comprised of an inordinate number of cell types, many of which remain poorly characterized. For the vast majority of neuroscientific DE studies, which have analyzed whole tissue samples, the measured expression “level” of any particular transcript is actually a weighted average of the cell type‒specific expression levels of that transcript multiplied by the representation of those cell types in the tissue sample. It is therefore easy to imagine how unbalanced dissection artifacts—for example, inadvertently including more white matter in some samples—could drive the apparent DE of many genes. Indeed, a number of neuroscientific DE studies have identified myelin-related genes as among the most significantly DE between sample cohorts (Albertson et  al. 2004; Aston et  al. 2005; Hakak et  al. 2001; Lewohl et  al. 2000). Without explicitly controlling for the number of oligodendrocytes present in each sample, measurements of gene activity alone cannot discern among the competing possibilities of true DE within oligodendrocytes, differential abundance of oligodendrocytes driven by technical (dissection) factors, differential abundance of oligodendrocytes driven by biological factors, or some combination of the above. Of course this difficulty extends to all cell types.

91

The knee-jerk response to the problem of cellular heterogeneity in neurobiological tissues has been to adopt a univariate mindset with respect to cell types. In other words, the thinking goes, by isolating “pure” populations of cells and measuring their gene activity, analysis of DE can clearly assign cellular agency to observed gene expression changes. A  number of methods for isolating homogeneous populations of cells from nervous system (and other) tissues have been proposed (Table  6.1) (Okaty et  al. 2011b). These include laser-capture microdissection (LCM) (Simone et al. 1998), manual cell sorting (Sugino et  al. 2006), fluorescence-activated cell sorting (FACS) (Arlotta et  al. 2005; Lobo et  al. 2006), immunopanning (Cahoy et al. 2008), and tandem ribosome affinity purification (TRAP) (Heiman et  al. 2008). These techniques have already contributed to our understanding of the nature and extent of cellular heterogeneity in the nervous system and will undoubtedly continue to provide critical insights. However, it is important to note that with the exception of LCM, all of these techniques are likely to be infeasible for use with adult human brain samples (Table 6.1); for LCM, background contamination with “off-target” cell types may be unavoidable (Okaty et  al. 2011a). Furthermore, all of these methods share a critical strategic limitation, which is that some knowledge of the specific properties of a particular cell type must exist in order for that cell type to be isolated and studied. This circularity limits the appeal of such methods for describing the full extent of cellular diversity in general and in tissues from the human nervous system in particular.

TABLE  6.1. TECHNIQUES FOR ISOLATING SPECIFIC CELL TYPES FROM

NEUROBIOLOGICAL TISSUES Technique

Reference

Pros

Cons

Practical for adult human brain

LCM

Chung et al. (2005)

OK for post-mortem issue

Yes

Manual FACS

Sugino et al. (2006) Arlotta et al. (2005)

Good for rare cell types High RNA yield

PAN

Cahoy et al. (2008)

No labeling required

TRAP

Doyle et al. (2008)

Translating mRNAs

Background contamination Low RNA yield Difficult for adult samples Cell-surface antigen needed Transgenic mice required

No No No No

Source: Modified from Okaty et al. (2011b). FACS, fluorescence-activated cell sorting; LCM, laser capture-microdissection; Manual, manual cell sorting; PAN, immunopanning; TRAP, translating ribosome affinity purification.

92

the OMICs

FROM DIFFERENTIAL EXPRESSION TO COEXPRESSION:  ADOPTING A M U LT I VA R I AT E P E R S P E C T I V E FOR TRANSCRIPTOMICS At present, an increasing number of neuroscientific investigators are utilizing multivariate analysis techniques to extract additional biological information from transcriptomic datasets. In contrast to DE studies, which implicitly assume that gene expression levels are independent, multivariate techniques embrace the notion of dependence and seek to describe its structure explicitly. More specifically, multivariate techniques aim to identify patterns of gene activity and/or “modules” of coexpressed genes. These efforts have been motivated not only by the many challenges facing neuroscientific DE studies described above but also by the recognition that in almost any biological system some genes are likely to share common sources of variation in gene expression, which

(B)

GJA1: P = 0.32

9

Expression level

Expression level

(A) 10

are biologically interesting. Furthermore, gene coexpression relationships can vary dramatically between sample cohorts in the absence of DE (Figure  6.2), suggesting that subtle perturbations in gene expression may exist between groups that are not detectable by DE analysis. There are many multivariate techniques available for analyzing high-dimensional gene expression datasets (Everitt et  al. 2011; Hastie et  al. 2009; Parmigiani et  al. 2003; Speed, 2003; Yakovlev et  al. 2013). These techniques can be broadly grouped into projection methods, which primarily seek to reduce dimensionality while accounting for the main sources of variance in the data, and clustering methods, which primarily seek to find groups of related features within the data. Examples of the former include principal component analysis, singular value decomposition, and multidimensional scaling; examples of the latter include hierarchical clustering, k-means clustering, self-organizing

8 7 6

10.0

Chimp

Human

Human: P = 3.5e – 10

(D)

12

Chimp

Chimp: P = 0.08

12 Expression level

Expression level

11.5

9.0 Human

(C)

APOE: P = 0.43

10 8

11 10 9 8

6 GJA1 Sample

APOE

7

GJA1

APOE

Sample

Gene coexpression relationships can vary significantly between sample cohorts in the absence of differential expression. Shown here are distributions of gene expression levels for GJA1 (A) and APOE (B) in human and chimpanzee cerebral cortex (Cáceres et  al. 2003; Enard et  al. 2002; Fraser et  al. 2005; Iwamoto et  al. 2004; Khaitovich et  al. 2004; Lu et  al. 2004). Because the mean expression levels for these genes do not differ significantly between the species, they are seen as uninteresting through the univariate lens of differential expression analysis. However, when the same genes are examined from a coexpression perspective, it is evident that their expression levels are significantly correlated in humans (C) but not in chimpanzees (D). FIGURE  6.2:

Transcriptomics: From Differential Expression to Coexpression maps, and network methods. All of these techniques can be used to perform supervised analysis of gene expression data (in which distinctions among samples are already known), but their true strength lies in unsupervised analysis of gene expression data (in which distinctions among samples are unknown). Recently there has been a great deal of interest in the use of network methods to understand biological systems (Barabási & Oltvai 2004; Carter et  al. 2004; Hartwell et  al. 1999; Horvath et  al. 2006; Huang et  al. 2007; Ihmels et al. 2005; Jeong et al. 2001; Jordan et al. 2004; Lee et  al. 2004; Snel et  al. 2004; Stuart et  al. 2003; van Noort et  al. 2004; Yook et  al. 2004; Zhang & Horvath 2005). Many biological systems exhibit properties of complex networks, such as scale-free topology, in which the probability that a given node (a gene or its product) in the network has a certain number of connections follows an inverse power law (Jeong et  al. 2000; Jordan et  al. 2004; Ravasz et  al. 2002; Yook et  al. 2004). This property predicts that certain “hub” nodes will have many connections and serve as convergence points in the network, whereas most nodes will have very few connections (Jeong et al. 2000). It has been shown that this type of topology is robust to random perturbations (Albert et  al. 2000)—a property that would appear to be highly desirable for biological systems. Among network methods, weighted gene coexpression network analysis (Horvath, 2011; Langfelder & Horvath 2008; Zhang & Horvath 2005)  (WGCNA;

93

Table  6.2) has emerged as a particularly popular approach for analyzing gene coexpression relationships in neurobiological systems (Table  6.3). An advantage of WGCNA (or network methods in general) is that it permits the specification of well-established network concepts (Dong & Horvath 2007), which provide succinct and quantitative descriptions of network topologies. While many multivariate techniques can be used to analyze neurobiological transcriptomes, for the remainder of this chapter I  emphasize the advantages, implications, and findings of techniques such as WGCNA that primarily seek to identify modules of coexpressed genes. As discussed in the following text, the use of coexpression methods in conjunction with DE can at least partially address the challenges that are inherent to DE studies of neurobiological systems. Furthermore, by analyzing gene coexpression relationships within individual sample cohorts, it is possible to obtain an entirely different perspective on transcriptomic data analysis.

A D VA N TA G E S O F I N T E G R AT I N G C O E X P R E S S I O N A N A LY S I S W I T H DIFFERENTIAL EXPRESSION A major advantage of integrating coexpression analysis with DE is dimensionality reduction. In many transcriptomic datasets, a relatively small number of patterns can explain a large fraction of the overall variation in gene expression (Oldham et  al. 2008; Podtelezhnikov et  al.

TABLE  6.2. WGCNA

WGCNA (Horvath, 2011; Langfelder and Horvath, 2008; Zhang and Horvath, 2005) is a popular approach for analyzing gene coexpression relationships in transcriptomic datasets. WGCNA draws upon principles from graph theory to construct undirected graphs from gene expression datasets in which nodes correspond to genes (or transcripts) and edges reflect the strength of coexpression. More specifically, WGCNA consists of four basic steps. First, a similarity matrix is constructed for all genes of interest (typically using the Pearson correlation coefficient, although other measures of similarity may be used). Second, the similarity matrix is converted into an adjacency matrix by "weighting" the similarities using a power function, which deemphasizes weak and potentially spurious relationships. Third, the “topological overlap” (TO) (Ravasz et al. 2002; Zhang & Horvath, 2005) for each pair of genes is determined by comparing their adjacencies with all other genes. Fourth, densely interconnected groups of genes, or "modules", are identified via hierarchical clustering using 1‒TO as a distance measure (Langfelder et al. 2008). WGCNA has been used to study the organization of transcriptomes generated from a number of neurobiological systems (Table 6.3). The WGCNA approach has also been extended to study patterns of protein abundance in the human brain (Shirasaki et al. 2012), gene methylation in aging (Bocklandt et al. 2011), and voxel activation in resting-state fMRI brain imaging data (Mumford et al. 2010).

TABLE  6.3. MULTIVARIATE STUDIES OF GENE EXPRESSION IN NEUROBIOLOGICAL SYSTEMS

First author

Year

PubMed ID

Species

Tissue(s)

No. samples

Platform

Pathology

Cell Method type modules

Horvath Oldham

2006 2006

17090670 17101986

Human Human/chimp

120 36

Affymetrix U133A Affymetrix U95Av2

Glioblastoma None

NI NI

WGCNA WGCNA

Miller Oldham Johnson

2008 2008 2009

18256261 18849986 19477152

Human Human Human

61 160 190

Affymetrix U133A/U95A Affymetrix U133A Affymetrix Exon 1.0 ST

AD None None

NI Yes NI

WGCNA WGCNA WGCNA

Winden

2009

19638972

Mouse

42

Affymetrix MOE430A

None

NA

WGCNA

Saris Hawrylycz

2009 2010

19712483 19800006

Human Mouse

Brain Various CTX, CN, CB CA1, frontal lobe CTX, CN, CB Fetal CB, THM, STR, HIP, various CTX Various LCM neuronal populations Whole blood CTX

246 2518 (voxels)

Illumina Ref-8 In situ hybridization

ALS None

NI NI

Ponomarev Torkamani

2010 2010

20147889 20197298

Rat Human

AMY CTX

56 101

Illumina Ref-12 Affymetrix U95A/U133Plus2

PTSD SZ

Yes Yes

Dabrowski

2010

20565733

Rat

CTX

15

Affymetrix U34A

Stroke

NI

Miller

2010

20616000

Human/mouse

Various

1066

Affymetrix (various)

Various

Yes

WGCNA Unsupervised hierarchical clustering using correlation k-means Unsupervised hierarchical clustering using mutual information SVD & Bayesian network analysis WGCNA

Ray

2010

20925940

Human

EC, HIP, MTG, PCG

96

Affymetrix U133Plus2

AD

NI

CoExp (sparse correlation network) WGCNA WGCNA

Iancu Cai

2010 2010

20959017 20961428

Mouse Human

54 1623

Illumina WG-6 Affymetrix U133A; Illumina HT-12/WG-6

None None

Yes NI

Ivliev Mulligan

2011 2011

21159630 21223303

Human Mouse

STR Whole blood; lymphocytes; CTX, CN, CB Brain Various

790 600

Affymetrix U133A/U133Plus2 Glioma In-house cDNA arrays Alcoholism

Yes Yes

Park Voineagu Winden Hawrylycz

2011 2011 2011 2011

21410935 21614001 21695113 21764550

Mouse Human Rat Mouse

HIP, STR FC, TC, CB DG Whole brain

194 36 64 2518 (voxels)

Illumina Ref-8 Illumina Ref-8 Agilent & Codelink In situ hybridization

Fear conditioning Autism Epilepsy model None

NI Yes Yes NI

Rosen 2011 Podtelezhnikov 2011

21943601 22216330

Human Human

CB, HIP, FC PFC, PVC, CB

52 >1800

Affymetrix U133A Rosetta/Merck 44k 1.1

FTD AD, HD, CTRL

NI Yes

Ponomarev Hilliard Ben-David

2012 2012 2012

22302827 22325205 22412387

Human Zebra finch Human

32 54 1340

Illumina HT-12 Agilent zebra finch Agilent

Alcoholism None None

Yes Yes Yes

Bernard

2012

22445337

Macaque

AMY, FC Various Whole brain (macrodissected regions) CTX

WGCNA PCA & WGCNA WGCNA WGCNA WGCNA Correlation based search of genes (Neuroblast) or voxels (AGEA) WGCNA PCA and supervised hierarchical clustering WGCNA WGCNA WGCNA

225

Affymetrix macaque

None

Yes

WGCNA

AD, Alzheimer’s disease; ALS, amyotrophic lateral sclerosis; AMY, amygdala; CA1, CA1 region of hippocampus; CB, cerebellum; CN, caudate nucleus; CTX, cortex; DG, dentate gyrus; EC, entorhinal cortex; FC, frontal cortex; FTD, fronto-temporal dementia; HD, Huntington’s disease; HIP, hippocampus; k-means, k-means clustering; LCM, laser-capture microdissection; MTG, middle temporal gyrus; NA, not applicable; NI, not investigated; PCA, principal components analysis; PCG, postcentral gyrus; PFC, prefrontal cortex; PTSD, post-traumatic stress disorder; PVC, primary visual cortex; STR, striatum; SVD, singular value decomposition; SZ, schizophrenia; TC, temporal cortex; THM, thalamus; WGCNA, weighted gene coexpression network analysis.

96

the OMICs

2011). By shifting the focus from individual gene expression levels to shared gene expression patterns, the dimensionality of large gene expression datasets can often be reduced by orders of magnitude. Reducing the dimensionality of transcriptomic datasets can be desirable, since analysis of DE at the level of coexpression modules (as opposed to individual genes) greatly mitigates the problem of multiple comparisons. Coexpression modules that have been identified in a neurobiological system of interest also provide a natural biological framework in which to organize DE genes. By interpreting DE genes within the context of coexpression modules, obvious themes may emerge (e.g., DE of astrocyte-expressed genes) that might remain invisible to other methods. Coexpression analysis can also reveal subgroups of DE genes that exhibit distinct patterns of variation within sample conditions (Figure  6.3). In contrast, by averaging gene expression levels, DE analysis essentially ignores patterns of variation that may exist within sample cohorts. Coexpression analysis can also help distinguish between hidden biases and bona fide DE. Because coexpression analysis will detect the most salient patterns of gene activity in a given dataset, this approach is exquisitely sensitive to batch effects and can help to reveal their presence. While it is possible for other methods (e.g., linear regression) to reveal biases, such revelations depend on a priori knowledge of sample covariates such as the hybridization batch, RIN scores, pH, and so on. In contrast, coexpression analysis can reveal hidden biases in the absence of pertinent sample covariates. For example, consider a DE study of whole tissue samples that identified genes involved in myelination as upregulated in the experimental group, leading investigators to conclude that this activity was somehow related to the experimental condition. By analyzing gene coexpression relationships, it would be possible to determine whether expression levels for these genes were uniformly higher in the experimental condition (suggesting true DE and/or differential abundance of oligodendrocytes) or higher in only a small subset of samples (suggesting possible dissection artifacts). The latter possibility could be further explored by histological examination of the specific sample(s) in question.

A N E W E X P E R I M E N TA L PA R A D I G M F O R TRANSCRIPTOMIC STUDIES OF NEUROBIOLOGICAL SYSTEMS In addition to aiding the interpretation of DE genes, coexpression analysis offers a new paradigm for transcriptomics by enabling analysis of gene expression within a single sample cohort. For example, consider a microarray dataset consisting of 50 postmortem control samples from a particular cortical area that were matched for age, sex, and tissue pH. In the absence of any distinguishing sample characteristics, there are no “groups” of samples in this dataset to compare for DE analysis. However, this dataset is perfectly amenable to gene coexpression analysis (Oldham et  al. 2008). DE studies pose the question, “Which genes are expressed, on average, higher or lower between my sample cohorts?” However, coexpression analysis of a single sample cohort asks:  “What are the most salient patterns of gene activity in my biological system, what do they mean, and how are they organized?” This experimental paradigm offers insights that are qualitatively and quantitatively different than those obtained by DE studies, as described in the following paragraphs. Importantly, coexpression analysis of a single sample cohort provides a novel approach for deconvolving the cellular heterogeneity of neurobiological specimens. This ability follows logically from two uncontroversial premises:  (1)  cell types are distinguished by the genes that they express and (2)  the absolute number of each cell type will vary from sample to sample. Therefore the genes that are the most specifically and consistently expressed in the same cell type should in principle appear highly correlated with one another in microarray or RNA-seq data derived from whole tissue homogenates (Figure  6.4). Unlike the techniques listed in Table  6.1, the ability to discern molecular signatures of discrete cell types in this fashion does not require prospective identification and isolation of individual cell types. Instead, a coexpression approach to the problem of cellular heterogeneity seeks to analyze all cell types that are present in the tissue simultaneously and let the patterns of gene activity speak for themselves. The data-driven nature of coexpression analysis makes it an ideal approach for studying the full extent of cellular heterogeneity in biological systems and particularly systems

Transcriptomics: From Differential Expression to Coexpression (A)

(B)

0.3

–0.2

0.1 w13_VZ w14_1_VZ w14_2_VZ w15_VZ w16_1_VZ w16_2_VZ w13_ISVZ w14_1_ISVZ w14_2_ISVZ w15_ISVZ w16_1_ISVZ w16_2_ISVZ w13_OSVZ w14_1_OSVZ w14_2_OSVZ w15_OSVZ w16_1_OSVZ w16_2_OSVZ w13_CP w14_1_CP w14_2_CP w15_CP w16_1_CP w16_2_CP

w13_VZ w14_1_VZ w14_2_VZ w15_VZ w16_1_VZ w16_2_VZ w13_ISVZ w14_1_ISVZ w14_2_ISVZ w15_ISVZ w16_1_ISVZ w16_2_ISVZ w13_OSVZ w14_1_OSVZ w14_2_OSVZ w15_OSVZ w16_1_OSVZ w16_2_OSVZ w13_CP w14_1_CP w14_2_CP w15_CP w16_1_CP w16_2_CP

0.2

(C)

97

(D)

0.4

0.2

w13_VZ w14_1_VZ w14_2_VZ w15_VZ w16_1_VZ w16_2_VZ w13_ISVZ w14_1_ISVZ w14_2_ISVZ w15_ISVZ w16_1_ISVZ w16_2_ISVZ w13_OSVZ w14_1_OSVZ w14_2_OSVZ w15_OSVZ w16_1_OSVZ w16_2_OSVZ w13_CP w14_1_CP w14_2_CP w15_CP w16_1_CP w16_2_CP

w13_VZ w14_1_VZ w14_2_VZ w15_VZ w16_1_VZ w16_2_VZ w13_ISVZ w14_1_ISVZ w14_2_ISVZ w15_ISVZ w16_1_ISVZ w16_2_ISVZ w13_OSVZ w14_1_OSVZ w14_2_OSVZ w15_OSVZ w16_1_OSVZ w16_2_OSVZ w13_CP w14_1_CP w14_2_CP w15_CP w16_1_CP w16_2_CP

0.0

–0.2

Coexpression analysis reveals subgroups of differentially expressed genes with distinct patterns of variation. Shown here (A‒D) are four modules of coexpressed genes in the developing human neocortex (Fietz et  al. 2012). Top:  heat maps depict relative expression levels for coexpressed genes (rows) across samples (columns). Bottom:  the first principal component, or module eigengene (Horvath & Dong 2008; Oldham et  al. 2006), obtained by singular value decomposition of each coexpression module is depicted here; sample labels are shown below (VZ, ventricular zone; ISVZ, inner subventricular zone; OSVZ, outer subventricular zone; CP, cortical plate; w13, 13th week postconception, etc.). The left modules (A and C) consist of genes that are upregulated in VZ samples, while the right modules (B and D) consist of genes that are upregulated in CP samples. Although differential expression analysis would likely identify all of the genes in these modules as significantly upregulated in VZ or CP as compared with the other neocortical regions, the resulting list of P values would not segregate these genes into the obvious subgroups that are revealed by coexpression analysis; specifically, those genes that are expressed uniformly from w13 to w16 (A and B) and those that are expressed substantially higher in w16 (C and D).

FIGURE  6.3:

in which the techniques listed in Table  6.1 are impractical, such as the human brain. The identification of cell type‒specific coexpression modules also offers a natural vehicle for biomarker discovery. There is a pressing need to discover new and more accurate biomarkers for neurobiological cell types. An ideal biomarker should exhibit high sensitivity and

specificity for a particular cell type; in terms of gene expression, this means that the mRNA should be consistently expressed at detectable levels in the cell type of interest but not in other cell types. Logically, genes that possess such characteristics should appear highly correlated in transcriptomic datasets derived from multicellular tissues (Figure  6.4). By calculating the

(A)

(B) 12

Amount

Expression level

10

8

6

4

Sample (C)

CALB1 CA8 ITPR1 CALB1 ITPR1

PVALB ITPR1 PCP4 PLXDC1 CHST2

LPL CEP76 LARGE SLC1A6 LRP8

Sample (D)

CALB1 (E)

CA8 (F)

PLXDC1

CHST2

Cell-type biomarkers are highly correlated in transcriptomic data derived from heterogeneous tissue samples. (A)  Thought experiment. Imagine 10 samples from human cerebellum in which the number of Purkinje neurons in each sample is known. Next, imagine a gene that is expressed constitutively and specifically by Purkinje neurons (i.e., it is expressed with perfect fidelity). Now imagine two more genes that are expressed by Purkinje neurons with very good fidelity or good fidelity. All three genes are highly correlated because they share the same primary source of variation: the relative abundance of Purkinje neurons in each sample, assuming linear relationships, for which empirical evidence exists (Shen-Orr et al. 2010). (B) A real Purkinje neuron coexpression module identified in microarray data generated from whole tissue samples of adult human cerebellum (Hodges et  al. 2006; Oldham et  al. 2008). This module is significantly enriched (P  =  1.2e ‒ 05)  with genes that cause abnormal Purkinje cell morphology when knocked out in mice (Zhang et al. 2010). Depicted are the top 15 genes (probes) on the microarray based on their correlation with the module eigengene. (C-F) Shown are in situ hybridization data from the Mouse Brain Atlas (Lein et al. 2007) for four genes from the module. CALB1 (C) and CA8 (D)  are known markers of Purkinje cells in mouse cerebellum, while many of the other top module genes have been shown to be exclusively or predominantly expressed by Purkinje neurons in cerebellum, including ITPR1, PVALB, PCP4 (also known as Purkinje cell protein 4), SLC1A6, and LRP8. PLXDC1 (E)  and CHST2 (F), which are not known Purkinje cell markers, exhibit very “clean” expression in this cell type, as predicted based on their coexpression relationships. FIGURE  6.4:

Transcriptomics: From Differential Expression to Coexpression strength of a gene’s association with a cell type‒ specific coexpression module, it is possible to predict how good of a marker for that cell type the gene is likely to be (Oldham et al. 2008). It is a leap, but not much of a leap, to posit that a gene that is specifically and consistently expressed in a particular cell type is likely to be doing something important in that cell type. Similarly, one might hypothesize that a gene that is coexpressed with a group of ribosomal genes is likely to play an important role in ribosomal biogenesis or function, or that a gene coexpressed with mitochondrial genes is likely to be involved, directly or indirectly, with some aspect of energy metabolism, and so on. Seen through this lens, the quantitative assessment of module membership coupled with accurate functional characterization of coexpression modules provides a new approach for annotating gene function in neurobiological tissues through the principle of “guilt by association” (Oldham et  al. 2008). An important corollary to this proposition is that the strength of module membership is itself a measure of functional significance; as such, it provides a new way to prioritize genes for further studies. There is growing evidence for these claims, as discussed further in the next section.

C O E X P R E S S I O N A P P L I C AT I O N S IN THE NEUROSCIENCES Multivariate techniques have been used to analyze microarray data generated from neurobiological samples for over a decade. However, until recently, these techniques were almost always used to determine relationships among samples (as opposed to relationships among genes), primarily via hierarchical clustering or projection methods (for example, by Khaitovich et  al. 2004 and Roth et  al. 2006). Many of the earliest applications of multivariate techniques to study relationships among gene expression levels analyzed samples from many species and tissues (Ge et  al. 2005; Jordan et  al. 2004; Lee et  al. 2004; Miki et  al. 2001; Stuart et  al. 2003). Since neurobiological samples constituted a very small fraction of the total number of samples analyzed in these studies, the identified patterns represented only the broadest brushstrokes of gene activity (e.g., up or down in whole brain compared to nonneural tissues). Nevertheless, these studies highlighted the need to carefully consider biological context when interpreting gene coexpression relationships and set the

99

stage for similar efforts focused purely on neurobiological samples. The first unsupervised analysis of gene coexpression relationships in primate brains used WGCNA to analyze microarray data from matched human and chimpanzee brain regions (Oldham et  al. 2006). This analysis found large gene coexpression modules that corresponded to functionally relevant brain anatomy, including cerebellum, caudate nucleus, cerebral cortex (multiple areas), and primary visual cortex. One particularly interesting result from this study was the identification of a coexpression module that spanned multiple brain regions and was enriched with genes involved in myelination. This observation suggested that the gene coexpression “pattern” captured by this module was related to the number of oligodendrocytes present in each sample; it also suggested that gene coexpression analyses in larger, more homogeneous microarray datasets (e.g., datasets comprising a single brain region) might identify additional modules of coexpressed genes corresponding to specific cell types in the brain. This hypothesis was formally tested through analysis of gene coexpression relationships in microarray data generated from adult human cerebral cortex, caudate nucleus, and cerebellum with WGCNA (Oldham et  al. 2008). In contrast to the study described above (Oldham et  al. 2006), which analyzed gene coexpression relationships across human brain regions, this study was the first to analyze gene coexpression relationships within specific human brain regions. The authors determined that many gene coexpression modules were highly preserved between brain regions (Figure 6.5). In addition, a majority of cortical modules were observed in independent datasets produced from unrelated individuals using different microarray platforms. Specific modules were found to be highly enriched with experimentally validated markers of neurobiological cell types, including oligodendrocytes, astrocytes, and neurons (Figure 6.5); other modules possessed characteristics of microglia, neuronal subtypes (including excitatory neurons, parvalbumin+ interneurons, and Purkinje neurons), ribosomes, mitochondria, synaptic function, hypoxic response, and sex differences (Oldham et  al. 2008). Thus, this work revealed a fundamental organization to the transcriptomes of human brain regions that had not been previously recognized and reflects

100

the OMICs

(A)

(B)

CTX

M9A

M15A

M16A

Network

Module

CTX

M9A

p=8.1e-71

NS

NS

CTX95

M9B

p=4.3e-37

NS

NS

CN

M9C

p=8.8e-66

NS

NS

CB

M9D

p=5.6e-38

NS

NS

CTX

M15A

NS

p=2.0e-122

NS

CTX95

M9B M16B

M15B CN

M9C

M16C

M15C

CB

M16D

Oligodendrocytes

M15B

NS

p=5.3e-79

NS

CN

M15C

NS

p=2.5e-61

NS

CB

M15D

NS

p=6.1e-69

NS

CTX

M16A

NS

NS

p=8.4e-30

CTX95

M16B

NS

NS

p=4.0e-15

CN

M16C

NS

NS

p=1.7e-14

CB

M16D

NS

NS

p=5.4e-05

M9D M15D

CTX (M9A)

(E) CTX (M15A)

CTX (M16A)

12

12 11

10

10 9 8 7

10

Expression level

Expression level

Expression level

Neurons

CTX95

(D)

(C)

Astrocytes

8

8

6

6 4

6

4

Sample

Sample

(F)

Sample

(G) CTX_95 (M9B)

(H) CN (M15C)

CB (M16D) 11

12

12

10

10 9 8 7 6

Expression level

11 Expression level

Expression level

11

10 9 8

9 8 7 6 5

7

4

5

Sample

Sample

Sample

FIGURE  6.5: Gene coexpression modules in human brain microarray data are highly conserved and enriched for markers of major cell types (adapted from Oldham et al. 2008). (A) Network organization of gene coexpression in human cerebral cortex (CTX, CTX95), caudate nucleus (CN), and cerebellum (CB). CTX95 data were generated using Affymetrix U95Av2 microarrays; all other brain regions were generated using Affymetrix U133A microarrays. Modules of coexpressed genes with significant overlap were assigned the same number (e.g., M9); only select modules are labeled. (B) Markers of oligodendrocytes, astrocytes, and neurons in adult mouse brain (Cahoy et  al. 2008)  were significantly enriched in M9, M15, and M16, respectively, in all brain regions (Fisher’s exact test; NS = not significant). Expression levels of the top 10 genes ranked by average module membership strength across all brain regions (Oldham et  al. 2008)  are shown in CTX (C)  and CTX95 (F)  for M9, CTX (D)  and CN (G) for M15, and CTX (E) and CB (H) for M16.

Transcriptomics: From Differential Expression to Coexpression the underlying cellular composition of brain tissue. Importantly, the authors of this study determined the strength of module membership for every measured transcript. This quantity, called kME, is defined as the Pearson correlation between the expression pattern of a transcript and a module eigengene (ME), which is defined as the first principal component obtained by singular value decomposition of a coexpression module (Horvath & Dong 2008; Oldham et  al. 2008). kME is therefore a natural summary of the extent to which a gene conforms to the characteristic expression pattern of a module. For modules that were present in multiple datasets (including modules enriched with markers of major cell types), this quantity was extremely reproducible (Oldham et  al. 2008). For example, the gene C11orf9 was ranked within the top ~0.15% in terms of kME for each of three oligodendrocyte coexpression modules found in adult human cerebral cortex, caudate nucleus, and cerebellum, respectively. As its name (chromosome 11 open reading frame 9)  implies, almost nothing was known about this gene. Shortly thereafter, another study identified myelin-gene regulatory factor (MRF), which is the mouse ortholog of C11orf9, as a critical transcriptional regulator required for CNS myelination (Emery et al. 2009). In the absence of MRF in mice, oligodendrocytes failed to myelinate and the mice died of seizures in the third postnatal week (Emery et al. 2009). These results indicate that the expression fidelity (kME) of C11orf9 in oligodendrocytes was predictive of its functional significance in this cell type. Collectively, these findings invalidated the commonly held belief that cellular heterogeneity precludes the recovery of cell type‒specific information in microarray data generated from whole brain tissue while simultaneously providing an initial description of the transcriptional programs that distinguish the major cell classes of the human brain. More broadly, these efforts revealed that there are consistent underlying sources of variation in microarray data generated from whole brain tissue. These sources of variation, most notably the relative abundance of different cell types, produce “currents” in the data that are easily recognized by analysis of gene coexpression relationships but invisible to standard analysis of DE. The past several years have produced a burst of studies predicated on analyzing gene

101

coexpression relationships in neurobiological systems (Table  6.3). These studies have confirmed the presence of gene coexpression modules corresponding to major cells classes in brain tissue in mice (Iancu et  al. 2010; Miller et  al. 2010; Mulligan et  al. 2011), rats (Ponomarev et  al. 2010; Winden et  al. 2011), zebra finches (Hilliard et  al. 2012), macaques (Bernard et  al. 2012), and humans (Ben-David & Shifman 2012; Ivliev et  al. 2010; Miller et  al. 2010; Podtelezhnikov et  al. 2011; Ponomarev et  al. 2012; Torkamani et  al. 2010; Voineagu et  al. 2011). Importantly, these studies used a variety of technology platforms and analysis techniques (Table  6.3), highlighting the robust nature of gene coexpression relationships in transcriptomic data derived from whole tissue samples. In addition, many of these studies analyzed gene coexpression relationships in samples derived from human pathological specimens, including neurodegenerative conditions (Miller et  al. 2008; Podtelezhnikov et  al. 2011; Ray & Zhang 2010; Rosen et  al. 2011; Saris et  al. 2009), schizophrenia (Torkamani et  al. 2010), autism (Voineagu et  al. 2011), cancer (Ivliev et  al. 2010), and alcoholism (Ponomarev et al. 2012), along with animal models of stroke (Dabrowski et  al. 2010), posttraumatic stress disorder (Ponomarev et  al. 2010), alcoholism (Mulligan et  al. 2011), and epilepsy (Winden et  al. 2011). These studies are discussed further in the following paragraphs. Miller and colleagues (2008) used WGCNA to study the effects of Alzheimer’s disease (AD) on transcriptome organization in the CA1 region of the hippocampus (Blalock et  al. 2004). The authors identified a number of gene coexpression modules that correlated negatively with AD progression and one that correlated positively with AD progression. By comparing these modules with those identified in human frontal cortex during the course of normal aging (Lu et  al. 2004), the authors identified coexpression modules related to energy metabolism and synaptic plasticity that were shared between both conditions, along with hub genes that were central to these modules in both aging and AD. Podtelezhnikov and associates (2011) used principal components analysis and supervised hierarchical clustering to analyze patterns of gene activity in samples of prefrontal cortex, visual cortex, and cerebellum from over 600 individuals with AD, Huntington’s disease, or no pathology. The authors identified

102

the OMICs

four “metagenes” that explained a substantial portion of gene expression variation in these subjects, which related to biological age, AD, inflammation, and neurodegenerative stress, respectively. Ray and coworkers (2010) reanalyzed microarray data generated from LCM neurons from individuals with and without AD (Liang et  al. 2008). The authors used a method called CoExp (Ruan et  al. 2010)  to construct gene coexpression networks in neurons of the entorhinal cortex, hippocampal CA1 region, posterior cingulate cortex, and medial temporal gyrus of individuals with AD. By comparing the resulting coexpression networks, the authors determined that the medial temporal gyrus was less affected by AD pathology than the other analyzed brain regions. Rosen and colleagues (2011) used WGCNA to reveal an unexpected role for the Wnt signaling pathway in an in vitro model of frontotemporal dementia. Given the inaccessibility of human brain tissue, there is growing interest in attempting to discern molecular signatures of neuropathological conditions using peripheral tissues such as blood (Cai et  al. 2010; Coppola et  al. 2011, 2008; Saris et  al. 2009). Saris and colleagues (2009) used WGCNA to analyze transcriptome organization in whole blood of patients with amyotrophic lateral sclerosis (ALS) and healthy controls. The authors identified two large coexpression modules that were associated with ALS and replicated this finding in three independent datasets, suggesting that, at least for ALS, peripheral biomarkers may aid monitoring of disease progression. However, it is not at all obvious how the transcriptomes of such disparate tissues as brain and blood should relate a priori. To address this question, Cai and colleagues (2010) used WGCNA to analyze and compare gene coexpression networks from three human brain datasets with two human blood datasets. This analysis revealed that transcriptome organization was, in general, poorly preserved between the two tissues. Nevertheless, several brain coexpression modules did exhibit strong preservation in blood; the authors proposed that this subset of genes might serve as informative peripheral biomarkers for neuropathological conditions. Network methods have also been used to assess potential perturbations in transcriptome organization that may result from neurodevelopmental conditions. Torkamani and coworkers (2010) used a variant of WGCNA with mutual

information as a distance metric to compare gene coexpression networks from prefrontal cortex in individuals with schizophrenia and control subjects. They determined that DE genes tended to localize in neuronal coexpression modules and that modules enriched with genes involved in CNS developmental processes failed to show an age-related decrease in expression in individuals with schizophrenia. Voineagu and associates (2011) used WGCNA to analyze transcriptome organization in frontal and temporal cortices of individuals with autism and control subjects. They identified two coexpression modules associated with autism:  one enriched with neuronal markers and the other enriched with micro- and macroglial markers. Of these, the neuronal module was also significantly enriched with gene variants that had been genetically associated with autism via genome-wide association studies, providing evidence for convergent molecular pathologies associated with autism spectrum disorders. Ben-David and associates (2012) also used WGCNA to reanalyze 1,340 microarray samples generated by the Allen Brain Institute from two adult human brains. The authors used the resulting coexpression network as a framework in which to localize candidate autism genes identified by GWAS studies. They identified three coexpression modules that were enriched with autism risk genes, two of which showed substantial overlap with the module identified by Voineagu and colleagues (2011). Horvath and coworkers (2006) used WGCNA to analyze patterns of gene activity in glioblastoma. They identified a mitosis/cell cycle module that was present in two independent glioblastoma microarray datasets as well as a breast cancer microarray dataset. This module was found to be reminiscent of a molecular signature that was present in fetal human brain but not adult human brain. Within this module, the authors identified genes with the highest connectivity (hub genes) that had not previously been implicated as cancer targets. One such gene, ASPM, was subsequently validated as a novel potential drug target for glioblastoma via siRNA knockdown in glioblastoma cell lines, which dramatically reduced cell proliferation. Ivliev and colleagues (2010) also used WGCNA to analyze gene coexpression relationships in 790 glioma samples from five independent datasets. The authors found 20 coexpression modules that were present in all five datasets

Transcriptomics: From Differential Expression to Coexpression and enriched with distinct functional categories of genes. Among these, the authors identified a novel “proastrocytic” signature linked to a particular pattern of glioma tumor differentiation. Ponomarev and coworkers (2012) used WGCNA to compare transcriptome organization in frontal cortex, the central nucleus of the amygdala, and the basolateral nucleus of the amygdala between alcoholics and control subjects. The authors identified distinct effects of alcoholism on cell type‒specific coexpression modules that varied by brain region. Specifically, they observed a downregulation of coexpressed neuronal genes and an upregulation of coexpressed microglial genes in the amygdalar regions of alcoholics. In another study of alcoholism using a mouse model, Mulligan and colleagues (2011) used WGCNA in conjunction with cDNA microarrays to identify alcohol-responsive coexpression modules in a variety of brain regions. Multivariate techniques have also been used to analyze gene coexpression relationships in animal models of stroke (Dabrowski et  al. 2010), posttraumatic stress disorder (Ponomarev et  al. 2010), fear conditioning (Park et  al. 2011), and epilepsy (Winden et al. 2011). In addition to the studies of neuropathology described in the preceding discussion, a number of neuroscientific studies have analyzed gene coexpression relationships in nonpathological specimens using a variety of data types and techniques. Hawrylycz and associates (2010) performed unsupervised hierarchical clustering of quantified, genome-wide in situ hybridization data from the Allen Brain Institute (Lein et  al. 2007), identifying groups of genes with similar patterns of areal and laminar enrichment in the adult mouse brain. Later, Hawrylycz and colleagues (2011) introduced two methods to enable correlation-based searches for genes (“Neuroblast”) or voxels (“AGEA”) in the adult mouse brain using the same ISH dataset. Bernard and associates (2012) applied WGCNA to microarray data to identify genes with similar patterns of areal and laminar enrichment in Rhesus macaque neocortex. WGCNA has also been used to study gene coexpression relationships across a variety of brain regions during human fetal development (Johnson et al. 2009), to compare transcriptome organization in specific neuronal classes from the adult mouse brain (Winden et  al. 2009), to analyze the effects of genetic diversity on gene

103

expression in the mouse striatum (Iancu et  al. 2010), to compare transcriptome organization between human and mouse brains (Miller et al. 2010), and to identify gene coexpression modules activated by singing in the brains of zebra finches (Hilliard et al. 2012). The studies that have been described here point to an important conclusion:  transcriptomic experiments capture an enormous amount of biological information, most of which remains invisible when seen only through the lens of DE. In light of its myriad advantages, it seems reasonable to expect that analysis of gene coexpression relationships will soon become a standard approach for studying neurobiological transcriptomes. However, as with all techniques, there are caveats. In the next section I discuss important challenges and limitations of coexpression analysis and consider what the future may hold for studies of neurobiological transcriptomes over the coming decade.

C H A L L E N G E S A N D L I M I TAT I O N S O F C O E X P R E S S I O N A N A LY S I S It is prudent for any investigator conducting an -OMICs study to bear in mind the famous computer science acronym GIGO:  garbage in, garbage out. Coexpression analysis can be performed on any transcriptomic dataset of reasonable size; if patterns exist, regardless of the reason, they are likely to be found. Thus it is critical to make every effort to identify and remove nonbiological sources of variation so that they do not confound the interpretation of coexpression modules. In practice, this requires careful attention to data preprocessing, including sample outlier detection and removal, data normalization, and correction for batch effects. Toward this end, Oldham and associates (2012) recently proposed a novel approach to standardizing and streamlining transcriptomic data preprocessing that uses network methods to enable detailed exploration of sample relationships. An advantage of this approach is that it generates a battery of quantitative measures for summarizing the consistency and integrity of sample relationships, which can subsequently be compared across disparate studies, technology platforms, and biological systems (Oldham et  al. 2012). Such measures can provide reassurance that the investigator is on solid ground before embarking on analysis of gene coexpression relationships.

the OMICs

What is the appropriate experimental design for a coexpression study? The answer is not straightforward, but there are specific questions to consider. Before designing any coexpression study, the investigator should attempt to determine “What are likely to be the largest sources of variation in gene expression among my samples, and how do these relate to the biological question(s) in which I  am most interested?” It is critical to balance potential technical sources of variation with biological effects of interest so that the two are not confounded. For microarray experiments, the most common technical sources of variation stem from multiple hybridization batches and, for arrays that analyze multiple samples simultaneously (such as Illumina’s), the arrays themselves (Kitchen et al. 2011). For RNA-seq experiments, the technical sources of variation are less well understood, but a number of recent studies have identified a dependence between read coverage and GC content, which is thought to reflect biases inherent to PCR (Aird et  al. 2011; Benjamini & Speed 2012; Risso et  al. 2011; Roberts et  al. 2011; Zheng et al. 2011). Another important question relates to sample size. As the number of samples increases, the likelihood of observing strong correlations (or anticorrelations) that are spurious decreases (Figure 6.6). In general, it is inadvisable to perform gene coexpression analysis on datasets containing fewer than ~20 samples; if this situation is unavoidable, more stringent parameters (e.g., higher similarity thresholds or larger minimum module sizes) should be used to identify coexpression modules. Most transcriptomic datasets that were generated for the purpose of identifying DE genes are not good candidates for gene coexpression analysis, primarily owing to small sample sizes. In addition to the number of samples, the number of sample cohorts will impact the design of coexpression studies. The simplest case, involving only one sample cohort, requires construction of only one coexpression network. However, if there are multiple sample cohorts (e.g., control vs. disease), the investigator must decide whether to analyze gene coexpression relationships across all sample cohorts or within each sample cohort separately. In general, the former approach is simpler in that it requires only one instance of data preprocessing and coexpression analysis. As illustrated in Figure  6.3, this “one-network” design can

4

10 samples 25 samples 50 samples 100 samples

3

Density

104

2

1

0 –1.0

–0.5

0.0 0.5 Pearson correlation

1.0

FIGURE  6.6: The likelihood of observing spurious correlations by chance decreases with increasing sample size. Shown here are density plots of Pearson correlations for simulated data (normally distributed random values) consisting of 2,000  “genes” and varying numbers of “samples.” This example illustrates the importance of large sample sizes for analyzing gene coexpression relationships.

complement DE analysis by revealing subgroups of DE genes that exhibit distinct patterns of variation within sample cohorts. Alternatively, a “multinetwork” design involves creating separate coexpression networks for each sample cohort. This design may be preferable if the  primary objective of the study is to analyze and compare gene coexpression relationships between heterogeneous sample cohorts or if the sample cohorts come from different studies or comprise different technical batches, thereby rendering a one-network design infeasible. A major challenge for coexpression studies is deciding exactly how to define coexpression modules. Unfortunately there is no universally agreed upon “best” algorithm for defining modules or clusters of coexpressed genes. Indeed, the investigator is confronted with many options, both for defining a measure of similarity among genes—for example, various measures of correlation, topological overlap (Ravasz et  al. 2002; Zhang & Horvath 2005), mutual information (Meyer et  al. 2008; Sales & Romualdi 2011), maximal information coefficient (Reshef et  al. 2011) and so on; and for detecting coexpression modules/clusters, such as k-means clustering (Lloyd 1982), WGCNA (Zhang and Horvath, 2005), CoExp (Ruan et al. 2010)  and so on. Furthermore, most module

Transcriptomics: From Differential Expression to Coexpression detection algorithms have parameters that must be supplied by the user, and the choice of parameters can have a profound effect on the number of coexpression modules that are found. To establish reproducibility, coexpression modules detected in one dataset should be validated in an independent dataset generated from the same biological system, with the same technology platform, and using the same similarity measure, module detection algorithm, and parameters. By altering algorithmic parameters in a progressive fashion, a single dataset can be analyzed at various levels of coexpression “resolution”—not unlike an in silico microscope for studying molecular relationships. In the same way as there is no “best” magnification for a tissue section, there is unlikely to be a “best” coexpression network for a given dataset. Rather, the desired resolution of coexpression analysis should reflect the aims of the study. Is the goal to provide an overview of the most salient patterns of gene activity in a biological system? Or is the goal to search for the molecular signature of a rare cell type? The former aim would suggest relatively coarse resolution, while the latter would require probing the finer structure of coexpression relationships. At the same time, it is worth noting that many methods have been proposed to assess the optimal number and quality of modules or clusters in a dataset (Dudoit & Fridlyand 2002; Kapp & Tibshirani 2007; Langfelder et  al. 2011; McShane et  al. 2002; Newman, 2006; Rousseeuw 1987; Tibshirani & Walther 2005). These methods tend to assess the density (strength of connections in a module), separability (distinctness between modules), and/ or stability (robustness of modules to artificial noise) of modules (Langfelder et  al. 2011). Prominent examples of such methods include the silhouette width (Rousseeuw 1987)  and modularity (Newman 2006). While these methods can be useful in certain contexts, their general applicability to coexpression modules defined by disparate algorithms is questionable. For example, module detection algorithms that identify small, dense groups of interconnected genes (e.g., clique-finding algorithms) can produce “optimal” networks as measured by modularity that consist of a trivially small number of large modules (Fortunato & Barthelemy 2007). While future studies may yield more generalizable module quality statistics, it is likely that the

105

ultimate arbiter of module quality will remain biology itself:  to the extent that an investigator can attribute real biological meaning to a coexpression module (particularly when it has been observed more than once), it is likely to be valid. Generally speaking, the detection of gene coexpression modules is now a relatively straightforward analytical problem (albeit with many solutions). But if module detection has become the easy problem, module characterization has become the hard problem. Attributing meaning to coexpression modules is fraught with challenge and uncertainty. To answer the simple question, “Why are these genes coexpressed?” often requires a combination of dry and wet lab approaches, including enrichment analysis, exemplar analysis, and histology. Enrichment analysis refers to the use of statistical tests (such as Fisher’s exact test) to determine whether biologically meaningful sets of genes are significantly overrepresented in coexpression modules (e.g., Figure 6.5B). Effectively, any group of genes that have been linked in a biologically meaningful fashion can be used to screen coexpression modules for significant enrichment. Enrichment analyses can be very informative, but their utility depends on the quality of the gene sets and their relevance to the biological system under investigation. Exemplar analysis refers to the use of specific genes as biological “tags” for coexpression modules. Typically such genes have expression profiles and functional traits that are well or at least partially understood. If such a gene exhibits strong membership for a particular module, it is likely that the module relates in some way to the expression profile or functional properties of the exemplar gene. For example, by screening modules in a cortical coexpression network for markers of neurobiological cell types (e.g., PVALB, GFAP, CNP, and so on), one can quickly assess whether any modules are likely to be predominantly associated with expression in PVALB+ interneurons, astrocytes, or oligodendrocytes, respectively. For studies of multicellular biological systems, the most obvious point of entry for characterizing coexpression modules is determining which cell type(s) express the genes in the module. While enrichment analyses may reveal which coexpression modules are associated with specific cell types, appropriate gene sets may not be available for all cell types.

106

the OMICs

Furthermore, many modules may consist of genes that are expressed in multiple cell types or all cell types. Histology therefore remains a critical ingredient, and often a bottleneck, for characterizing coexpression modules. In the absence of reliable antibodies for immunohistochemistry, in situ hybridization (ISH) is the most commonly used technique toward this end. While ISH can in principle detect the expression pattern of any transcript, it is not well suited to high-throughput and quantitative analysis in most laboratories. In this respect, community resources such as the Allen Institute for Brain Science’s Mouse Brain Atlas (Lein et al. 2007), which contains genome-wide ISH data for serial sections through the entire adult mouse brain, can provide critical tools for module characterization. The ability of coexpression analysis of transcriptomic data to discern a molecular signature for a given cell type is likely to depend on many factors, including (1)  the abundance of the cell type, (2) the number of genes that are uniquely or predominantly expressed in that cell type, (3) the ability to reliably detect and quantify the expression of those genes using a given technology platform, (4) the stoichiometry of different cell types, (5) the number of samples, and (6)  the algorithm for identifying coexpression modules. Nevertheless, there are already some indications of the sensitivity of this approach. For example, Oldham and coworkers (2008) used Affymetrix U133A microarrays to identify a coexpression module in human cerebral cortex (n = 67 samples) consisting of genes that are primarily expressed in PVALB+ interneurons, which may comprise ~5% of all cells in adult human cerebral cortex. In the same study, the authors analyzed a much smaller dataset consisting of 24 samples from adult human cerebellum and identified a coexpression module corresponding to Purkinje neurons, which are dwarfed by the number of granule neurons and may represent less than 1% of all cells in the cerebellum. The interpretation of gene coexpression relationships is also challenged by the fact that they are undirected (i.e., they lack causality). While some methods for inferring causality from undirected networks have been proposed (Aten et  al. 2008; Opgen-Rhein & Strimmer 2007; Schafer & Strimmer 2005), these methods must ultimately be validated with additional experimental evidence. Such evidence may emerge

from targeted perturbations of gene expression in experimental systems, comparisons of gene expression across discrete time points, or integration with additional types of biological information, such as genetic or protein interaction data (Aten et al. 2008).

THE NEXT  DECADE In recent years several microarray and RNA-seq studies of neurobiological transcriptomes with very large sample sizes have appeared. Two of these studies (each consisting of more than 1,000 samples) analyzed the spatiotemporal profile of gene expression in the human brain (Colantuoni et al. 2011; Kang et al. 2011) while a third analyzed laser-microdissected layers of macaque neocortex (Bernard et  al. 2012). As the cost of RNA-seq continues to decline, it is likely that the trend toward the production of large transcriptomic datasets will continue. The increasing availability of large, high-quality microarray and RNA-seq datasets in well-annotated public repositories such as Gene Expression Omnibus and ArrayExpress will continue to facilitate the analysis and comparison of gene coexpression relationships among various neurobiological conditions. What trends are likely to emerge as gene coexpression analysis of neurobiological samples becomes routine? In the nearterm, it is possible that the proliferation of coexpression studies will induce a state of “module overload,” which will in turn precipitate efforts to adopt standardized guidelines and nomenclature for experimental design, module detection, and module characterization. To the extent that coexpression modules can be reproducibly identified and grounded in biological reality, they will present an opportunity to forge a new molecular taxonomy for describing cellular heterogeneity and functional processes in neurobiological systems. Because this taxonomy will be data-driven and multivariate by nature, it will facilitate molecular descriptions of neurobiological systems that are inherently quantitative and robust. Furthermore, by defining gene coexpression module taxonomy in nonpathological human specimens, investigators will be able to establish baselines for comparing coexpression relationships in other sample cohorts. The comparison of gene coexpression relationships between two or more sample cohorts, sometimes referred to as “differential coexpression analysis” or “differential network analysis,”

Transcriptomics: From Differential Expression to Coexpression constitutes a promising new direction for neuroscientific research. Recently a number of new methods have been proposed to facilitate detection of differentially coexpressed genes (Choi et  al. 2005; Oldham et  al. 2006; Tesson et  al. 2010; Watson, 2006; Yu et  al. 2011). Compared with DE analysis, differential coexpression analysis exhibits greater sensitivity to outliers, thereby facilitating detection of subtle perturbations in gene expression that may only be present in a small number of samples. More generally, differential coexpression analysis can identify dysregulated gene expression within modules, which may implicate specific neurobiological cell types or functional processes in connection with disease (Ben-David & Shifman 2012). It is often assumed that coexpressed genes exhibit similar patterns of activity across samples because they are coregulated. However, it is important to emphasize that coexpression and coregulation are not synonymous. Ultimately, coexpression reflects correlated measurements of transcript abundance, which may result from many factors that include but are not limited to transcriptional coregulation. Other factors—such as cellular or subcellular transcript sequestration, mRNA stability, or the rate of translation—will also affect transcript abundance and contribute to observed patterns of coexpression. In the future, as the neuroscientific community begins to coalesce around gene coexpression modules that are routinely identified in neurobiological systems, it will become increasingly important to attempt to discern among these possibilities. One approach toward this end involves searching for short regulatory motifs that are significantly enriched in the DNA sequences of coexpressed genes. Such motifs are most easily identified in promoter regions (corresponding to binding sites for transcription factors) and the 3′ untranslated regions of mRNAs (corresponding to binding sites for microRNAs or RNA-binding proteins), although they may also be present in intronic sequence or enhancer regions. A  variety of algorithms have been created to identify enrichment of short linear motifs in DNA sequences (for example, Bailey et  al. 2009 and Elemento et  al. 2007). To the extent that robust coexpression modules are repeatedly associated with identical or highly similar candidate regulatory motifs, there is a greater possibility that coexpression in these

107

modules will predominantly reflect transcriptional coregulation by shared regulatory factors. Although this chapter has focused on the use of multivariate techniques for analyzing gene coexpression relationships in transcriptomic datasets, it is worth noting that similar techniques are also gaining popularity in other realms of neuroscience. In particular, there are many parallels between the analysis of gene expression and brain imaging data suggesting that these two fields may inform one another and find common ground in network language and concepts (Hagmann et  al. 2008; Mumford et  al. 2010; Power et  al. 2011). The brain, after all, is a vast network of connected cells; it seems natural that graph theory, which is the mathematical language of networks, should provide a helpful tool kit for deciphering its organization.

C O N C L U S I O N S A N D S U M M A RY It is now quite obvious that substantially more biological information exists than has previously been appreciated in transcriptomic datasets generated from neurobiological systems. In this chapter I  argue that this information may be recovered, at least in part, by systematically analyzing gene coexpression relationships. Although elucidating the functional significance of coexpression modules will require a great deal effort from biologists and bioinformaticians, the importance of modules rests not only in their functional interpretation but also in their reproducibility. Since transcriptome organization in a given biological system is highly reproducible, coexpression modules provide a natural framework for comparisons between species, tissues, and diseased conditions. This framework can reduce dimensionality while simultaneously placing identified gene expression differences within specific cellular and functional contexts. Gene coexpression modules themselves are simply summaries of interdependencies that are already present in the data; in light of the overwhelming evidence that robust and reproducible interdependencies exist, it is no longer defensible to simply analyze each gene in isolation while ignoring this rich tapestry of information. REFERENCES Aalto, A.P., & Pasquinelli, A.E. (2012). Small non-coding RNAs mount a silent revolution in gene expression. Curr Opin Cell Biol 24, 333–340.

108

the OMICs

Aird, D., Ross, M.G., Chen, W.S., Danielsson, M., Fennell, T., Russ, C., . . . Gnirke, A. (2011). Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12, R18. Albert, R., Jeong, H., & Barabasi, A.L. (2000). Error and attack tolerance of complex networks. Nature 406, 378–382. Albertson, D.N., Pruetz, B., Schmidt, C.J., Kuhn, D.M., Kapatos, G., & Bannon, M.J. (2004). Gene expression profile of the nucleus accumbens of human cocaine abusers:  evidence for dysregulation of myelin. J Neurochem 88, 1211–1219. Alwine, J.C., Kemp, D.J., & Stark, G.R. (1977). Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes. Proc Natl Acad Sci U S A 74, 5350–5354. Arlotta, P., Molyneaux, B.J., Chen, J., Inoue, J., Kominami, R., & Macklis, J.D. (2005). Neuronal subtype-specific genes that control corticospinal motor neuron development in vivo. Neuron 45, 207–221. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., . . .et  al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25–29. Aston, C., Jiang, L., & Sokolov, B.P. (2005). Transcriptional profiling reveals evidence for signaling and oligodendroglial abnormalities in the temporal cortex from patients with major depressive disorder. Mol Psychiatry 10, 309–322. Aten, J.E., Fuller, T.F., Lusis, A.J., & Horvath, S. (2008). Using genetic markers to orient the edges in quantitative trait networks:  the NEO software. BMC Syst Biol 2, 34. Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., . . . Noble, W.S. (2009). MEME SUITE:  tools for motif discovery and searching. Nucleic Acids Res 37, W202–W208. Barabási, A., & Oltvai, Z. (2004). Network biology: understanding the cell's functional organization. Nat Rev Genet 5(2), 101–113. Barbosa-Morais, N.L., Dunning, M.J., Samarajiwa, S.A., Darot, J.F., Ritchie, M.E., Lynch, A.G., & Tavare, S. (2010). A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res 38, e17. Bartel, D.P. (2009). MicroRNAs:  target recognition and regulatory functions. Cell 136, 215–233. Ben-David, E., & Shifman, S. (2012). Networks of neuronal genes affected by common and rare variants in autism spectrum disorders. PLoS Genet 8, e1002556. Benjamini, Y., & Speed, T.P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 40, e72.

Bernard, A., Lubbers, L.S., Tanis, K.Q., Luo, R., Podtelezhnikov, A.A., Finney, E.M., . . .et  al. (2012). Transcriptional architecture of the primate neocortex. Neuron 73, 1083–1099. Blalock, E.M., Geddes, J.W., Chen, K.C., Porter, N.M., Markesbery, W.R., & Landfield, P.W. (2004). Incipient Alzheimer’s disease:  microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci U S A 101, 2173–2178. Bocklandt, S., Lin, W., Sehl, M.E., Sanchez, F.J., Sinsheimer, J.S., Horvath, S., & Vilain, E. (2011). Epigenetic predictor of age. PLoS One 6, e14821. Cáceres, M., Lachuer, J., Zapala, M.A., Redmond, J.C., Kudo, L., Geschwind, D.H., . . . Barlow, C. (2003). Elevated gene expression levels distinguish human from non-human primate brains. Proc Natl Acad Sci U S A 100, 13030–13035. Cahoy, J.D., Emery, B., Kaushal, A., Foo, L.C., Zamanian, J.L., Christopherson, K.S., . . .et  al. (2008). A transcriptome database for astrocytes, neurons, and oligodendrocytes:  a new resource for understanding brain development and function. J Neurosci 28, 264–278. Cai, C., Langfelder, P., Fuller, T.F., Oldham, M.C., Luo, R., van den Berg, L.H., . . . Horvath, S. (2010). Is human blood a good surrogate for brain tissue in transcriptional studies?BMC Genomics 11, 589. Carter, S., Brechbuhler, C., Griffin, M., & Bond, A. (2004). Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics 20(14), 2242–2250. Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., & Liu, C. (2011). Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6, e17238. Choi, J.K., Yu, U., Yoo, O.J., & Kim, S. (2005). Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics 21, 4348–4355. Cirelli, C., Gutierrez, C.M., & Tononi, G. (2004). Extensive and divergent effects of sleep and wakefulness on brain gene expression. & 41, 35–43. Colantuoni, C., Lipska, B.K., Ye, T., Hyde, T.M., Tao, R., Leek, J.T., . . .et al. (2011). Temporal dynamics and genetic control of transcription in the human prefrontal cortex. Nature 478, 519–523. Coppola, G., Burnett, R., Perlman, S., Versano, R., Gao, F., Plasterer, H., . . .et al. (2011). A gene expression phenotype in lymphocytes from Friedreich ataxia patients. Ann Neurol 70, 790–804. Coppola, G., & Geschwind, D.H. (2006). Technology Insight: querying the genome with microarrays— progress and hope for neurological disease. Nat Clin Pract Neurol 2, 147–158.

Transcriptomics: From Differential Expression to Coexpression Coppola, G., Karydas, A., Rademakers, R., Wang, Q., Baker, M., Hutton, M., . . . Geschwind, D.H. (2008). Gene expression study on peripheral blood identifies progranulin mutations. Ann Neurol 64, 92–96. Dabrowski, M., Dojer, N., Zawadzka, M., Mieczkowski, J., & Kaminska, B. (2010). Comparative analysis of cis-regulation following stroke and seizures in subspaces of conserved eigensystems. BMC Syst Biol 4, 86. Djebali, S., Davis, C.A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., . . .et  al. (2012). Landscape of transcription in human cells. Nature 489, 101–108. Dong, J., & Horvath, S. (2007). Understanding network concepts in modules. BMC Syst Biol 1, 24. Dredge, B.K., Polydorides, A.D., & Darnell, R.B. (2001). The splice of life: alternative splicing and neurological disease. Nat Rev Neurosci 2, 43–50. Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3, RESEARCH0036. Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J Am Stat Assoc 102, 93–103. Elemento, O., Slonim, N., & Tavazoie, S. (2007). A universal framework for regulatory element discovery across all genomes and data types. Mol Cell 28, 337–350. Emery, B., Agalliu, D., Cahoy, J.D., Watkins, T.A., Dugas, J.C., Mulinyawe, S.B., . . . Barres, B.A. (2009). Myelin gene regulatory factor is a critical transcriptional regulator required for CNS myelination. Cell 138, 172–185. Enard, W., Khaitovich, P., Klose, J., Zollner, S., Heissig, F., Giavalisco, P., . . . et al. (2002). Intra- and interspecific variation in primate gene expression patterns. Science 296, 340–343. Everitt, B.S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis.West Sussex,UK:Wiley. Fietz, S.A., Lachmann, R., Brandl, H., Kircher, M., Samusik, N., Schroder, R., . . . Riehn, A., et  al. (2012). Transcriptomes of germinal zones of human and mouse fetal neocortex suggest a role of extracellular matrix in progenitor self-renewal. Proc Natl Acad Sci U S A 109, 11836–11841. Fortunato, S., & Barthelemy, M. (2007). Resolution limit in community detection. Proc Natl Acad Sci U S A 104, 36–41. Fraser, H.B., Khaitovich, P., Plotkin, J.B., Paabo, S., & Eisen, M.B. (2005). Aging and gene expression in the primate brain. PLoS Biol 3, e274. Gall, J.G., & Pardue, M.L. (1969). Formation and detection of RNA-DNA hybrid molecules in cytological preparations. Proc Natl Acad Sci U S A  63, 378–383.

109

Ge, X., Yamamoto, S., Tsutsumi, S., Midorikawa, Y., Ihara, S., Wang, S.M., & Aburatani, H. (2005). Interpreting expression profiles of cancers by genome-wide survey of breadth of expression in normal tissues. Genomics 86, 127–141. Griffin, R.S., Mills, C.D., Costigan, M., & Woolf, C.J. (2003). Exploiting microarrays to reveal differential gene expression in the nervous system. Genome Biol 4, 105. Hagmann, P., Cammoun, L., Gigandet, X., Meuli, R., Honey, C.J., Wedeen, V.J., & Sporns, O. (2008). Mapping the structural core of human cerebral cortex. PLoS Biol 6, e159. Hakak, Y., Walker, J.R., Li, C., Wong, W.H., Davis, K.L., Buxbaum, J.D., . . . Fienberg, A.A. (2001). Genome-wide expression analysis reveals dysregulation of myelination-related genes in chronic schizophrenia. Proc Natl Acad Sci U S A  98, 4746–4751. Hartwell, L., Hopfield, J., Leibler, S., & Murray, A. (1999). From molecular to modular cell biology. Nature 402(6761 Suppl), C47–C52. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning.New York:Springer. Heiman, M., Schaefer, A., Gong, S., Peterson, J.D., Day, M., Ramsey, K.E., . . .et al. (2008). A translational profiling approach for the molecular characterization of CNS cell types. Cell 135, 738–748. Hess, K.R., Zhang, W., Baggerly, K.A., Stivers, D.N., & Coombes, K.R. (2001). Microarrays: handling the deluge of data and extracting reliable information. Trends Biotechnol 19, 463–468. Hilliard, A.T., Miller, J.E., Fraley, E.R., Horvath, S., & White, S.A. (2012). Molecular microcircuitry underlies functional specification in a basal ganglia circuit dedicated to vocal learning. Neuron 73, 537–552. Hodges, A., Strand, A.D., Aragaki, A.K., Kuhn, A., Sengstag, T., Hughes, G., . . .et al. (2006). Regional and cellular gene expression changes in human Huntington’s disease brain. Hum Mol Genet 15, 965–977. Horvath, S., Zhang, B., Carlson, M., Lu, K., Zhu, S., Felciano.,... Mischel, P. (2006). Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target. Proc Natl Acad Sci U S A. Horvath, S. (2011). Weighted network analysis. Applications in genomics and systems biology. New York: Springer. Horvath, S., & Dong, J. (2008). Geometric interpretation of gene coexpression network analysis. PLoS Comput Biol 4, e1000117. Huang, Y., Li, H., Hu, H., Yan, X., Waterman, M., Huang, H., & Zhou. X. (2007). Systematic discovery of functional modules and context-specific functional

110

the OMICs

annotation of human genome. Bioinformatics 23(13), i222–i229. Iancu, O.D., Darakjian, P., Walter, N.A., Malmanger, B., Oberbeck, D., Belknap, J., . . . Hitzemann, R. (2010). Genetic diversity and striatal gene networks: focus on the heterogeneous stock-collaborative cross (HS-CC) mouse. BMC Genomics 11, 585. Ihmels, J., Bergmann, S., Berman, J., & Barkai, N. (2005). Comparative gene expression analysis by differential clustering approach: application to the Candida albicans transcription program. PLoS Genet 1(3), e39. Ivliev, A.E., 'tHoen, P.A., & Sergeeva, M.G. (2010). Coexpression network analysis identifies transcriptional modules related to proastrocytic differentiation and sprouty signaling in glioma. Cancer Res 70, 10060–10070. Iwamoto, K., Bundo, M., & Kato, T. (2005). Altered expression of mitochondria-related genes in postmortem brains of patients with bipolar disorder or schizophrenia, as revealed by large-scale DNA microarray analysis. Hum Mol Genet 14, 241–253. Iwamoto, K., Kakiuchi, C., Bundo, M., Ikeda, K., & Kato, T. (2004). Molecular characterization of bipolar disorder by comparing gene expression profiles of postmortem brains of major mental disorders. Mol Psychiatry 9, 406–416. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., & Barabasi, A.L. (2000). The large-scale organization of metabolic networks. Nature 407, 651–654. Jeong, H., Mason, S., Barabasi, A., Oltvai, Z. (2001). Lethality and centrality in protein networks. Nature 411(6833), 41–42. Johnson, M.B., Kawasawa, Y.I., Mason, C.E., Krsnik, Z., Coppola, G., Bogdanovic, . . . Sestan, N. (2009). Functional and evolutionary insights into human brain development through global transcriptome analysis. Neuron 62, 494–509. Johnson, W.E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127. Jordan, I.K., Marino-Ramirez, L., Wolf, Y.I., & Koonin, E.V. (2004). Conservation and coevolution in the scale-free human gene coexpression network. Mol Biol Evol 21, 2058–2070. Jung, S.H., & Jang, W. (2006). How accurately can we control the FDR in analyzing microarray data?Bioinformatics 22, 1730–1736. Kafatos, F.C., Jones, C.W., & Efstratiadis, A. (1979). Determination of nucleic acid sequence homologies and relative concentrations by a dot hybridization procedure. Nucleic Acids Res 7, 1541–1552. Kang, H.J., Kawasawa, Y.I., Cheng, F., Zhu, Y., Xu, X., Li, M., . . .et  al. (2011). Spatio-temporal

transcriptome of the human brain. Nature 478, 483–489. Kapp, A.V., & Tibshirani, R. (2007). Are clusters found in one dataset present in another dataset?Biostatistics 8, 9–31. Khaitovich, P., Muetzel, B., She, X., Lachmann, M., Hellmann, I., Dietzsch, J., . . .et al. (2004). Regional patterns of gene expression in human and chimpanzee brains. Genome Res 14, 1462–1473. Kierzek, E. (2009). Binding of short oligonucleotides to RNA: studies of the binding of common RNA structural motifs to isoenergetic microarrays. Biochemistry 48, 11344–11356. Kim, K.I., & van de Wiel, M.A. (2008). Effects of dependence in high-dimensional multiple testing problems. BMC Bioinformatics 9, 114. Kitchen, R.R., Sabine, V.S., Simen, A.A., Dixon, J.M., Bartlett, J.M., & Sims, A.H. (2011). Relative impact of key sources of systematic noise in Affymetrix and Illumina gene-expression microarray experiments. BMC Genomics 12, 589. Kugel, J.F., & Goodrich, J.A. (2012). Non-coding RNAs:  key regulators of mammalian transcription. Trends Biochem Sci 37, 144–151. Langfelder, P., & Horvath, S. (2008). WGCNA:  an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559. Langfelder, P., Luo, R., Oldham, M.C., & Horvath, S. (2011). Is my network module preserved and reproducible?PLoS Comput Biol 7, e1001057. Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree:  the Dynamic Tree Cut package for R. Bioinformatics 24, 719–720. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J., & Pavlidis, P. (2004). Coexpression analysis of human genes across many microarray data sets. Genome Res 14, 1085–1094. Lein, E.S., Hawrylycz, M.J., Ao, N., Ayres, M., Bensinger, A., Bernard, A., . . .et  al.(2007). Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176. Lewohl, J.M., Wang, L., Miles, M.F., Zhang, L., Dodd, P.R., & Harris, R.A. (2000). Gene expression in human alcoholism: microarray analysis of frontal cortex. Alcohol Clin Exp Res 24, 1873–1882. Li, J.Z., Meng, F., Tsavaler, L., Evans, S.J., Choudary, P.V., Tomita, H., . . .et al. (2007). Sample matching by inferred agonal stress in gene expression analyses of the brain. BMC Genomics 8, 336. Li, J.Z., Vawter, M.P., Walsh, D.M., Tomita, H., Evans, S.J., Choudary, P.V., . . .et  al. (2004). Systematic changes in gene expression in postmortem human brains associated with tissue pH and terminal medical conditions. Hum Mol Genet 13, 609–616.

Transcriptomics: From Differential Expression to Coexpression Liang, W.S., Dunckley, T., Beach, T.G., Grover, A., Mastroeni, D., Ramsey, K., . . .et al. (2008). Altered neuronal gene expression in brain regions differentially affected by Alzheimer’s disease:  a reference data set. Physiol Genomics 33, 240–256. Lipscombe, D. (2005). Neuronal proteins custom designed by alternative splicing. Curr Opin Neurobiol 15, 358–363. Lloyd, S.P. (1982). Least squares optimization in PCM. IEEE Transactions on Information Theory 28, 129–137. Lobo, M.K., Karsten, S.L., Gray, M., Geschwind, D.H., & Yang, X.W. (2006). FACS-array profiling of striatal projection neuron subtypes in juvenile and adult mouse brains. Nat Neurosci 9, 443–452. Loring, J.F., Wen, X., Lee, J.M., Seilhamer, J., & Somogyi, R. (2001). A gene expression profile of Alzheimer’s disease. DNA Cell Biol 20, 683–695. Lu, T., Pan, Y., Kao, S.Y., Li, C., Kohane, I., Chan, J., & Yankner, B.A. (2004). Gene regulation and DNA damage in the ageing human brain. Nature 429, 883–891. Luo, J., Schumacher, M., Scherer, A., Sanoudou, D., Megherbi, D., Davison, T., . . .et  al. (2010). A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J 10, 278–291. Maskos, U., & Southern, E.M. (1992). Oligonucleotide hybridizations on glass supports:  a novel linker for oligonucleotide synthesis and hybridization properties of oligonucleotides synthesised in situ. Nucleic Acids Res 20, 1679–1684. McLachlan, G., Do, K., & Ambroise, C. (2004). Analyzing microarray gene expression data. Hoboken, NJ: Wiley-Interscience. McShane, L.M., Radmacher, M.D., Freidlin, B., Yu, R., Li, M.C., & Simon, R. (2002). Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18, 1462–1469. Mexal, S., Berger, R., Adams, C.E., Ross, R.G., Freedman, R., & Leonard, S. (2006). Brain pH has a significant impact on human postmortem hippocampal gene expression profiles. Brain Res1106, 1–11. Meyer, P.E., Lafitte, F., & Bontempi, G. (2008). minet:  A  R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics 9, 461. Miki, R., Kadota, K., Bono, H., Mizuno, Y., Tomaru, Y., Carninci, P., . . .et al. (2001). Delineating developmental and metabolic pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length enriched mouse cDNA arrays. Proc Natl Acad Sci U S A 98, 2199–2204.

111

Miller, J.A., Horvath, S., & Geschwind, D.H. (2010). Divergence of human and mouse brain transcriptome highlights Alzheimer disease pathways. Proc Natl Acad Sci U S A 107, 12698–12703. Miller, J.A., Oldham, M.C., & Geschwind, D.H. (2008). A systems level analysis of transcriptional changes in Alzheimer’s disease and normal aging. J Neurosci 28, 1410–1420. Mirnics, K., Middleton, F.A., Marquez, A., Lewis, D.A., & Levitt, P. (2000). Molecular characterization of schizophrenia viewed by microarray analysis of gene expression in prefrontal cortex. Neuron 28, 53–67. Mirnics, K., & Pevsner, J. (2004). Progress in the use of microarray technology to study the neurobiology of disease. Nat Neurosci 7, 434–439. Modrek, B., Resch, A., Grasso, C., & Lee, C. (2001). Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res 29, 2850–2859. Mulligan, M.K., Rhodes, J.S., Crabbe, J.C., Mayfield, R.D., Adron Harris, R., & Ponomarev, I. (2011). Molecular profiles of drinking alcohol to intoxication in C57BL/6J mice. Alcohol Clin Exp Res 35, 659–670. Mumford, J.A., Horvath, S., Oldham, M.C., Langfelder, P., Geschwind, D.H., & Poldrack, R.A. (2010). Detecting network modules in fMRI time series: a weighted network analysis approach. Neuroimage 52, 1465–1476. Newman, M.E. (2006). Modularity and community structure in networks. Proc Natl Acad Sci U S A 103, 8577–8582. Okaty, B.W., Sugino, K., & Nelson, S.B. (2011a). A  quantitative comparison of cell-type-specific microarray gene expression profiling methods in the mouse brain. PLoS One 6, e16493. Okaty, B.W., Sugino, K., & Nelson, S.B. (2011b). Cell type-specific transcriptomics in the brain. J Neurosci 31, 6939–6943. Oldham, M.C., Horvath, S., & Geschwind, D.H. (2006). Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proc Natl Acad Sci U S A  103, 17973–17978. Oldham, M.C., Konopka, G., Iwamoto, K., Langfelder, P., Kato, T., Horvath, S., & Geschwind, D.H. (2008). Functional organization of the transcriptome in human brain. Nat Neurosci 11, 1271–1282. Oldham, M.C., Langfelder, P., & Horvath, S. (2012). Network methods for describing sample relationships in genomic datasets:  application to Huntington’s disease. BMC Syst Biol 6, 63. Opgen-Rhein, R., & Strimmer, K. (2007). From correlation to causation networks: a simple approximate learning algorithm and its application to

112

the OMICs

high-dimensional plant gene expression data. BMC Syst Biol 1, 37. Owzar, K., Barry, W.T., & Jung, S.H. (2011). Statistical considerations for analysis of microarray experiments. Clin Transl Sci 4, 466–477. Park, C.C., Gale, G.D., de Jong, S., Ghazalpour, A., Bennett, B.J., Farber, C.R., . . .et  al.(2011). Gene networks associated with conditional fear in mice identified using a systems genetics approach. BMC Syst Biol 5, 43. Parmigiani, G., Garrett, E.S., Irizarry, R.A., & Zeger, S.L. (2003). The analysis of gene expression data. New York:Springer-Verlag. Podtelezhnikov, A.A., Tanis, K.Q., Nebozhyn, M., Ray, W.J., Stone, D.J., & Loboda, A.P. (2011). Molecular insights into the pathogenesis of Alzheimer’s disease and its relationship to normal aging. PLoS One 6, e29610. Ponomarev, I., Rau, V., Eger, E.I., Harris, R.A.,  & Fanselow, M.S. (2010). Amygdala transcriptome and cellular mechanisms underlying stress-enhanced fear learning in a rat model of posttraumatic stress disorder. Neuropsychopharmacology 35, 1402–1411. Ponomarev, I., Wang, S., Zhang, L., Harris, R.A., & Mayfield, R.D. (2012). Gene coexpression networks in human brain identify epigenetic modifications in alcohol dependence. J Neurosci 32, 1884–1897. Power, J.D., Cohen, A.L., Nelson, S.M., Wig, G.S., Barnes, K.A., Church, J.A., . . .et  al. (2011). Functional network organization of the human brain. Neuron 72, 665–678. Prabakaran, S., Swatton, J.E., Ryan, M.M., Huffaker, S.J., Huang, J.T., Griffin, J.L., . . .et  al. (2004). Mitochondrial dysfunction in schizophrenia: evidence for compromised brain metabolism and oxidative stress. Mol Psychiatry 9, 684–697, 643. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N., & Barabasi, A.L. (2002). Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555. Ray, M., & Zhang, W. (2010). Analysis of Alzheimer’s disease severity across brain regions by topological analysis of gene co-expression networks. BMC Syst Biol 4, 136. Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., . . . Sabeti, P.C. (2011). Detecting novel associations in large data sets. Science 334, 1518–1524. Risso, D., Schwartz, K., Sherlock, G., & Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L., & Pachter, L. (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol 12, R22.

Rosen, E.Y., Wexler, E.M., Versano, R., Coppola, G., Gao, F., Winden, K.D., . . .et al. (2011). Functional genomic analyses identify pathways dysregulated by progranulin deficiency, implicating Wnt signaling. Neuron 71, 1030–1042. Roth, R.B., Hevezi, P., Lee, J., Willhite, D., Lechner, S.M., Foster, A.C., & Zlotnik, A. (2006). Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, 67–80. Rousseeuw, P. (1987). Silhouettes:  a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20, 53–56. Ruan, J., Dean, A.K., & Zhang, W. (2010). A general co-expression network-based approach to gene expression analysis: comparison and applications. BMC Syst Biol 4, 8. Saiki, R.K., Walsh, P.S., Levenson, C.H., & Erlich, H.A. (1989). Genetic analysis of amplified DNA with immobilized sequence-specific oligonucleotide probes. Proc Natl Acad Sci U S A 86, 6230–6234. Sales, G., & Romualdi, C. (2011). Parmigene—a parallel R package for mutual information estimation and gene network reconstruction. Bioinformatics 27, 1876–1877. Saris, C.G., Horvath, S., van Vught, P.W., van Es, M.A., Blauw, H.M., Fuller, T.F., . . .et  al. (2009). Weighted gene co-expression network analysis of the peripheral blood from Amyotrophic Lateral Sclerosis patients. BMC Genomics 10, 405. Schafer, J., & Strimmer, K. (2005). An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21, 754–764. Schena, M., Shalon, D., Davis, R.W., & Brown, P.O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470. Schena, M., Shalon, D., Heller, R., Chai, A., Brown, P.O., & Davis, R.W. (1996). Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci U S A 93, 10614–10619. Sehgal, A., Boynton, A.L., Young, R.F., Vermeulen, S.S., Yonemura, K.S., Kohler, E.P., . . . Murphy, G.P. (1998). Application of the differential hybridization of Atlas Human expression arrays technique in the identification of differentially expressed genes in human glioblastoma multiforme tumor tissue. J Surg Oncol 67, 234–241. Shen-Orr, S.S., Tibshirani, R., Khatri, P., Bodian, D.L., Staedtler, F., Perry, N.M., . . . Butte, A.J. (2010). Cell type-specific gene expression differences in complex tissues. Nat Methods 7, 287–289. Shi, L., Reid, L.H., Jones, W.D., Shippy, R., Warrington, J.A., Baker, S.C., . . .et  al. (2006). The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24, 1151–1161.

Transcriptomics: From Differential Expression to Coexpression Shirasaki, D.I., Greiner, E.R., Al-Ramahi, I., Gray, M., Boontheung, P., Geschwind, D.H., . . .et  al.(2012). Network organization of the huntingtin proteomic interactome in mammalian brain. Neuron 75, 41–57. Simon, R., Korn, E.L., McShane, L.M., Radmacher, M.D., Wright, G.W., & Zhao, Y. (2005). Design and analysis of DNA microarray investigations. New York, Springer-Verlag. Simone, N.L., Bonner, R.F., Gillespie, J.W., Emmert-Buck, M.R., & Liotta, L.A. (1998). Laser-capture microdissection:  opening the microscopic frontier to molecular analysis. Trends Genet 14, 272–276. Southern, E.M. (1975). Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol 98, 503–517. Speed, T.P. (2003). Statistical analysis of gene expression microarray data.Boca Raton, FL:  Chapman and Hall/CRC. Stuart, J.M., Segal, E., Koller, D., & Kim, S.K. (2003). A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255. Sugino, K., Hempel, C.M., Miller, M.N., Hattox, A.M., Shapiro, P., Wu, C., . . . Nelson, S.B. (2006). Molecular taxonomy of major neuronal classes in the adult mouse forebrain. Nat Neurosci 9, 99–107. Sultan, M., Schulz, M.H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., . . .et  al. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960. Sun, X., Wang, J.F., Tseng, M., & Young, L.T. (2006). Downregulation in components of the mitochondrial electron transport chain in the postmortem frontal cortex of subjects with bipolar disorder. J Psychiatry Neurosci 31, 189–196. Tesson, B.M., Breitling, R., & Jansen, R.C. (2010). DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules. BMC Bioinformatics 11, 497. Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. J Comput Graph Stat 14, 511–528. Torkamani, A., Dean, B., Schork, N.J., & Thomas, E.A. (2010). Coexpression network analysis of neural tissue reveals perturbations in developmental processes in schizophrenia. Genome Res 20, 403–412. Vawter, M.P., Tomita, H., Meng, F., Bolstad, B., Li, J., Evans, S., . . .et  al. (2006). Mitochondrial-related gene expression changes are sensitive to agonal-pH state: implications for brain disorders. Mol Psychiatry 11, 615, 663–679. Voineagu, I., Wang, X., Johnston, P., Lowe, J.K., Tian, Y., Horvath, S., . . . Geschwind, D.H. (2011). Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature 474, 380–384.

113

Watson, J.D., & Crick, F.H. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171, 737–738. Watson, M. (2006). CoXpress: differential coexpression in gene expression data. BMC Bioinformatics 7, 509. Whitney, L.W., Becker, K.G., Tresser, N.J., Caballero-Ramos, C.I., Munson, P.J., Prabhu, V.V., . . . Biddison, W.E. (1999). Analysis of gene expression in mutiple sclerosis lesions using cDNA microarrays. Ann Neurol 46, 425–428. Winden, K.D., Karsten, S.L., Bragin, A., Kudo, L.C., Gehman, L., Ruidera, J., . . . Engel, J., Jr. (2011). A systems level, functional genomics analysis of chronic epilepsy. PLoS One 6, e20763. Winden, K.D., Oldham, M.C., Mirnics, K., Ebert, P.J., Swan, C.H., Levitt, P., . . . Geschwind, D.H. (2009). The organization of the transcriptional network in specific neuronal classes. Mol Syst Biol 5, 291. Yakovlev, A.Y., Klebanov, L.B., & Gaile, D.P. (2013). Statistical methods for microarray data analysis, Vol. 972.New York:Springer. Yang, S., Wang, K., Valladares, O., Hannenhalli, S., & Bucan, M. (2007). Genome-wide expression profiling and bioinformatics analysis of diurnally regulated genes in the mouse prefrontal cortex. Genome Biol 8, R247. Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V., Ngai, J., & Speed, T.P. (2002). Normalization for cDNA microarray data:  a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30, e15. Yauk, C.L., Berndt, M.L., Williams, A., & Douglas, G.R. (2004). Comprehensive comparison of six microarray technologies. Nucleic Acids Res 32, e124. Yu, H., Liu, B.H., Ye, Z.Q., Li, C., Li, Y.X., & Li, Y.Y. (2011). Link-based quantitative methods to identify differentially coexpressed genes and gene pairs. BMC Bioinformatics 12, 315. Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4, Article 17. Zhang, J., Finney, R.P., Clifford, R.J., Derr, L.K., & Buetow, K.H. (2005). Detecting false expression signals in high-density oligonucleotide arrays by an in silico approach. Genomics 85, 297–308. Zhang, Y., De, S., Garner, J.R., Smith, K., Wang, S.A., & Becker, K.G. (2010). Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information. BMC Med Genomics 3, 1. Zheng, W., Chung, L.M., & Zhao, H. (2011). Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics 12, 290.

7 High-Throughput RNA Interference as a Tool for Discovery in Neuroscience L I S A P. E L I A A N D S T E V E N F I N K B E I N E R

INTRODUCTION The discovery of RNA interference (RNAi), a potent endogenous posttranscriptional genesilencing mechanism induced by doublestranded RNAs (dsRNA) and mediated by the small noncoding RNAs (sRNA) derived from them has paved the way for the development of a powerful and robust technology allowing scientists to causally link gene function to cellular processes. RNAi pathways occur in a number of organisms, ranging from plants (Hamilton et  al. 1999), fungi (Romano et  al. 1992), nematodes (Fire et  al. 1998; Montgomery et  al. 1998), insects (Elbashir et  al. 2001; Pal-Bhadra et  al.1997), and rodents (Okamura et  al. 2008)  to humans (Cullen 2004; Elbashir et  al. 2001). RNAi pathways help to define endogenous programs that operate during the development of many organisms or provide host defense by repressing unwanted gene expression introduced via exogenous genomes. The field of RNAi has experienced rapid growth over the past decade, and a number of excellent reviews discuss the multiple sRNA molecules that have been identified, their associated proteins, and the mechanisms leading to and the pathways affected by RNAi (Czech & Hannon 2011; Ketting 2011; Pais et  al. 2011). These sRNAs share common features, such as length (~20 to 30 nucleotides [nt] long), with 5´ monophosphate and 3´ 2-nucleotide overhang terminal structures, and being able to associate with members of the Argonaute protein family. They differ in their respective functional outcomes. MicroRNAs (miRNAs), which lead to silencing via mRNA destabilization or translation inhibition, each regulate large sets of genes and help to define developmental gene

expression patterns. The piwi-interacting RNAs (piRNAs), which silence genes via target RNA cleavage or chromatin alterations, are involved in germ cell specification and differentiation as well as transposon silencing. The small inhibitory RNAs—siRNAs; short hairpin RNAs (shRNAs)—which also lead to gene silencing via target RNA cleavage, play roles in gene expression patterning and virus resistance. siRNAs and shRNAs can be generated in vitro and introduced into cells via transfection or infection and used as the RNAi inducers for small loss-of-function studies as well as large-scale genomic screens. The focus of this chapter is on the use of RNAi in large-scale high-throughput screens for discovery in neuroscience.

THE MECHANISM OF  RNAI Early studies in Caenorhabditis elegans demonstrated that only a few dsRNAs introduced into cells can induce the specific degradation of a cognate population of target mRNAs and prevent their cytoplasmic accumulation via an unknown catalytic mechanism (Fire 1999; Montgomery et  al. 1998). Subsequent studies performed in C.  elegans, Drosophila, and mammalian model systems revealed that dsRNA-induced gene silencing occurs via a multistep process involving sRNA biogenesis, strand selection, loading into the RNA-induced silencing complex (RISC) via Argonaute proteins, target recognition, and effector function (Carthew and Sontehimer 2009; Czech & Hannon 2011; Ketting 2011; Kim et al, 2009). Most of our understanding of the conversion of dsRNA into siRNAs is based on studies performed in the Drosophila system. Recent work identified endogenous sources

High-Throughput RNA Interference of dsRNA in Drosophila that include transcripts with extensive hairpin structure, convergent transcription units, or molecules formed from the pairing and annealing of sense and antisense RNAs from unlinked loci (Czech et  al. 2008; reviewed in Czech and Hannon 2011; Okamura. 2008). There is some evidence that similar endogenous dsRNA sources, such as transcripts that form hairpin structures or long transcripts arising from pseudogenes, exist in some mammalian cells (mouse oocytes, mouse embryonic stem (ES) cells) and give rise to siRNAs (Babiarz et  al. 2008; Tam et  al. 2008; Watanabe et  al. 2008). Biogenesis of siRNAs within a cell begins with the processing of a long dsRNA into small RNA duplexes (siRNAs) 21 to 23 nucleotides in length by a protein known as Dicer (Carmell & Hannon 2004). Dicer family proteins are dsRNA-specific ribonuclease III (RNase III) enzymes found in all organisms exhibiting RNAi, and these proteins are also involved in the biogenesis of miRNAs. Some organisms, such as Drosophila or plants, have multiple forms of Dicer, whereas others, such as C.  elegans and mammals, have only one form (Bernstein et  al. 2003; Filipowicz et  al. 2005; Lee et  al. 2004; Rossi 2005). After Dicer processing, the resulting siRNAs are unwound by an ATP-dependent helicase and one strand, the guide strand (antisense), is selected for loading into the RISC via binding to Argonaute proteins, while the other passenger strand (sense) is cleaved and destroyed (Matranga et  al. 2005; Rand et  al. 2005). It is important that the appropriate guide strand, rather than the passenger strand, is selectively stabilized with the appropriate Argonaute protein to avoid destruction of incorrect target mRNAs by a misloaded passenger strand, which contributes to off-target effects in experiments. The thermodynamic properties of the siRNA guide strand appear to be the main force driving the correct and selective loading into RISC (Czech and Hannon 2011; Khvorova et  al. 2003; Schwarz et  al. 2003). Once bound to the cognate guide strand in RISC, the target mRNA undergoes endonucleolytic cleavage at a site 10 nucleotides upstream of the 5´ end residue of the siRNA-target mRNA duplex and is destroyed, resulting in cognate gene silencing (Elbashir et  al. 2001; Hannon 2002; Ketting 2011; Rossi  2005).

115

FROM MECHANISM TO A P P L I C AT I O N : R N A I - B A S E D HIGH-THROUGHPUT SCREENING (RNAI HTS) IN NEUROSCIENCE The first large-scale genomic screen was performed in C.  elegans, in which Fraser and colleagues systematically assessed loss-of-function phenotypes for all predicted genes along chromosome I  with a library of dsRNA-expressing bacteria that could be fed to the worms, and the ingested dsRNAs further processed into siRNAs to achieve RNA-induced gene silencing as described above (Fraser et  al. 2000). The same group went on to perform a large-scale RNAi screen for the full C.  elegans genome (Kamath et  al. 2003). Interestingly, they found that neuronal genes of C.  elegans were somewhat resistant to RNAi, as they found that only a fraction of the known neuronal genes gave detectable RNAi phenotypes (see the following paragraphs). An impressive number of RNAi HTSs have been performed over the last 12 years since then, ranging from in vivo full animal screens in C.  elegans and Drosophila to screens performed in cultured cells from both Drosophila and mammals. Many of the screens covered a diverse number of basic and specialized cellular processes, such as cell growth, survival, morphology, adhesion, signal transduction, responses to specific pathogens or toxic agents and stressors, mitochondrial function, RNA biology, and cancer biology; the list continues to grow. We wish to direct the reader to excellent reviews that provide a more in-depth discussion of past screens encompassing many of the processes listed above (Mohr et  al. 2010; Moffat et  al. 2006). We now focus our discussion on screens involving neuroscience. A growing list of published neuroscienceoriented screens describe invertebrate and mammalian neuronal model systems. These studies identified new genes and new roles for known genes involved in various aspects of nervous system development, function, and disease (i.e., neuronal specification, dendritic arborization, neurite outgrowth, synapse development and function, nociception, neurodegeneration). Many of these screens are listed in Table  7.1; we highlight a few of these screens here.

TABLE  7.1. NEUROBIOLOGY-ORIENTED RNAI SCREENS IN INVERTEBRATE AND VERTEBRATE MODEL SYSTEMS

Neurological Issue Addressed

Species/Cell Type

Screen Assay

Neural stem cell self-renewal

Drosophila/ neuroblasts in vivo

Neuroblast lethality/ Transgenic dsRNA 195/595 imaging fly lines; library of 595 genes selected from transcriptional profiling of neuroblasts Imaging Injected dsRNA; library Not reported of >3,000 genes representing ~25% of whole genome

Embryonic Drosophila/embryo nervous system nervous system development ventral nerve cord (VNC) and peripheral nervous system (PNS) in vivo Neuronal C. elegans/ASE specification neurons

Imaging

Dendrite Drosophila/ embryos/ Imaging morphogenesis class I sensory neurons in vivo

Neurite outgrowth

Drosophila/ cultured primary neural cells

Live-cell imaging; digital image analysis tools to quantify morphological RNAi phenotypes

RNAi-Inducer/Library

Primary Hits

dsRNA/ library of 748/15,395 15,395 gene-specific, dsRNA-expressing feeding bacteria strains Injected dsRNA; Not reported library covering 730 transcriptional regulators dsRNA; whole genome library covering 21,300 annotated genes

336/21,300 strong phenotypes; 2,106/21,300 weak or moderate phenotypes

Secondary Hits

Validation Strategy

References

84/195

(a) Independent dsRNAs

Carney et al.2012

rescreened (b) Related assay

43/3,314 22/3,998

(a) Independent screen

245/748

(a) Independent screen

76/730

104/125

replicates (b) Independent dsRNAs rescreened

replicates (b) Related tissue- and cell-specific assays (a) Independent screen replicates (b) 27of 32 candidates phenocopy loss-of-function mutant alleles (a) Independent screen replicates (b) One candidate phenocopies loss-of-function mutant allele (c) Mamm. ortholog phenocopies in embryonic mouse brains

Ivanov et al. 2004 Koizumi et al. 2007

Poole et al. 2011

Parrish et al. 2006

Sepp et al. 2008

Neurite outgrowth, Human/ SH-SY5Y High-content automated siRNA/Ambion neurite retraction, neuronal cell line imaging (Cellomics arrayed library growth cone Kinetic Scan Reader) of 750 human collapse kinases (3 indep. siRNAs per gene)

Neurite outgrowth

Human/ SH-SY5Y Imaging via microarray neuronal cell line scanner

Integrin-dependent Human/ SH-SY5Y Imaging; FACS analysis neurite outgrowth neuronal cell line

Axon outgrowth

C.elegans/ motorneurons and interneurons of the motor circuit

Imaging (fluorescentreporterlabeled neurons)

Not reported

siRNA/siArray Not reported (Dharmacon) library of 85 human tyrosine kinases; four subarrays for each siRNA screened siRNA/ library of Not reported 43,000 siRNAs targeting 8,500 human genes cloned into an FIV-based pSIF1-H1,GFP vector dsRNA/library of Not reported Chromosome I and III genes, expressed in 4,577 bacterial feeding strains

59/750 shortened neurite length; 66/750 abnormally long neurite length; 79/750 inhibit LPA-induced growth cone collapse/neurite retraction 9/85 alter neurite extension; follow-up on 1 target gene, twinfilin-2 (Twf2)

(a) Independent screen

39/8,500 genes

(a) Independent screen

replicates (b) Candidate orthologous siRNAs tested in rat primary neuron cultures,or in a retinal degeneration Drosophila model (a) correlated RNAi

phenotype and protein level knockdown (b) cDNA overexpression in neuronal cultures had opposing phenotype

replicates (b) Bioinformatics analysis

93/4,577

(a) Independent screen

replicates (b) Related assay

Loh et al. 2008

Yamada et al. 2007

Ossovskaya et al. 2009

Schmitz et al. 2007

(continued)

TABLE 7.1. CONTINUED

Neurological Issue Addressed

Species/Cell Type

Screen Assay

Glutamatergic Rat/cultured primary Imaging (confocal and GABAergic hippocampal neurons microscopy) synapse development

RNAi-Inducer/Library

Primary Hits

Secondary Hits

Diced siRNAs/600 160/600 genes 4/160 genes genes selected after (dsiRNAs transcriptional profiling reduced for transcripts that synapse change during synapse density) development in rodent hippocampus or in response to activity; screened in pools targeting 1 to 4 genes dsRNA/library consisting 304/2092 lines 158/304 NMJ of 2092 inducible, synapse transgenic lines phenotypes (1,970 genes), all with vertebrate orthologs

NMJ synapse development

Imaging (confocal Drosophila/postmitotic microscopy) neurons of the neuromuscular junction (NMJ)

NMJ synapse development

C. elegans/whole embryos; Plate assay for further characterization resistance to performed for DA aldicarb-induced motor neurons of the paralysis NMJ

dsRNA/ library of Not reported 2,072 gene-specific dsRNA-expressing feeding bacterial strains

185/2,071

GABA synaptic transmission

C. elegans/young adults

dsRNA/ library of 2,072 gene-specific, dsRNA-expressing feeding bacteria strains (Sieburth et al. 2005)

90/129

Plate assay/ hypersensitivity to aldicarb-induced paralysis

129/2072

Validation Strategy

References

(a) Independent screen replicates (b) Deconvoluted siRNA pools using individual shRNAs (c) Rescue expt. for 1 gene using an shRNA-resistant cDNA (d) RNAi phenotypes in related synapse assays (a) independent screen replicates (b) One gene (NudE) validated using traditional loss-of-function genetic mutant fly line (a) Two parallel screens run in different strains with varying sensitivity to aldicarb (b) Validation with alternate loss-of-function mutant allele (a) Rescreened with more potent 2-Gen RNAi (b) Alternate loss-of-function strategy (mutant allele or drug) phenocopies RNAi phenotype

Paradis et al. 2007

Valakh et al. 2012

Sieburth et al. 2005

Vashlishan et al. 2008

Nociception

Drosophila/young adults

Behavioral assay/ impaired noxious temperature avoidance

dsRNA/ Transgenic fly lines expressing dsRNAs; 11,664 different genes screened

Spinal muscular atrophy (SMA)

C. elegans/larva

Growth/body size assay; Ahringer bacterial secondary assay for clone feeding library NMJ pharyngeal consisting of 16,500 muscle function C. elegans genes

Amyotrophic lateral C. elegans/ Visual fluorescence assay RNAi feeding library sclerosis (ALS) transgenic worms looking for increased of 16,757 bacterial expressing number and intensity clones. human of fluorescent-labeled SOD1(G85R)mutant protein YFP mutant (G85R-YFP) allele inclusions Alzheimer’s human/H4 High content Qiagen kinome library disease (AD) neuroglioma immunofl. assay to consisting of 572 cell line quantitate changes human kinases overexpressing in phosphorylated four repeat tau (12E8) vs. total tau protein tau levels (4R0N)

Not reported

580/11,664

31/15,600

4/15,600

88/16,757 enhanced inclusion formation

11/12

17/572 reduced phospho-tau but not total tau levels

2/17 retested

(a) Independent screen

Neely replicates et al.2010 (b) Rescreening with other indep. dsRNAs (c) Generated and analyzed a knockout mouse for one ortholog (a) Independent screen Dimitriadi replicates et al. 2010 (b) Other NMJ functional assays (c) Mutant alleles phenocopy RNAi (d) Ortholog mutant alleles analyzed (a) Mutant alleles Wang crossed to the et al.2009 parent mutant SOD1(G85R)-YFP strain phenocopy the RNAi enhanced aggregation (a) Repeat with 2 Azorsa et al. siRNAs/target 2010 kinase performed in triplicate() Western blot assay to analyze phospho-tau levels

(continued)

TABLE 7.1. CONTINUED

Neurological Issue Species/Cell Type Addressed

Screen Assay

RNAi-Inducer/ Library

Primary Hits

Secondary Hits

Validation Strategy

References

(a) 2o assays for Unc Kraemer et al. Tauopathy/frontal C. elegans/ transgenic lines behavioral assay to RNAi feeding 1,217/16,757 75/1,217 phenotypes 2006 temporal lobar expressing normal or find modulators library of 16,757 enhanced severity enhance Unc degeneration, FTLD-17 mutant human of tau-induced bacterial clones of Unc phenotype phenotype60/75 (b) Mutant alleles Ch. 17-linked tau alleles uncoordinated enhance phenocopied the (FTLD-17) (Unc) locomotion tau-induced Unc RNAi phenotype Parkinson’s disease C. elegans/ transgenic strain Visual screen RNAi feeding 111/868 were 20/757 enhanced (a) Primary 125 were Hamamichi confirmed in screen et al. 2008 (PD) overexpressing human for modifiers library consisting lethal and aggregation at replicates α-synuclein that enhanced of 868 candidate excluded;125/757 an early age5/7 age-dependent genes, with enhanced analyzed for (b) Secondary aggregation of mammalian aggregation of neuro-protection screen for early age α-syn::GFP in the homologs, α-syn when cDNAs phenotypes body wall muscles associated with coexpressed (c) cDNAs showed familial PD genes with α-syn neuro-protection and protein over-expression degradation pathways Huntington’s Drosophila/S2 cell lines Automated Genome-wide 463/24,000 126/463 (a) Re-synthesized and Zhang et al.2010 disease (HD) stably expressing microscope-based library of retested dsRNAs in copper-inducible screen for 24,000 dsRNAs aggregation assay huntingtin aggregate (b) Alternate luciferase exon1-46Q-eGFP formation and reporter assay for to automated exclude nonspecific quantification effects on the of cell number metallothionein and aggregates promoter with MetaMorph (c) Second set of analytic software dsRNAs targeting other regions tested in aggregation assay

High-Throughput RNA Interference Neuronal Differentiation, Neurite Outgrowth, and Synapse Development Several RNAi screens have been used to identify genes involved in different aspects of neurodevelopment, from the early stages of nervous system development, neuronal stem cell maintenance, and neuronal differentiation to how connectivity is established. Screens examining changes in neurite outgrowth and dendritic arborization have been performed in invertebrate whole animal and cultured cell systems and extended to mammalian cultured cells. In one study published by the Jan lab, a sublibrary of dsRNAs targeting transcription factors (TFs) was screened for dendritic morphology phenotypes in a confocal microscope assay using live embryos engineered to express green fluorescent protein (GFP) specifically and highly in Drosophila class  I  dendrite arborization (da) neurons (Parrish et al. 2006). Three functional RNAi phenotype classes that affected various aspects of dendrite development were identified. Group A  genes had concerted effects on both primary dendrite outgrowth and branching, with RNAi against 19 genes leading to reduced arborization and against 20 genes leading to increased arborization. Some of the genes within the group—such as Su(z)12, esc, E(z), Rpd3, nervy, and Caf1— have known functional links to hox genes. Group B genes had opposing effects on primary dendrite outgrowth and branching, with several dsRNAs resulting in overextension of primary dendrites and reduced or absent branching, suggesting that these two processes partially antagonize one another. The third group, group C, identified 10 genes that function to correctly target the da neuron primary dendrites toward the dorsal midline; RNAi of these 10 TFs disrupted routing and produced misplaced or disoriented dendritic arbors. Validation of many of the gene candidates from the screen was performed using available loss-of-function (LoF) alleles and identifying on-target candidate genes as those with LoF alleles that phenocopied the RNAi phenotypes. Many of the genes identified in the screen have not been implicated in dendritic development and many of the genes have mammalian homologs, which opens the possibility of further studies in mouse mutant models. More recently, a larger-scale genome-wide RNAi screen was performed in cultured

121

primary Drosophila neural cells to identify genes important for axon outgrowth. Using automated live-cell high-content imaging partnered with quantitative image analysis with specific algorithms designed to assess statistically significant morphological defects, Sepp and associates identified phenotypic classes of neurons with excessive axonal branching, defasciculation, blebbing, reduced outgrowth, and finally cell loss (Sepp et al. 2008). New roles for genes falling within a variety of functional categories were revealed, including genes encoding known proteins involved in protein and vesicle trafficking, cytoskeletal function, signal transduction, ion channels, and receptors. One gene, Sec61α, was further validated through in vivo analysis of LoF hypomorphic alleles; phenotypes of axon defasciculation, and excessive branching observed in mutant flies phenocopied the RNAi-induced defects. Another gene, RanGTPase, was further validated using a cross-species analysis. Ran shRNA constructs were electroporated into the lateral ventricles of E14 embryonic mouse brains and transfected cortices were dissected later as explants or dissociated cultures. Imaging showed that the Ran RNAi constructs recapitulated the axonal blebbing and increased arborization phenotypes observed in the original Drosophila screen. Neurite outgrowth screens performed in mammalian cells primarily have been restricted to neuronal cell lines, such as SH-SY5Y cells. One study performed two independent parallel genome-wide functional genetic screens with the same assay platform (Ossovskaya et al. 2009). One screen was performed with a retroviral library of genetic suppressor elements (GSEs) and the second with a lentiviral siRNA library to identify repressors of β1 integrin‒ dependent neurite outgrowth. Interestingly, the two screens complemented each other as they independently identified components of the transcriptional repressor MTG8 (ETO)/MTGR1 protein complex, suggesting that this complex is a significant regulator of β1 integrin‒dependent neurite outgrowth. In the study by Yamada and coworkers, a tyrosine kinase siRNA library was arrayed onto glass microscope slides to screen in SH-SY5Y cells for genes regulating neurite extension (Yamada et  al. 2007). Fluorescent images were collected by an array scanner and analyzed using an area-based algorithm to quantify neurite outgrowth. Validation was performed for one of the genes identified,

122

the OMICs

twinfilin-2 (Twf2), by showing that cDNA overexpression in either differentiated PC12 cells or cultured primary cortical neurons promote neurite extension, an opposite phenotype to the RNAi screen phenotype, where knockdown of Twf2 reduced neurite outgrowth. One of the first successful large-scale screens in neuroscience was performed by Sieburth and colleagues to identify genes required for the development or function of a model synapse, the neuromuscular junction (NMJ) (Sieburth et  al. 2005). Previously, RNAi resistance had been observed in the worm nervous system, possibly because of a limitation in one or more components of the RNAi machinery or dsRNA instability in the worm nervous system (Timmons et al. 2001). To improve the effectiveness of RNAi expression in worm neurons, the group first identified a C.  elegans mutant strain (eri-1; lin-15B) in which neurons were partially reverted back to a more germline cell state via the lin-15B mutation. Germline cells are more responsive to RNAi, and the eri-1 mutation suppresses siRNA degradation. The screen was designed to identify synaptic components of the NMJ, which were assessed by acquired resistance to aldicarb due to decreased acetylcholine (ACh) neurotransmission. Aldicarb is an acetylcholine esterase inhibitor that normally causes accumulation of ACh at NMJs, leading to paralysis and death in the worm. The screen identified more than 100 genes important for different aspects of synaptic structure and function, such as genes involved in synaptic vesicle recycling and neurotransmitter release and synaptic active zone formation and structure. One of the first RNAi screens performed in mammalian cultured primary hippocampal neurons also sought to identify components important for synapse formation (Paradis et  al. 2007). The group prepared a library of diced siRNAs targeting 600 genes that were first identified by transcriptional analysis as showing changes in expression during the time period of synapse development in the rodent brain. The diced siRNAs were generated from recombinant Dicer-based in vitro digestion of dsRNAs derived from mRNAs isolated from embryonic and early postnatal rodent cortex and hippocampus. The library was screened in a pooled format (one to four target genes/pool), with siRNAs transfected into dissociated hippocampal neurons that had been grown for four days in vitro (DIV) and imaged at 14 DIV by

immunostaining for the synaptic components, synapsin, and PSD95 to visualize and quantify synaptic density. Positive hits were identified as pools that caused a significant change in synaptic density compared to control, and subsequently were retested in multiple replicates. Hits were deconvolved by transfecting neurons with multiple shRNAs against each of the target genes in the pools and identifying the genes that replicated the original RNAi phenotype of reduced synaptic density. Genes that had known roles in cell adhesion or path finding as well as activity-regulated genes were identified. Rescue experiments were performed with RNAi-resistant cDNAs for two of the genes, cadherin-11 and cadherin-13. Both genes were confirmed to play roles as positive regulators of glutamatergic synapse development.

Neurodegenerative Disease High-throughput RNAi screening was recently used to identify genetic modifiers of pathogenesis in models of Alzheimer’s disease (AD), Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS), spinal muscular atrophy (SMA), and polyglutamine (polyQ)-expansion diseases, such as Huntington’s disease (HD) (Table  7.1). The majority of these screens to date have been performed in either C.  elegans or Drosophila models; very few have been performed in mammalian systems. Kraemer and associates used a transgenic C.  elegans strain expressing normal or FTDP-17‒associated mutant human tau alleles (301L or 337M) to identify genes that modulated the uncoordinated (Unc) locomotion behavioral phenotype induced by these mutations (Kraemer et  al. 2006). They identified 60 genes that specifically enhanced the tau mutation‒induced Unc phenotype, 38 of which have known human homologs. Six of the genes (WNT2, TTBK2, GSK3β, MARKK, CSTE, and CHRNA7) have been implicated in tau pathology in human or animal models of disease. An RNAi screen performed by Azorsa and associates in an H4 neuroglioma cell line overexpressing four tau repeats (4R0N) looked specifically for kinases that lead to phosphorylation of tau at Ser262, a site known to be hyperphosphorylated in AD and resulting in microtubule instability and the formation of neurofibrillary tangles (Azorsa et  al. 2010). They screened a commercial library targeting 572 human kinases using a high-throughput immunofluorescence assay designed to detect and quantify

High-Throughput RNA Interference changes in the ratio of phosphorylated tau vs. total tau. Seventeen of the 572 kinases were identified as significantly changing phosphorylated tau levels but not total tau levels. siRNA against one kinase, MARK2, that was used as a positive control for tau phosphorylation, was reisolated from the library, providing validation that the screen works. Two additional kinases, ADAP13 and DYRK1A, were validated by a western blot assay, demonstrating modulation of tau phosphorylation levels. RNAi screens for HD, PD, ALS, and SMA have all been performed in invertebrate model systems. For example, Lejeune and colleagues used a C.  elegans transgenic strain expressing mutant huntingtin exon1-128Q protein fused to CFP (httex1-128Q-CFP) in touch receptor neurons to model polyQ-dependent neurodegeneration. Expression of this construct results in a dramatic loss of touch response and axonal degeneration (Lejeune et  al. 2012). The investigators screened a library of 6,260 dsRNAs corresponding to 6,034 genes, and isolated 662 genes that, when knocked down, led to either enhanced or reduced neuronal dysfunction in multiple replicates of a touch assay. Vertebrate homologs of some of these genes, including Pah, which acts upstream of dopamine synthesis, were upregulated in the striatum in two mouse models of HD, the CHL2 knockin mice, and the R6/2 transgenic mice.

DESIGNING A  SCREEN The design and performance of a large-scale RNAi screen is not a trivial undertaking as multiple steps must be implemented to ensure success (Figure  7.1). Choices must be made at every step, beginning with the biological question of interest, the phenotypic assay to address the question, the choice of RNAi-reagent library and screening format, optimization of the screen format using small-scale pilot screens, and identification and validation of on-target true hits. Designing a Phenotypic  Assay With careful and thorough design, almost any phenotype that can be qualitatively observed or quantitatively measured can serve as the basis for a screen assay. Changes in protein levels, posttranslational modifications, or localization can be analyzed using luminescentor fluorescent-based reporter plate assays or correlated with survival or morphological

123

phenotypes using automated high-content image-based multiparametric data acquisition and analysis (high-content screening (HCS)). We recently designed and are currently running a genome-wide siRNA screen to identify modulators of secreted progranulin levels in Neuro2A cells in an enzyme-linked immunosorbent assay (ELISA) platform for screening (Elia & Finkbeiner, unpublished work in progress, see further on). Dominant mutations in progranulin have been identified in a number of cases of frontal temporal dementia (FTD) leading to haploinsufficiency and thus reduced levels of secreted progranulin in patients’ brains (Baker et al. 2006). HCS is well suited for neuroscience-based screens. Advantages of this method, which is based on multichannel fluorescence microscopy, are that it allows for standardized and objective image acquisition that can be automated and optimized for speed and accuracy. In addition, imaging can be multiplexed for measuring different cellular parameters—such as dendrite and synapse development, axon degeneration, or disease protein aggregation or mislocalization—within one dataset for either fixed cells or in live cells tracked over a defined period of time (Dragunow 2008). A  multiparametric approach provides ample flexibility to develop novel assays of diverse biology. Challenges associated with HCS include automation of image acquisition, the development of accurate and automated analysis algorithms to extract meaningful phenotypes, and data management pipelines that can handle the enormous amounts of generated data. Our laboratory developed an automated fluorescent microscopy system and statistical analysis software that allows us to collect images from cohorts of live transfected primary neurons tracked over time and to assess how different parameters, such as survival, inclusion body formation, disease protein expression level, or localization, are interrelated with or influence one another in the context of long-term neuronal fate (Arrasate et  al. 2004, 2005; Barmada et  al. 2010; Daub et  al. 2009; Sharma et  al. 2012). In fact, we have established several primary neuron models of neurodegenerative diseases—including HD, AD, PD, ALS, and FTD—and are adapting these models for HCS, including siRNA-based library screens, using our automated imaging system and cell analysis algorithms (Sharma et al. 2012).

124

the OMICs RNAi Library: Genome-wide or classspecific (e.g. kinases, phosphatases, GPCRs, apoptosis)

Cell Type: Cell line or primary

Phenotypic Assay of Biological Process

Single-Parameter Readouts:

Multiparameter Readouts:

Increases throughput (e.g. plate reader, FACs)

Enriched datasets (e.g. automated fluorescent microscopy)

Optimization: Achieve high-quality assay performance to minimize false hits (e.g., RNAi delivery, toxicity, screening time, controls, miniaturization)

Screen

Hit Confirmation: (e.g., repeat RNAi assay with deconvoluted RNAi species or other multiple, unique RNAi species; rescue with RNAi resistant cDNA)

Detailed Functional Analyses FIGURE  7.1:

Flowchart for RNAi screening.

Assays should be developed and optimized using the appropriate positive and negative controls in replicate that will allow for a robust, reproducible, and sensitive system with low variability to identify true hits and minimize false positives and negatives. One standard assessment of assay quality that uses data generated from multiple replicates performed with positive and negative controls is the Z´ factor (Zhang et  al. 1999). This is a statistical parameter that determines the suitability of an assay for screening and takes into account the variability of data measured and the dynamic range of the assay. It is most applicable for assays measuring one phenotypic output, such as secreted levels of a protein; for HCS, Z´ factors must be measured individually for each parameter to be

assessed in the screen. In addition, analysis of a large set of nontargeting control siRNAs should be analyzed to determine the range of nonspecific off-target effects on the phenotype to be screened.

Choice of Organism or Cell  Type Many successful screens have been done in invertebrate model systems at both a cell-culture level, such as in Drosophila primary neurons, and at the level of the whole animal for both C.  elegans and Drosophila, allowing for behavioral phenotype-based screening (see Table 7.1). For mammalian systems, we are limited to cell-based assays with cultured cells. The choice of cell type used in a screen, whether primary neuronal cells or a neuronal cell line, is an

High-Throughput RNA Interference important parameter to establish (Daub et  al. 2009). Primary neurons are desirable as a more physiologically relevant model system than cell lines, but they have a number of inherent properties that increase heterogeneity within the cell population and between independent preps; thus they may introduce significant variability between experiments. Primary neurons are difficult to transfect and must be cultured fresh at the start of each experiment or screen run. They are sensitive to handling, which could trigger nonspecific biological responses and may not be easily adapted to automated plating using liquid handlers. Immortalized cell lines have been used more consistently in high-throughput RNAi screens as they are easy to generate and transfect, and they can be adapted to automated plating formats. Cell lines also have their drawbacks, such as acquiring changes in cell biology or potential deleterious genetic alterations that can accumulate over time with sequential passages, and they may display some degree of population heterogeneity if a differentiation program is leaky. Neuronal cell lines differ significantly from primary neurons in key aspects of cellular physiology, as cell lines do not express structures such as synapses, making some phenotypic screens (e.g., synaptogenesis), intractable in this cell type. Optimization efforts must address sources of variability, such as nonspecific fluctuations in cell numbers due to changes in viability, growth rates, or resistance to the method of RNAi library delivery. Adhering to a consistent method of preparation and handling of primary cells or passaging of cell lines is an important quality control mechanism for achieving reproducible data and reducing variability between experiments for the duration of an RNAi-HTS. One recent publication advocated the preparation of a screen’s worth of cells from a cell line of choice that can be aliquoted, frozen, and used throughout an entire screen, thus decreasing the time and costs involved in maintenance, and minimizing variability (Swearingen et al. 2010).

RNAi Library The choices of RNAi library and delivery into cells are important ones that are guided by the cell type (dividing, nondividing, or primary cells) or organism being used for screening cost, and genome coverage (whole genome vs. functional subsets, such as the kinome or druggable genome). Libraries suitable for whole-animal

125

screens in C. elegans consist either of dsRNAs in solution that can be directly applied to the culture medium, or as plasmid-expressed dsRNAs packaged into bacteria that can be fed to the worms. For Drosophila-based screens, researchers have the choice of using inducible transgenic fly strains for whole-animal screens or arrayed dsRNAs that can be transfected into cultured cells (Echeverri & Perrimon 2006). Libraries developed for mammalian cell-based screens consist of collections of chemically synthesized siRNAs, RNAseIII/Dicer-generated siRNAs (esiRNAs; diced RNAs), plasmid-based shRNAs, or lentiviral-, adenoviral-, or retroviral-based shRNAs (Clark & Ding 2006; Elbashir et  al. 2001; Kittler et  al. 2007; Schlabach et  al. 2008, Silva et al. 2008; Theis & Buchholz 2011). siRNA-based libraries are often arrayed in a single-gene target per well format in multiwell plates or spotted in arrays onto glass microscope slides to provide an even greater capacity for high-throughput screening. An arrayed format allows for ease of determining the intended gene targets of hits, and straightforward delivery into cells via lipid-based forward or reverse transfection strategies; several unique siRNAs per gene target can be screened separately or pooled in a well to improve throughput. siRNAs can be quite potent at silencing, reaching levels of target knockdown in excess of 70%, and they are better suited for a relatively short time course of knockdown owing to their eventual degradation, with the effective window of 24 to 96 hours. The disadvantages of using siRNAs include the following:  (1)  a significant rate of off-target effects, often due to potential matches in the seed region of the siRNA (nucleotide positions 2 to 7)  and the 3´UTR of a nontargeted gene, which may lead to a high level of false positive hits (Birmingham et  al. 2006; Lin et al. 2005), (2) dilution of potent siRNAs within a pooled mixture or due to degradation or cell division within the transfected culture, which can reduce the overall effectiveness of target gene knockdown leading to a high level of false negative hits, and (3)  cells that are relatively intractable to transfection, such as primary cultured neurons, may not reach high enough levels of transfection to achieve potent knockdown or, at the doses required, exhibit toxicity. Recent strategies to improve the quality and effectiveness of siRNA-based libraries include the following:  (1)  using bioinformatic sequence alignment programs to identify seed-sequence

126

the OMICs

matches within off-targeted transcripts and to incorporate the information into siRNA design algorithms to choose unique siRNA sequences against the target gene (Sigoillot et  al. 2012), (2)  using pools of three or more highly potent validated siRNAs per target gene at low concentrations (less than 100 nM), or using a library of higher complexity where several (at least 4)  unique and potent siRNAs per target gene are screened individually and those target genes with three or more siRNAs that reproduce the specific phenotype chosen, (3) chemically modifying either the sense strand to prevent entry into the RISC or the antisense strand to suppress key nucleotide positions that promote off-target mismatches (reviewed in Jackson et al. 2010; Konig et al. 2007). Delivery of siRNAs into cells (via lipid-based complexes, Ca2+-phosphate coprecipitation or electroporation) needs to be established early during the assay development phase with the goal of minimizing cytotoxicity and maximizing delivery efficiency. Our lab found that Lipofectamine 2000 provides a workable balance between transfection efficiency, ease of use in manual and automated reverse or forward transfection formats, and low toxicity. We recently used Lipofectamine 2000 in a reverse transfection format to introduce a mouse siRNA library into Neuro2A cells for screening purposes (or progranulin modifier identification as described above). In addition, we routinely use this reagent for primary neuron culture transfections and are adapting it further for use in siRNA-based high-content screens with our automated microscope platform (Sharma et  al. 2012). Establishing the lowest effective concentration of siRNAs in the functional screen assay through dose-response testing is key to reducing nonspecific effects. An siRNA concentration of 100 nM (a common level used in transfections) can produce significant modulatory effects on the expression of nontargeted transcripts that may contain partial complementary sequences to the siRNAs (Jackson et  al. 2003; Persengiev et al. 2004). shRNA-based libraries can also be screened in array-based formats, though for virus-based shRNA libraries it may be technically challenging to generate consistent high titers of each shRNA-containing virus per individual well. However, shRNA-based libraries can be screened in pooled formats with shRNAs against a small number (one to four) of gene

targets per well, as the shRNAs can be constructed with unique barcode sequences that can be recovered by PCR and identified by hybridization to microarrays. The number of virus particles can be controlled so that one shRNA-containing virus infects and integrates the shRNA-expressing DNA into one individual cell at a time (Brummelkamp & Bernards 2003). Viral-based delivery of shRNA libraries is better tolerated by primary neurons and may be a better option for this cell type. Also, the time course of knockdown with shRNA is on the order of weeks to months, so interesting studies where one might wish to track long-term survival or cumulative neurodegeneration can be executed. A number of large-scale RNAi libraries have been developed, validated, and made available to the public either through commercial sources or academic labs and consortia. The Vienna Drosophila RNAi Center (VDRC) has a collection of 22,247 inducible dsRNA transgenic Drosophila strains representing 12,251 genes (~88% genome coverage) (Dietzl et  al. 2007; Valakh et  al. 2012). The Ahringer C.  elegans RNAi feeding library consists of bacterial strains containing dsRNA constructs corresponding to ~19,000 predicted genes (~86% genome coverage) (Kamath et  al. 2003; Qu et  al. 2011; Sieburth et  al. 2005). The C.  elegans ORFeome-RNAi library generated by Rual and colleagues contains 15,395 RNAi bacterial feeding strains (~81% genome coverage) (Kim et  al. 2005; Poole et  al. 2011; Rual et al. 2004). A lentiviral shRNA library available through the RNAi Consortium at the Broad Institute includes 180,000 clones evenly divided between ~18,000 genes each for the mouse and human genomes (Blow 2008). Several commercial libraries are available for use in mammalian and human cells, including collections of siRNAs or viral-based, integrating shRNAs (listed in Blow 2008; Echeverri et  al. 2006; Moffat et al. 2006). One recent version of a viral-based shRNA library, originally developed by Stephen Elledge and George Hannon and available through Open Biosystems, consists of over 100,000 unique barcoded shRNAs spanning the entire human genome. These shRNAs are expressed in the context of a miRNA cassette to generate shRNA-miRs that can work at the level of single-copy integration, greatly facilitating pooled screening formats (Blow 2008; Silva et al. 2008).

High-Throughput RNA Interference

127

Validation:  Distinguishing On-Target Hits From Off-Target Effects Off-target effects often encountered with RNAi screening can be addressed through intelligent screen design and optimization (as discussed previously) and implementation of rigorous follow-up approaches based on repetition and rescue. Confidence levels in screen hits can be increased by retesting multiple, individual siRNAs or shRNAs of different sequences against the target gene, especially if the primary screen used pooled siRNA populations. The higher the number of individual target gene-specific siRNAs or shRNAs that can reproduce the screen phenotype, the more likely it is that the hit will be on target. Another approach to increase the confidence of on-target effects is to show direct correlation of target gene knockdown at both the transcript and protein levels with RNAi phenotype. Demonstration that traditional LoF approaches using either mutant alleles (mouse knockout, fly, or worm mutant strains) of the target gene or pharmacological inhibitors phenocopy the RNAi-induced phenotype provides yet another level of confidence that a hit is true. Finally, one of the better approaches is to demonstrate phenotypic rescue using an RNAi-resistant version of the target gene (Pulverer 2003; Sigoillot & King 2011).

components of biological pathways. Recent applications in the area of neuroscience have opened the door to providing a tractable approach to better understand the complexities of the nervous system and, hopefully soon, to provide an efficient path forward to identify therapeutic targets for neurodegenerative diseases. A  recent screen performed by cancer researchers beautifully illustrates the power of this strategy to identify a potential treatment for acute myeloid leukemia (AML). Zuber and colleagues used an innovative approach to screen a library of shRNAs in vivo in a genetically defined AML mouse model to identify epigenetic factors required for leukemia stem cell self-renewal, an underlying process key to the maintenance and propagation of the disease (Zuber et  al. 2011). They identified a protein, Brd4, whose inhibition with shRNA led to potent antileukemic effects, including cell-cycle arrest and death of leukemic cells, a delay in disease progression, and a significant extension of survival of the diseased animals. They further showed that a small molecule Brd4 inhibitor, JQ1, recapitulated the knockdown results. It may be possible to identify new components and therapeutic targets of neurodegenerative diseases by combining unbiased RNAi screening with neuronal cells derived from induced pluripotent stem cells (iPSCs). Neuronal cells derived from iPSCs generated from fibroblasts isolated from patients with neurodegenerative diseases, such as HD or ALS, may soon provide a platform for such screens (Bilican et  al. 2012; the HD iPSC Consortium 2012). For instance, functional motor neurons have been derived from iPSCs obtained from patient fibroblasts harboring an ALS-linked mutation in TDP-43 (TDP-43 M337V). These neurons recapitulated disease phenotypes, including cytoplasmic misaccumulation of this predominantly nuclear protein, which is mechanistically linked with neurodegeneration (Bilican et  al. 2012). These neurons would be attractive candidate cells to use in an RNAi screen to identify relevant modifiers of the disease phenotype, such as suppression of the toxic cytoplasmic accumulation, and potential therapeutic targets that extend neuron survival and block disease progression.

CONCLUSION AND FUTURE PERSPECTIVES A well-designed, comprehensive RNAi screen provides a powerful tool to identify new

ACKNOWLEDGMENTS We wish to thank Gary Howard and members of the Finkbeiner lab for discussions and comments on the manuscript. This work was

For our progranulin modifier screen, we are working with a commercial genome-wide RNAi library of siRNAs that targets ~16,900 mouse genes. The library is arrayed in a 96-well plate format with four unique siRNAs against one target gene in each well; this is designed to induce specific and potent knockdown levels of more than 70% in target cells (siGENOME SMARTpool mouse genomic siRNA library, Thermo-Scientific). The library is subdivided into kinase, G protein‒coupled receptor (GPCR), druggable, and remaining genome categories to allow for either genome-wide or focused screens. Custom libraries of desired genes can also be constructed depending on the requirements of the screen; they can be drawn, for example, from lists generated from transcriptional profiling or from proteomic or bioinformatic analyses of a particular pathway or process of interest.

128

the OMICs

supported by grants from the Consortium for Frontotemporal Dementia Research and the National Institute for Neurological Disease and Stroke (2 R01 NS45491, NS39074), the National Institute on Aging (2 P01 AG022074), and by the Taube-Koret Center and the Hellman Family Foundation Program for Alzheimer’s Disease Research.

REFERENCES Arrasate, M., Mitra, M., Schweitzer, E. S., Segal, M. R., & Finkbeiner, S. (2004). Inclusion body formation reduces levels of mutant huntingtin and the risk of neuronal death. Nature 431, 805–810. Arrasate, M., & Finkbeiner, S. (2005). Automated microscope system for determining factors that predict neuronal fate. PNAS 102, 3840–3845. Azorsa, D. O., Robeson, R. H., Frost, D., Meechoovet, B., Brautigam, G. R., Dickey, C., . . . Dunckley, T. (2010). High-content siRNA screening of the kinome identifies kinases involved in Alzheimer’s disease-related tau hyperphosphorylation. BMC Genomics 11, 25–35. Babiarz, J.E., Ruby, J. G., Wang, Y., Bartel, D. P., & Blelloch, R. (2008). Mouse ES cells express endogenous shRNAs, siRNAs, and other Microprocessor-independent, Dicer-dependent small RNAs. Genes Dev 22, 2773–2785. Baker, M., Mackenzie, I. R., Pickering-Brown, S. M., Gass, J., Rademakers, R., Lindholm, C., . . . Hutton, M. (2006). Mutations in progranulin cause tau-negative frontotemporal dementia linked to chromosome 17. Nature 442, 916–919. Barmada, S. J., Skibinski, G., Korb, E., Rao, E. J., Wu, J. Y., & Finkbeiner, S. (2010). Cytoplasmic mislocalization of TDP-43 is toxic to neurons and enhanced by a mutation associated with familial amyotrophic lateral sclerosis. J Neurosci 30, 639–649. Bernstein, E., Kim, S. Y., Carmell, M. A., Murchison, E. P., Alcorn, H., Li, M. Z., . . . Hannon, G.J. (2003) Dicer is essential for mouse development. Nat Genet 35, 215–217. Bilican, B., Serio, A., Barmada, S. J., Nishimura, A. L., Sullivan, G. J., Carrasco, M., . . . Chandran, S. (2012). Mutant induced pluripotent stem cell lines recapitulate aspects of TDP-43 proteinopathies and reveal cell-specific vulnerability. PNAS 109, 5803–5808. Birmingham, A., Anderson, E. M., Reynolds, A., Ilsley-Tyree, D., Leake, D., Fedorov, Y., . . . Khvorova, A. (2006). 3’UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nat Methods 3, 199–204. Blow, N. (2008). RNAi technologies:  a screen whose time has arrived. Nat Methods 5, 361–368.

Brummelkamp, T. R., & Bernards, R. (2003). New tools for functional mammalian cancer genetics. Nat Rev Cancer 3, 781–789. Carmell, M. A., & Hannon, G. J. (2004). RNAseIII enzymes and the initiation of gene silencing. Nat Struct Molec Biol 11(3), 214–218. Carney, T. D., Miller, M. R., Robinson, K. J., Bayraktar, O. A., Osterhout, J. A., & Doe, C. Q. (2012). Functional genomics identifies neural stem cell sub-type expression profiles and genes regulating neuroblast homeostasis. Dev Biol 361, 137–146. Carthew R. W., & Sontheimer E. J. (2009). Origins and mechanisms of miRNAs and siRNAs. Cell 136, 642–655. Clark, J., & Ding, S. (2006). Generation of RNAi libraries for high-throughput screens. J Biomed Biotech 45716, 1–7. Cullen, B. R. (2004). Transcription and processing of human microRNA precursors. Mol Cell 16(6), 861–865. Czech, B., & Hannon, G. J. (2011). Small RNA sorting: matchmaking for Argonautes. Nat Rev Genet 12, 19–31. Czech, B., Malone, C. D., Zhou, R., Stark, A., Schlingeheyde, C., Dus, M., . . . Brennecke, J. (2008). An endogenous small interfering RNA pathway in Drosophila. Nature 453, 798–802. Daub, A., Sharma, P., & Finkbeiner, S. (2009). High-content screening of primary neurons: ready for prime time. Curr Opin Neurobiol 19, 537–543. Dietzl, G., Chen, D., Schnorrer, F., Su, K-C., Barinova, Y., Fellner, M., . . . Dickson, B. J. (2007). A genome-wide transgenic RNAi library for conditional gene inactivation in Drosophila. Nature 448, 151–156. Dimitriadi, M., Sleigh, J. N., Walker, A., Chang, H. C., Sen, A., Kalloo, G., . . . Hart, A. C. (2010). Conserved genes act as modifiers of invertebrate SMN loss of function defects. PLOS Genet 6, e1001172. Dragunow, M. (2008). High-content analysis in neuroscience. Nat Rev Neurosci 9, 779–788. Echeverri, C. J., & Perrimon, N. (2006). High-throughput RNAi screening in cultured cells: a user’s guide. Nat Rev Genet 7, 373–384. Elbashir, S. M., Harborth, J., Lendeckel, W., Yalcin, A., Weber, K., & Tushchl, T. (2001). Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 411, 494–498. Elbashir, S. M., Lendeckel, W., & Tushchl, T. (2001). RNA interference is mediated by 21- and 22-nucleotide RNAs. Genes Dev 15, 188–200. Elbashir, S. M., Martinez, J., Patkaniowska, A., Lendeckel, W., & Tushchl, T. (2001). Functional anatomy of siRNAs for mediating efficient RNAi in Drosophila melanogaster embryo lysate. EMBO J 20, 6877–6888.

High-Throughput RNA Interference Filipowicz, W., Jaskiewicz, L., Kolb, F. A., & Pillai, R. S. (2005). Post-transcriptional gene silencing by siRNAs and miRNAs. Curr Opin Struct Biol 15, 331–341. Fire, A. (1999). RNA-triggered gene silencing. TIG 15(9), 358–363. Fire, A., Xu, S., Montgomery, M. K., Kostas, S. A., Driver, S. E., & Mello, C.C. (1998). Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806–811. Fraser, A. G., Kamath, R. S., Zipperlen, P., Martinez-Campos, M., Sohrmann, M., & Ahringer, J. (2000). Functional genomic analysis of C.  elegans chromosome I  by systematic RNA interference. Nature 408, 325–330. Hamamichi, S., Rivas, R.N., Knight, A.L., Cao, S., Caldwell, K. A., & Caldwell, G. A. (2008). Hypothesis-based RNAi screening identifies neuroprotective genes in a Parkinson’s disease model. PNAS 105, 728–733. Hamilton, A. J., & Baulcombe, D. C.(1999). A species of small antisense RNA in posttranscriptional gene silencing in plants. Science 286, 950–952. Hannon, G. J. (2002). RNA interference. Nature 418, 244–251. Ivanov A. J., Rovescalli, A. C., Pozzi, P., Yoo, S., Mozer, B., Li H-P.,. . . Nirenberg, M. (2004). Genes required for Drosophila nervous system development identified by RNA interference. PNAS 101, 16216–16221. Jackson, A. L., Bartz, S. R., Schelter, J., Kobayashi, S. V., Burchard, J., Mao, M., . . . Linsley, P.S. (2003). Expression profiling reveals off-target gene regulation by RNAi. Nat Biotechnol 21, 635–637. Jackson, A. L., & Linsley, P. S. (2010). Recognizing and avoiding siRNA off-target effects for target identification and therapeutic application. Nat Rev Drug Dis 9, 57–67. Kamath, R. S., Fraser, A. G., Dong, Y., Poulin, G., Durbin, R., Gotta, M., . . . Ahringer, J. (2003). Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature 421, 231–237. Ketting, R. F. (2011). The many faces of RNAi. Dev Cell 20, 148–161. Khvorova, A., Reynolds, A., & Jayasena, S.D. (2003). Functional siRNAs and miRNAs exhibit strand bias. Cell 115, 209–216. Kim, J. K., Gabel, H. W., Kamath, R. S., Tewari, M., Pasquinelli, A., Rual, J., . . . Ruvkun, G. (2005). Functional genomic analysis of RNA interference in C. elegans. Science 308, 1164–1167. Kim, V. N., Han J., & Siomi M. C. (2009). Biogenesis of small RNAs in animals. Nat Rev Mol Cell Biol 10, 126–139. Kittler, R., Surendranath, V., Heninger, A. H., Slabicki, M. J., Theis, M., Putz, G., . . . Buchholz,

129

F. (2007). Genome-wide resources of endoribonuclease-prepared short interfering RNAs for specific loss-of-function studies. Nat Meth 4, 337–344. Koizumi, K., Higashida, H., Yoo, S., Islam, M. S., Ivanov, A. I., Guo, V., . . . Nirenberg, M. (2007). RNA interference screen to identify genes required for Drosophila embryonic nervous system development. PNAS 104, 5626–5631. Konig, R., Chiang, C., Tu, B. P., Yan, S. F., DeJesus, P. D., Romero, A., . . . Chanda, S.K. (2007). A probability-based approach for the analysis of large-scale RNAi screens. Nat Meth 4, 847–849. Kraemer, B. C., Burgess, J. K., Chen, J. H., Thomas, J. H., & Schellenberg, G.D. (2006). Molecular pathways that influence human tau-induced pathology in Caenorhabditis elegans. Hum Mol Genet 15, 1483–1496. Lee, Y. S., Nakahara, K., Pham, J. W., Kim, K., He, Z., Sontheimer, E. J., & Carthew, R.W. (2004). Distinct roles for Drosophila Dicer-1 and Dicer-2 in the siRNA/miRNA silencing pathways. Cell 117, 69–81. Lejeune, F. X., Mesrob, L., Parmentier, F., Bicep, C., Vazquez-Manrique, R. P., Parker, J. A., . . . Neri, C. (2012). Large-scale functional RNAi screen in C.elegans identifies genes that regulate the dysfunction of mutant polyglutamine neurons. BMC Genomics 13, 91. Lin, X., Ruan, X., Anderson, M. G., McDowell, J. A., Kroeger, P. E., Fesik, S. W., & Shen, Y. (2005). siRNA-mediated off-target gene silencing triggered by a 7nt complementation. Nucleic Acids Res 33, 4527–4535. Loh, S., Francescut, F., Lingor, P., Bahr, M., & Nicotera, P. (2008). Identification of new kinase clusters required for neurite outgrowth and retraction by a loss-of-function RNA interference screen. Cell Death Diff 15, 283–298. Matranga, C., Tomari, Y., Shin, C., Bartel, D. P., & Zamore, P. D. (2005). Passenger-strand cleavage facilitates assembly of siRNA into Ago 2-containing RNAi enzyme complexes. Cell 123, 607–620. Moffat, J., & Sabatini, D. M. (2006). Building mammalian signaling pathways with RNAi screens. Nat Rev Mol Cell Biol 7, 177–187. Mohr, S., Bakal, C., & Perrimon, N. (2010). Genomic screening with RNAi:  results and challenges. Annu Rev Biochem 79, 37–64. Montgomery, M. K., Xu, S., & Fire, A. (1998). RNA as a target of double-stranded RNA-mediated genetic interference in Caenorhabditis elegans. PNAS 95:15502–15507. Neely, G. G., Hess, A., Costigan, M., Keene, A. C., Goulas, S., Langeslag, M., . . . Penninger, J. M. (2010). A genome-wide Drosophila screen for

130

the OMICs

heat nociception identifies α2δ3 as an evolutionarily conserved pain gene. Cell 143, 628–638. Okamura, K., Chung, W. J., Ruby, J. G., Guo, H., Bartel, D. P., & Lai, E. C. (2008). The Drosophila hairpin RNA pathway generates endogenous short interfering RNAs. Nature 453, 803–806. Okamura,K., & Lai, E. C. (2008). Endogenous small interfering RNAs in animals. Nat Rev Mol Cell Biol 9(9), 673–678. Ossovskaya, V. S., Dolganov, G., & Basbaum, A. I. (2009). Loss of function genetic screens reveal MTGR1 as an intracellular repressor of β1 integrin-dependent neurite outgrowth. J Neursci Methods 177, 322–333. Pais, H., Moxon, S., Dalmay, T., & Moulton, V. (2011). Small RNA discovery and characterization in eukaryotes using high-throughput approaches. In L. J. Collins (Ed.). RNA infrastructure and networks (Vol. 722, pp. 239–254). New York: Springer. Pal-Bhadra, M., Bhadra, U., & Birchler, J. A. (1997). Cosuppression in Drosophila:  gene silencing of Alcohol dehydrogenase by white-Adh transgenes is Polycomb dependent. Cell 90, 479. Paradis, S., Harrar, D. B., Lin, Y., Koon, A. C., Hauser, J.L., Griffith, E. C., . . . Greenberg, M. E. (2007). An RNAi-based approach identifies molecules required for glutamatergic and GABAergic synapse development. Neuron 53, 217–232. Parrish, J. Z., Kim, M. D., Jan, L. Y., & Jan, Y. N. (2006). Genome-wide analyses identify transcription factors required for proper morphogenesis of Drosophila sensory neuron dendrites. Genes Dev 20, 820–835. Persengiev, S. P., Zhu, X., & Green, M. R. (2004). Nonspecific, concentration-dependent stimulation and repression of mammalian gene expression by small interfering RNAs (siRNAs). RNA 10, 12–18. Poole, R. J., Bashllari, E., Cochella, L., Flowers, E. B., & Hobert, O. (2011). A genome-wide RNAi screen for factors involved in neuronal specification in Caenorhabditis elegans. PLOS Genet 7, e1002109. Pulverer, B. (2003) Editorial: Whither RNAi? Nat Cell Biol 5, 489–490. Qu, W., Ren, C., Li, Y., Shi, J., Zhang, J., Wang, X., . . . Zhang, C. (2011). Reliability analysis of the Ahringer Caenorhabditis elegans RNAi feeding library:  a guide for genome-wide screens. BMC Genomics 12, 170. Rand, T.A., Petersen, S., Du, F., & Wang, X. (2005). Argonaute2 cleaves the anti-guide strand of siRNA during RISC activation. Cell 123, 621–629. Romano, N., & Macino, G. (1992). Quelling: transient inactivation of gene expression in Neurospora crassa by transformation with homologous sequences. Mol Microbiol 6, 3343–3353.

Rossi, J. J. (2005). Mammalian Dicer finds a partner. EMBO Rep 6, 927–929. Rual, J. F., Ceron, J., Koreth, J., Hao, T., Nicot, A. S., Hirozane-Kishikawa, T., . . . Vidal, M. (2004). Toward improving Caenorhabditis elegans phenome mapping with an ORFeome-based RNAi library. Genome Res 14, 2162–2168. Schlabach, M. R., Luo, J., Solimini, N. L., Hu, G., Xu, Q., Li, M. Z., . . . Elledge, S. J. (2008). Cancer proliferation gene discovery through functional genomics. Science 319, 620–624. Schmitz, C., Kinge, P., & Hutter, H. (2007). Axon guidance genes identified in a large-scale RNAi screen using the RNAi-hypersensitive Caenorhabditis elegans strain nre-1(hd20) lin-15b(hd126). PNAS 104, 834–839. Schwarz, D. S., Hutvagner, G., Du, T., Xu, Z., Aronin, N., & Zamore, P. D. (2003). Asymmetry in the assembly of the RNAi enzyme complex. Cell 115, 199–208. Sepp, K. J., Hong, P., Lizarraga, S. B., Liu, J. S., Mejia, L. A., Walsh, C. A., & Perrimon, N. (2008). Identification of neural outgrowth genes using genome-wide RNAi. PLOS Genet 4, e1000111. Sharma, P., Ando, D. M., Daub, A., Kaye, J., & Finkbeiner, S. (2012). High-throughput screening in primary neurons. Methods Enzymol 506, 331–360. Sieburth, D., Ch’ng, Q., Dybbs, M., Tavazoie, M., Kennedy, S., Wang, D., . . . Kaplan, J.M. (2005). Systematic analysis of genes required for synapse structure and function. Nature 436, 510–517. Sigoillot, F. D., Lyman, S., Huckins, J. F., Adamson, B., Chung, E., Quattrochi, B., & King, R. W. (2012). A bioinformatics method identifies prominent off-targeted transcripts in RNAi screens. Nat Methods 9, 363–366. Silva, J. M., Marran, K., Parker, J. S., Silva, J., Golding, M., Schlabach, M. R., . . . Chang, K. (2008). Profiling essential genes in human mammary cells by multiplex RNAi screening. Science 319, 617–620. Swearingen, E. A., Fajardo, F., Wang, X., Watson, J.E.V., Quon, K. C., & Kassner, P. D. (2010). Use of cryopreserved cell aliquots in the high-throughput screening of small interfering RNA libraries. J Biomol Screening 15, 469–477. Tam, O. H., Aravin, A. A., Stein, P., Girard, A., Murchison, E. P., Cheloufi, S., . . . Hannon, G.J. (2008). Pseudogene-derived small interfering RNAs regulate gene expression in mouse oocyte. Nature 453, 534–538. Theis, M., & Buchholz, F. (2011). High-throughput RNAi screening in mammalian cells with esiRNAs. Methods 53, 424–429. Timmons, L., Court, D. L., & Fire, A. (2001). Ingestion of bacterially expressed dsRNAs can produce

High-Throughput RNA Interference specific and potent genetic interference in Caenorhabditis elegans. Gene 263, 103–112. Valakh, V., Naylor, S. A., Berns, D. S., & DiAntonio, A. (2012). A large-scale RNAi screen identifies functional classes of genes shaping synaptic development and maintenance. Dev. Biol http://dx.doi. org/10.1016/j.ydbio.2012.04.008. Vashlishan, A. B., Madison, J. M., Dybbs, M., Bai, J., Sieburth, D., Ch’ng, Q., . . . Kaplan, J. M. (2008). An RNAi screen identifies genes that regulate GABA synapses. Neuron 58, 346–361. Wang, J., Farr, G. W., Hall, D. H., Li, F., Furtak, K., Dreier, L., & Horwich, A.L. (2009). An ALS-linked mutant SOD1 produces a locomotor defect associated with aggregation and synaptic dysfunction when expressed in neurons of Caenorhabditis elegans. PLOS Genet 5, e1000350. Watanabe, T., Totoki, Y., Toyoda, A., Kaneda, M., Kuramochi-Miyagawa, S., Obata, Y., . . . Sasaki, H. (2008). Endogenous siRNAs from naturally

131

formed dsRNAs regulate transcripts in mouse oocytes. Nature 453, 539–543. Yamada, S., Uchimura, E., Ueda, T., Nomura, T., Fujita, S., Matsumoto, K., . . . Miyake, J. (2007). Identification of twinfilin-2 as a factor involved in neurite outgrowth by RNAi-based screen. Biochem Biophys Res Commun 363, 926–930. Zhang, S., Binari, R., Zhou, R., & Perrimon, N. (2010). A genomewide RNA interference screen for modifiers of aggregates formation by mutant huntingtin in Drosophila. Genetics 184, 1165–1179. Zhang, J. H., Chung, T. D., & Oldenburg, K. R. (1999). A simple statistical parameter for use in evaluation and validation of high throughput screening assays. J Biomol Screen 4, 67–73. Zuber, J., Shi, J., Wang, E., Rappaport, A. R., Herrmann, H., Sison, E. A., . . . Vakoc, C.R. (2011). RNAi screen identifies Brd4 as a therapeutic target in acute myeloid leukaemia. Nature 478, 524–528.

8 The Genetics of Gene Expression:  Multiple Layers and Multiple Players AMANDA J.  MYERS

INTRODUCTION After the human genome was sequenced there were two main unexpected findings. First, the human genome was 99% identical to that of the chimp, which is surprising if the assumption is that DNA base-pair changes are what differentiate homo sapiens as a species. Second, the human genome contained ~ 3 billion base pairs, but these only encoded for ~ 20,000 genes, which was surprising given that ~98% of the human genome appeared to be “junk.” Thus initially, The Human Genome Project was a collection of base-pairs without any key to decipher the meaning of ~98% of those base-pairs. In terms of a well-used analogy: the human genome sequence is a library in an ancient language with no Rosetta stone. But there is hope; both new technologies as well as the completion of large-scale genome and transcriptome mapping projects such as ENCODE are forging the way towards developing the cipher to those base-pair changes It is now well known that the non-coding “junk genome” is vital to determining who we are as a species as well as the outomes for our health and disease. This chapter will focus on the links between DNA alterations and their downstream effects, highlighting both the layers and the players in this complex system. A  brief review of the technologies of interest is given, since these have driven the outcomes in the field. Second, a background on the gene regions of interest (ROIs) will help to define the knowledge gained from many fields on the areas of interest for human genetic regulation. Finally, there is a review of the current knowledge about human genetic regulation, which is divided into regulatory actions mapping close to the target of interest (cis effects) and those mapping more distally (trans effects).

TECHNOLOGIES The study of DNA and its downstream effects is very much a technology-driven process. Most of the first screens looking at DNA variation involved looking at segregation in families because there were no reasonable technologies at the time to assess the entire genome in unrelated individuals.1 While assessing segregation in multigenerational pedigrees is a powerful approach and allows for assessment of heritability of DNA variability and is still the gold standard for the assessment of rare Mendelian forms of genetic disease, many diseases cannot be mapped in this manner because it is impossible to collect multiple generations. Now that the human genome has been sequenced, it has become possible to map effects in unrelated cohorts. Much of the later work on mapping has been done using microarrays. These allow for the assessment of hundreds of thousands of single-nucleotide polymorphisms (SNPs) in one hybridization, thus generating finer mapping of the human genome. Newer technologies, such as Next Generation Sequencing (NGS), are now being employed that do not rely on hybridization. NGS allows for the parallel collection of many short sequence reads at the same time, thus reducing costs considerably over conventional Sanger methodologies. NGS will truly allow for the development of a Human Genome Project v2.0 (http://www.1000genomes.org/) because these technologies reduce the cost of sequencing to enable the expansion of the Human Genome Project outcomes to larger cohort sizes. The application of these technologies to the study of the genetic regulation of gene expression is reviewed in references 2 and 3.

The Genetics of Gene Expression REGIONS OF INTEREST Much as in real estate, location is a crucial component to understanding genetic regulation of gene expression. An impressive amount of work has been performed in this area to delineate genomic areas close to genes that act as switches to modulate gene function. Cis elements including the 5′ and 3′ untranslated regions (UTRs) along with intergenic introns are located close to the genes they act upon and are thought to have a direct effect. More recently, the analysis of epigenetic regulation has also demonstrated the importance of CpG islands in cis gene regulation.4 Trans elements that are located some distance from the genes they act upon were considered junk DNA for many years; however, recent studies have demonstrated pervasive transcription throughout the genome.5 Further work has demonstrated the crucial roles of noncoding RNAs (ncRNA) in regulating coding gene transcription6 and underlying the complexity of higher organisms.7 The unique components of each of these ROIs are detailed in the following sections. Promoter Transcription initiation involves a highly coordinated process of several components binding upstream of gene coding sequence. The core promoter generally spans approximately 80 bp around the transcription start site (TSS). For mammals there are two main promoter classes:  narrow TATA box‒enriched and variable CpG-rich. High-CpG-content promoters are associated with strong expression across a broad range of cell types and low-CpG-content promoters are associated with weaker but tissue-specific expression.8 TATA box promoters initiate at a single TSS, while CpG promoters have multiple initiation sites.9 Alternate promoter usage is a widespread phenomenon in humans10 and alternative promoters are overrepresented among genes involved in transcriptional regulation and development.11 In contrast, single-promoter genes are active in a broad range of tissues and are more likely to be involved in general cellular processes, such as RNA processing, DNA repair, and protein biosynthesis.12 Alternate promoters can generate products from the same gene with opposing functions. For example, the Lymphoid enhancer function (LEF1) gene is transcribed from two promoters, one of which creates a full-length form of the gene that activates downstream

133

targets, the other promoter produces a shorter isoform, which represses downstream targets.13 Further adding to the complexity of promoters, recent work mapping promoter sequences in plants, flies, humans, and rice has expanded the number of promoter types to 10 generic classes.14 Additionally, it has been shown that human genes can be expressed via bidirectional promoters, or TSS sites of gene pairs that are separated by less than 1,000 bp. It has been shown that some of these bidirectional gene pairs are coexpressed, whereas other pairs are negatively correlated.15 About 11% of human genes are bidirectionally paired.16 Within and around the core promoter region there are many different types of sequence motifs that are involved in further regulation of gene transcription (Table  8.1). In general, sequence regions lying -300 to -50 bp upstream of the TSS have positive effects on promoter activity, while sequences that are further upstream (-1,000 to -500 bp) were shown to inhibit transcription in 55% of genes tested.10 Enhancers and silencers bind to these sequences and act to selectively regulate the levels of gene transcription. Many of these factors are cell type‒specific and act to help determine decisions regarding cell fate and cell-type maintenance.

5′UTR The 5′UTR is the region that extends past the promoter and is at the 5′ end of all genes. It is transcribed into mRNA but not translated into protein. In general sequences within the 5′UTR surrounding the promoter can affect the stability or translation efficiency of RNA products, thus altering temporal or spatial distributions. Specific regulatory elements in the 5′UTR include the 5′ cap structure, the 5′UTR secondary structure, internal ribosome entry sites (IRES), and upstream open reading frames (uORFs). The 5′ cap structure is a 7-methylguanosine attached to the 5′ end of precursor mRNA and is essential for efficient translation. It serves as the binding site for various eukaryotic initiation factors (eIFs) and promotes binding of 40S ribosomal subunits and other proteins that make up the 43S preinitiation complex (PIC). If the full set of eIFs is not present, the 5′ cap can block mRNA recruitment to the PIC.15 Thus, the 5′ cap can serve both as a positive and negative regulator of transcription through eIF binding. Other mechanisms

134

the OMICs TABLE  8.1. PROMOTER ELEMENTS INVOLVED IN GENE REGULATION

Motif Name

Binding Element

Consensus Sequence

Position

Action

SSRCGCC

−37 to −32

Located close to transcription start site; transcription initiation Binds to the TAT box protein (TBP) subunit of TFIID complex Initiation of transcription Promotes transcription by RNA polymerase II when it is Located at positions +18 to +27 relative to TSS with the core promoter to facilitate transcription initiation with the core promoter to facilitate Transcription initiation

BRE

Beta -Recognition Elements

TFIIB

TATA

TATA box

TFIID complex TATAWAAR

−31 to −26

INR

Initiator Elements Motif Ten Element

TFIID complex YYANWYY

−2 to +4

Downstream Promoter Element

Transcription factors

MTE

DPE

TFIID complex CSARCSSAACGS +18 to +27

RGWYV(T)

+28 to +32

Table 8.1 gives a brief list of known promoter motifs, their binding elements as well as their action on transcription levels. For additional elements and resources see http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=home

of regulation at this region include decapping, which can initiate mRNA decay from the 5′ end of transcripts. Secondary structure is sequence-dependent and can result in marked differences in the amount of downstream products. In silico predictions have shown that short 5′UTRs with low GC content and no upstream AUG codons result in high levels of protein production, whereas longer 5′UTRs with higher GC levels have a greater secondary structure and produce lower levels of protein5; however, there are known cases that break this general rule—for example, the human L1 bicistronic mRNA, which contains a long 900-nucleotide 5′UTR with high GC content but yet is translated very efficiently.17 IRES are internal mRNA regulatory sites close to the translation start site that bind ribosomes. IRES allow for ribosomal binding both when mRNAs are 5′ capped or uncapped, which allows for the translation of crucial proteins required for cell survival in

conditions where cap-dependent translation is inhibited, as during cell stress or apoptosis. The efficiency of IRES-mediated transcription is reliant on trans-acting protein factors, and altered levels of these factors can establish different cell-specific fates.18 IRES are very diverse and there is no universally conserved sequence that can delineate function at these regions. uORFs are present in approximately 50% of human 5′UTRs. For the most part, these regions result in reduced protein expression by between 30% and 80% by altering the efficiency of translation or initiation at the main ORF.19 Additionally, uORFs themselves can initiate transcription, and there is evidence that protein products can be produced just from these sites alone.20 All of these 5′UTR structures highlight the complex role and subtle regulatory potential of translated sequences located in the 5′UTR. Regulation mediated by 5′UTRs relies heavily on the secondary structure and accessibility of

The Genetics of Gene Expression protein binding sites. Further identification  of sequences and their roles in regulation is ongoing.

Introns Introns are genomic regions that are transcribed and then removed by posttranscriptional processing to create smaller mature mRNAs. Introns are present in all studied eukaryotes and allow for increased evolutionary diversity by potentiating recombination within coding regions as well as reducing selective pressure at noncoding introns.21 Intron number and length varies greatly between different species and different genes. The average human gene contains between five and six introns with an average length of 2,100 nucleotides.22 In general intron length is inversely correlated with transcript levels. Of note, the first intron of coding sequences is on average 40% longer than later introns.22 Additionally, first introns, particularly within the first 100 bp, tend to be enriched for G-rich sequences that have the potential to form G-quadruplex structures (G4s).23 G4s are guanine-rich sequences that can fold into tetrahelical structures, which are very stable and tend to strongly repress translation.24 G4 structures are also found within the 5′UTR, where they can significantly repress translation. Within the first intron, these structures are thought to regulated transcription or RNA processing. Approximately 35% of human genes have introns within the 5′UTR.25 Shorter 5′UTR introns are more commonly found in highly expressed genes, which is thought to occur both through increases in transcription as well as increases in mRNA stability.25 In contrast, only 5% of 3′UTRs contain introns.26 Introns within the 3′UTR act to downregulate gene expression through nonsense-mediated decay (NMD).27 Additionally, introns within 3′UTRs can also contain miRNA binding sites, which can further act to lower transcript levels.28 Introns in other gene regions are also known to regulate gene expression, and many genes with intact promoters are not expressed if they do not contain introns.27 Introns potentiate transcription by containing binding sites for enhancers as well as increasing the processivity of the transcription machinery at the elongation stage.27 Besides alterations in transcript levels, introns are crucial components to altering transcript type through alternative splicing. One extreme example is the Drosophila Dscam

135

gene, where over 38,000 different isoforms can be generated through the splicing of four cassette exon clusters.29 Alternative splicing has been found in up to 95% of human genes.30,31 Crucial sequence regions for splicing include alternative 5′ sites, alternative 3′ sites, cassette exons, retained exons, and intronic secondary structure. The 5′ splice site has the canonical sequence of AGGURAGU, where splicing occurs between the set of GG nucleotides. The 3′ splice site contains a polypyrimidine tract followed by an AG dinucleotide at the 3′ site where splicing actually occurs, which is called the branch-point sequence. Splicing occurs when members of the splicesome complex bind these sites in pre-mRNA. U1 snRNP binds to the 5′ splice site, U2 snRNP binds to the branch point, and U2 auxillary factor (U2AF) binds to the polypyrimidine tract at AG at the 3′ site. Changes in the 5′ or 3′ splice sites can alter intron length or inclusion. Cassette exons can shuffle in and out of transcripts and alter their composition. Intronic secondary structure is crucial; pre-mRNAs fold into complex secondary and tertiary structures in vivo and these structures can alter the binding of splicing regulatory elements and proteins.32,33 Local pre-mRNA structure can prevent splicesomal recognition of the 5′ splice site, 3′ splice site, or branch-point elements. Genome-wide studies of pre-mRNA structure have shown conservation of secondary structures at alternative splice sites, suggesting that there are canonical pre-mRNA formations that act to enrich splicing.34 Alterations in secondary structure can either sequester sequence elements or bring them together, leading either to repression or enhancement of alternative exons. Finally, alterations in splicing can either lead to functional processed transcripts, or approximately one third of the time lead to untranslated transcripts that are degraded by nonsense-mediated decay (NMD).35

3′UTR The 3′ untranslated region is located just downstream of the protein sequence of a gene. The main function of the 3′UTR is to regulate gene transcription at the posttranslational level.36 Specifically, the 3′UTR is involved in transcript cleavage, stability and polyadenylation, translation, and mRNA localization. Notably, the average 3′UTR is twice as long in humans as in other mammals, indicating that there

136

the OMICs

are significant regulatory elements located in these regions.37 Probably the most important sequence at the 3′ end of transcripts is the poly(A) tail. This is an extension of adenosine bases to the 3′ end of RNA molecules. Poly(A) binding proteins (PABP) bind in this region and have roles in the regulation of gene expression, mRNA export, mRNA stability, mRNA decay, and translation.38,39 In mammalian cells poly(A) tails are synthesized at a standard length of ~250 bp, which then may be shortened within the cytoplasm to repress translation as required.40 Additionally, it is estimated that ~ 50% of human genes have alternative poly(A) sites resulting in different gene levels or spatial organization.41 Another regulatory sequence located in the 3′UTR is the microRNA (miRNA) response element (MRE). One survey found that over 50% of 3′UTR motifs are associated with miRNAs.42 miRNAs bind to MRE sequences through partial complementary binding to their mRNA partners, and this binding results in an inhibition of translation. Individual miRNAs have the ability to regulate entire networks of genes because the miRNA block is reliant only on binding to the MRE seed region, not perfect base pairing. Studies on MRE sequence prediction have shown that ~50% of targets contain multiple MRE sequences.43 Another function of the 3′UTR is transcript stability. Alterations in transcript stability allow for expression to be rapidly controlled without altering translation rates and are an important component of processes such as cell growth and differentiation as well as response to environmental stimuli.44 The best-studied RNA stabilization sequence elements are the AU-rich elements (AREs). AREs range in size from 50 to 150 bp and usually contain multiple copies of an AUUUA sequence.45 AREs bind proteins that for the most part promote the decay of mRNA in response to a variety of intra- and extracellular signals. ARE secondary structure is crucial for protein binding, and different proteins can compete for the same binding site.46 For the most part, ARE binding results in transcript destabilization; however, there are rare cases where binding to AREs activates translation.47 Finally, as with the 5′UTR, 3′UTR secondary structure plays a crucial role in gene regulation. 3′UTR folding is an important determinant of translation efficiency. Factors that bind to the

3′UTR can change the spatial configuration by disruption mRNA folding or by interacting with other factors and causing looping out of intervening mRNA sequence.48

CpG Islands For the most part, genomic variability in the regions discussed previously occurs in the germline and is relatively stable. However, there are regions of interest in the human genome that are subject to epigenetic effects, which can both arise within a single generation as well as be transmitted through the germline. In terms of genomic sequence, the regions that are most likely to be affected by epigenetic changes are CpG islands (CGIs). CGIs are short interspersed DNA sequences that are on average between 300 to 3,000 bp long, are GC-rich, and are in their native state predominantly nonmethylated. They are characterized by a CpG dinucleotide content enrichment of at least 60%, whereas the rest of the genome has a much lower CpG frequency (approximately 1% enrichment). Nonmethylated CGIs are also associated with a transcriptionally permissive chromatin state, and approximately 40% of CGIs map to the sites of transcription initiation49; however, many orphan CGIs have been mapped, including thousands that are not located near a typical promoter sequence.50,51 CGIs not associated with typical TSSs can be located both within gene bodies (intragenic CGIs) as well as between genes (intergenic CGIs).50 There is evidence for transcription initiation at approximately 40% of these nonpromoter-associated sites.50 Methylation of CpG sequences can alter gene expression by inducing histone modifications that inhibit access of the transcriptional machinery and lower levels of expression51; however, there are some exceptions to the rule that increased methylation is repressive and hypomethylation permissive to transcription. For example, Gius and colleagues found that chemical hypomethylation of approximately 50% of genes they examined resulted in a transcriptional block, rather than upregulation as expected.52 CpGs are relatively rare; about ~50,000 have been mapped in the human genome53 and the number of known imprinted genes is less than 1% of the entire human genome.54 Yet, they are an important part of genetic regulation, providing a link between environmental changes that could

The Genetics of Gene Expression alter methylation status and underlying DNA heritable signals.

Trans Factors There are a considerable number of trans-acting factors that bind cis sequence and act to enhance or repress gene transcription. These include long noncoding RNAs (lncRNAs), microRNAs (miRNA) and transcription factors (TFs). Details of how each of these elements contributes to the landscape of gene regulation are given further on in the section titled Transgenetic Regulation. Conclusion There is a tremendous diversity of regions within the human genome that act to regulate gene transcription. It is important to note that most of these sites can interact, creating subtle changes to maintain gene level homeostasis. For example, there are known interactions between the 5′UTR cap structure and the 3′ poly(A) tail that result in circularization of the mRNA and promote both translation initiation and efficient ribosome recycling.55 Another example is ncRNA competition with miRNA for MRE sites. Specifically, it has been shown that pseudogene partners of tumor suppressor genes can act as potent regulators of their coding pair by binding to MRE sites and blocking miRNA binding, thus resulting in increased transcription of the coding tumor suppressor genes and growth suppression of cells.56 Thus it is crucial to understand not only the specific genomic regions playing a role in expression but also the multiple layers of players involved and their interactions. C I S G E N E T I C R E G U L AT I O N Eukaryotic RNA polymerases cannot initiate transcription by themselves. Instead, combinations of short sequence elements in the immediate vicinity of a gene act as recognition signals for transcription factors to bind to the DNA in order to guide and activate the polymerase. Cis genetic regulation involves regulatory elements located close to the genes of effect. Cis regulation can occur upstream or downstream of a gene. Additionally, there are cases of intragenic elements changing transcript profiles; this is especially true of spicing modifiers. Cis-regulatory elements are often binding sites for trans-regulatory factors, or regulatory elements that are located far away from the gene

137

of effect and are typically diffusible proteins such as transcription factors (see next section). This section reviews cis regulations of expression, splicing, epigenetic cis modifications, and noncoding RNA.

Expression  (eQTL) Many DNA variants mapped through genomewide association (GWAS) approaches are noncoding polymorphisms falling into junk regions of the genome. From this result, two hypotheses can be generated:  (1)  there are rare coding variants which have not been assayed in the existing screens that can account for these effects and (2)  these new LOAD risk variants are acting in ways that do not require changes to protein coding sequence. While the first theory is valid, studies have shown that association results discovered in genome-wide association studies are significantly more likely to be caused by DNA variation changing transcript expression levels rather than rare polymorphisms in linkage disequilibrium with the original findings.57 This emphasizes the importance of genetic variability in defining both expression states and expression disease traits and lends credence to the second theory. The second theory is the focus of several groups that are performing analyses of risk variation including additional datasets beyond DNA sequence changes. These types of screens, called expression trait quantitative locus (eQTL) screens, involve measuring molecules downstream of DNA variation to map DNA risk variation in the context of function. Expression quantitative trait loci (eQTL) studies are similar to traditional genetic association studies, but instead of correlating genetic markers with qualitative traits such as disease status, genetic markers are associated with the quantitative gene expression levels. It has been suggested that this may be a more powerful approach to detecting loci of true effect as opposed to those that are just markers of disease,58 and further data have shown that between 10% and 15% of top hits in genome-wide association studies have affected a known eQTL.59 The outputs from eQTL screens are SNP transcript pairs that are correlated in a dose-dependent fashion, as shown in Figure 8.1. eQTLs can either be located in cis to a gene, meaning that the SNP maps relatively close to the transcription start site (TSS) and is thought to exert its expression effects by direct action on the gene, or in trans, meaning that

138

the OMICs

(A)

(B)

RNA

NORMALIZED EXPRESSION PROFILE

2.0

CIS EFFECTS

1.5

GENE X T/C

1.0 0.5

TRANS EFFECTS

0.0 T/C

–0.5

GENE X

–1.0 –1.5

NETWORK EFFECTS

–2.0 0

1 ALLELE DOSE

2

DNA

FIGURE  8.1: Correlating genetic variation with expression changes. 1A. To perform eQTL analysis, the variation profiles of DNA are mapped according to the copy number of the minor allele (0, 1, 2; x-axis of the box plot) and for each allele group the expression profile is determined (quantitative; y-axis of the box plot). The analysis involves assessing (1) Whether there is a linear relationship between allele dose and expression (lines on boxplot) and (2)  Whether that linear relationship changes with disease status (arrows on boxplot). 1B. The outputs from the regression tests then are mapped in terms of gene location. eQTLs can either be in cis (top), meaning they map closely to the gene they effect or they can be in trans to the gene (middle), meaning they map further away from the locus they are affecting and can even map to other chromosomes. Outputs can also be assessed for group effects—i.e., how transcript/protein profiles map to other transcript/protein profiles (bottom).

the SNP maps far away from the TSS, possibly even on another chromosome. Trans effects are thought to be indirect, resulting from perturbations in pathways or microRNA effects that are not assayed in traditional microarray eQTL screens; these are discussed in the next section. Screens can be performed using unclassified or clinically normal individuals; however, greater power comes from assessing both normal and disease states. The increase in power arises because while risk factors can be seen in normal tissues and thus mapped, having the data to evaluate how and if those changes are affected by disease status can pick up additional relationships, as can be seen in our own work on human brain.60,61 To date the majority of studies of eQTLs in humans have focused on the relationship between DNA variation and downstream transcript expression. The first studies on human genome and transcriptome-wide eQTL analysis used linkage-based analyses in human EBV-transformed immortalized B-cell lines (ECLs) to study the SNP-transcript relationship

in clinically uncharacterized individuals.62‒64 This work was then extended to using ECLs to study diseases such as asthma, cardiovascular disease, hypertension, and obesity.65‒67 Because the immortalization procedures used in creating ECLs could alter expression profiles and SNP-transcript relationships, work has now progressed to include human tissues such as adipose tissue, liver, and human brain.60,61,66‒70 More recent work has employed NGS instead of microarrays to more fully capture the human transcriptome.71‒73 These approaches have all shown that genetic variability is an important component in the regulation of gene expression, with between 10% and 20% of the transcriptome being regulated by DNA variation.

Splicing  (sQTL) Alternative splicing has been found in up to 95% of human genes30,31; thus it is important to understand whether genetic variability contributes to alterations in splicing. Just as genetic control of transcript abundance can be mapped by performing eQTL studies, genetic control of

The Genetics of Gene Expression transcript type can also be assessed. These studies map genetic variability to changes in transcript splicing (splicing quantitative trait loci, or sQTLs). sQTL studies map quantitative changes in transcripts isoforms and also qualitative changes resulting in unique isoforms. While some of the conventional expression microarray chips used in eQTL studies do have probes to capture different isoforms, for the most part probes are designed against the 3′ end of transcript sequence; thus different techniques are used to fully assay sQTLs. One of the first studies to examine splicing used the same set of Centre d’Etude de Polymorphisme Humaine (CEPH) LCL cell lines as the first eQTL studies.74,75 The investigators used exon-tiling microarrays, which have probes designed to assay each predicted exon from several databases including RefSeq, Genbank, and dbEST. The first study74 found that approximately 5% of probes examined showed potential splicing events and ~3% of those were different between the two individuals studied. In the second study additional samples were employed (n  =  57). This study found that 39% of the transcripts assayed demonstrated changes in whole-gene expression whereas 55% of transcripts had isoform changes.75 Most of the captured isoform changes resulted in alternative splicing (26%), with other captured effects resulting in changes in termination (18%) and initiation (11%). In both studies, a number of different types of splicing events were identified, including exon exclusion, intron retention, and the use of cryptic splice sites. Additional work in the same CEPH samples has used NGS to assess both expression and splicing.72,73 NGS is an improvement on exon tiling arrays, since— rather than probe-based hybridization—quantification and isoform identification is based on short sequence traces. Thus de novo events can be more readily recovered using RNA-seq NGS methods and profiling can occur over the entire transcript. The first study72 used two unrelated HapMap CEU lymphoblast cell lines running poly(A)-enriched RNA on an Illumina GAIIx platform. On examining expression differences in previous exon array experiments, the investigators found a high level of correlation (Pearson r2  =  0.56).72 Additionally, 20% to 25% of observed changes in expression were due to changes in exons alone, suggesting a considerable contribution of splicing to transcriptomic variation. The researchers also tested whether

139

transcripts under genomic control were more likely to have different splicing and found a twofold enrichment of splicing in eQTL targets compared with noneQTL transcripts; however, overall only 10% of eQTL targets were alternatively spliced, suggesting that total expression makes a greater contribution to transcript variance than splicing in considering genomic regulation. The second NGS study73 looked at nine different tissue types as well as several cancerous cell lines. Again, the Illumina platform was used for sequencing. These investigators found that alternative splicing was for the most part universal in human multiexon genes, with a portion of those events being controlled by genomic sequence changes. Much of alternative splicing found in this study was tissue-specific, with ~22,000 events found. Between 47% and 74% of splicing events varied between tissues. Considerably less variability was accounted for by interindividual genomic changes. Between ~10% and 30% of alternative splicing events showed individual specific variation, which is in line with prior studies showing that ~21% of alternatively spliced transcripts are regulated by DNA variation.75

Epigenetic Regulation: DNA Methylation Sequence methylation and histone modifications can have a profound effect on RNA transcription. Indeed, recent mapping has demonstrated that nucleosome occupation and histone states are integrally tied to transcription factor binding and epigenetic regulation and transcription cannot be fully disentangled.76 Epigenetic regulation of transcription can occur by either direct changes to the DNA sequence via the addition of methyl groups or by modifications to external histone proteins, which act as organizational and support structures for DNA. For the most part, methylation occurs in cis and histone modifications act more distally; therefore histone modifications are discussed in the next section, on transgenetic regulation. Methylation occurs primarily at cytosine residues via DNA methyltransferases (DNMTs) and stably alters the expression of genes in cells as cells divide and replicate. There are several DNMTs. DNA methyltransferase 1 (DNMT1) is primarily involved in the maintenance of methylation after DNA replication. DNMT2 is for the most part a RNA methyltransferase. DNMT3a

140

the OMICs

and DNMT3b are crucial for de novo methylation. DNMT 5′-cytosine methylation acts to reduce gene expression and occurs at CG sites in nongamete cells.77 DNA methylation inhibits transcription by either physically impeding the binding of transcriptional proteins78 or by the binding of methyl-CpG-binding domain proteins (MBDs), which can recruit additional histone-modifying proteins such as histone deacetylases, resulting in a modification of the overall chromatin structure to a closed configuration.79 Between 60% and 90% of CpG dinucleotides in mammals are methylated.80,81 Methylation in non-CG contexts has been found in undifferentiated embryonic cell lines, suggesting that methylation control might be important for cell differentiation.76 Alterations in methylation are also crucial in the development of disease states, such as cancer,82 and in response to early life stress.83

Noncoding  RNAs While most of the genome in complex organisms is transcribed, only ~1.5% of the human genome codes for protein.84 Noncoding RNAs (ncRNA) are transcribed into full functional RNAs that never get translated. Instead, these molecules function as regulatory sequences for other RNAs. There are many different types of ncRNA, including ribosomal (rRNAs) and transfer RNAs (tRNAs), miRNAs, short interfering RNAs (siRNAs), piwi-interacting RNAs (piRNAs), small nuclear RNAs (snRNAs), natural antisense transcripts (NATs), and long noncoding RNAs (lncRNAs). The GENCODE project has recently annotated close to 10,000 ncRNAs.85 ncRNAs can be transcribed from either the sense or antisense strand. Often they are interspersed with multiple coding and noncoding transcripts; this fact has shifted the understanding of genomic organization from a linear model to a modular model whereby there are cassettes of transcription occurring in both directions. Genome-wide scans indicate that a significant number of ncRNAs have functional roles.86 For the most part ncRNA is concentrated around gene promoters, enhancers, and 3′UTRs.87 ncRNAs can regulate epigenetic chromatin modifications,88 splicing,89 transcription,90 and translation.90 ncRNA can regulate gene function either in cis or in trans. In terms of cis regulation, NATs are probably the most

specific cis regulator of individual transcripts. NATs are RNAs containing sequences that are complementary to other RNAs. NATs are involved in numerous cellular processes, including translation regulation and stability, RNA export, alternative splicing, genomic imprinting, X inactivation, DNA methylation, and modification of histones.91 NATs can regulate their sense partners through three main mechanisms: transcriptional interference, RNA masking and double-stranded RNA‒dependent mechanisms.92 Transcriptional interference can occur when there is overlap between two transcriptional units such that there is steric inhibition of transcription initiation protein binding. This reduces steady-state mRNA levels through inhibition in transcription elongation.93 RNA masking involves the formation of RNA duplexes between sense and antisense strands at specific regulatory regions that are normal bound by elements involved in mRNA splicing, transport, polyadenylation, translation, and degradation. Blocking of these sites through RNA masking results in changes in downstream transcript content. Double-stranded RNA‒dependent mechanisms also occur through complementary binding, resulting in RNA hybrids that either stabilize transcripts or send them down decay pathways. Stabilization can occur through masking of miRNA sites, as has been found with the ncRNA linc-MD1, which masks miRNA sites and regulates the expression of MAML1 and MEF2C, which are muscle specific transcription factors involved in myogenesis.94

Conclusions None of these cis regulatory effects exist in a vacuum; it is highly probable that many of these cis genetic regulatory switches occur in tandem and that there is both competition and cooperation among effects to provide final outcomes of both RNA type and amount. Indeed, there have been specific instances of splicing and expression being coregulated. Additionally, ncRNA can also act on splice sites in a coordinated fashion. Other work has found that SNPs located within open chromatin, as marked by DNaseI hypersensitivity, are approximately four times more likely to be associated with variation in gene expression levels than SNPs outside these regions, suggesting coordinated regulation of cis eQTL and epigenetic marks. Thus there

The Genetics of Gene Expression are multiple layers of cis players coordinating and competing to create the framework of gene expression that either allows for homeostatic balance and maintenance of health or leads down the pathways to disease.

TRANS GENETIC R E G U L AT I O N In contrast to cis effects, trans regulation of gene expression occurs distal to the gene of effect. Distance cutoffs for determining effects are somewhat arbitrary, but for the most part trans effects refer to gene expression regulation that is interchromosomal or involves some intermediary diffusible factor for regulatory control. One of the first examples of a trans effect was mapped in Drosophila in 1954.95 Transvection is a trans alteration of gene expression resulting from disruptions of somatic or meiotic chromosomal pairing. Since the time when transvection was first discovered, there are now many other known instances of trans alterations in gene expression. In this section, trans alterations in histones, DNA structure, ncRNAs and DNA binding proteins (DBPs) are reviewed. Epigenetic Regulation:  Histone Modifications Histones are globular proteins that associate with DNA to form nucleosomes, which are then grouped into chromatin fibers and finally chromosomes. Nucleosomes are formed of histone octamers that consist of two copies of each of four core histone proteins (H2A, H2B, H3, and H4). Histone modifications can serve to recruit other proteins by specific recognition of modifications, and this can act to alter chromatin structure and silence or promote transcription. There are two major chromatin conformations. In the closed chromatin conformation, called heterochromatin, DNA is packaged around histone cores and is not accessible to the transcription machinery. In the open chromatin confirmation, called euchromatin, DNA is unwound, and this allows for active transcription. There are many modifications to the four core histones that can help to establish the different states. These can include methylation, acetylation, ADP-ribosylation, ubiquitination, citrullination, and phosphorylation.96 Major modifications, histones, and enzymes are listed in Table  8.2. There are a myriad of possible combinations for these modifications; however, acetylation and methylation mapping

141

found a common signature of 17 modifications within ~ 3,000 genes examined, suggesting some consistency of process.97 Modifications can induce or repress transcription. For example, regions of open chromatin are associated with H3K4me1 signals; these regions predict enhancer elements. Additionally, H3K4me3 marks are sites of open chromatin; they indicate start sites, and H3K36me3 is a transcription elongation signal.98 Alternatively, H3K27me3 polycomb-enriched regions are repressed, with little to no transcription signal. H3K9me also acts as a silencing agent via the binding of the chromo domain of heterochromatin-associated protein 1 (HP-1).99 While acetylation and phosphorylation modification are reversible, methylation on histone tails can persist through cell division, thus potentially serving as a permanent “off ” switch for transcription.100 Other modifications include the monomethylations of H4K20, H3K79, and H2BK5 (gene activation) and trimethylation of H3K79 (repression).101 Along with the open and closed configurations, recent mapping has determined that there are seven different chromatin states.102 Open chromatin states that are transcriptionally permissive include (1)  CTCF-enriched elements that lack histone modifications, (2) enhancers associated with H3K4me1, (3)  predicted promoter flanking regions, (4)  regions overlapping the TSS that are enriched for H3K4me3, (5)  predicted transcribed regions that overlap gene bodies and include the H3K36me3 elongation signal, and finally (6)  weak enhancers.102 The final chromatin state includes repressed regions containing the H3K27me3 signal.102

Epigenetic Regulation:  Structural Factors The human genome is approximately 3  million bp in length and is organized into chromosomes, which are tightly wound segments of genomic DNA and DNA proteins. This organizational packaging results in a particular three-dimensional spatial orientation of DNA within the cell nucleus that can effect the genomic regulation of transcription. Studies have found a nonrandom distribution of interchromosomal organization, with small chromosomes being located closer to the center of the nucleus.103 Additional work examining X-inactivation has demonstrated that epigenetic silencing is dependent on chromosome

142

the OMICs

localization, supporting a role for higher-order spatial arrangements in transcription regulation.104,105 Other spatial modifications involved in gene regulation can include looping out of intervening genome sequence, which can act to trigger gene expression. For example, upon activation with retinoic acid, the location of the Hoxb gene cluster is altered, such that certain genes are looped away from the tightly wound chromosome territories; this results in induction of transcription.106 Other studies have

shown that loci that are distant from each other (including gene loci on separate chromosomes) colocalize when actively transcribed to shared transcription “factories.”107 Finally, further data mapping of genome-wide spatial organization has demonstrated that this organization results in loci being organized into two main compartments. Greater interaction occurs between genomic loci within the same compartment as opposed to across compartments, and transcriptional activity occurs for the most part in the

ACETYLATION

TABLE  8.2. MAJOR HISTONE MODIFICATIONS*

Histone Site

Species

Enzymes

Proposed Function

H2A

Lys4 Ly5S

S. cerevisiae mammals

Transcriptional activation Transcriptional activation

Lys7

S. cerevisiae

Esa1 Tip60, p300/CBP pp300/CBP Hat1 Esa1

Unknown Transcriptional activation

Lys5 Lys11 Lys12 Lys15 Lys16 Lys20

S. cerevisiae mammals mammals S. cerevisiae

P300, ATF2 Gen 5 D300/CBP, ATF2 D300/CBP, ATF2 Gcn5. Esa1 p300

Transcriptional Transcriptional Transcriptional Transcriptional Transcriptional Transcriptional

Lys4

S. cerevisiae

Esa1 Hpa2 unknown Gcn5, SRC-1 unknown Gcn5, PCAF Esal, Tip60 SRC-1 Elp3 Hpa2 hTFIIIC90 TAF1 Sas2 Sas3 p300 Gcn5 P300/CBP

Transcriptional activation Unknown Histone deposition Transcriptional activation Histone deposition Transcriptional activation Transcriptional activation, DNA repair Transcriptional activation Transcriptional activation (elongation) Unknown RNA polymerase III transcription RNA polymerase II transcription Euchromatin Transcriptional activation (elongation) Transcriptional activation Transcriptional activation, qnarepar DNA replication transcriptional activation Histone deposition Transcriptional activation, DNA repair Transcriptional activation (elongation) Transcriptional activation Transcriptional activation Transcriptional activation. DNA repair

H2A

H3

Lys9 Lys14

Lys18

Lys23

Lys27 Lys56

S. cerevisiae

unknown Gen 5 Sas3 D300/CBP Gcn5 Spt10

activation activation activation activation activation activation

The Genetics of Gene Expression

143

TABKE 8.2. CONTINUED

H4

Lys5

S. cerevisiae

Hat1 Esal, Tip60 ATF2 Hpa2 p300 Gcn5, PCAF Esal. Tip60 ATF2 Elp3 p300 Had Esal. Tip60 Hpa2 p300 Gen 5 MOF Esal, Tip60 ATF2 Sas2 Hat1/Hat2

Species

Enzymes

Proposed Function

Ezh2

transcription a I silencing

Lys8

Lys12

Lys16 D. melanogaster

Lys91

METHYLATION

Histone Site H1

Lys26

HI

Lys4

S. cerevisiae vertebrates D. melanogaster

Arg8 Lys9

N. crassa, A. thalis D. melanogaster Arg17 Lys27

Lys36 Lys79

H4

Arg3 Lys20 D. melanogaster S. pombe Lys59

Histone deposition Transcriptional activation, DNA repair Transcriptional activation Unknown Transcription a I activation Transcriptional activation Transcriptional activation. DNA repair Transcriptional activation Transcriptional activation (elongation) Transcriptional activation Histone deposition, telomeric silencing Transcriptional activation, DNA repair Unknown Transcriptional activation Transcriptional activation Transcriptional activation Transcriptional activation, DNA repair Transcriptional activation Euchromatin Chromatin assembly

Set1 Set7/9 MLL, ALL-1 Ash1 PRMT5 Suv39h, Clr4 G9a

permissive euchromatin (di-Me) transcriptional activation (tri-Me) transcriptional activation transcriptional activation transcriptional repression transcriptional silencing (tri-Me) transcriptional repression genomic imprinting SETDB1 transcriptional repression (Iri-Me) Dim-5. Kryptonite DNA methylation (tri-Me) Ash1 transcriptional activation CARM1 transcriptional activation Ezh2 transcriptional silencing X inactivation (tri-Me) G9a transcriptional silencing Set2 transcriptional activation (elongation) Dot1 euchromatin transcriptional activation (elongation) checkpoint response

PRMT1 PRMT5 PR-Set7 Suv4-20h Ash1 Set9 unknown

transcriptional activation transcriptional repression transcriptional silencing (mono-Me) heterochromatin (tri-Me) transcriptional activation checkpoint response transcriptional silencing

(continued)

144

the OMICs

PHOSPHORYLATION

TABKE 8.2. CONTINUED

H1

Ser27

Unknown

transcriptional activation, chromatin decondensation

H2A

Ser1

Unknown MSK1 NHK1 Unknown Mec1, Tel1 ATR, ATM, DNA-PK

mitosis, chromatin assembly transcriptional repression mitosis DNA repair DNA repair DNA repair

Ste20 Mst1 unknown TAF1

apoptosis apoptosis DNA repair transcriptional activation

Haspin/Gsg2 Aurora-B kinase MSK1, MSK2 IKK-a Snf Dlk/Zip Aurora-B kinase MSK1, MSK2

mitosis mitosis, meiosis immediate-early gene activation transcriptional activation transcriptional activation mitosis mitosis immediate-early activation

unknown CK2

mitosis, chromatin assembly DNA repair

H2B

m

Thr119 Ser122 Sen 29 Ser139

D. melanogaster S. cerevisiae S. cerevisiae mammalian H2A

Ser10 Ser14

S. cerevisiae vertebrates

Ser33

D. melanogaster

Thr3 SeMO

Thr11 Ser2B H4

mammals mammals

Ser1

*Listed are the histones, sites and enzymes modified by acetylation, methylation and phosphorylation of the genome. Table modified from http://www.cellsignal.com/reference/pathway/histone modification.html

compartment associated with accessible chromatin and histone modifications.108 All of these finding point to the conclusion that structural relationships are crucial for gene regulation.

Noncoding  RNAs As discussed previously, ncRNA can act in cis through the actions of NATs to eliminate or potentiate transcript stabilization; however, there are other forms of ncRNA that can act in trans. Probably the most famous of these trans ncRNA factors are the microRNAs (miRNAs). miRNAs are short (~22 nucleotides long) noncoding RNAs that hybridize with mRNA targets posttranscriptionally. They were first described by Lee and colleagues, who showed that the main function of miRNAs is to downregulate gene expression.109 Importantly, the miRNA described in this report had effects only on gene translation and not on gene transcription, presenting a novel form of gene regulation at the time. Most of the characterized miRNAs are located antisense to neighboring genes in intergenic regions and are transcribed independently

from mRNAs.110 miRNAs constitute approximately 1% of the human genome and regulate protein for ~10% of all human genes.111 A  single miRNA can downregulate the expression of hundreds of targets, and a typical conserved miRNA has 200 target mRNAs with conserved binding sites. miRNAs downregulate genes by inhibiting translation of mRNA after assembly of the miRNA into the RNA-induced silencing complex (RISC).112 An additional mechanism whereby miRNAs downregulate genes is via the promotion of RNA cleavage; however, this is not as common as inhibition of translation.113

DNA (DBPs)-Binding Proteins Another major player in trans genomic regulation is the binding of trans-acting factors, such as DNA-binding proteins (DBPs) to cis regulatory sequence. DBPs include transcription factors (TFs), polymerases, nucleases, and histones. The TFs are one of the best- studied sets of DBPs. They bind to specific cis sequences and can activate or inhibit downstream transcription. Regulation can occur either by binding the

The Genetics of Gene Expression

RNA polymerase responsible for transcription, or they can bind enzymes that modify the histones at the promoter, changing the DNA conformation to permit or block transcription. As with miRNA, single TFs can have broad-ranging effects on thousands of genes.114 There are at least 1,400 DNA-binding transcription factors in the human genome.115 These include general transcription factors (GTFs), promoter-specific activator proteins, and coactivators. GTFs help form the preinitiation complex (PIC) through their assembly on the core promoter. These factors are required for the transcription of almost all human genes. Promoter-specific activator proteins bind to specific promoter sequences and modulate transcription. This class, along with the coactivators, helps to establish intercell, interindividual, and interspecies differences in gene expression. Coactivators typically lack intrinsic sequence-specific DNA binding and act to link activators to the general transcription machinery. The mechanisms by which these types of TFs regulate transcription are varied and include (1)  stabilization or blocking of RNA polymerase binding, (2)  acetylation or deacetylation of histones and (3)  recruitment of coactivators.116 While TFs regulate the transcription of other genes, they themselves are regulated in multiple ways. Most transcription factors act as their own repressors in a negative feedback loop. This mechanism is important for maintaining low levels of active TFs within cells, and misregulation of this process has been implicated in tumor formation.117

Conclusions As with cis factors, there are multiple layers and players involved in trans genetic regulation, and all factors can act in concert to determine final outputs of gene expression. It is interesting to note that while there appears to be more regulators acting in trans,118 other work has found that differences in interindividual expression profiles were mostly the result of many cis-acting changes spread throughout the genome as opposed to a few global trans effects (i.e., master regulators).119 While it remains to be seen whether this effect occurs universally, this result has implications for the development of molecules to regain gene expression homeostasis in that following cis targets will require different modifiers than changing single trans activators with multiple downstream effects.

145

ENCODE Finally, any discussion of genomic regulation of gene expression would be remiss without a mention of the ENCylopedia Of DNA Elements project (www.nature.com/encode/#/threads).102 This project was launched in September 2003 with the main goal of delving into the “DNA deserts” that were previously unannotated genomic areas not easily described by the HGP. In September 2005, some 30 papers published in three different journals were released describing the first full genome pass of the ENCODE project. To identify all functional elements in the human genome sequence, greater than 1,600 experiments were performed in 147 different cell types. Functional elements were defined as discrete genome segments that encoded a defined product or displayed a reproducible biochemical signature. The targets of their experiments included DNA methylation, open chromatin regions, RNA binding sites, human protein coding regions, noncoding RNA, pseudogenes, histone modifications, and transcription factors. ENCODE assigned function to approximately 80% of the human genome. Experimental methods span the range from capturing epigenetic modifications to mapping transcription factor binding. Two major classes of annotations were completed. First, genes as defined by the HGP were validated and their RNA transcripts mapped. Second, transcriptional regulatory regions were annotated. ENCODE found that genes constituted approximately 1% of the genome and that ~95% of the human genome lies within 8 kilobases of a DNA-protein interaction and ~99% of the human genome is within 1.7 Kb of at least one biochemical event as measured by ENCODE. Four major classes of functional elements were identified. First, ENCODE identified various types of RNA, which accounted for 62% of the genome and included both coding and noncoding RNA. Most of the RNA delineated by ENCODE resided within introns or near genes. The second most enriched class of elements was histone modifications, accounting for ~56% of the human genome. Open chromatin accounted for ~15% of genomic content, and finally transcription factor binding sites accounted for ~8% of genomic content. Overall, the mapped fraction of the genome responsible for gene regulation was considerably higher than the portion of the genome responsible for protein coding (~20% for regulation compared with ~1% for protein

146

the OMICs

coding), further suggesting that gene regulation is a crucial component of the human genome and perhaps what helps distinguish the human genome from the genomes of other species.

DISCUSSION Genomic control of gene expression helps to establish and ensure cell health. Genomic control also establishes cell fates and most likely accounts for what separates homo sapiens from other species. Regulation of the human transcriptome occurs through many multilayered combinatorial processes (Figure  8.2), none of which exists in isolation. Final output of this transcriptome “code” helps to modulate the subtle control of protein coding genes such that spatial and temporal distribution of proteins is tightly orchestrated to maintain health. Yet proteins are just the beginning of the story. Newer data suggest a crucial role in nonprotein coding junk areas of the genome to act as regulators of this complex process, including but not limited to epigenetic effects and direct actions of noncoding RNA. In eukaryotes, there is a clear correlation between organismal complexity and the amount of noncoding genomic sequence; the genome of homo sapiens contains one of

the highest ratios of noncoding content to total genome size.120 This finding is logical in that greater organismal complexity can be achieved through subtle variations in the activity of regulatory components as opposed to simple increases in coding gene counts. It is also desirable for higher organisms to have a greater ratio of regulatory noncoding to coding genome content because noncoding regulatory sequences are not subject to the evolutionary constraints imposed on protein coding genes, many of which are crucial to cell survival. Thus, this ratio skew can increase potential inter-species diversity.121 While these points are salient, the idea that the human genome contains more regulatory content than protein coding content was not the original hypothesis for how the genome was organized. It was always assumed that genomic information would lie within the protein “code” and that again the rest of the genome was merely junk; however, these assumptions were invalidated based on the vast amounts of data collected through the HGP (http://www. ornl.gov/sci/techresources/Human_Genome/ home.shtml), the FANTOM consortium (http:// fantom.gsc.riken.jp/) and the ENCODE project (http://www.genome.gov/10005107).

INTRONS/SPLICING COMPONENTS, SQTL

TRANS FACTORS

CIS FACTORS

METHYLATION BLOCK CpG

TSS

NATS C/T G/A

G/C

EXON

AG

MRE

5’ UTR/PROMOTER ELEMENTS, EQTL

HISTONE MODIFICATIONS TRANSCRIPTION FACTORS

miRNA

STRUCTURAL EFFECTS FIGURE  8.2: The many layers of genomic control of transcriptome output. Shown in the figure are some of the processes discussed in this chapter, including both cis (top of figure) and trans (bottom of figure) effects.

The Genetics of Gene Expression Currently there is a constant influx of information that modifies existing assumptions, much like the assumption that humans would contain the largest number of protein-coding genes among all species. This data-gathering process is seemingly impossibly complex, and the more we know the less we know. Yet much as with the human genome sequencing project, we are still in a data-gathering/hypothesis-generation phase. Indeed some have suggested this necessitates a change in funding priorities.122 Indeed, limiting science to truncated hypothesis-testing projects rather than descriptive hypothesis-generating projects would have resulted in little new knowledge in the field of genomics and genomic regulation, since most of the core assumptions were incorrect. Observation is as important as testing ideas. The field of genomic regulation is exploding with rich resources, novel ideas, and new data and our understanding of the human genome and its regulation is slowly becoming more tractable ~10  years since the first draft of the raw sequence was completed. However, there are still discoveries to be made and, as Winston Churchill famously opined: “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.”

ACKNOWLEDGEMENTS AJM is supported by a R01 award from the National Institute on Aging (AG041232). REFERENCES 1. Schafer AJ, & Hawkins JR. DNA variation and the future of human genetics. Nat Biotechnol 1998;16(1):33–39. 2. Myers AJ. The age of the “ome”:  genome, transcriptome and proteome dataset collection and analysis. Brain Res Bull 2012;88(4):294–301. 3. Myers AJ. AD gene 3-D: moving past single layer genetic information to map novel loci involved in Alzheimer’s disease. J Alzheimers Dis 2013;33 Suppl 1:S15–22 4. Deaton AM, & Bird A.  CpG islands and the regulation of transcription. Genes Dev 2011;25(10):1010–1022. Review. 5. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, . . . et  al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007;447:799–816. 6. Mattick JS. Non-coding RNAs:  the architects of eukaryotic complexity. EMBO Rep 2001;2:986–991.

147

7. Levine M, & Tjian R. Transcription regulation and animal diversity. Nature 2003;424:147–151. 8. Landolin JM, Johnson DS, Trinklein ND, Aldred SF, Medina C, Shulha H, . . . Myers RM. Sequence features that drive human promoter function and tissue specificity. Genome Res 2010;20:890–898. 9. Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, . . . Hayashizaki Y. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 2006;38:626–635. 10. Cooper SJ, Trinklein ND, Anton ED, Nguyen L, & Myers RM. Comprehensive analysis of transcriptional promoter structure and function in 1 % of the human genome. Genome Res 2006;16:1–10. 11. Baek D, Davis C, Ewing B, Gordon D, & Green P. Characterization and predictive discovery of evolutionarily conserved mammalian alternative promoters. Genome Res 2007;17:145–155. 12. Arce L, Yokoyama NN, & Waterman ML. Diversity of LEFrrCF action in development and disease. Oncogene 2006;25:7492–7504. 13. Li TW, Ting JH, Yokoyama NN, Bernstein A, van de Wetering M, & Waterman ML. Wnt activation and alternative promoter repression of LEF1 in colon cancer. Mol Cell Biol 2006;26(14):5284–5299. 14. Gagniuc P, & Ionescu-Tirgoviste C. (2012) Eukaryotic genomes may exhibit up to 10 generic classes of gene promoters. BMC Genomics 2012;13:512. 15. Mitchell SF, Walker SE, Algire MA, Park EH, Hinnebusch AG, & Lorsch JR. The 5’-7-methylguanosine cap on eukaryotic mRNAs serves both to stimulate canonical translation initiation and to block an alternative pathway. Mol Cell 2010;39:950–962. 16. Trinklein ND, Aldred SF, Hartman SJ, Schroeder DI, Otillar RP, & Myers RM. An abundance of bidirectional promoters in the human genome. Genome Res 2004;14(1):62–66. 17. Dmitriev SE, Andreev DE, Terenin 1M, Olovnikov lA, Prassolov VS, Merrick WC, & Shatsky IN. Efficient translation initiation directed by the 900-nucleotide-Iong and GC-rich 5’ untranslated region of the human retrotransposon LINE-I mRNA is strictly cap dependent rather than internal ribosome entry site mediated. Mol Cell Biol 2007;27:4685–4697. 18. Pickering BM, & Willis AE. The implications of structured 5’ untranslated regions on translation and disease. Semin Cell Dev Biol 2005;16:39–47. 19. Calvo SE, Pagliarini DJ, & Mootha VK. Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc Natl Acad Sci USA 2009;106:7507–7512.

148

the OMICs

20. Oyama M, Itagaki C, Hata H, Suzuki Y, Izumi T, Natsume T, . . . Sugano S. Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs. Genome Res 2004;14:2048–2052. 21. Fedorova L, & Fedorov A. Introns in gene evolution. Genetica 2003;118:123–131. 22. Bradnam KR, Korf I. Longer first introns are a general property of eukaryotic gene structure. PLoS ONE 2008;3:e3093. 23. Eddy J, & Maizels N. Conserved elements with potential to form polymorphic G-quadruplex structures in the first intron of human genes. Nucleic Acids Res 2008;36:1321–1333. 24. Beaudoin JD, & Perreault JP. 5’-UTR G-quadruplex structures acting as translational repressors. Nucleic Acids Res 2010;38:7022–7036. 25. Cenik C, Derti A, Mellor JC, Berriz GF, & Roth FP. Genome-wide functional analysis of human 5’ untranslated region introns. Genome Biol 2010;11:R29. 26. Fablet M, Bueno M, Potrzebowski L, & Kaessmann H. Evolutionary origin and functions of retrogene introns. Mol Biol EvoI 2009;26:2147–2156. 27. Rose AB. Intron-mediated regulation of gene expression. Curr Top Microbiol Immunol 2008;326:277–290. 28. Tan S, Guo J, Huang Q, Chen X, Li-Ling J, Li Q, & Ma F. Retained introns increase putative microRNA targets within 3’ UTRs of human mRNA. FEBS Lett 2007;581:1081–1086. 29. Schmucker D, Clemens JC, Shu H, Worby CA, Xiao J, Muda M, . . . Zipursky SL. Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity. Cell 2000;101:671–684. 30. Pan Q, Shai O, Lee LJ, Frey BJ, & Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 2008;40:1413–1415. 31. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, . . . Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature 2008;456:470–476. 32. Buratti E, & Baralle FE. Influence of RNA secondary structure on the pre-mRNA splicing process. Mol Cell Biol 2004;24:10505–10514. 33. Warf MB, & Berglund JA. Role of RNA structure in regulating pre-mRNA splicing. Trends Biochem Sci 2010;35:169–178. 34. Shepard PJ, & Hertel KJ. Conserved RNA secondary structures promote alternative splicing. RNA 2008;14:1463–1469. 35. Lewis BP, Green RE, & Brenner SE. Evidence or the widespread coupling of alternative splicing

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

and nonsense-mediated mRNA decay in humans. Proc Natl Aca. Sc. U S A 2003;100;189–192. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, . . . Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005;15:1034–1050. Pesole G, Mignone F, Gissi C, Grillo G, Licciulli F, & Liuni S. Structural and functional features of eukaryotic mRNA untranslated regions. Gene 2001;276:73–81. Gorgoni B, Gray NK. The roles of cytoplasmic poly(A)binding proteins in regulating gene expression:  a developmental perspective. Brief Funct Gen Proteom 2004;3: 125–141. Mangus DA, Evans MC, & Jacobson A. Poly(A)-binding proteins:  multifunctional scaffolds for the post-transcriptional control of gene expression. Genome Biol 2003;4:223. Kuhn U, Gundel M, Knoth A, Kerwitz Y, Rudel S, & Wahle E. Poly(A) tail length is controlled by the nuclear poly(A)binding protein regulating the interaction between poly(A) polymerase and the cleavage and polyadenylation specificity factor. J Biol Chem 2009;284:22803–22814. Dickson AM, & Wilusz J. Polyadenylation: alternative lifestyles of the A-rich (and famous?). EMBO J 2010;29:1473–1474. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, . . . Kellis M. Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 2005;434:338–345. Stark A, Brennecke J, Bushati N, Russell RB, & Cohen SM. Animal MicroRNAs confer robustness to gene expression and have a significant impact on 3’UTR evolution. Cell 2005;123:1133–1146. Eberhardt W, Doller A, Akool el-S, & Pfeilschifter J. Modulation of mRNA stability as a novel therapeutic approach. Pharmacol Ther 2007;114:56–73. Chen CY, & Shyu AB. AU-rich elements:  characterization and importance in mRNA degradation. Trends Biochem Sci 1995;20:465–470. Meisner NC, Hackermuller J, Uhl V, Aszodi A, Jaritz M, & Auer M. mRNA openers and closers:  modulating AU-rich element-controlled mRNA stability by a molecular switch in mRNA secondary structure. ChemBioChem 2004;5:1432–1447. Vasudevan S, & Steitz JA. AU-rich-element-mediated upregulation of translation by FXRl and Argonaute 2. Cell 2007;128:1105–1118.

The Genetics of Gene Expression 48. Eberle AB, Stalder L, Mathys H, Orozco RZ,  & Muhlemann O. Posttranscriptional gene regulation by spatial rearrangement of the 3’ untranslated region. PLoS Biol 2008;6:e92. 49. Fatemi M, Pao MM, Jeong S, Gal-Yam EN, Egger G, Weisenberger DJ, & Jones PA. Footprinting of mammalian promoters:  use of a CpG DNA methyltransferase revealing nucleosome positions at a single molecule level. Nucleic Acids Res 2005;33(20): e176. 50. Illingworth RS, Gruenewald-Schneider U, Webb S, Kerr ARW, James KD, Turner DJ. . . Bird AP. Orphan CpG islands identify numerous conserved promoters in the mammalian genome. PLoS Genet 2010;6:e1001134. 51. Bird AP, & Wolffe AP. 1999. Methylation-induced repression—belts, braces, and chromatin. Cell 1999;99:451–454. 52. Gius D, Cui H, Bradbury CM, Cook J, Smart DK, Zhao S, . . . Feinberg AP. Distinct effects on gene expression of chemical and genetic manipulation of the cancer epigenome revealed by a multimodality approach. Cancer Cell 2004; 6:361–371. 53. Bandyopadhyay D, & Medrano EE. The emerging role of epigenetics in cellular and organismal aging. Exp Gerontol 2003;38:1299–1307. 54. Weaver JR, Susiarjo M, & Bartolomei MS. Imprinting and epigenetic changes in the early embryo. Mamm Genome 2009;20:532–543. 55. Mazumder B, Seshadri V, & Fox PL. Translational control by the 3’-UTR:  the ends specify the means. Trends Biochem Sci 2003;28:91–98. 56. Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, & Pandolfi PP. A coding-independent function of gene and pseudogene mRNAs regulates tumor biology. Nature 2010;465:1033–1038. 57. Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, & Cox NJ. Trait-associated SNPs are more likely to be eQTLs:  annotation to enhance discovery from GWAS. PLoS Genet 2010;6(4):e1000888. 58. Schadt EE. Exploiting naturally occurring DNA variation and molecular profiling data to dissect disease and drug response traits. Curr Opin Biotechnol 2005;16(6):647–654. 59. Cookson W, Liang L, Abecasis G, Moffat M, & Lathrop M. Mapping complex disease traits with global gene expression. Nat Rev Genet 2009;0(3):184–194. 60. Myers AJ, Gibbs JR, Webster JA, Rohrer KC, Zhao AS, Marlowe L, . . . Nath P, et  al. A survey of genetic human cortical gene expression. Nat Genet 2007;39(12):1494–1499. 61. Webster JA, Gibbs JR, Clarke J, Ray M, Zhang W, Holmans P, . . . Myers AJ. Genetic control of

62.

63.

64.

65.

66.

67.

68.

69.

70.

71.

72.

73.

149

human brain transcript expression in Alzheimer’s disease. Am J Hum Genet 2009;84(4):445–458. Monks SA, Leonardson A, Zhu H, Cundiff P, Pietrusiak P, Edwards S, . . . Schadt EE. Genetic inheritance of gene expression in human cell lines. Am J Hum Genet 2004;75(6):1094–1105. Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, & Cheung VG. Genetic analysis of genome-wide variation in human gene expression. Nature 2004;430(7001):743–747. Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, & Burdick JT. Mapping determinants of human gene expression by regional and genome-wide association. Nature 2005;437(7063):1365–1369. Dixon AL, Liang L, Moffatt MF, Chen W, Heath S, Wong KC, . . . et  al. A genome-wide association study of global gene expression. Nat Genet 2007;39(10):1202–1207. Göring HH, Curran JE, Johnson MP, Dyer TD, Charlesworth J, Cole SA, . . . et  al. Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat Genet 2007;39(10):1208–1216. Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, Heath S, . . . et  al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 2007;448(7152):470–473. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, . . . et al. Genetics of gene expression and its effect on disease. 2008;452(7186):423–428. Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, . . . et  al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol 2008;6(5):e107. Kleinman JE, Law AJ, Lipska BK, Hyde TM, Ellis JK, Harrison PJ, &Weinberger DR. Genetic neuropathology of schizophrenia:  new approaches to an old question and new uses for postmortem human brains. Biol Psychiatry 201169(2):140–145. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, . . . Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 2010;464(7289):768–772. Lalonde E, Ha KC, Wang Z, Bemmo A, Kleinman CL, Kwan T, . . . Majewski J. RNA sequencing reveals the role of splicing polymorphisms in regulating human gene expression. Genome Res 2011;21(4):545–554 Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, . . . Burge CB. Alternative

150

74.

75.

76.

77.

78.

79.

80.

81.

82. 83.

84.

85.

86.

87.

the OMICs isoform regulation in human tissue transcriptomes. Nature 2008;456(7221):470–476. Kwan T, Benovoy D, Dias C, Gurd S, Serre D, Zuzan H, . . . Majewski J. Heritability of alternative splicing in the human genome. Genome Res 2007;17(8):1210–1218. Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, Beaulieu P, . . . Majewski J. Genome-wide analysis of transcript isoform variation in humans. Nat Genet 2008;40(2):225–231. Rosenfeld JA, Wang Z, Schones D, Zhao K, DeSalle R, & Zhang MQ. Determination of enriched histone modifications in non-genic portions of the human genome. BMC Genomics 200910:143. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, . . . Ecker JR. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 2009;462(7271):315–322. Choy MK, Movassagh M, Goh HG, Bennett M, Down T, & Foo R. Genome-wide conserved consensus transcription factor binding motifs are hyper-methylated. BMC Genomics 2010;11:519. Hung MS, & Shen CKJ. Eukaryotic methyl-CpG-binding domain proteins and chromatin modification. Eukaryotic Cell 2003;2(5):841. Ehrlich M, Gama-Sosa MA, Huang LH, Midgett RM, Kuo KC, McCune RA, & Gehrke C. Amount and distribution of 5-methylcytosine in human DNA from different types of tissues of cells. Nucleic Acids Res 1982;10(8):2709–2721. Tucker KL. Methylated cytosine and the brain:  a new base for neuroscience. Neuron 2001;30(3):649–652. Jones PA, & Baylin SB. The epigenomics of cancer. Cell 2007;128:683–692. Murgatroyd C, Patchev AV, Wu Y, Micale V, Bockmühl Y., Fischer D, . . . Spengler D. Dynamic DNA methylation programs persistent adverse effects of early-life stress. Nat Neurosci 2009;12:1559–1566. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431(7011):931–945. Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, . . . Guigó R. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 2012;22(9):1775–1789. Mercer TR, Dinger ME, & Mattick JS. Long non-coding RNAs:  insights into functions. Nat Rev Genet 2009;10:155–159. He Y, Vogelstein B, Velculescu VE, Papadopoulos N, & Kinzler KW. The antisense transcriptomes of human cells. Science 2008;322:1855–1857.

88.

89.

90.

91.

92.

93.

94.

95.

96.

97.

98.

99.

100.

101.

102.

Hirota K, Miyoshi T, Kazuto Kugou K, Hoffman CS, Shibata T, & Ohta K. Stepwise chromatin remodelling by a cascade of transcription initiation of non-coding RNAs. Nature 2008;456:130–134. Ankö ML, & Neugebauer KM. Long noncoding RNAs add another layer to Pre-mRNA splicing regulation. Mol Cell 2010;39(6):833–834. Szymański M, Barciszewska MZ, Zywicki M, & Barciszewski J. Noncoding RNA transcripts. J Appl Genet 2003;44(1):1–19. Su WY, Xiong H, & Fang JY. Natural antisense transcripts regulate gene expression in an epigenetic manner. Biochem Biophys Res Commun 2010;396:177–181. Lavorgna G, Dahary D, Lehner B, Sorek R, Sanderson CM, & Casari G. In search of antisense. Trends Biochem Sci 2004;29(2):88–94. Prescott EM, & Proudfoot NJ. Transcriptional collision between convergent genes in budding yeast. Proc Natl Acad Sci U S A 2002;99:8796–8801. Cesana M, Cacchiarelli D, Legnini I, Santini T, Sthandier O, Chinappi M, . . . Bozzoni I. (2011) A long noncoding RNA controls muscle differentiation by functioning as a competing endogenous RNA. Cell 2011;147:358–369. Lewis EB. The theory and application of a new method of detecting chromosomal rearrangements in Drosophila melanogaster. Am Nat 1954;88(841): 225–239. Strahl B, & Allis C. The language of covalent histone modifications. Nature 2000;403(6765):41–45. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, . . . Zhao K. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet 2008;40(7): 897–903. Santos-Rosa H, Schneider R, Bannister AJ, Sherriff J, Bernstein BE, Emre NCT, . . . Kouzarides T. Active genes are tri-methylated at K4 of histone H3. Nature 2002;419(6905):407–411. Vermaak D, Ahmad K, & Henikoff S. Maintenance of chromatin states:  an open-and-shut case. Curr Opin Cell Biol 2003;15(3):266–274. Lachner M, & Jenuwein T. The many faces of histone lysine methylation. Curr Opin Cell Biol ;14(3):286–298. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, . . . Zhao K. High-resolution profiling of histone methylations in the human genome. Cell 2007;129(4):823–837. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74.

The Genetics of Gene Expression 103. Bolzer A, Kreth G, Solovei I, Koehler D, Saracoglu K, Fauth C, . . . Cremer T. Three-dimensional maps of all chromosomes in human male fibroblast nuclei and prometaphase rosettes. PLoS Biol 2005;3(5):e157. 104. Xu N, Tsai C L, & Lee JT. Transient homologous chromosome pairing marks the onset of X inactivation. Science 2006;311(5764):1149–1152. 105. Bacher CP, Guggiari M, Brors B, Augui S, Clerc P, Avner P, . . . Heard E. Transient colocalization of X-inactivation centres accompanies the initiation of X inactivation. Nat Cell Biol 2006;8(3):293–299. 106. Chambeyron S, & Bickmore WA. Chromatin decondensation and nuclear reorganization of the HoxB locus upon induction of transcription. Genes Dev 2004;18(10):1119–1130. 107. Osborne CS, Chakalova L, Brown KE, Carter D, Horton A, Debrand E, . . . Fraser P. Active genes dynamically colocalize to shared sites of ongoing transcription. Nat Genet 2004;36(10):1065–1071. 108. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, . . . Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009;326(5950):289–293. 109. Lee RC, Feinbaum RL, & Ambros V. The C.  elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 1993;75(5):843–854. 110. Lau NC, Lim LP, Weinstein EG, & Bartel DP. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. Science 2001;294(5543):858–862. 111. John B, Enright AJ, Aravin A, Tuschl T, Sander C, & Marks DS. Human microRNA targets. PLoS Biol 2004;2(11):e363.

151

112. Valencia-Sanchez MA, Liu J, Hannon GJ, &Parker R. Control of translation and mRNA degradation by miRNAs and siRNAs. Genes Dev 2006;20(5):515–524. 113. Yekta S, Shih IH, &Bartel DP. MicroRNAdirected cleavage of HOXB8 mRNA. Science 2004;304:594–596. 114. Li Z, Van Calcar S, Qu C, Cavenee W, Zhang M, & Ren B. A global transcriptional regulatory role for c-Myc in Burkitt’s lymphoma cells. Proc Natl Acad Sci USA 2000;14:8164–8169. 115. Vaquerizas JM, Kummerfeld SK, Teichmann SA, & Luscombe NM. A census of human transcription factors: function, expression and evolution. Nat Rev Genet 2009;10(4):252–563. 116. Gill G. Regulation of the initiation of eukaryotic transcription. Essays Biochem 2001;37:33–43. 117. Khan A, Shover W, & Goodliffe JM. Su(z)2 antagonizes auto-repression of Myc in Drosophila, increasing Myc levels and subsequent trans-activation. PLoS One 2009;4(3):e5076. 118. Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, & Spielman RS. Polymorphic cis- and trans-regulation of human gene expression. PLoS Biol 2010;8(9). 119. Wittkopp PJ, & Haerum BK, Clark AG. Evolutionary changes in cis and trans gene regulation. Nature 2004;430(6995):85–88. 120. Taft RJ, Pheasant M, & Mattick JS. The relationship between non-protein-coding DNA and eukaryotic complexity. Bioessays 2007;29(3):288–299. 121. Lander ES. Initial impact of the sequencing of the human genome. Nature 2011; 470(7333):187–197. 122. Geschwind DH, & Konopka G. Neuroscience in the era of functional genomics and systems biology. Nature 2009;461(7266):908–915.

PART  III PROTEIN

9 Proteomics J O N AT H A N C . T R I N I D A D , R A L F S C H O E P F E R , A N D A . L . B U R L I N G A M E

N

eurons are cells with highly specialized sub-cellular compartments. They are at the core of information processing and storage in the central nervous system. Signaling events can occur within a synaptic compartment (on the order of a micron), or involve retrograde transport to the nucleus (sometimes on the order of a meter), and range in timescale from milliseconds to decades. Multiple neuronal subtypes exist, each of which is endowed with a distinct set of signaling pathways. These processes have been investigated most extensively at electrochemical synapses and axons. Pathways within a compartment are highly interconnected, which provides some degree of redundancy with respect to the biological endpoint of a given pathway. Thus far neuroproteomic approaches have contributed significantly to defining the relevant protein compositions of protein complexes, machines, and subcellular entities and in demonstrating how posttranslational processes modulate their dynamics. In doing so, these approaches have revealed unanticipated levels of complexity at the protein level. In practice, detailed molecular information is obtained that continues to provide a comprehensive and reliable quantitative determination of the constituents of specific subcellular compartments and how they vary in response to particular perturbations of the system. The additional major challenge focuses on the identification and modulation of site-specific posttranslational modifications in the brain, with particular emphasis how these PTMs mediate processes that are unique to the nervous system. Recent progress in proteomic investigations has been driven by the commercial development of increasingly powerful instrumentation

able to characterize complex biological samples with higher sensitivity and mass measurement accuracy. Despite the lack of amplification technologies such as PCR for protein analyses, continued development of ever more powerful instrumentation will enable more sensitive and comprehensive analytical strategies to emerge and thus remain central to proteomic investigations. Currently proteomics is undergoing a paradigm shift from qualitative studies to quantitative measurements that will facilitate the investigation of neuronal system dynamics. The future will continue to see the development of improved sample enrichment strategies aimed at allowing pursuit of global investigations into the comparative compositions and dynamics of several posttranslational modifications per experiment simultaneously, such as a recent example dealing with integrated investigation of both phosphorylation and O- GlcNAcylation of the protein machine of synaptosomes—the same biological organelle.

BACKGROUND AND CHALLENGES Scope of Proteomics:  Molecules By analogy to genomics, proteomics is the large-scale analysis of proteins expressed in a given biological system.1,2 Current proteomic studies can be categorized into three broad areas:  qualitative and quantitative analysis of protein expression, either at the whole organ/ cell level or within subcellular structures and organelles; analysis of the protein posttranslational modification landscape; and investigations of protein-protein interactions and architecture of protein complexes.

156

the OMICs

Several morphologically distinct structures exist that are restricted to the nervous system or a limited number of additional cell types. These include chemical and electrical synapses, dendritic and axon arbors, and the myelin sheath surrounding axons. Other organelles, such as mitochondria, are widely expressed throughout the body, but any impaired function will play a role in nervous system diseases as well.3,4 Recent advances in mass spectrometry-based proteomics have revealed that posttranslational modifications (PTMs) of proteins are far more widespread than previously thought. Proteomics is uniquely suited to detect and delineate the wide range of site-specific PTMs that exist, since such species cannot be actually characterized structurally by genomic and transcriptomic methodologies. Investigations focused on gaining a comprehensive knowledge of interactions between proteins, both pairwise and in the context of protein assemblies, will benefit from the rigorous experimental involvement and attention of mass spectrometry‒based strategies. Mass spectrometry has the attractive advantage of being able to identify in vivo interactors of purified targets without making prior assumptions about the identity of yet to be discovered participant proteins (as would be necessary using Western blot approaches).

Scope of Proteomics:  Neuroproteomics There are numerous areas of investigation for which neuroproteomics will continue to be particularly informative. These include investigation of the molecular underpinnings of plasticity in the nervous system; provision of new neurophysiological insights from studies of the molecular and PTM level changes induced by cell damage and repair; revealing the stratification of protein level defects that define disease-associated changes in the nervous system; gaining new knowledge on how pharmacological manipulations affect the brain, and investigation of distinctions in signaling and signal transport processes between the peripheral and central nervous systems. Other areas include defining and profiling changes in the composition of synapses in general as well as defining the molecular composition of synapses unique to individual neuronal subtypes. The neuronal subtypes known at present are mostly based upon morphological and physiological characteristics; further delineation

of these and new subtypes will require studies using expression profiling of RNAs and a differential knowledge of their proteomes with particular emphasis on synaptic compartments and components.5 It is known that only a small subset of brain diseases results from single gene mutations.6,7 More commonly, prevalent diseases such as schizophrenia, bipolar disorders, and autism spectrum disorders have an etiology involving the interaction of an unknown number of genes and their products. It is an open question whether just one or a few dysfunctional pathways could converge to eventually define an individual disease phenotype. In this context, the potential strength of proteomics lies in its ability to define and decipher the molecular machinery and networks involved in either a particular signaling pathway or the entire network. The primary targets of most current neuroactive drugs are known. However, what often remains unclear is how any drug-induced alterations in target function can result in the broader molecular changes that affect biological activity and phenotype. There are three areas of brain research with which mass spectrometry is heavily involved but that are not the focus of this review:  studies of neuropeptide hormones, natural venom-containing neurotoxins (channel blockers), and tissue surface imaging mass spectrometry. For reviews on the use of mass spectrometry for the study of neuropeptides, refer to references 8 and 9; for neurotoxins, refer to references 10 and 11. Imaging mass spectrometry involves the direct analysis of intact proteins, lipids, and metabolites from tissue sections. For a review of this technique and how it can be applied (for example, to characterize tumor pathology), see reference 12. In addition, proteomic investigations of blood plasma biomarkers are covered in Chapter 10.

Separation and Identification of Complex Mixtures of Proteins:  The Emergence of Proteomics Proteomic analysis requires that an individual protein or protein fragment be sufficiently purified or separated from other components to then allow for its precise identification. Since its invention in the mid-1970s, two-dimensional gel electrophoresis has provided a means to separate proteins in complex mixtures such as

Proteomics cell lysates.13,14 More than 1,500 spots can be resolved in a 2D experiment. Separation during the first dimension is based on isoelectric focusing, which has specific requirements for protein solubility. These requirements preclude the analysis of membrane-bound proteins— a class of proteins of particular interest to the neuroscience community. While the advent of spot visualization via fluorescent labeling of the sample has increased the range and improved relative quantification of spots, the relatively limited dynamic range of gel systems remains a major limitation. Both chemical and posttranslational modification of any given protein can shift the position of the protein spot in an unpredictable fashion, thereby adding to the limitations of 2D gel systems. Prior to the development of suitable methods of tandem mass spectrometry, chemical Edman degradation was the primary means of protein identification.15 Antibodies are an alternative means of spot identification but may suffer from uncertain specificity, limited dynamic range, and limited availability. Common to all spot-identification strategies is the problem that one visually identified spot may contain more than one protein. Modern proteomic workflows (Figure  9.1) handle these problems much better. They take advantage of the high resolution of LC-based separation of peptides derived from proteolytic processing of protein samples and the high mass accuracy of mass spectrometers. Fragmentation of individual components in the mass spectrometer yields MS/MS spectra, which enable identification of the peptide sequence plus precise site determination of any PTMs that may be present on the peptide. These workflows are “universal” from the digest onward (pool of peptides) and are amenable to automation. The sample may be as complex as whole tissue lysate. However, more commonly, the analysis of enriched or purified subcellular compartments, such as the post-synaptic density (PSD), or molecular complexes, such as the AMPA-R with its associated proteins, are employed since they are more tractable and thus tend to be more informative (see the section headed General Issues With Respect to Proteomics of Complex Samples, further on, regarding complexity of the sample). These initial purification steps will necessarily be tailored to the individual problem and may be crucial for determining the overall quality

157

and scientific significance of the proteomic investigation.

Mass Spectrometry in the Age of Genomics and Transcriptomics The advances in DNA sequence databases over the last 10 years have facilitated the role of mass spectrometry as the exquisite workhorse driving current proteomics. For large-scale experiments, interpretation of the raw mass spectra relies on interrogation of databases of deduced protein sequences from actual or predicted transcripts. In this respect, spectral analysis is not error tolerant. While identification of a given protein in general is robust, since it can be identified from one of many peptides resulting from proteolytic digestion (see also the section headed Poor Repetitive Identification of the Same Peptide, further on), allelic variation is not explicitly captured in the current databases and thus precludes the ability to match an MS/MS spectrum to that resulting peptide sequence. This immediately highlights the crucial importance of correct database entries and limits the ability to employ databases from related species when the target organism’s proteome has not been determined from genomic analysis. With particular respect to studies on humans, information regarding allelic variation in protein sequences is not generally incorporated into proteomic analysis but will need some attention in light of the increasing DNA sequence information. Equally, splice variants are often not yet taken into account in most large-scale proteomic investigations, although the ability to do so has been demonstrated.16 C U R R E N T A P P L I C AT I O N S O F NEUROPROTEOMICS General Issues With Respect to Proteomics of Complex Samples Cellular protein abundance spans up to six orders of magnitude in cells17 and potentially eight orders of magnitude within a tissue owing to cell-type heterogeneity. Physiologically relevant levels of site-specific PTM occupancies and their stoichiometries may span an additional two orders of magnitude. This very wide range represents a significant experimental challenge not only for mass spectrometric analysis but for all types of studies on proteins from complex samples. Indeed, the most commonly used discovery type LC-MS/MS analyses are

158

the OMICs Biological sample (tissue, cells, etc.) A

cell type–specific isolation, subcellular fractionation, and/or affinity purification of protein complex

Sample B

Reduce, alkylate, digest

Pool of digested peptides C

Pool of digested peptides D

Pool of digested peptides E

Orthogonal LC

Affinity enrichment for PTMs

LCMS/MS

Multiple LCMS/MS

LCMS/MS

Basic workflow

In-depth sample characterization

Analysis of modified peptides

FIGURE  9.1: Illustration of a proteomic workflow. The initial isolation of biological material can be prepared from a variety of sources, such as whole brain or cultured neurons. A:  Samples can be analyzed in this state but are typically subjected to a variety of enrichment steps, including isolation of specific neuronal subtypes; subcellular fractionation (e.g., using sucrose density gradients); and affinity purification (one or two rounds, the latter often as TAP) to isolate protein complexes. B: To facilitate proteolytic digestion of the resulting proteins, disulfide bonds are reduced and the resulting free sulfhydryl groups are alkylated. Proteins are then digested to peptides using a protease such as trypsin (Table  9.1). This digest or pool of peptides can then be analyzed in one of three ways. C:  A  direct analysis via LC-MS/MS will identify the most abundant peptides. D:  Multidimensional fractionation typically generates 5 to 30 fractions, which are then sequentially analyzed via LC-MS/MS. Such an approach is much more effective at identifying lower-level components in complex mixtures. E:  Posttranslationally modified peptides are identified at very low rates when samples are analyzed that have not undergone enrichment for the PTM of interest. Therefore a specific affinity enrichment step for the modification of interest must be utilized prior to single or multidimensional LC-MS/MS. The PTM-enriched fraction can be analyzed directly by LC-MS/ MS or can itself be subjected to additional multidimensional fractionation.

biased toward detection of higher abundance components; therefore analysis of less abundant proteins or signaling components requires special attention. One obvious way to reduce these problems is by limiting the complexity of any given sample destined for mass spectral characterization (see Figure 9.1). In a discovery experiment, peptides resulting from a proteolytic digestion are separated by reverse phase liquid chromatography that is coupled directly to the mass spectrometer. In

such a configuration, a mass spectrometer can successfully sequence several thousand unique peptides in a two-hour capillary UPLC-MS/MS analysis. However, a proteolytic digestion of a whole cell lysate will generate over 500,000 unique peptides (which does not even take into account posttranslational modifications, different cell types within a tissue, or the fact that digestion of the sample does not proceed to completion). To address this sheer complexity challenge, some kind of fractionation

Proteomics prior to the UPLC-MS/MS step is required. Thus the use of multidimensional chromatography is employed.18 In this approach, an additional chromatographic separation is used, either “online” (i.e., coupled directly to the initial reverse phase separation) or “offline.” This expands the dynamic range and sequence coverage, allowing identification of on the order of 100,000 peptides. Nevertheless, components from the proteins of lowest relative abundance may still be detected but are sampled less efficiently. Perhaps more importantly, PTM-modified peptides are usually substoichiometric in site occupancy and hence are generally not identified in large or representative numbers using such approaches. Therefore targeted PTM-specific enrichment strategies need to be adopted in order to optimize global identification of any class of particular PTM-modified peptides (Figure 9.1). In recent years, studies employing mass spectrometry have revealed a degree of PTM complexity not anticipated by the biological research community earlier. It is now appreciated that many proteins can occur in multiple modified versions. Assuming that these PTMs can occur in all possible permutations on a given protein molecule, the number of distinct protein species that may exist for a given transcribed mRNA can number in the hundreds to thousands.19 It will remain a major challenge to determine what the physiological consequences of distinct PTM patterns may mean and the degree to which different PTMs may cross talk with each other on individual proteins to yield a suite of isoforms of the same protein bearing distinct functions. The present size of the characterized human phosphoproteome continues to grow and there are currently over 100,000 sites curated at PhosphoSite (www.phosphosite.org). It remains to be established whether all of these sites are physiologically relevant or whether many of them result from off-target basal kinase activity. Experimental manipulation of individual sites or groups of sites followed by analysis of the resulting phenotype is currently the gold standard to verify physiological relevance. However, this will be possible only for a limited subset of sites. Bioinformatic analyses of large-scale PTM studies have begun to provide a broad understanding of the principles governing PTM regulation and possible relationships between distinct PTM pathways.20,21

159

Current Mass Spectrometry Workflows and Caveats Current mass spectrometry approaches involve the analysis of peptides generated from the tryptic digestion of proteins/protein mixtures. The existence of specific proteins is inferred from the identification of individual peptides or sets of peptides mapping to a given protein. Peptides longer than six amino acid residues will in general map to unique gene entries in a given database except for stretches of fully conserved sequences within gene families, or protein domains. From an experimental point of view, it is more tractable for the mass spectrometer to sequence small polypeptides (between 6 and 30 amino acid residues) that constrain their stable isotopic profile than intact proteins. However, in particular cases, it is highly desirable to measure the molecular weight profile of an intact protein and even its entire sequence scaffold (via electron capture (ECD) or electron transfer dissociation (ETD)), especially of a highly posttranslationally modified protein, to establish that the sequences and PTM assignments obtained from a detailed digest map correctly onto the intact protein isoforms in question.22 The analysis of mixtures of intact protein molecules has made substantial progress in recent years but still lags seriously behind corresponding analysis of the digests. For a review, see reference 23. Prior to mass spectrometric analysis, protein samples are generally denatured, their disulfide bonds are reduced and alkylated, and then they are proteolytically digested (Figure  9.1). The protease most widely used is trypsin, which cleaves at the carboxy terminal side of arginine and lysine residues. However, other common proteases include chymotrypsin, AspN, LysC, and LysN and pepsin (see Table  9.1). Owing to a variety of factors, including length and hydrophobicity, not all of the resulting peptides will be easily detected by mass spectrometry. This is not an issue where the focus is at the level of the overall protein, but it may be a problem when the PTM status of a particular residue is of interest, as the site in question may not be covered by the specific set of peptides detected. As alluded to above, a further limitation with digestion lies in the unlinking of putatively functionally coupled PTMs occurring on a single protein molecule.24 In the case of quantitative MS analysis, incomplete digestion is a potential source of error [see 24]; it is possible

160

the OMICs

TABLE  9.1. A LIST OF COMMON PROTEOLYTIC ENZYMES AND THEIR CLEAVAGE SPECIFICITY

Proteomic analysis typically uses trypsin as the preferred protease. Because it cleaves carboxy-terminal to arginine and lysine residues, the resulting peptides have a positive charge at this end. This generally results in a favorable series of sequence ions during MS/MS. Many alternative proteases exist, and are useful particularly in light of the fact that not all tryptic peptides possess optimal biochemical properties for mass spectrometric analysis. Enzyme

Cleavage Specificity

Arg-C Asp-N Chymotrypsin

C-terminal to R N-terminal to N C-terminal to FYW and to a lesser extent ML (not before P) C-terminal to K Broad specificity, somewhat protein dependent C-terminal to K and R (not before P)

LysC Pepsin Trypsin

that neighboring PTM-modified residues may impair proteolytic cleavage efficiency.

Matching MS/MS Spectra to Peptide Sequence MS/MS spectra acquired by the mass spectrometer are “interpreted” using database sequence search algorithms. These algorithms require specification of multiple parameters, including the organism; a list of potential PTMs and experimentally introduced modifications; and the precursor and MS/MS fragment mass measurement accuracy. The database search algorithm will then create theoretical fragmentation spectra for all peptides, which may occur from proteins in the database as a result of the specified proteolysis. Each MS/MS spectrum that was acquired is then compared with the theoretical peptide spectrum that matches each peptide in the database within a given mass accuracy. For a review, see reference 25. Publicly available search algorithms exist, such as Protein Prospector (prospector.ucsf.edu) and Mascot (matrixscience.com), which allow the investigator to interpret lists of MS/MS  data.

In many respects, proteomics is still a rapidly developing field. As such, there remain a number of issues with respect to data analysis and interpretation for which the field has not settled on a consensus solution. The first issue deals with the application of appropriate criteria to assess false peptide identifications, particularly in large-scale analyses. This is best carried out at present by searching a database that contains proteins from the species of interest as well as a size-matched decoy set (containing protein with sequences that are randomized or reversed from the proteins of interest). A  second issue deals with PTM site assignments within a given peptide sequence. For example, search engines can identify that a given peptide is phosphorylated and provide a reliability measure of this assignment. While they will identify which residue is most likely the site of modification, most of them will not calculate a probability that this is the case. Finally, as most MS/MS data interpretation algorithms rely on a protein (or translated DNA) database from which to find potential matching sequences, any issues regarding accuracy or completeness of the database will impact negatively on the proteomic search results.

Quantitative Proteomic Strategies Quantification using mass spectrometry is not as straightforward as with other analytical techniques because the signal observed in the MS is not solely a function of analyte concentration. The presence of additional compounds can create a competitive ionization situation during electrospray and/or result in space-charging effects in the mass analyzer, thereby affecting relative signal intensities. Nevertheless, the relative peak areas of individual peptides or the integrated signal from all peptides from a given protein can be used as a rough indication of comparative abundances of proteins across sample replicates ( “label-free quantification”).26 Alternatively, stable isotopically labeled analogs using carbon, nitrogen, oxygen, and hydrogen can be utilized for more accurate quantification. For a review of quantitative approaches to MS, see reference 27. Standards can be prepared that are isotopically labeled with heavy or light versions of these atoms using metabolic, chemical, or enzymatic techniques. Subsequent mixing and simultaneous analysis in the MS enables

Proteomics relative quantification of protein abundances between samples. Metabolic incorporation of isotopes is most commonly carried out in cell culture using heavy and light analogs of arginine and lysine. Digestion with trypsin, which cleaves after these two residues, yields a sample in which essentially all peptides may be isotopically labeled. By growing cells over five or six population doublings in isotopic media, it is possible to obtain over 95% incorporation of isotopic arginine and lysine. Primary neuronal cultures, which are non-dividing, represent a challenge for this approach with respect to degree of isotopic incorporation. In this case, isotopic incorporation results only from protein turnover rather than new protein synthesis related to cell division. Nevertheless, neurons cultured  for two weeks will have labeled amino acid incorporation levels of over 80%; this approach has been used to investigate BDNF signaling.28 Mathematical models can also be used to correct for incomplete labeling of proteins.29 For in vivo isotopic labeling, see the following section. A variety of chemical labeling approaches exist, which typically utilize reactive functional groups on peptides, such as primary amines (on lysines or peptide amino termini), cysteine side chains, or carboxyl functions. The isotopically encoded chemical labeling reagents—iTRAQ and TMT—enable multiplexing of samples so that between four and eight samples can be quantified in a single MS experiment.30,31 These reagents have been used to quantitatively profile differences in synaptic components as a function of brain region32 and relative PSD composition following synaptic stimulation,33 compare the proteome of glutamatergic versus GABAergic synaptic vesicles,34 and investigate axonal retrograde transport in response to nerve injury.35 One caveat with the use of iTRAQ and TMT is that the observed changes in peptide levels are generally less than the actual changes in peptide levels. This occurs in part because other peptides with the same LC elution time and a similar m/z value contribute to the background signal—an effect minimized using precursor window minimization—that otherwise increases with sample complexity.36

Protein Turnover in Whole Brain Lysates The entire proteome of mice or rats has been isotopically labeled in vivo using two similar

161

strategies. The first is by feeding rodents with a food source in which all the nitrogen is 15N.37 The mass shift observed for a given peptide will depend on the number of nitrogen atoms it contains. This approach has been termed SILAM, for stable isotope labeling in mammals, and has been used to quantify changes in synaptosome proteins during development. Of the 1,138 proteins that were quantified, 106 were developmentally regulated. This approach also allows for comparisons of proteins and PTMs between organs or disease states.38 A  similar strategy has been employed to determine the turnover half-lives of some 1,700 proteins in brain and liver tissues by measuring the rate of 15N incorporation into different tissues over the course of a month.39 Development of a data processing pipeline was required to process the mass spectrometry data and obtain the turnover rate constants.40 The second labeling approach uses isotopic versions of lysine. This diet results in full incorporation of labeled lysine by the F2 generation.41 Breeding isotopically labeled rodents is a significant effort that may not be appropriate for all laboratories. An alternative approach employs a neuronal cell line grown in SILAC media to generate an isotopic reference sample.42 Normalizing to the signal of cultured cells (which can be completely labeled with SILAC media) allows comparisons between samples of interest. Further refinement of these approaches will enable their application toward a range of applications, including studies of neuronal regeneration and disease as well as facilitating quantitative analysis of postmortem human brain.43

Analysis of Subcellular Components Synaptic Analyses Proteomic characterization of synaptic compartments typically relies on biochemical purification of synaptosomes or postsynaptic densities. These compartments are enriched from whole brain (or dissected brain region) using discontinuous sucrose density gradient centrifugation. A  key issue with such approaches and all other enrichment strategies is the degree to which proteins identified in these preparations are bona fide components. These problems with adventitious binding are more obvious in MS-based analysis because their dynamic range of up to four orders of magnitude44 exceeds the capacity of more traditional methods. Biochemical

162

the OMICs

isolation does not yield infinite enrichment in any system. As a consequence, there will always be some quantity of nonsynaptic proteins in these preparations. As mass spectrometers and workflows become increasingly sensitive, these “contaminants” will be identified in increasing numbers. As is the case for protein-protein interaction analysis, there is a relationship between increased purity of a given preparation and the loss of less stable/more transient real components. A  bona fide list of components can be generated by orthogonal means, such as component-by-component confirmation using traditional cell biological approaches (e.g., confocal microscopic localization45). With such a list, the issues with impurity are largely mitigated, as one can always disregard the known copurifying components. A  more subtle issue for quantitative experiments will be those bona fide PSD components that also exist extrasynaptically, since some portion of the extrasynaptic population copurifies with the PSD. Mass spectrometry was first systematically applied to characterize purified PSD preparations by the Kennedy laboratory in the year 2000; these initial studies identified 31 components from a one-dimensional SDS PAGE gel.46 As mass spectrometers have become more sensitive and multidimensional fractionation has been applied, these numbers have grown from several hundred to just under 2,000.32,47,48 Fernández and coworkers created knock-in mice expressing PSD-95 fused to a tandem affinity enrichment tag49 that may have facilitated sample preparation and purity. The set of these of investigations on PSD preparations highlight the experimental advances of proteomics and the related problem of minor components. Using stable isotopically labeled standards, a number of key PSD proteins have been quantified, allowing the relative stoichiometry of these members to be determined.50,51 A  study designed to probe the structural organization of several synaptic proteins has been reported using a technique known as nanodepth tagging.52 In this study, a series of variable-length chemical cross-linkers were used to determine the distance individual proteins were localized relative to the surface of the PSD. Proteomics has been used to characterize individual synaptic types. This approach requires isolation of specific cell types. In certain instances, this can be accomplished via dissection, as has been achieved for the ribbon

synapse from hair cells, which were isolated from chicken cochleas.53 These studies demonstrated that Ribeye and Rab3 were major components, while certain regulatory proteins such as the complexions were not found in cochlear ribbon synapses. Because of the codistribution of multiple neuronal subtypes with brain regions, microdissection from subcortical regions will not yield highly homogeneous synaptic populations. It is possible to genetically encode tagged proteins to be expressed in individual neuronal subpopulations, and these tags can be employed as handles to purify specific synaptic types. GluD2 has been fused to GFP and expressed in the parallel fiber to Purkinje cell synapse, which allowed for the characterization of the molecular composition of these synapses, identifying MRCKγ as a novel component.54 Proteomic level studies of the PSD have provided new insight into the underlying biology. Core synaptic proteins have been shown to be more evolutionarily conserved than proteins generally expressed in the brain possibly owing to the highly interconnected nature of proteins at the synapse.47 The composition of PSD proteins and specifically those that interact with PDZ motifs was compared between Drosophila and mice.55 While the number of synaptic components was similar, a higher percentage of mouse proteins were involved in signaling or structural control of the synapse. Over 150 synaptic phosphorylation sites have been shown to change in stoichiometry in response to neuronal activity.56 In vitro kinase assays on peptide arrays allowed the authors to develop a map linking phosphorylation sites with their likely kinases. This advance has allowed for the formulation of an initial signaling network map; however, many of its aspects await confirmation in an in vivo setting. The quantitative comparison of PSDs under different phenotypic conditions is an active area of study. Various activity/plasticity paradigms have been applied. Isolated synaptosomes can be stimulated in vitro with extracellular potassium to induce synaptic kinase activity. This procedure has allowed quantification of the activity-dependent phosphorylation of GluR1.57 We have recently quantified activity-dependent changes in synaptic composition that result from pilocarpine-induced neuronal activity.33 This study revealed that many proteins that directly interact at the synapse displayed

Proteomics similar dynamics of synaptic protein localization. In addition, we demonstrated the existence of a functionally defined core of proteins that include key neurotransmitter receptors and their associated scaffolding proteins, for example (Figure 9.2). Changes in the synaptic membrane proteome of the visual cortex have been investigated as a function of development and visual experience.58 Monocular deprivation increased levels of kinases as well as regulators of the actin cytoskeleton and endocytosis, indicating that dark rearing may specifically increase signaling pathways that promote synaptic plasticity. Induction of synaptic activity in cultured neurons has been shown to result in accumulation of RNA binding proteins.59 The biological relevance of this finding was investigated via knockdown of hnRNP-M and -G. Alterations in

163

synaptic spine density were observed as a result, emphasizing the role of localized protein synthesis in dendrite and axon physiology. Neuroproteomics has been applied to understand drug-induced changes in synaptic composition. PSDs from the nucleus accumbens were examined for changes resulting from extinction training following cocaine self-administration in rats.60 Forty two proteins were differentially regulated, with a notable increase in AKAP79/150. Subsequent treatment of the nucleus accumbens with a peptide that disrupted AKAP function impaired the reinstatement of cocaine seeking. Synaptosomes were analyzed following chronic morphine administration, demonstrating downregulation of multiple presynaptic proteins involved in G-protein signaling, vesicle trafficking, and cell adhesion.61 Rats that have been trained to self-administer heroin have then

Correlations with core PSD proteins

34

Camk2a

Camk2d 0 O

Total correlations with other proteins

71

Ppp3cb

Mapk1

Camk2b

0

Network analysis of postsynaptic density dynamics. The temporal dynamics of proteins at synapses from PSD fractions isolated from mouse forebrain were quantified in response to chemical mass-stimulation activity induced by pilocarpine. Changes in protein abundance were mostly due to trafficking and/or diffusion into and out of the PSD. Edges (lines) were drawn between pairs of proteins that displayed a high degree of correlated temporal activity. Individual proteins were represented as circles, whose size was proportional to the number of other proteins with which they were highly correlated. These circles were color-coded as a function of the number of core PSD proteins with which they were highly correlated. A cluster of proteins containing key glutamate receptors and scaffolding proteins is evident as well as a cluster highly enriched in mitochondrial proteins. These data further indicate that changes in levels of major kinases and phosphatases do not correlate with either of the two main clusters. These data highlight the power of quantitative proteomic data as compared with traditional qualitative studies, which merely registered the presence or absence of specific proteins.33 FIGURE  9.2:

164

the OMICs

undergone long-term abstinence. Subsequent exposure to heroin-associated audiovisual cues resulted in reinduction of heroin-seeking behavior.62 Synaptosomes were isolated from the medial prefrontal cortex of these rats, and iTRAQ quantification showed significant decreases in GluR2, GluR3, NR2B, β catenin, and Atp2b1. This was interpreted to suggest that synaptic depression resulting from GluR endocytosis is crucial for cue-induced relapse behavior. Subsequent microinjection of a GluR2 endocytosis inhibitor into the medial prefrontal cortex resulted in attenuation of cue-induced relapse behavior. Mouse disease models have also been targets of proteomic analysis. A  fragile-X model was investigated. It was shown that altered protein expression was primarily localized to presynaptic axon terminals.63 Quantitative analysis of PSDs from a mouse model of Down syndrome determined that proteins were essentially unaltered.64 In these mice, future experiments may clarify whether altered protein expression was presynaptic in nature or whether altered synaptic physiology becomes more evident when activity-dependent changes are quantified.

Vesicular Compartments Subcellular fractionation enables isolation of highly purified synaptic vesicles containing neurotransmitters and the associated docking machinery. These vesicles have been analyzed for their protein composition, initially identifying 72 distinct proteins.65 More recent analyses have significantly extended these studies. The copy number per vesicle for major SV components has been measured. These data were combined with analysis of lipid composition and physical characterization of SVs to measure such aspects as density, diameter, and mass. This allowed the development of a molecular model of an average composite synaptic vesicle.66 Transmembrane domains represent approximately 20% of the surface of vesicles in general at multiple copy number with the exception of V-ATPase, which appears to exist at only one or two molecules per SV. Major lipids include phosphatidylcholine, phosphatidylethanolamine, phosphatidylserine, and phosphatidylinositol. VGLUT1 and 2 were both found at approximately 10 copies per vesicle. A quantitative comparison between glutamatergic and GABAergic synaptic vesicles has been reported.34 Of the roughly 450

proteins identified, 25 and 27 were enriched in VGLUT-1 or VGAT vesicles, respectively. VGLUT-1‒enriched proteins include ZnT3, SV2B, SV31, synaptophysin, synaptotagmins, and syntaxin 1a. The protein MAL2 was identified as a novel VGLUT-1 vesicle protein. SV2C was mainly associated with VGAT. The authors conclude that the vesicle transporter proteins themselves are the primary determinants that define the phenotype of a synaptic vesicle. The iTRAQ reagent was used to quantitatively analyze the distribution of Rab G protein family members.67 By quantifying the enrichment in synaptic vesicles relative to selected subcellular fractions (brain homogenate, synaptic cytosol, and crude synaptic vesicles), Rabs were characterized as to whether they were strongly, moderately, or weakly enriched in synaptic vesicles. The molecular interactors of the major synaptic vesicle protein SV2 have been investigated.68 SV2-Flag fusion proteins, with or without a mutation in the endocytosis motif, were used to identify vesicle-associated and non‒vesicle-associated interacting proteins. Many proteins were found to associate with SV2 in a WT-specific manner, particularly those involved in initial stages of endocytosis. These included AP-1 and AP-2 complex subunits, EPS15, and amphiphysin. Calsyntenin-1 is a cargo docking protein involved in kinesin-mediated axonal transport.69 Immunoaffinity purification has been used to characterize the proteome of calsyntenin-1‒positive vesicles, revealing that such vesicles are enriched in endosomal trafficking machinery.70 Coimmunofluorescence with endosomal markers subsequently demonstrated that calsyntenin-1 occurs on at least two distinct vesicle populations.

Axons Myelin Myelin can be purified from brain homogenate using sequential ultracentrifugation.71 This has allowed proteomics to characterize over 300 proteins present in myelin sheaths (for a review see reference 72). The myelination process has been shown to involve interactions between the extracellular matrix receptor integrin α6β1 and the Ig family adhesion molecule contactin.73 Growth cones in the developing nervous system can be isolated using sucrose density ultracentrifugation. From this preparation, mass spectrometry identified 945 protein

Proteomics components.74 Quantitative immunostaining was used for approximately 100 of these to demonstrate that they were highly enriched in growth cone preparations. It is necessary to isolate/purify axons or axoplasm to specifically analyze their proteomes. Axons can be microdissected from the lateral olfactory tract of embryonic mice. This approach was used to characterize changes in proteome composition during maturation, which demonstrated that multiple classes of calcium-dependent membrane-binding proteins become upregulated as axons develop.75 Cargo is specifically transported along axons in both the retrograde and anterograde directions. Axons in the peripheral nervous system are experimentally accessible to ligature, resulting in the accumulation of transport machinery and its cargo at this site of ligation. This approach was employed to characterize the collection of proteins transported in response to nerve crush in rat sciatic nerve, identifying approximately 150 proteins that were transported in either direction.35 These results were extended to examine changes in protein phosphorylation in response to injury.76 These data were combined with transcriptomics to demonstrate that multiple signal transduction networks regulate the transcription of injury-response proteins.

Protein-Protein Interactions Mass spectrometry is the method of choice to characterize novel protein-protein interactions, since—unlike Western blotting—it does not require making assumptions about possible interactors. A  number of related techniques exist to biochemically isolate potential interacting proteins. Antibodies can be used to purify target proteins and interactors by immunoprecipitation. Proteins can be genetically encoded with a single affinity tag, such as GFP or a FLAG epitope, or with a tandem affinity tag.77 Alternatively, immunoaffinity isolation can be used. In this approach, the protein of interest is recombinantly expressed along with an enrichment tag in a heterologous cell system or transgenic animal. The purified protein is immobilized on an affinity resin (generally via the enrichment tag itself). The resin is packed in a column format and cell lysate from the system of interest is passed over the column. With any of these approaches, trade-offs will be required between achieving specificity of the

165

enrichment (to minimize false positives) and yield of interactors (to minimize false negatives, particularly with respect to detecting weak or transient interactors). Once a set of candidate interactors has been identified, traditional cell biological confirmation studies can of course be conducted to validate interactions. The ability to detect transient interactors is particularly challenging as their relatively high off rates necessitate fast/mild washing procedures that also increase background.78–80 With the advent of ever more sensitive mass spectrometers, 10s to 100s of proteins can be identified from a co-IP experiment. This does not necessarily indicate a poor level of biochemical enrichment but nevertheless presents a challenge to experimenters in deciding on which targets to focus their efforts. Stable isotope labeling can be used to compare the purification of interest a negative control sample (e.g., a sample not expressing the tagged protein, or where the endogenous protein has been knocked out/down, or using control IgG for the purification). If the enrichments are carried out in a quantitative manner, the level of enrichment for each candidate interactor can be measured to give some indication as to whether or not the protein was specifically enriched.81 A  number of statistical approaches have been developed, primarily based on medium- to large-scale purification datasets to determine the likelihood that individual components will be specific to a particular IP or whether they are common contaminants.82–84 Such systematic analysis of protein complexes can yield data of much higher quality,84 although such major undertakings may be outside the scope of many laboratories. There will always be a trade-off between false positives and false negatives. Stringent washing approaches will give very low false-positive identifications but at the cost that low-affinity interactors will not be identified. The degree of effort required to deal with false positives in the subsequent validation steps would need to be evaluated on an experiment-by-experiment basis. The use of reciprocal pull downs can provide additional confirmation with respect to the specificity of a given interaction.85 The notion that these types of studies can identify both protein-protein interactions and protein complexes can be complicated. For example, to show that A interacts with B and C; B interacts with A  and C; and C interacts with

166

the OMICs

A  and B is not sufficient evidence to conclude that a trimeric complex exists. However, the use of native blue gels, analytical ultracentrifugation, sizing columns, or mass spectrometry of the intact complex can further define the composition and relative stoichiometry of components of a given complex.86,87

Protein Complexes Neuroproteomic research into protein complexes has generally focused on those complexes containing neurotransmitter receptors and their principal scaffolding proteins. Early work in this field examined the complement of proteins that could be immunoprecipitated along with the NMDA receptor.88 A  multiprotein complex containing 77 proteins was identified, including kinases, phosphatases, GAPs and Ras proteins. This knowledge set the stage for thinking about large assemblies of ion channels. Mice have been generated with a TAP-tagged version of the key synaptic scaffolding protein PSD-95.49 Purification of PSD-95 complexes from whole brain lysate permitted the identification of 118 proteins. The mGluR5 signaling complex has been isolated from rat brain.89 A  total of eight previously known interactors were found, including Homer 3, GTP-binding protein alpha q and o, calmodulin, and Shank 1a. Novel interactors identified in this study include MAP 1A and 2 and several isoforms of 14-3-3. The β2 subunit of nAChR has been immunoprecipitated from mouse brain and 21 interacting proteins were identified.90 The use of β2-knockout mice helped to address the issue of specificity of these targets, which overall appear to regulate both signaling by and trafficking of nAChRs. At the presynaptic side, a combination of affinity chromatography and immunoprecipitation with Kir2.1, 2.2, and 2.3 revealed the existence of a trafficking complex that contained SAP97, CASK, Veli, and Mint1.91 In this case, SAP97, CASK, Veli, and Mint1 were shown to exist in a complex with Kir2.2 rather than in multiple binary potassium channel interactions. This complex was mediated by the C-terminal PDZ binding motif on Kir2.2. Müller and colleagues92 used a large-scale approach to characterize proteins interacting with the alpha and beta subunits of the voltage-gated calcium channel CaV2. A  total of 14 different antibodies against α and β subunits were used. Plasma membranes were prepared from mice and rats as well as mice deficient in

different Cav2 subunits (Cav2.1, 2.2, 2.3, β2, β3, β4). Three different detergents of varying stringency were employed. The proteins detected were filtered by their ability to be reproducibly identified by several antibodies, overall MS signal from the identified proteins, and a greater than 10-fold increase in WT versus the respective knockout mice. This led to the identification of 207 proteins defined as true interactors (albeit potentially indirect binders). The authors observe that the majority of proteins appeared to be involved with regulating the intracellular Ca2+ concentration. A recent study by Schwenk and associates87 illustrates the challenges of interaction proteomics and demonstrates how a series of approaches can be integrated to yield a high-quality dataset. The authors probed brain lysates with a series of 10 antibodies against the four GluA subunits. The relative levels of coprecipitated proteins were compared to IPs using preimmunization IgGs as well as immunoprecipitation using material from knockout control mice (for GluA1 and GluA2). Of the 1,711 proteins detected, 34 passed the quantitative validation criteria. The authors then used an antibody-free approach by which GluR protein complexes were separated using native blue gel electrophoresis. GluRs ran as a complex between 0.6 and 1.0 MDa. The gel was sectioned into 81 slices and the relative amounts of GluRs as well as the previously characterized interactors were quantified using stable isotope standards synthesized using a QConCAT approach.93 This study demonstrated the existence of multiple distinct AMPAR complexes that have different stability profiles.

Individual Protein-Protein Interactions Mass spectrometry continues to play a key role in the characterization of protein-protein interactions. As examples in this section illustrate, MS in these cases is used to enable initial identification of potential protein partners. Subsequent biological follow-up experiments are then used to characterize these interactions and the phenotypic consequences of their disruption. Rather than exhaustively profile these studies, we have chosen to highlight several recent examples. The use of antibodies to immunoprecipitate proteins and their interacting partners has been widely utilized to analyze neural tissue. By using antibodies against spinophilin,

Proteomics 125  potential interactors were identified.94 In this case, the authors used immunoprecipitates from spinophilin-deficient tissue to control for nonspecific interactors. Immunoprecipitation with antibodies against GluR6/7 from rat cerebella identified NETO2 as an interactor of kainate receptors.95 NETO2 was shown to modulate the channel properties of kainate receptors without directly affecting their trafficking. Tagged proteins have been expressed in cell culture or in vivo to examine many aspects of neuronal biology. A TAP-tagged version of amyloid precursor protein was expressed in vivo to identify interacting proteins in mice. Eight previously reported interactors as well as 36 additional proteins were identified.96 Coexpression of the novel interactor, NEEP21 influenced proteolytic processing of APP. A  TAP-tagged version of GluR2 was expressed in mouse and lead to confirmation of TARPs as AMPA receptor interactors as well as a novel interaction with Bip/Grp78.97 In vitro tagging with GFP of multiple CaMK cascade kinases identified a known interaction with 14-3-3 proteins as well as a novel interaction with the Rac1/ Cdc42 GEF βPIX.98 CaMKK, CaMKI, βPIX, and GIT1 were subsequently shown to form a complex that is localized to spines and whose activity-dependent signaling promoted synapse formation. Dysbindin was FLAG tagged in vitro, and in this case a SILAC strategy was employed to compare proteins from the target antibody with a control IP done in the presence of excess FLAG peptide.99 A total of 24 proteins showed significant enrichment, including multiple members of the BLOC-1 complex. Immunoaffinity chromatography techniques involve the expression of a protein or protein domain in a heterologous expression system, which is then purified and immobilized on a resin. Protein lysate from the cells of interest are passed over the column and the interacting proteins that are retained are identified by MS. This approach identified Plk2 interactors involved in endocytosis including AP1, AP2 and NSF.100 It was then demonstrated that Plk2 binding to NSF decreases NSF interactions with GluA2, resulting in decreased surface-associated GluA2. Affinity chromatography has been used to investigate receptor-ligand interactions. Ko and associates identified neurexin as an interactor of LRRTM.101 Neuroligins and LRRTM2 were shown to bind different splice isoforms of neurexins. Recombinant latrophilin, LPHN3-Fc,

167

was used to probe for interactors in rat synaptosomes, resulting in the identification of FLRT2 and 3.102 This interaction was shown to be direct and shRNA reduction of FLRT3 decreased dendritic spine number in cultured dentate granule cells.

PTMs and PTM-Modifying Enzymes The analysis of posttranslational modifications has obvious importance, as these chemical moieties affect almost every aspect of protein function.19 A  complete understanding of a protein’s PTM landscape and how it changes in response to physiological state is critical for understanding the discrete biological role(s) of that protein. Data from large-scale analyses can be used in a more basic sense to refine predictions regarding protein structure. In the case of the synaptic protein densin-180, the detection of multiple phosphorylation sites in a putative extracellular domain provided new insight that necessitated a revision of its membrane topology.103 MS based proteomics has discovered a plethora of PTMs that has far exceeded initial estimations. At least 60% of proteins in synaptosome preparations have been found to be phosphorylated, at an average ratio of six sites per protein.21 In order to achieve global level of characterization, each PTM requires use of specific enrichment strategies. To date, the field of phosphoproteomics is the most mature as there are established techniques to enrich phosphorylated peptides the total pool of peptides that result from proteolytic digestion of a sample. These include IMAC/ TiO2 enrichment,21 antiphosphotyrosine antibodies,104 and motif-specific phosphoserine/ threonine antibodies.105 Antibodies and genetic tagging have been used to investigate ubiquitination (for review, see reference106). Other PTMs have been enriched using antibodies, chemical tagging,107 PTM-specific binding proteins,108 or weak affinity chromatography.21,109 Representative examples of MS/MS spectra from PTM-modified peptides are shown in Figure 9.3. Peptide MS/MS search algorithms are in general designed to identify a given peptide sequence from a database and to determine whether it exists in a (PTM) modified state but not necessarily for a precise assignment of the modified site. For example, a search algorithm can match a MS/MS spectrum to a

the OMICs

200

400 m/z

600

1000

y18-SOCH4+2 y15+2 y16+2 y17+2 y18+2 y19+2

y4

50

200

1200

LTD E E VD E MI R

a2

75 25

c11 z11

z7

800 m/z

y3

b8

z10

c10

z9

z6 c9 z5

600

b2

y5

1800

z3

Intensity

50 40 30 20 10

100

800

b16

y13 b14

y12

MH-H3PO4+2

y10 y11

y6 y7

1500

(F) H V M TN L G E K(Acetyl)

R

y4

25

1200

900

G L A G P T T V P AT (GlcNAc) K

400

1000

b7

50

b6

b2

75

b3

y1

100

y10

800

E H A L A Q A E L L K(GlyGly)

y3 b y44 y5

y9 y8 y9

y7b8

y6

y5

b4

600 m/z

L N A E AI R

m/z

(D)

y6

(E)

400

600

300

1200

y4

168 186 204

50 40 30 20 10

200

Intensity

900

G L A G P T T VP A T K +(GlcNAc)

b3

Intensity

(C)

600 m/z

50 40 30 20 10

y1

300

Intensity

10

b12-H PO 3 4 y10 b12 y11

20

y3

Intensity

30

(B) S Q N IITD S S S (Phospho)

P P P T TA PHK

Intensity

A AV V T S(Phospho)

y2 b4 b5

(A)

b4 y3 y4 b6-H O 3 4 y12-H PO +2 3 4 y13-H PO +2 3 4 b7 MH-NH3+2 y8 MH-H3PO +2 4

168

400

600 m/z

800

1000

1200

Characterization of posttranslationally modified peptides via MS/MS. Tandem (i.e., MS/MS) mass spectra can allow one to both determine the amino acid sequence of a given peptide and enable determination of the site of modification within this sequence. The precision on scope of the interpretation of the MS/MS spectrum depends critically on its spectral information content. A:  The peptide AAVVTSPPPTTAPHK is phosphorylated on the serine residue in the sixth position as determined by MS/MS using collision induced dissociation in a linear ion trap. B:  This MS/MS spectrum demonstrates that the peptide SQNIITDSSSLNAEAIR is phosphorylated. However, there is insufficient spectral information to determine whether the site of phosphorylation is on serine residue eight, nine, or ten. In other words, fragment ions were not observed (such as a phosphorylated version of y8) that would unambiguously localize the PTM to a single serine or threonine residue. C:  MS/MS in a quadrupole instrument was used to sequence the peptide GLAGPTTVPATK; the mass of the precursor ion indicates this peptide is O-GlcNAcylated. However, the GlcNAc moiety is labile during MS/MS, undergoing neutral loss so that the resulting spectrum is essentially identical to the nonmodified version of the peptide. Therefore the GlcNAc moiety cannot be assigned to a specific residue. D:  Electron transfer dissociation (ECD) of this peptide allows fragmentation with retention of the GlcNAc moiety. This spectrum clearly indicates that the site of modification is the threonine residue in the 11th position. These examples highlight a recent major advantage in MS instrumentation for the investigation of labile modifications. E: The peptide EHALAQAELLKR is ubiquitinated on the lysine side chain. Tryptic digestion of ubiquitinated proteins generates peptides containing a GlyGly ubiquitin remnant on the side chain of modified lysines. This occurs because ubiquitin contains a tryptic cleavage site immediately amino-terminal to the diglycine motif. F: The peptide HVMTNLGEKLTDEEVDEMIR is acetylated on the lysine residue in the ninth position. An antibody against acetylated lysine was used to enrich modified peptides. For an explanation of MS/MS fragmentation, see references 135 and 166. FIGURE  9.3:

Proteomics phosphorylated peptide sequence and calculate the probability that the match is the correct assignment. However, such algorithms do not generally provide a statistic measuring the likelihood that a given residue in the peptide is the site of modification relative to the other possibilities. As such, care must be taken in using data derived from large-scale studies. These datasets currently become incorporated into various online repositories and in many cases it is still not clear what criteria were originally used to confidently assign the site of a given PTM. We would recommend that any laboratory interested in conducting follow up experiments on a given site consult with a mass spectrometry expert to assess the quality of the original spectrum used for site assignment. Data supporting the identification of posttranslational modifications are now generally required to be provided in the supplementary materials.110 More recently, algorithms have been developed, such as AScore and SLIP, which provide a statistical estimation of actual residue site assignments within a peptide.111–114

PTM Analysis of Single Proteins Mass spectrometry has been used extensively to characterize posttranslationally modified forms of individual proteins. In such an analysis it is generally sufficient to obtain a relatively pure sample of the target protein for mass spectrometry, since subsequent PTM enrichment is usually not required because of the overall limited complexity of the pool of peptides in such a digest. Proteolysis of a purified protein produces a limited number of peptides; in a currently available LC-MS/MS analysis, the mass spectrometer will have sufficient time to attempt to sequence all the peptides as well as any modified peptides. This does not mean that all actual PTMs will be identified, because not all peptides (PTM-modified or not) possess the proper biophysical properties to be sequenced in a single experiment (e.g., they may be too short, too long, poorly resolved chromatographically, and so on). In addition, PTMs present at very low stoichiometry in a given preparation may be below the absolute detection limit of the mass spectrometer. Some of these limitations can be addressed via the use of multiple proteases (either separately or in combination) to yield different sets of peptides. The acetylation status of Huntingtin has been examined using myc-tagged protein

169

expressed in HEK cells.115 Three novel and two previously characterized sites were identified and acetylation-specific antibodies were generated against the novel sites as reagents for future studies. As an alternative to examining endogenously occurring sites of modification, an acetyltransferase can be used to modify substrates in vitro. Such an approach was used to identify 23 putative sites of lysine acetylation on tau.116 To demonstrate physiological relevance, it was shown that the inhibition of the acetyltransferase p300 in primary cultures reduced acetylation levels and caused elimination of phosphorylated tau. As a general rule, in vitro modification studies require careful follow-up experiments to determine their relevance with respect to PTMs present in vivo and their functional relevance. Ion channels have been shown to be extensively phosphorylated. The alpha subunit of voltage-gated sodium channels contains at least 15 sites of phosphorylation at endogenous levels in rat brain.117 The β2 subunit of voltage-gated potassium channels is phosphorylated at two residues.118 Modification of one of these sites by Cdk negatively regulates the interaction between EB1 and the potassium channel, which is important in the subcellular regulation of these channels. CaMKIIβ was found localized to the centrosome independently from CaMKIIα. Cdc20-APC was known to have several phosphorylation sites with a CaMKIIβ, and in vitro kinase assays were performed in conjunction with MS to demonstrate that CaMKII beta phosphorylates Cdc20 at Ser51, Ser84, and Ser86. Analysis of purified MECP2 identified three phosphorylation sites from rat brain nuclear extracts and HEK cells.119 Phosphorylation at S421 has been shown to play a critical role in the MeCP2-mediated regulation of dendritic patterning, spine morphogenesis, and induction of Bdnf transcription. Cyclin-dependent kinase 5 has been shown to positively regulate the interaction between the ubiquitin ligase Mdm2 and PSD-95.120 Of the five ubiquitination sites identified on PSD-95, none resulted in decreased protein levels, suggesting that these modifications are playing a nondegradative role.

Phosphorylation and Kinases As a result of recent advances in the biochemical isolation of phosphorylated peptides using immobilized metal affinity chromatography

170

the OMICs

(IMAC) or TiO2 beads, phosphoproteomics has advanced markedly over the last five years. The number of phosphorylation sites identified from brain samples has increased significantly. Early studies identified on the order of 100 sites from synaptic preparations of various organisms.48,121,122 Combining phosphopeptides enrichment with multidimensional chromatography (Figure  9.1) increased these numbers up to several hundred.123,124 Large-scale analyses now commonly identify over 10,000 phosphorylated components from complex samples. Approaches such as hydrophilic interaction chromatography can sequentially enrich phosphorylated and glycosylated peptides from the same pool of peptides, but such approaches are less well developed.125 We recently analyzed the phosphoproteome of synaptosomes derived from mouse tissue and identified over 16,000 phosphorylation sites.21 As noted earlier, this report indicates that approximately 60% of mouse brain proteins are phosphorylated and that on average there are approximately six sites of phosphorylation per protein. With the technical capability to more completely characterize the extent of phosphorylation, a key challenge remains the dissection of kinase pathways at the molecular level. One approach to this problem is the use of analog-sensitive kinases.126,127 This has recently been applied for the identification of substrates for NDR1/2.128 As a result, AAK1, Rabin8, PI4K beta, Pannexin-2, and Rab11fip5 were identified as putative substrates for NDR1/2. All of these substrates were phosphorylated in an HXXRXXS/T motif. Purified NDR was subsequently shown to phosphorylate purified AAK1 and Rabin8 in vitro and to regulate both dendrite length and spine development. Dendrite growth appears to be mediated by phosphorylation of AAK1, while spine development appears to be mediated by phosphorylation of Rabin8. Motif-specific antibodies can be used to broadly profile kinases with related substrate motifs. Using an antibody against a MAPK motif such as PX(pS/pT)P, potential substrates were immunoprecipitated.129 A  total of 449 potential MAPK substrate proteins were identified, many of which appeared to be dynamically regulated by activity in neuronal culture. The immunoprecipitation was conducted at the protein level. Therefore not all sites of phosphorylation were identified, but they were able to map 82 phosphorylation sites. Targeted follow

up on Ser-447 from delta-catenin validated that it is modified in an activity-dependent manner by the MAPK JNK and that this phosphorylation was correlated with substrate degradation. These approaches do not uniquely identify kinases responsible for a given phosphorylation, but they represent a potentially powerful method for targeting a subset of the phosphoproteome for focused analyses. An approach to understand the different facets of RTK signaling involves the use of chimeric RTKs.130 The authors expressed receptors containing the extracellular domain of PDGFR with the intracellular domain of TrkA in PC12 cells (which lack endogenous PDGFR). Application of PDGF could then selectively activate chimeric receptors while leaving endogenous WT TrkA unaffected. SILAC quantification was used to examine phosphorylation sites regulated by RTK activation. Motif analysis showed that the kinase activation profile upon TrkA activation was very similar to the previously characterized profile of EGFR stimulation. This approach could potentially be used to examine how individual phosphorylated docking sites in the intracellular domain of TrkA recruit distinct complexes to initiate signaling via multiple pathways. Immobilized kinases themselves can be used as affinity chromatography reagents to enrich substrates from complex mixtures. The catalytic domain of Rho-Kinase was used to purify 313 putative substrates from rat brain cytosol and membrane preparations.131 The specificity of the enrichment was controlled by parallel enrichment using a GST-alone affinity column. The overall false-discovery rates with such an approach remain to be investigated, but this could represent an orthogonal approach to the identification of kinase substrates.

O-GlcNAc and  OGT O-GlcNAcylation is the addition of a single sugar (β-N-acetylglucosamine) to serine and threonine residues on intracellular domains of proteins. In a similar fashion to phosphorylation, it is a dynamic modification whose addition is catalyzed by an enzyme known as O-GlcNAc-transferase (OGT) and removal is catalyzed by O-GlcNAcase (OGA).132–134 The resulting cycling of O-GlcNAc has the potential to broadly regulate protein function, possibly in concert with phosphorylation. Levels of this modification are highest in the liver, pancreas,

Proteomics and brain (where it is abundant at nerve terminals).Analysis of O-GlcNAc by mass spectrometry has been challenging using standard peptide fragmentation approaches, as the sugar moiety is very labile and undergoes neutral loss during collision-induced dissociation (CID), precluding the ability to definitely localize the site of modification within a modified peptide. More recently developed fragmentation techniques, ECD ETD cleave the peptide bonds on a time scale that minimizes energy randomization. Thus it becomes possible to fragment modified peptides while retaining labile modifications on side chain.135 This has proven to be much more effective for the characterization of O-GlcNAc.21,136 Two biochemical approaches have been employed successfully to enrich O-GlcNAcylated peptides. The first is a chemoenzymatic approach in which a modified version of a GalNAc-transferase is used to attach a tagged GalNAc sugar to O-GlcNAcylated peptides.137,138 By isotopically labeling peptides with formaldehyde, this approach has been used to examine changes in O-GlcNAcylation in rat brain following stimulation with kainic acid.139 A total of 20 O-GlcNAcylated peptides were identified, of which eight appeared to be regulated by stimulation. A  novel application of this approach is the addition of PEGylated versions of GalNAc. This large mass addition allows resolution of non-modified and PEGylated versions of a protein on an SDS-PAGE gel.140 Proteins with increasing additions of the PEG tag can be resolved from each other as well. This enables the estimation of modification stoichiometries, which were shown to range from 2% to 100%. As a general principle, the addition of large tags to peptides decreases the efficiency with which they are identified by mass spectrometry. To address this concern, a cleavable version of the GalNAc tag has been developed in which the biotin enrichment tag can be cleaved after enrichment but prior to MS analysis.107 This approach has recently been applied in profiling the mouse brain GlcNAc-proteome, resulting in identification of 458 sites of modification.141 Our lab has developed a chromatographybased enrichment by which GlcNAcylated peptides are directly purified using immobilized lectins, which specifically bind certain classes of carbohydrates.109 The lectin wheat germ agglutinin (WGA) is selective for O-GlcNAc.142 Chromatographic conditions have to be tailored

171

to account for the lower affinity of WGA for peptides carrying a simple sugar residue compared to the high affinity of WGA for proteins with complex sugars. We have recently applied this approach to characterize both the GlcNAcylation and phosphorylation landscape of mouse synaptosomes, and identified 1,750 and 16,500 sites of O-GlcNAcylation and phosphorylation, respectively.21 A  bioinformatic comparison between the distribution of phosphorylation and O-GlcNAcylation demonstrated a complex relationship between the distribution of phosphorylation and O-GlcNAcylation on protein substrates (Figure  9.4). Proteins that were highly O-GlcNAcylated were generally phosphorylated to an equal or greater degree. However, proteins could be extensively phosphorylated with little to no O-GlcNAcylation. Therefore OGT seems to target only a subset of proteins that are also kinase substrates. Structural analysis revealed that, with respect to individual proteins, phosphorylation sites clustered together, as did O-GlcNAcylation sites. However, no relationship was observed between exact sites of phosphorylation and those of O-GlcNAcylation.

Other Posttranslational Modifications Proteomics has additionally been employed to study a range of posttranslational modifications. A general requirement for such studies is a biochemical handle to enable enrichment and subsequent detection of PTM-modified peptides. While not covalent PTMs, cAMP binding proteins have been specifically examined using a targeted affinity reagent.143 This compound contained cAMP linked to a photoactivatable cross-linker and a biotin enrichment tag and was used to identify 18 cAMP binding proteins from rat synaptosomes. A chemical enrichment strategy has been used to identify several hundred palmitoylated proteins from rat brain.144 To identify palmitoylated proteins, unmodified cysteine thiols were blocked with N-ethyl maleimide. Then palmitoylation thioesters were cleaved with hydroxylamine and these newly exposed cysteinyl thiols were labeled with a biotinylation reagent. Quantitative application of this approach demonstrated that palmitoylation states of proteins changed broadly as a function of activity, which suggests that this PTM may have a far-reaching role in regulation of synaptic function. Other PTMs that have been studied from neuronal tissue include

the OMICs (B) Observed phospho Expected phospho

20 40 60 80 Phospho-phospho distance (residues)

100

Observed GlcNAc Expected GlcNAc

250 200 150 100 50 20

40 60 80 GlcNAc-GlcNAc distance (residues)

100

(C) 100 75 50 25

Observed phospho Expected phospho

Occurrences

Occurrences

(A) 2000 1600 1200 800 400

Occurrences

172

40 60 80 GlcNAc-phospho distance (residues)

20

100

Bioinformatic analysis of phosphorylation and O-GlcNAcylation site distributions. A:  Over 15,000 sites of phosphorylation on brain proteins were mapped onto the linear structure of their respective proteins. For multiphosphorylated proteins, a comparison of the distance between sites of modification (black) relative to the distances between serines and threonines (dark gray) demonstrates that phosphorylation sites occur in clusters within a protein. B:  Over 1,500 sites of O-GlcNAcylation on brain proteins were mapped onto the linear structure of their respective proteins. For multi-O-GlcNAcylated proteins, a comparison of the distance between sites of modification (black) relative to the distances between serines and threonines (dark gray) demonstrates that O-GlcNAcylation sites occur in clusters within a protein. C:  For the subset of proteins used in the above analysis that were modified by both phosphorylation and O-GlcNAcylation, the distance from each site of O-GlcNAcylation to the nearest site of phosphorylation was calculated. The distribution of phosphorylation sites bears little relationship to the distribution of O-GlcNAcylation sites, indicating that these two PTM do no cocluster.21

FIGURE  9.4:

O-linked N-acetylglucosamine phosphorylation,145 fucose-alpha(1-2)-galactose modified proteins,146 and carbonylation as a function of aging in mice.147 Targeted proteolysis and protein degradation have been studies using mass spectrometry. One approach is based on peptidase-specific proteome derived peptide libraries.148 Alternatively, a differential 2D gel approach has identified targets of caspase-6 in human neurons.149 By comparing tissue lysates treated or untreated with recombinant caspase-6, the authors identified 24 potential substrates largely related to cytoskeleton and cytoskeleton remodeling. While the technology has yet to be applied to neuronal samples, caspase substrates can be identified at large scale via enrichment following labeling with subtiligase.150 To examine targets of ubiquitination, a Drosophila line expressing the BirA biotinylation sequence fused to Ub6 was generated, allowing ubiquitinated proteins to be specifically enriched.151 As a consequence of tryptic digestion, ubiquitinylation sites are cleaved to yield a GlyGly remnant on previously modified lysine residues. An antibody has been generated that recognizes this remnant, allowing for enrichment of these peptides,134 although it has not yet been widely applied to brain samples.152 Similarly, proteomic approaches have been developed to investigate ubiquitin-like modifiers such as SUMO153; however, they have yet to be widely applied to the analysis of neural tissue.

Using the WGA-enrichment described in the previous section, we have also enriched for peptides bearing a range of complex carbohydrates from synaptic membranes.154 Glycopeptides were fragmented using ETD. This allowed sequencing of the peptides and determination of glycan mass, but gave no information regarding the sugar linkages. Overall, this approach allowed for identification of over 2500 unique glycopeptides on 453 proteins, demonstrating the breadth of glycosylation in the central nervous system.

FUTURE DIRECTIONS AND CHALLENGES FOR PROTEOMICS Neuroproteomics has made remarkable progress during the last few years, with no sign of slowing down. A  number of challenges have become obvious with possible solutions emerging that are at various stages of development. Proteomics Is Evolving Rapidly Large-scale mass spectrometry analysis of complex tissue (be it protein levels or PTMs) are still very labor-intensive. This is due to the continuing rapid evolution of the strategies and analytical pipelines on all fronts:  sample handling methodology, instrument innovation and performance as well as bioinformatic tools. Proteomics at any level requires methodologies that are able to tackle effectively with the unprecedented span of physicochemical

Proteomics properties of bio-macromolecules and their subcellular partitioning or localizations. They should not be compared with the advances and relative technological ease of the nucleic acid sequencing field.

Poor Repetitive Identification of the Same Peptide Typical MS based workflows also suffer from poor reproducibility with respect to repetitive analysis of the peptides. This is due to technical issues with how most mass spectrometers select peptides for identification. Of course when one wants to quantify multiple biological replicates, ideally the same set of compounds will be measured in each analysis. This is less of an issue at the protein level, since a given protein can be identified/quantified across samples even if a different set of peptides are used to do so. However, with respect to PTMs, the same exact modified peptide will need to be measured each time to obtain data on that species. The nonrepetitive nature of peptide identification should in theory decrease as mass spectrometers become able to identify increasing numbers of compounds in a single analysis. However, in the case of phosphorylation (for example), there are likely on the order of 100,000 sites of phosphorylation in the mammalian cortex, and we are far from being able to even identify the majority of them. Two related mass spectrometry approaches have the potential to address the issue of repetitive identification/quantification. The first is selected reaction monitoring (SRM), and the second is known as SWATH or MSE, depending on its exact implementation. SRM analysis involves targeted quantification assays on a predefined set of compounds and is implemented on a triple quadrupole mass spectrometer.155 For each of the peptides, it is necessary to know the peptide m/z value as well as the m/z values of the most abundant MS/MS fragment ions. For a single SRM assay, the mass spectrometer is set up to use the first quadrupole (Q1) as a mass filter, allowing transmission of only compounds with the targeted m/z value. The second quadrupole acts as a MS/MS collision cell, generating peptide-specific fragment ions. The third quadrupole (Q3) also acts as a mass filter, in this case allowing sequential transmission of only fragment ions with the appropriate m/z values. These mass spectrometers can iteratively cycle through sets of Q1/Q3 mass values on the

173

order of milliseconds per pair, allowing a large set of such transitions to be measured during an LC-MS run. These instruments can quantify on the order of 1,000 peptides per analysis. The second approach to repetitively quantify the same set of peptides involves taking an MS scan to detect precursor ions and then fragmenting either all of the peptides simultaneously throughout the run (MSE), or sequentially fragmenting 50- to 100-Da windows throughout the run (SWATH). In such an approach, the fragment ions are measured with high mass accuracy and the temporal profile of fragment ions can be correlated with the temporal profile of precursor ions to determine which fragment ions came from which precursor peptides. In this way almost every identifiable peptide will be sequenced by the mass spectrometer, and this can be done repetitively across replicate analyses.

Sequence Isoforms of Proteins A significant shortcoming in the proteomics field is the ability to effectively deal with the issue of splice isoforms or point mutations. In large part this derives from the experimental design in which proteins are digested with trypsin to yield more readily sequenceable peptides. Assuming the database contains the relevant isoforms, peptides specific to a given exon can be matched as well as those that would span two exons.156 However, the overall pattern of exon sequences for a given protein isoform could not be unambiguously reconstructed if more than one isoform was likely present in the sample. Potential combinatorial complexity is highlighted in the case of neuregulin1, with at least 15 different alternative splice forms from multiple promoters.157 The existence of point mutations is less of a problem for most inbred model organisms for which neuroproteomics has been used to date, but this is potentially a significant issue for human samples. MS search engines will attempt to match only those sequences present in a given database; point mutations present in the sample population will not be positively identified. Customized databases in conjunction with updated search engines will contribute to a more comprehensive coverage of sequence isoforms of proteins. The results of parallel sample analysis by RNA-Seq would address this shortcoming and are a compelling reason for the continued integration of proteomic and transcriptomic analysis.

174

the OMICs

Relevance of Newly Detected PTM  Sites Improvements in the ability to enrich and characterize posttranslational modifications will further provide insights into nervous system function. Comparatively speaking, phosphoproteomics is a relatively mature field. While the number of PTMs in the mammalian central nervous likely exceeds 100,000, it is unclear what the vast majority of these might be doing. With the realization that the number of phosphorylation sites in mammalian systems is so large, questions have arisen regarding whether all these sites have biological relevance. Perhaps certain sites represent “biochemical noise” as a consequence of kinases modifying proteins indiscriminately at low levels. Perhaps this is not surprising given that evolutionary processes are dynamic and ongoing. Clearly, even in the case of a phosphorylation site with understood physiological function it likely arose from a system where the site lacked a well-defined function when it first became a kinase target. And in this sense a large percentage of current phosphorylation sites may be in the process of either evolving functions or being selected out as sites of modification. Detailed follow-up experiments for novel sites identified during proteomic studies can be conducted, as was the case for serine-295 on PSD-95, which demonstrated that this site was responsible for regulating its synaptic accumulation.158 Most likely such detailed investigations will be possible for only a small subset of sites. So far investigations into these PTMs are typically limited to “present above detection threshold.” Quantitative proteomics approaches can be used to understand how each of these sites varies under a range of phenotypic conditions. Bioinformatic analysis of quantitative data will certainly lead to testable models. The Relationships Between Protein-Modifying Enzymes and Their Substrates Despite the ability to identify and quantify tens of thousands of sites of posttranslational modification, it is very difficult to determine the specific enzymes responsible for the addition and removal of individual PTMs. This task is especially complicated because of the redundancy with which enzymes target their substrates, particularly with respect to phosphorylation; it is a general problem in cell biology.

One can knock out/down or inhibit a given kinase and quantify how individual phosphorylation sites are affected, but possible phosphorylation changes in substrates of the kinase may be compensated for by other kinases that also modify that substrate. Also, decreases in phosphorylation at individual sites may not have directly resulted from decreases in activity of the kinase of interest but rather may occur because such targets are indirect and lie downstream in a signaling network. Chemical genetic approaches, such as the use of bulky ATP-γS and analog-sensitive kinase will allow researchers to directly link kinases with their specific substrates. These engineered proteins have a mutation in the “gatekeeper residue” that allows them to accommodate either an inhibitor or a bulky version of ATP (which a chemical group attached to the nucleotide structure), neither of which can be utilized by WT kinases.127,128,159 In vitro kinase assays between recombinantly expressed kinase and target protein can be used to demonstrate that a given protein can potentially act as an in vivo substrate; however, it is difficult to demonstrate the specificity of this interaction. Recently mass spectrometry has been used to conduct large-scale quantitative measurements of relative catalytic efficiencies of recombinant caspases acting on whole cell lysates.160 The notion of such an approach would be that those substrates with higher relative rates are more likely to be biologically relevant substrates of an enzyme.

Cell Type− or Circuit-Specific Neuroproteomics Characterization of cell type‒specific protein expression as well as synapse-type protein expression remains a major challenge with special relevance to neuroscience. Recent advances in genetic engineering have enabled targeted protein expression at a subset of synapses by cell type or activity see also Chapter 11.161–163 A  tagged version of the glutamate receptor GluD2 was expressed in the parallel fibers to Purkinje cell synapses in the cerebellum, allowing for purification of this distinct fraction of synapses.54 Expression of a tagged version of AMPA receptor GluR1 under control of the c-fos promoter has allowed localization of newly synthesized receptors to active synapses.164 Future proteomic work will make use of such engineered animals, where the expression of tagged molecules is restricted to the cell type or circuit of interest. The isolation of

Proteomics cell type‒specific synaptic subpopulations would allow for a level of molecular resolution not possible with today’s approaches. A  critical issue for such a series of experiments would be sample yield and the ability of mass spectrometers to characterize such samples.

Human Neuroproteomics The availability of high-quality tissue samples will be the main limiting factor in conducting proteomic investigations directly on the normal and diseased human nervous system. Usually the investigation of an animal model will precede the analysis of human tissue and provide a well-founded working hypothesis. Certain proteins and PTM will be more affected by the conditions surrounding tissue collection from the human nervous system than others, with currently only limited information on this topic. Tissue collection and storage conditions for human tissue sample have to evaluated and optimized for subsequent proteomic analysis. In this regard, see the clinical proteomics guidelines at (http://mcponline.org/ content/7/11/2071) and references 164 and 165. CONCLUSION AND S U M M A RY In recent years the application of mass spectrometry and proteomics to the study of the nervous system has resulted in significant insight into function of the nervous system; these represent crucial enabling technologies for future investigations. These approaches are the main tool to study many aspects of molecular function, including posttranslational modifications, quantitative protein expression, and the molecular composition of subcellular compartments. We will continue to see major advances in both the analytical capabilities of the mass spectrometers themselves as well as the biochemical processing steps immediately prior to MS analysis. These improvements will push back the boundaries with respect to the types of experiments that become technically feasible. Specifically, large-scale analysis of multiple PTMs will become more common, as will the ability to profile essentially every protein in a sample using increasingly smaller amounts of sample. ACKNOWLEDGMENTS This work was supported by the Biotechnology and Biological Sciences Research Council (to R.  S.) and by NIH NIGMS 8P41GM103481

175

and the Adelson Program in Neural Repair and Rehabilitation (to A.L.B.). J.C.T.  was additionally supported by P50 GM081879 (to ALB, co-PI).

REFERENCES 1. Blackstock WP, & Weir MP. Proteomics:  quantitative and physical mapping of cellular proteins. Trends Biotechnol 1999;17(3):121–127. PMID: 10189717. 2. Anderson NL, & Anderson NG. Proteome and proteomics:  new technologies, new concepts, and new words. Electrophoresis 1998;19(11): 1853–1861. PMID: 9740045. 3. Schapira AH, Cooper JM, Dexter D, Clark JB, Jenner P, & Marsden CD. Mitochondrial complex I deficiency in Parkinson’s disease. J Neurochem 1990;54(3):823–827. PMID: 2154550. 4. Rahman S. Mitochondrial disease and epilepsy. Dev Med Child Neurol 2012;54(5):397–406. PMID: 22283595. 5. Sugino K, Hempel CM, Miller MN, Hattox AM, Shapiro P, Wu C, . . . Nelson SB. Molecular taxonomy of major neuronal classes in the adult mouse forebrain. Nat Neurosci 2006;9(1):99–107. PMID: 16369481. 6. Walker FO. Huntington’s disease. Lancet 2007;369(9557):218–228. PMID: 17240289. 7. Paulson HL, & Igo I. Genetics of dementia. Semin Neurol 2011;31(5):449–460. PMID: 22266883. 8. Fricker LD. Analysis of mouse brain peptides using mass spectrometry-based peptidomics:  implications for novel functions ranging from non-classical neuropeptides to microproteins. Mol Biosyst 2010;6(8):1355–1365. PMID: 20428524. 9. Van Eeckhaut A, Maes K, Aourz N, Smolders I, & Michotte Y. The absolute quantification of endogenous levels of brain neuropeptides in vivo using LC-MS/MS. Bioanalysis 2011;3(11): 1271–1285. PMID: 21649502. 10. Vetter I, Davis JL, Rash LD, Anangi R, Mobli M, Alewood PF, Lewis RJ, & King GF. Venomics: a new paradigm for natural products-based drug discovery. Amino Acids 2011;40(1):15–28. PMID: 20177945. 11. Aráoz R, Molgó J, & Tandeau de Marsac N. Neurotoxic cyanobacterial toxins. Toxicon 2010;56(5):813–828. PMID: 19660486. 12. Seeley EH, Schwamborn K, & Caprioli RM. Imaging of intact tissue sections: moving beyond the microscope. J Biol Chem 2011;286(29): 25459–25466. PMID: 21632549. 13. O’Farrell PH. High resolution two-dimensional electrophoresis of proteins. J Biol Chem 1975;250(10):4007–4021. PMID: 236308.

176

the OMICs

14. Klose J. Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues. A  novel approach to testing for induced point mutations in mammals. Humangenetik 1975;26(3):231–243. PMID: 1093965. 15. Hall SC, Smith DM, Masiarz FR, Soo VW, Tran HM, Epstein LB, & Burlingame AL. Mass spectrometric and Edman sequencing of lipocortin I  isolated by two-dimensional SDS/PAGE of human melanoma lysates. Proc Natl Acad Sci U S A 1993;90(5):1927–1931. PMID: 8446611. 16. Menon R, Omenn GS. Proteomic characterization of novel alternative splice variant proteins in human epidermal growth factor receptor 2/neu-induced breast cancers. Cancer Res 2010;70(9):3440–3449. PMID: 20388783. 17. Ghaemmaghami S, Huh W-K, Bower K, Howson RW, Belle A, Dephoure N, . . . Weissman JS. Global analysis of protein expression in yeast. Nature 2003;425(6959):737–741. PMID: 14562106. 18. Horvatovich P, Hoekman B, Govorukhina N, & Bischoff R. Multidimensional chromatography coupled to mass spectrometry in analysing complex proteomics samples. J Sep Sci 2010;33(10):1421–1437. PMID: 20486207. 19. Walsh CT. Posttranslational modification of proteins:  expanding nature’s inventory. Greenwood Village, CO: Roberts, 2006. 20. Hunter T. The age of crosstalk:  phosphorylation, ubiquitination, and beyond. Mol Cell 2007;28(5):730–738. PMID: 18082598. 21. Trinidad JC, Barkan DT, Gulledge BF, Thalhammer A, Sali A, Schoepfer R, & Burlingame AL. Global identification and characterization of both O-GlcNAcylation and phosphorylation at the murine synapse. Mol Cell Proteom 2012;11(8):215–229. PMID: 22645316. 22. Medzihradszky KF, Zhang X, Chalkley RJ, Guan S, McFarland MA, Chalmers MJ, . . . Burlingame AL. Characterization of Tetrahymena histone H2B variants and posttranslational populations by electron capture dissociation (ECD) Fourier transform ion cyclotron mass spectrometry (FT-ICR MS). Mol Cell Proteom 2004;3(9): 872–886. PMID: 15199121. 23. Tipton JD, Tran JC, Catherman AD, Ahlf DR, Durbin KR, & Kelleher NL. Analysis of intact protein isoforms by mass spectrometry. J Biol Chem 2011;286(29):25451–25458. PMID: 21632550. 24. Siuti N, Kelleher NL. Decoding protein modifications using top-down mass spectrometry. Nat Methods 2007;4(10):817–821. PMID: 17901871. 25. Eng JK, Searle BC, Clauser KR, & Tabb DL. A face in the crowd:  recognizing peptides through database search. Mol Cell Proteom 2011;10(11):R111.009522. PMID: 21876205.

26. Higgs RE, Knierman MD, Gelfanova V, Butler JP, & Hale JE. Comprehensive label-free method for the relative quantification of proteins from biological samples. J Proteom Res 2005;4(4): 1442–1450. PMID: 16083298. 27. Cox J, & Mann M. Quantitative, high-resolution proteomics for data-driven systems biology. Annu Rev Biochem 2011;80:273–299. PMID: 21548781. 28. Spellman DS, Deinhardt K, Darie CC, Chao MV, & Neubert TA. Stable isotopic labeling by amino acids in cultured primary neurons:  application to brain-derived neurotrophic factor-dependent phosphotyrosine-associated signaling. Mol Cell Proteom 2008;7(6):1067–1076. PMID: 18256212. 29. Liao L, Park SK, Xu T, Vanderklish P, & Yates JR III. Quantitative proteomic analysis of primary neurons reveals diverse changes in synaptic protein content in fmr1 knockout mice. Proc Natl Acad Sci U S A 2008;105(40):15281–15286. PMID: 18829439. 30. Thompson A, Schäfer J, Kuhn K, Kienle S, Schwarz J, Schmidt G, . . . Hamon C. Tandem mass tags:  a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Anal Chem 2003;75(8): 1895–1904. PMID: 12713048. 31. Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, . . . Pappin DJ. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteom 2004;3(12):1154–1169. PMID: 15385600. 32. Trinidad JC, Thalhammer A, Specht CG, Lynn AJ, Baker PR, Schoepfer R, & Burlingame AL. Quantitative analysis of synaptic phosphorylation and protein expression. Mol Cell Proteom 2008;7(4):684–696. PMID: 18056256. 33. Trinidad JC, Thalhammer A, Burlingame AL, & Schoepfer R. Activity-dependent protein dynamics define interconnected cores of co-regulated postsynaptic proteins. Mol Cell Proteom 2013;12(1):29–41. PMID: 23035237 34. Grønborg M, Pavlos NJ, Brunk I, Chua JJE, Münster-Wandowski A, Riedel D, . . . Jahn R. Quantitative comparison of glutamatergic and GABAergic synaptic vesicles unveils selectivity for few proteins including MAL2, a novel synaptic vesicle protein. J Neurosci 2010;30(1):2–12. PMID: 20053882. 35. Michaelevski I, Medzihradszky KF, Lynn A, Burlingame AL, & Fainzilber M. Axonal transport proteomics reveals mobilization of translation machinery to the lesion site in injured sciatic nerve. Mol Cell Proteom 2010;9(5):976–987. PMID: 19955087.

Proteomics 36. Karp NA, Huber W, Sadowski PG, Charles PD, Hester SV, & Lilley KS. Addressing accuracy and precision issues in iTRAQ quantitation. Mol Cell Proteom 2010;9(9):1885–1897. PMID: 20382981. 37. Wu CC, MacCoss MJ, Howell KE, Matthews DE, & Yates JR III. Metabolic labeling of mammalian organisms with stable isotopes for quantitative proteomic analysis. Anal Chem 2004;76(17):4951–4959. PMID: 15373428. 38. McClatchy DB, Liao L, Park SK, Xu T, Lu B, & Yates Iii Jr. Differential proteomic analysis of mammalian tissues using SILAM. PLoS ONE 2011;6(1):e16039. PMID: 21283754. 39. Price JC, Guan S, Burlingame A, Prusiner SB, & Ghaemmaghami S. Analysis of proteome dynamics in the mouse brain. Proc Natl Acad Sci U S A 2010;107(32):14508–14513. PMID: 20699386. 40. Guan S, Price JC, Prusiner SB, Ghaemmaghami S, & Burlingame AL. A data processing pipeline for mammalian proteome dynamics studies using stable isotope metabolic labeling. Mol Cell Proteom 2011;10(12):M111.010728. PMID: 21937731. 41. Krüger M, Moser M, Ussar S, Thievessen I, Luber CA, Forner F, . . . Mann M. SILAC mouse for quantitative proteomics uncovers kindlin-3 as an essential factor for red blood cell function. Cell 2008;134(2):353–364. PMID: 18662549. 42. Ishihama Y, Sato T, Tabata T, Miyamoto N, Sagane K, Nagasu T, & Oda Y. Quantitative mouse brain proteomics using culture-derived isotope tags as internal standards. Nat Biotechnol 2005;23(5):617–621. PMID: 15834404. 43. Seyfried NT, Gozal YM, Donovan LE, Herskowitz JH, Dammer EB, Xia Q, . . . Peng J. Quantitative analysis of the detergent-insoluble brain proteome in frontotemporal lobar degeneration using SILAC internal standards. J Proteom Res 2012;11(5):2721–2738. PMID: 22416763. 44. Makarov A, Denisov E, Lange O, & Horning S. Dynamic range of mass accuracy in LTQ Orbitrap hybrid mass spectrometer. J Am Soc Mass Spectrom 2006;17(7):977–982. PMID: 16750636. 45. Jordan BA, Fernholz BD, Boussac M, Xu C, Grigorean G, Ziff EB, & Neubert TA. Identification and verification of novel rodent postsynaptic density proteins. Mol Cell Proteom 2004;3(9):857–871. PMID: 15169875. 46. Walikonis RS, Jensen ON, Mann M, Provance DW Jr, Mercer JA, & Kennedy MB. Identification of proteins in the postsynaptic density fraction by mass spectrometry. J Neurosci 2000;20(11): 4069–4080. PMID: 10818142. 47. Bayés A, van de Lagemaat LN, Collins MO, Croning MDR, Whittle IR, Choudhary JS, & Grant SGN. Characterization of the proteome,

48.

49.

50.

51.

52.

53.

54.

55.

56.

57.

58.

177

diseases and evolution of the human postsynaptic density. Nat Neurosci 2011;14(1):19–21. PMID: 21170055. Trinidad JC, Thalhammer A, Specht CG, Schoepfer R, & Burlingame AL. Phosphorylation state of postsynaptic density proteins. J Neurochem 2005;92(6):1306–1316. PMID: 15748150. Fernández E, Collins MO, Uren RT, Kopanitsa MV, Komiyama NH, Croning MDR, . . . Grant SGN. Targeted tandem affinity purification of PSD-95 recovers core postsynaptic complexes and schizophrenia susceptibility proteins. Mol Syst Biol 2009;5:269. PMID: 19455133. Peng J, Kim MJ, Cheng D, Duong DM, Gygi SP, & Sheng M. Semiquantitative proteomic analysis of rat forebrain postsynaptic density fractions by mass spectrometry. J Biol Chem 2004;279(20):21003–21011. PMID: 15020595. Cheng D, Hoogenraad CC, Rush J, Ramm E, Schlager MA, Duong DM, . . . Peng J. Relative and absolute quantification of postsynaptic density proteome isolated from rat forebrain and cerebellum. Mol Cell Proteom 2006;5(6):1158–1170. PMID: 16507876. Yun-Hong Y, Chih-Fan C, Chia-Wei C, & Yen-Chung C. A study of the spatial protein organization of the postsynaptic density isolated from porcine cerebral cortex and cerebellum. Mol Cell Proteom 2011;10(10):M110.007138. PMID: 21715321. Uthaiah RC, & Hudspeth AJ. Molecular anatomy of the hair cell’s ribbon synapse. J Neurosci 2010;30(37):12387–12399. PMID: 20844134. Selimi F, Cristea IM, Heller E, Chait BT, & Heintz N. Proteomic studies of a single CNS synapse type:  the parallel fiber/purkinje cell synapse. PLoS Biol 2009;7(4):e83. PMID: 19402746. Emes RD, Pocklington AJ, Anderson CNG, Bayes A, Collins MO, Vickers CA, . . . Grant SGN. Evolutionary expansion and anatomical specialization of synapse proteome complexity. Nat Neurosci 2008;11(7):799–806. PMID: 18536710. Coba MP, Pocklington AJ, Collins MO, Kopanitsa MV, Uren RT, Swamy S, . . . Grant SGN. Neurotransmitters drive combinatorial multistate postsynaptic density networks. Sci Signal 2009;2(68):ra19. PMID: 19401593. Munton RP, Tweedie-Cullen R, LivingstoneZatchej M, Weinandy F, Waidelich M, Longo D, . . . Mansuy IM. Qualitative and quantitative analyses of protein phosphorylation in naive and stimulated mouse synaptosomal preparations. Mol Cell Proteom 2007;6(2):283–293. PMID: 17114649. Dahlhaus M, Li KW, van der Schors RC, Saiepour MH, van Nierop P, Heimel JA, . . . Levelt CN. The

178

59.

60.

61.

62.

63.

64.

65.

66.

67.

68.

the OMICs synaptic proteome during development and plasticity of the mouse visual cortex. Mol Cell Proteom 2011;10(5):M110.005413. PMID: 21398567. Zhang G, Neubert TA, & Jordan BA. RNA Binding proteins accumulate at the postsynaptic density with synaptic activity. J Neurosci 2012;32(2):599–609. Reissner KJ, Uys JD, Schwacke JH, Comte-Walters S, Rutherford-Bethard JL, Dunn TE, . . . Kalivas PW. AKAP signaling in reinstated cocaine seeking revealed by iTRAQ proteomic analysis. J Neurosci 2011;31(15):5648–5658. PMID: 21490206. Abul-Husn NS, Annangudi SP, Ma’ayan A, Ramos-Ortolaza DL, Stockton SD Jr, Gomes I, . . . Devi LA. Chronic morphine alters the presynaptic protein profile:  identification of novel molecular targets using proteomics and network analysis. PLoS ONE 2011;6(10):e25535. PMID: 22043286. Van den Oever MC, Goriounova NA, Li KW, Van der Schors RC, Binnekade R, Schoffelmeer ANM, . . . De Vries TJ. Prefrontal cortex AMPA receptor plasticity is crucial for cue-induced relapse to heroin-seeking. Nat Neurosci 2008;11(9):1053–1058. PMID: 19160503. Klemmer P, Meredith RM, Holmgren CD, Klychnikov OI, Stahl-Zeng J, Loos M, . . . Li KW. Proteomics, ultrastructure, and physiology of hippocampal synapses in a fragile X syndrome mouse model reveal presynaptic phenotype. J Biol Chem 2011;286(29):25495–25504. PMID: 21596744. Fernandez F, Trinidad JC, Blank M, Feng D-D, Burlingame AL, & Garner CC. Normal protein composition of synapses in Ts65Dn mice:  a mouse model of Down syndrome. J Neurochem 2009;110(1):157–169. PMID: 19453946. Morciano M, Burré J, Corvey C, Karas M, Zimmermann H, & Volknandt W. Immunoisolation of two synaptic vesicle pools from synaptosomes:  a proteomics analysis. J Neurochem 2005;95(6):1732–1745. PMID: 16269012. Takamori S, Holt M, Stenius K, Lemke EA, Grønborg M, Riedel D, . . . Jahn R. Molecular Anatomy of a Trafficking Organelle. Cell 2006;127(4):831–846. Pavlos NJ, Grønborg M, Riedel D, Chua JJE, Boyken J, Kloepper TH, . . . Jahn R. Quantitative analysis of synaptic vesicle Rabs uncovers distinct yet overlapping roles for Rab3a and Rab27b in Ca2+-triggered exocytosis. J Neurosci 2010;30(40):13441–13453. Yao J, Nowack A, Kensel-Hammes P, Gardner RG, & Bajjalieh SM. Cotrafficking of SV2 and synaptotagmin at the synapse. J Neurosci 2010;30(16):5569–5578.

69. Hintsch G, Zurlinden A, Meskenaite V, Steuble M, Fink-Widmer K, Kinter J, & Sonderegger P. The calsyntenins—a family of postsynaptic membrane proteins with distinct neuronal expression patterns. Mol Cell Neurosci 2002;21(3):393–409. PMID: 12498782. 70. Steuble M, Gerrits B, Ludwig A, Mateos JM, Diep T-M, Tagaya M, . . . Sonderegger P. Molecular characterization of a trafficking organelle:  dissecting the axonal paths of calsyntenin-1 transport vesicles. Proteomics 2010;10(21):3775–3788. PMID: 20925061. 71. Larocca JN, Norton WT. Isolation of myelin. Curr Protoc Cell Biol 2007 Jan; chapter  3:Unit 3.25. PMID: 18228513. 72. Jahn O, Tenzer S, & Werner HB. Myelin proteomics:  molecular anatomy of an insulating sheath. Mol Neurobiol 2009;40(1):55–72. PMID: 19452287. 73. Laursen LS, Chan CW, & ffrench-Constant C. An integrin–contactin complex regulates CNS myelination by differential Fyn phosphorylation. J Neurosci 2009;29(29):9174–9185. 74. Nozumi M, Togano T, Takahashi-Niki K, Lu J, Honda A, Taoka M, . . . Igarashi M. Identification of functional marker proteins in the mammalian growth cone. Proc Natl Acad Sci U S A 2009;106(40):17211–17216. PMID: 19805073. 75. Yamatani H, Kawasaki T, Mita S, Inagaki N, & Hirata T. Proteomics analysis of the temporal changes in axonal proteins during maturation. Dev Neurobiol 2010;70(7):523–537. PMID: 20225247. 76. Michaelevski I, Segal-Ruder Y, Rozenbaum M, Medzihradszky KF, Shalem O, Coppola G, . . . Fainzilber M. Signaling to transcription networks in the neuronal retrograde injury response. Sci Signal 2010;3(130):ra53. PMID: 20628157. 77. Gingras A-C, Aebersold R, & Raught B. Advances in protein complex analysis using mass spectrometry. J Physiol (Lond) 2005;563(Pt 1):11–21. PMID: 15611014. 78. Fang L, Kaake RM, Patel VR, Yang Y, Baldi P, & Huang L. Mapping the protein interaction network of the human COP9 signalosome complex using a label-free QTAX strategy. Mol Cell Proteom 2012;11(5):138–147. PMID: 22474085. 79. Chen GI, & Gingras A-C. Affinity-purification mass spectrometry (AP-MS) of serine/threonine phosphatases. Methods 2007;42(3):298–305. PMID: 17532517. 80. Chen GI, Tisayakorn S, Jorgensen C, D’Ambrosio LM, Goudreault M, & Gingras A-C. PP4R4/ KIAA1622 forms a novel stable cytosolic complex with phosphoprotein phosphatase 4. J Biol Chem 2008;283(43):29273–29284. PMID: 18715871. 81. Tackett AJ, DeGrasse JA, Sekedat MD, Oeffinger M, Rout MP, & Chait BT. I-DIRT, a general

Proteomics

82.

83.

84.

85.

86.

87.

88.

89.

90.

91.

92.

method for distinguishing between specific and nonspecific protein interactions. J Proteom Res 2005;4(5):1752–1756. PMID: 16212429. Sowa ME, Bennett EJ, Gygi SP, & Harper JW. Defining the human deubiquitinating enzyme interaction landscape. Cell 2009;138(2):389–403. PMID: 19615732. Choi H, Larsen B, Lin Z-Y, Breitkreutz A, Mellacheruvu D, Fermin D, . . . Nesvizhskii AI. SAINT:  probabilistic scoring of affinity purification-mass spectrometry data. Nat Methods 2011;8(1):70–73. PMID: 21131968. Jäger S, Cimermancic P, Gulbahce N, Johnson JR, McGovern KE, Clarke SC, . . . Krogan NJ. Global landscape of HIV-human protein complexes. Nature 2012;481(7381):365–370. PMID: 22190034. Gavin A-C, Bösche M, Krause R, Grandi P, Marzioch M, Bauer A, . . . Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002;415(6868):141–147. PMID: 11805826. Ruotolo BT, Benesch JLP, Sandercock AM, Hyung S-J, & Robinson CV. Ion mobility-mass spectrometry analysis of large protein complexes. Nat Protoc 2008;3(7):1139–1152. PMID: 18600219. Schwenk J, Harmel N, Brechet A, Zolles G, Berkefeld H, Müller CS, . . . Fakler B. High-resolution proteomics unravel architecture and molecular diversity of native AMPA receptor complexes. Neuron 2012;74(4):621–633. PMID: 22632720. Husi H, Ward MA, Choudhary JS, Blackstock WP, & Grant SG. Proteomic analysis of NMDA receptor-adhesion protein signaling complexes. Nat Neurosci 2000;3(7):661–669. PMID: 10862698. Farr CD, Gafken PR, Norbeck AD, Doneanu CE, Stapels MD, Barofsky DF, . . . Saugstad JA. Proteomic analysis of native metabotropic glutamate receptor 5 protein complexes reveals novel molecular constituents. J Neurochem 2004;91(2):438–450. PMID: 15447677. Kabbani N, Woll MP, Levenson R, Lindstrom JM, & Changeux J-P. Intracellular complexes of the beta2 subunit of the nicotinic acetylcholine receptor in brain identified by proteomics. Proc Natl Acad Sci U S A 2007;104(51):20570–20575. PMID: 18077321. Leonoudakis D, Conti LR, Radeke CM, McGuire LMM, & Vandenberg CA. A multiprotein trafficking complex composed of SAP97, CASK, Veli, and Mint1 is associated with inward rectifier Kir2 potassium channels. J Biol Chem 2004;279(18):19051–19063. PMID: 14960569. Muller CS, Haupt A, Bildl W, Schindler J, Knaus HG, Meissner M, . . . Schulte U. Quantitative proteomics of the Cav2 channel nano-environments in the mammalian brain. Proc Natl Acad Sci U S A 2010;107:14950–14957.

179

93. Pratt JM, Simpson DM, Doherty MK, Rivers J, Gaskell SJ, & Beynon RJ. Multiplexed absolute quantification for proteomics using concatenated signature peptides encoded by QconCAT genes. Nat Protoc 2006;1(2):1029–1043. PMID: 17406340. 94. Baucum AJ II, Jalan-Sakrikar N, Jiao Y, Gustin RM, Carmody LC, Tabb DL, . . . Colbran RJ. Identification and validation of novel spinophilin-associated proteins in rodent striatum using an enhanced ex vivo shotgun proteomics approach.Mol Cell Proteom 2010;9(6):1243–1259. PMID: 20124353. 95. Zhang W, St-Gelais F, Grabner CP, Trinidad JC, Sumioka A, Morimoto-Tomita M, . . . Tomita S. A transmembrane accessory subunit that modulates kainate-type glutamate receptors. Neuron 2009;61(3):385–396. PMID: 19217376. 96. Norstrom EM, Zhang C, Tanzi R, & Sisodia SS. Identification of NEEP21 as a Β-amyloid precursor protein-interacting protein in vivo that modulates amyloidogenic processing in vitro. J Neurosci 2010;30(46):15677–15685. 97. Fukata Y, Tzingounis AV, Trinidad JC, Fukata M, Burlingame AL, Nicoll RA, & Bredt DS. Molecular constituents of neuronal AMPA receptors. J Cell Biol 2005;169(3):399–404. PMID: 15883194. 98. Saneyoshi T, Wayman G, Fortin D, Davare M, Hoshi N, Nozaki N, . . . Soderling TR. Activity-dependent synaptogenesis:  regulation by a CaM-kinase kinase/CaM-kinase I/betaPIX signaling complex. Neuron 2008;57(1):94–107. PMID: 18184567. 99. Gokhale A, Larimore J, Werner E, So L, Moreno-De-Luca A, Lese-Martin C, . . . Faundez V. Quantitative proteomic and genetic analyses of the schizophrenia susceptibility factor dysbindin identify novel roles of the biogenesis of lysosome-related organelles complex 1. J Neurosci 2012;32(11):3697–3711. 100. Evers DM, Matta JA, Hoe H-S, Zarkowsky D, Lee SH, Isaac JT, & Pak DTS. Plk2 attachment to NSF induces homeostatic removal of GluA2 during chronic overexcitation. Nat Neurosci 2010;13(10):1199–1207. PMID: 20802490. 101. Ko J, Fuccillo MV, Malenka RC, & Südhof TC. LRRTM2 functions as a neurexin ligand in promoting excitatory synapse formation. Neuron 2009;64(6):791–798. PMID: 20064387. 102. O’Sullivan ML, de Wit J, Savas JN, Comoletti D, Otto-Hitt S, Yates JR III, & Ghosh A. FLRT proteins are endogenous latrophilin ligands and regulate excitatory synapse development. Neuron 2012;73(5):903–910. PMID: 22405201. 103. Thalhammer A, Trinidad JC, Burlingame AL, & Schoepfer R. Densin-180: revised membrane

180

104.

105.

106.

107.

108.

109.

110.

111.

112.

113.

the OMICs topology, domain structure and phosphorylation status. J Neurochem 2009;109(2):297–302. PMID: 19187442. Salomon AR, Ficarro SB, Brill LM, Brinker A, Phung QT, Ericson C, . . . Peters EC. Profiling of tyrosine phosphorylation pathways in human cells using mass spectrometry. Proc Natl Acad Sci U S A 2003;100(2):443–448. PMID: 12522270. Stokes MP, Rush J, Macneill J, Ren JM, Sprott K, Nardone J, . . . Comb MJ. Profiling of UV-induced ATM/ATR signaling pathways. Proc Natl Acad Sci U S A 2007;104(50): 19855–19860. PMID: 18077418. Bustos D, Bakalarski CE, Yang Y, Peng J, & Kirkpatrick DS. Characterizing ubiquitination sites by peptide based immunoaffinity enrichment. Mol Cell Proteom 2012;11(12):1529–1540. PMID: 22729469. Wang Z, Udeshi ND, O’Malley M, Shabanowitz J, Hunt DF, & Hart GW. Enrichment and site mapping of O-linked N-acetylglucosamine by a combination of chemical/enzymatic tagging, photochemical cleavage, and electron transfer dissociation mass spectrometry. Mol Cell Proteom 2010;9(1):153–160. Blagoev B, Kratchmarova I, Ong S-E, Nielsen M, Foster LJ, & Mann M. A proteomics strategy to elucidate functional protein-protein interactions applied to EGF signaling. Nat Biotechnol 2003;21(3):315–318. PMID: 12577067. Vosseller K, Trinidad JC, Chalkley RJ, Specht CG, Thalhammer A, Lynn AJ, . . . Burlingame AL. O-linked N-acetylglucosamine proteomics of postsynaptic density preparations using lectin weak affinity chromatography and mass spectrometry. Mol Cell Proteom 2006;5(5): 923–934. PMID: 16452088. Bradshaw RA, Burlingame AL, Carr S, & Aebersold R. Reporting protein identification data: the next generation of guidelines. Mol Cell Proteom 2006;5(5):787–788. PMID: 16670253. Beausoleil SA, Villén J, Gerber SA, Rush J, & Gygi SP. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol 2006;24(10):1285–1292. PMID: 16964243. Baker PR, Trinidad JC, & Chalkley RJ. Modification site localization scoring integrated into a search engine. Mol Cell Proteom 2011;10(7):M111.008078. PMID: 21490164. Savitski MM, Lemeer S, Boesche M, Lang M, Mathieson T, Bantscheff M, & Kuster B. Confident phosphorylation site localization using the Mascot Delta Score. Mol Cell Proteom 2011;10(2):M110.003830. PMID: 21057138.

114. Ruttenberg BE, Pisitkun T, Knepper MA, & Hoffert JD. PhosphoScore:  an open-source phosphorylation site assignment tool for MSn data. J Proteom Res 2008;7(7):3054–3059. PMID: 18543960 115. Cong X, Held JM, DeGiacomo F, Bonner A, Chen JM, Schilling B, . . . Ellerby LM. Mass spectrometric identification of novel lysine acetylation sites in huntingtin. Mol Cell Proteom 2011;10(10):M111.009829. PMID: 21685499. 116. Min S-W, Cho S-H, Zhou Y, Schroeder S, Haroutunian V, Seeley WW, . . . Gan L. Acetylation of tau inhibits its degradation and contributes to tauopathy. Neuron 2010;67(6):953–966. PMID: 20869593. 117. Berendt FJ, Park K-S, & Trimmer JS. Multisite phosphorylation of voltage-gated sodium channel α subunits from rat brain. J Proteom Res 2010;9(4):1976–1984. 118. Vacher H, Yang J-W, Cerda O, Autillo-Touati A, Dargent B, & Trimmer JS. Cdk-mediated phosphorylation of the Kvβ2 auxiliary subunit regulates Kv1 channel axonal targeting. J Cell Biol 2011;192(5):813–824. PMID: 21357749. 119. Zhou Z, Hong EJ, Cohen S, Zhao W-N, Ho H-YH, Schmidt L, . . . Greenberg ME. Brain-specific phosphorylation of MeCP2 regulates activity-dependent Bdnf transcription, dendritic growth, and spine maturation. Neuron 2006;52(2):255–269. PMID: 17046689. 120. Bianchetta MJ, Lam TT, Jones SN, & Morabito MA. Cyclin-dependent kinase 5 regulates PSD-95 ubiquitination in neurons. J Neurosci 2011;31(33):12029–12035. 121. Jaffe H, Vinade L, & Dosemeci A. Identification of novel phosphorylation sites on postsynaptic density proteins. Biochem Biophys Res Commun 2004;321(1):210–218. PMID: 15358237. 122. DeGiorgis JA, Jaffe H, Moreira JE, Carlotti CG Jr, Leite JP, Pant HC, & Dosemeci A. Phosphoproteomic analysis of synaptosomes from human cerebral cortex. J Proteom Res 2005;4(2):306–315. PMID: 15822905. 123. Tweedie-Cullen RY, Reck JM, & Mansuy IM. Comprehensive mapping of post-translational modifications on synaptic, nuclear, and histone proteins in the adult mouse brain. J Proteom Res 2009;8(11):4966–4982. PMID: 19737024. 124. Trinidad JC, Specht CG, Thalhammer A, Schoepfer R, & Burlingame AL. Comprehensive identification of phosphorylation sites in postsynaptic density preparations. Mol Cell Proteom 2006;5(5):914–922. PMID: 16452087. 125. Zhang H, Guo T, Li X, Datta A, Park JE, Yang J, . . . Sze SK. Simultaneous characterization of glyco- and phosphoproteomes of mouse

Proteomics

126.

127.

128.

129.

130.

131.

132.

133.

134.

brain membrane proteome with electrostatic repulsion hydrophilic interaction chromatography. Mol Cell Proteom 2010;9(4):635–647. PMID: 20047950. Shah K, Liu Y, Deirmengian C, & Shokat KM. Engineering unnatural nucleotide specificity for Rous sarcoma virus tyrosine kinase to uniquely label its direct substrates. Proc Natl Acad Sci U S A 1997;94(8):3565–3570. PMID: 9108016. Hertz NT, Wang BT, Allen JJ, Zhang C, Dar AC, Burlingame AL, & Shokat KM. Chemical genetic approach for kinase-substrate mapping by covalent capture of thiophosphopeptides and analysis by mass spectrometry Current protocols in chemical biology. Hoboken, NJ: Wiley; 2010;2(1):15–26. Ultanir SK, Hertz NT, Li G, Ge W-P, Burlingame AL, Pleasure SJ, . . . Jan Y-N. Chemical genetic identification of NDR1/2 kinase substrates AAK1 and Rabin8 uncovers their roles in dendrite arborization and spine development. Neuron 2012;73(6):1127–1142. PMID: 22445341. Edbauer D, Cheng D, Batterton MN, Wang C-F, Duong DM, Yaffe MB, . . . Sheng M. Identification and characterization of neuronal mitogen-activated protein kinase substrates using a specific phosphomotif antibody. Mol Cell Proteom 2009;8(4):681–695. PMID: 19054758. Biarc J, Chalkley RJ, Burlingame AL, & Bradshaw RA. The induction of serine/threonine protein phosphorylations by a PDGFR/ TrkA chimera in stably transfected PC12 cells. Mol Cell Proteom 2012;11(5):15–30. PMID: 22027198. Amano M, Tsumura Y, Taki K, Harada H, Mori K, Nishioka T, . . . Kaibuchi K. A proteomic approach for comprehensively screening substrates of protein kinases such as Rho-kinase. PLoS ONE 2010;5(1):e8704. PMID: 20090853. Hanover JA, Krause MW, & Love DC. The hexosamine signaling pathway:  O-GlcNAc cycling in feast or famine. Biochim Biophys Acta 2010;1800(2):80–95. PMID: 19647043. Hart GW, Slawson C, Ramirez-Correa G, & Lagerlof O. Cross talk between O-GlcNAcylation and phosphorylation:  roles in signaling, transcription, and chronic disease. Annu Rev Biochem 2011;80(1):825–858. Hanover JA, Krause MW, & Love DC. Bittersweet memories:  linking metabolism to epigenetics through O-GlcNAcylation. Nat Rev Mol Cell Biol 2012;13(5):312–321. PMID: 22522719.

181

135. Syka JEP, Coon JJ, Schroeder MJ, Shabanowitz J, & Hunt DF. Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc Natl Acad Sci U S A 2004;101(26):9528–9533. PMID: 15210983. 136. Chalkley RJ, Thalhammer A, Schoepfer R, & Burlingame AL. Identification of protein O-GlcNAcylation sites using electron transfer dissociation mass spectrometry on native peptides. Proc Natl Acad Sci U S A 2009;106(22):8894–8899. PMID: 19458039. 137. Khidekel N, Arndt S, Lamarre-Vincent N, Lippert A, Poulin-Kerstien KG, Ramakrishnan B, . . . Hsieh-Wilson LC. A chemoenzymatic approach toward the rapid and sensitive detection of O-GlcNAc posttranslational modifications. J Am Chem Soc 2003;125(52):16162–16163. PMID: 14692737. 138. Khidekel N, Ficarro SB, Peters EC, & Hsieh-Wilson LC. Exploring the O-GlcNAc proteome:  direct identification of O-GlcNAcmodified proteins from the brain. Proc Natl Acad Sci U S A 2004;101(36):13132–13137. PMID: 15340146. 139. Khidekel N, Ficarro SB, Clark PM, Bryan MC, Swaney DL, Rexach JE, . . . Hsieh-Wilson LC. Probing the dynamics of O-GlcNAc glycosylation in the brain using quantitative proteomics. Nat Chem Biol 2007;3(6):339–348. PMID: 17496889. 140. Rexach JE, Rogers CJ, Yu S-H, Tao J, Sun YE, & Hsieh-Wilson LC. Quantification of O-glycosylation stoichiometry and dynamics using resolvable mass tags. Nat Chem Biol 2010;6(9):645–651. PMID: 20657584. 141. Alfaro JF, Gong C-X, Monroe ME, Aldrich JT, Clauss TRW, Purvine SO, . . . Smith RD. Tandem mass spectrometry identifies many mouse brain O-GlcNAcylated proteins including EGF domain-specific O-GlcNAc transferase targets. Proc Natl Acad Sci U S A 2012;109(19): 7280–7285. PMID: 22517741. 142. Nagata Y, & Burger MM. Wheat germ agglutinin. Molecular characteristics and specificity for sugar binding. J Biol Chem 1974;249(10): 3116–3122. PMID: 4830237. 143. Luo Y, Blex C, Baessler O, Glinski M, Dreger M, Sefkow M, & Köster H. The cAMP capture compound mass spectrometry as a novel tool for targeting cAMP-binding proteins:  from protein kinase A  to potassium/ sodium hyperpolarization-activated cyclic nucleotide-gated channels. Mol Cell Proteom 2009;8(12):2843–2856. PMID: 19741253. 144. Kang R, Wan J, Arstikaitis P, Takahashi H, Huang K, Bailey AO, . . . El-Husseini A. Neural

182

145.

146.

147.

148.

149.

150.

151.

152.

153.

154.

the OMICs palmitoyl-proteomics reveals dynamic synaptic palmitoylation. Nature 2008;456(7224):904– 909. PMID: 19092927. Graham ME, Thaysen-Andersen M, Bache N, Craft GE, Larsen MR, Packer NH, & Robinson PJ. A Novel Post-translational modification in nerve terminals:  O-linked N-acetylglucosamine phosphorylation. J Proteom Res 2011;10(6):2725–2733. Murrey HE, Ficarro SB, Krishnamurthy C, Domino SE, Peters EC, & Hsieh-Wilson LC. Identification of the plasticity-relevant fucose-alpha(1-2)-galactose proteome from the mouse olfactory bulb. Biochemistry 2009;48(30):7261–7270. PMID: 19527073. Poon HF, Vaishnav RA, Getchell TV, Getchell ML, & Butterfield DA. Quantitative proteomics analysis of differential protein expression and oxidative modification of specific proteins in the brains of old mice. Neurobiol Aging 2006;27(7):1010–1019. PMID: 15979213. Schilling O, auf dem Keller U, & Overall CM. Protease specificity profiling by tandem mass spectrometry using proteome-derived peptide libraries. Methods Mol Biol 2011;753:257–272. PMID: 21604128. Klaiman G, Petzke TL, Hammond J, & Leblanc AC. Targets of caspase-6 activity in human neurons and Alzheimer disease. Mol Cell Proteom 2008;7(8):1541–1555. PMID: 18487604. Mahrus S, Trinidad JC, Barkan DT, Sali A, Burlingame AL, & Wells JA. Global sequencing of proteolytic cleavage sites in apoptosis by specific labeling of protein N termini. Cell 2008;134(5):866–876. PMID: 18722006. Franco M, Seyfried NT, Brand AH, Peng J, & Mayor U. A novel strategy to isolate ubiquitin conjugates reveals wide role for ubiquitination during neural development. Mol Cell Proteom 2011;10(5):M110.002188. PMID: 20861518. Xu G, Paige JS, & Jaffrey SR. Global analysis of lysine ubiquitination by ubiquitin remnant immunoaffinity profiling. Nat Biotechnol 2010;28(8):868–873. PMID: 20639865. Matic I, van Hagen M, Schimmel J, Macek B, Ogg SC, Tatham MH, . . . Vertegaal ACO. In vivo identification of human small ubiquitin-like modifier polymerization sites by high accuracy mass spectrometry and an in vitro to in vivo strategy. Mol Cell Proteom 2008;7(1):132–144. PMID: 17938407. Trinidad JC, Schoepfer R, Burlingame AL, Medzihradszky KF. N- and O-glycosylation in the murine synaptosome. Mol Cell Proteom 2013. PMID: 23816992.

155. Lange V, Picotti P, Domon B, & Aebersold R. Selected reaction monitoring for quantitative proteomics: a tutorial. Mol Syst Biol 2008;4:222. PMID: 18854821. 156. Power KA, McRedmond JP, de Stefani A, Gallagher WM, & Gaora PO. High-throughput proteomics detection of novel splice isoforms in human platelets. PLoS ONE 2009;4(3):e5001. PMID: 19308253. 157. Falls DL. Neuregulins:  functions, forms, and signaling strategies. Exp Cell Res 2003;284(1):14–30. PMID: 12648463. 158. Kim MJ, Futai K, Jo J, Hayashi Y, Cho K, & Sheng M. Synaptic accumulation of PSD-95 and synaptic function regulated by phosphorylation of serine-295 of PSD-95. Neuron 2007;56(3):488–502. PMID: 17988632. 159. Blethrow JD, Glavy JS, Morgan DO, & Shokat KM. Covalent capture of kinase-specific phosphopeptides reveals Cdk1-cyclin B substrates. Proc Natl Acad Sci U S A 2008;105(5): 1442–1447. PMID: 18234856. 160. Agard NJ, Mahrus S, Trinidad JC, Lynn A, Burlingame AL, Wells JA. Global kinetic analysis of proteolysis via quantitative targeted proteomics. Proc Natl Acad Sci U S A 2012;109(6):1913–1918. PMID: 22308409. 161. Gong S, Doughty M, Harbaugh CR, Cummins A, Hatten ME, Heintz N, & Gerfen CR. Targeting Cre recombinase to specific neuron populations with bacterial artificial chromosome constructs. J Neurosci 2007;27(37): 9817–9823. PMID: 17855595. 162. Filosa A, Paixão S, Honsek SD, Carmona MA, Becker L, Feddersen B, . . . Klein R. Neuron-glia communication via EphA4/ephrin-A3 modulates LTP through glial glutamate transport. Nat Neurosci 2009;12(10):1285–1292. PMID: 19734893. 163. Reijmers L, & Mayford M. Genetic control of active neural circuits. Front Mol Neurosci 2009;2:27. PMID: 20057936. 164. Matsuo N, Reijmers L, & Mayford M. Spine-type-specific recruitment of newly synthesized AMPA receptors with learning. Science 2008;319(5866):1104–1107. PMID: 18292343. 165. Alzate OO (Ed). Neuroproteomics. Boca Raton, FL: CRC Press; 2010. PMID: 21882445. 166. Oka T, Tagawa K, Ito H, & Okazawa H. Dynamic changes of the phosphoproteome in postmortem mouse brains. PLoS One 2011;6(6) e21405. PMID: 21731734. 167. Medzihradszky KF. Peptide sequence analysis. Meth Enzymol 2005;402:209–244. PMID: 16401511.

10 Focused Plasma Proteomics for the Study of Brain Aging and Neurodegeneration PHILIPP A. JAEGER, SAUL A. VILLEDA, DANIELA BERDNIK, M A R K U S B R I T S C H G I , A N D T O N Y W Y S S - C O R AY

INTRODUCTION In the past, the adult brain has been considered an immune-privileged organ protected by a tight blood-brain barrier. Now, an exciting flurry of evidence supports a more sophisticated interaction between the systemic environment, including the immune system or the blood and the central nervous system (CNS). Thus systemic immune cells and secreted signaling proteins communicate with the CNS and have been associated not only with neuroinflammation but neurodegenerative processes in general (Britschgi & Wyss-Coray 2007; Czirr & Wyss-Coray 2012). As a remarkable example, physical exercise even during midlife can delay or alleviate age-related cognitive impairments or symptoms associated with neurodegeneration in humans (Aberg et  al. 2012; Lista & Sorrentino 2009; Sofi et  al. 2010), and similar benefits have been observed in rodents ( Lista & Sorrentino 2009; van Praag et al. 1999). A large meta-analysis of prospective studies including over 150,000 subjects concluded that physical exercise is inversely correlated with dementia (Hamer & Chida 2009). In a Swedish study of more than a million military conscripts, higher physical fitness at age 20 was associated with enhanced cognition in general as well as with reduced depression—a major risk factor for dementia—later in life (Aberg et  al. 2012). Likewise, changes made to the systemic milieu through calorie restriction, or diet, exert beneficial effect in various animal models of neurological disease and aging (Maalouf et al. 2009). While some of these interactions between the systemic environment and the CNS may involve cells entering the nervous tissue, it is likely that many more are mediated by

soluble signaling molecules. These molecules may communicate with the adult brain through the vasculature, or they may enter the brain via endothelial transport mechanisms, specialized openings (e.g., circumventricular organs), or an injured or aged blood-brain barrier. Changes in the levels of systemic secreted signaling molecules associated with CNS disease may thus be derived from neural tissue itself or from the CNS vasculature, or they may result from systemic sources. Many of the molecular mechanisms underlying this communication between CNS and periphery remain to be characterized. Using heterochronic parabiosis, we showed recently that blood-borne factors present in the systemic milieu can inhibit or promote adult neurogenesis in an age-dependent fashion in mice (Villeda et  al. 2011). Accordingly, exposing a young mouse to an old systemic environment or to plasma from old mice decreased synaptic plasticity and impaired contextual fear conditioning as well as spatial learning and memory. On the other hand, a young circulatory environment increased neurogenesis in old mice. In support of a beneficial role of the young systemic environment in brain injury, Ruckh and colleagues showed recently that heterochronic parabiosis can restore regeneration of the aging brain following experimental demyelination (Ruckh et al. 2012). Together, these studies support the concept that factors in the systemic environment are not only indicative of the state of the brain but can modulate it as well. We describe here proteomic approaches that take advantage of this exciting observation and try to identify proteins indicative of and/or involved in brain aging and neurodegeneration.

184

the OMICs

PLASMA PROTEOMICS FOR THE STUDY OF BRAIN DISEASES Brain diseases can typically not be studied at the molecular level in living individuals owing to the inaccessibility of CNS tissue. This makes it particularly difficult to understand sporadic psychiatric or neurodegenerative diseases with no strong genetic component or etiology. To overcome this challenge, scientists have tried to discover molecular or cellular changes in blood associated with such diseases (reviewed in Lista et  al. 2012). Plasma, the soluble fraction of blood, is a highly complex mixture of proteins and lipids; various proteomic methods have been used in order to identify blood-based biomarkers. The simplest approaches use antibody-based assays to measure individual soluble proteins in hypothesis-driven experiments. Typically such proteins have been cytokines and trophic factors or other factors we would consider part of the communicome. For example, levels of brain-derived neurotrophic factor (BDNF) in blood were measured in schizophrenia and depression and found to be reduced with disease, to correlate with hippocampal volume, and to predict clinical outcome following drug treatment (Green et al. 2011; Kurita et al. 2012; Lee et  al. 2007; Martinotti et  al. 2012). In an increasingly less biased fashion, multiplex ELISAs detecting multiple secreted signaling proteins and in some studies up to 200 such communication factors have shown promise in identifying potential biomarkers of disease. In this fashion, multiplex antibody assays have detected significant changes in plasma cytokine and chemokine levels in patients with presymptomatic Huntington’s disease (HD) and in mouse models of the disease (Björkqvist et  al. 2008; Wild et  al. 2011). Out of these, CCL11 plasma levels were described to positively correlate with disease progression in close to 200 patients and healthy controls (Wild et al. 2011), supporting the role of this chemokine in CNS dysfunction reported by our group (Villeda et  al. 2011). Likewise, panels of proteins have been identified to be associated with depressive symptoms in older adults (Domenici et al. 2010). Multiplex studies in plasma from AD patients described protein signatures that may be specific to prodromal stages of the disease (Hu et  al. 2012)  or that characterize patients who progress from a prodromal stage to AD

(Ray et  al. 2007). Other signatures appear to correlate with the APOE genotype (Soares et  al. 2012)  or with pathological changes such as Aβ and tau protein levels in the CSF of AD patients (Britschgi et  al. 2011). Close to 200 communicome proteins were measured in plasma samples from patients participating in the AD neuroimaging initiative (ADNI), yielding protein signatures that correlated with patients who converted from mild cognitive impairment to AD (Johnstone et  al. 2012). A  possibly more mechanistic study used the same patient samples and protein measurements to identify proteins associated with amyloid burden in the brains of these patients (Kiddle et  al. 2012). While there is some general overlap between these studies (e.g., apoE, complement), most signatures have not been validated and their biological significance is unclear (see the section titled Challenges and Opportunities further  on). More sophisticated and unbiased methods to study the plasma proteome use mass spectrometry, often in combination with initial fractionation or selection steps such as 2D gel electrophoresis, chromatography, or antibodies. For instance, proteins punched out from 2D gels and analyzed by matrix-assisted laser desorption ionization mass spectrometry (MALDI-MS) found reduced levels of plasma apolipoprotein A-I (ApoA-I) to be associated with treatment-resistant schizophrenia (La et  al. 2007). A  similar approach found complement factor H, alpha2-macroglobulin, and other proteins to be associated with AD (Hye et  al. 2006). The combination of MS based “shotgun” approaches and sequence database searching has become a frequent method of choice for the identification of peptides and the mapping of proteomes in large sample sets. For instance, surface-enhanced laser desorption/ionization time-of-flight (SELDI TOF) MS identified proteolytic fragments of complement factor C3 in serum as potential biomarkers of autism (Momeni et  al. 2012a). Subsequently, complement factor I, which regulates C3 cleavage, was found to be increased in the plasma of autistic children (Momeni et  al. 2012b). However, mass spectrometry has its limitations with regard to maintaining integrity and detectability for all proteins in the sample and it requires large computational power (Nesvizhskii & Aebersold 2005; Shteynberg et  al.  2011).

Focused Plasma Proteomics and Neurodegeneration THE COMMUNICOME: A REDUCTIONIST APPROACH TO THE STUDY OF BRAIN AGING AND DISEASE Concept While the idea of measuring soluble immune molecules such as cytokines in plasma or other body fluids from patients is not new, we extended this concept to try to measure all proteins that serve as communication factors between cells. We dubbed this subgroup of the proteome the “communicome” (Ray et  al. 2007)  and proposed that it measures the essence of cellular communication between tissues in physiological and pathophysiological states (Figure  10.1). With this reductionist approach, rather than interrogating the entire transcriptome or proteome, the communicome will focus instead on hundreds of plasma proteins for which specific antibodies are commercially available. Whereas the transcriptome tries to understand how the cell responds to environmental stimuli by studying the expression levels of every gene, we propose that it may be sufficient, or perhaps even more informative, to know how the cell integrates these signals. The biocomputational output of the cell is manifested by changes in morphology, movements, and, to a large extent, by the secretion of defined messenger molecules that help the cell communicate with its environment. These communication factors include any secreted proteins that bind to cellular receptors as well as secreted receptors or soluble binding proteins that regulate ligand binding. Plasma is easily accessible in living individuals repeatedly over time, and the plasma communicome may thus be highly useful and informative in monitoring an understanding complex diseases. Others have sufficiently made the point for studying the plasma as a window to every tissue in an organism; indeed, a significant number of clinical biomarkers take advantage of this fact (Anderson, 2010). As we discuss in the subsequent examples, changes in the plasma communicome seem to be different between normal aging and disease and do not merely reflect a generalized and common inflammatory process.

Example  1:  Normal  Aging As mentioned earlier, changes made to the aging systemic milieu through exercise, caloric

185

restriction, or parabiosis have proven potent approaches to enhance the regenerative potential of the adult brain, benefit cognition, and delay or revert dementia (Lista and Sorrentino, 2009; Maalouf et  al. 2009; Ruckh et  al. 2012; Villeda et  al. 2011). Such findings demonstrate the relevance molecular changes in the periphery may have on aging and degeneration in the CNS. Together, they led us to formulate the following hypothesis:  Molecular changes in the periphery, specifically in blood plasma, can help us identify novel signaling pathways that contribute to brain aging. To test this hypothesis and identifying critical signaling pathways involved in brain aging, we measured changes in secreted intercellular signaling proteins in plasma and correlated them with cellular and molecular changes in the brain (Villeda et al. 2011). Using multiplexed antibody-based assays, we measured levels of close to 70 secreted signaling proteins including cytokines, chemokines, growth factors, complement proteins, and other proteins involved in intercellular communication in normal aging and heterochronic parabiosis between 2- and 18-month-old mice. Neurogenesis decreased precipitously with normal aging, and heterochronic parabiosis for 5 weeks reduced neurogenesis in young parabionts exposed to an old systemic environment. These studies identified a set of six signaling proteins (Figure  10.1C) strongly correlated with the loss of neurogenesis in normal aging and sufficient to inhibit neurogenesis in young healthy mice. Furthermore, we identified the chemokine CCL11/eotaxin as a key age-related systemic mediator associated with reduced neurogenesis and cognitive impairment. Systemic administration of CCL11 was sufficient to mimic these age-related changes, and injection of a CCL11 neutralizing antibody abrogated the effect in mice. Interestingly, CCL11 is increased not only with aging but also with obesity, an important risk factor for cognitive impairment, and it decreases in obese patients following exercise (Choi et al. 2007; Kim et al. 2011).

Example  2:  Alzheimer’s Disease Earlier studies from our lab show that expression levels of cellular signaling proteins in Alzheimer’s disease (AD) plasma are distinctly different from those of controls (Ray et  al. 2007). We measured 120 cytokines, chemokines, growth factors, and related communication factors using filter-based arrayed sandwich

186

the OMICs (A)

Environmental Stimuli

(B)

Blood Cells

Transcriptome (104 – 105) Proteome (105 – 106)

Cell

Signaling Proteins

Adipose Tissue

Nervous Tissue

Communications Factors

Liver

Communicome (102 – 103)

Endothelium

Endocrine Tissue Kidney

(C)

Normal Aging CXCL10 SGOT Osteopontin CXCL2 vWF CCL7

CCL9 TIMP-1 XCL1 Leptin CCL22

Parabiosis CCL2 CCL11 CCL12 CCL19 Haptoglobin β2M

CXCL6

CXCL1

IL-11 IL-1α IL-5 IL-7

CCL4 Myoglobin MPO

The cellular communicome. (A)  cells in their environment are exposed to a multitude of local environmental stimuli, many of which trigger cell surface receptors. The cell’s response to these stimuli is reflected in a transcriptional program that may contain thousands of mRNA species and an order of magnitude more protein species. As a major part of the response to its environment, the cell secretes signaling proteins with high information content. The relatively smaller number of these specialized proteins represent the key mode of communication between cells, and we named it the communicome. (B) plasma communication factors (double arrows) carry information between peripheral tissues and the CNS via blood. (C) Venn diagram outlining the results from the normal aging and parabiosis proteomic screens. The 17 blood-borne factors whose levels increased with aging and correlated most strongly with the age-related decline in neurogenesis are shown on the left; the 15 blood-borne factors that increased between young isochronic and young heterochronic parabionts are shown on the right. The intersection shows six factors that might have a role in the age-related decline in neurogenesis and cognitive function. FIGURE  10.1:

ELISAs in plasma from patients with mild to moderate AD and from age-matched nondemented controls. Statistical analysis led to the identification of 18 proteins that classified a blinded set of samples with high accuracy and predicted conversion from presymptomatic state to disease. Unfortunately, translating the findings to a clinically useful platform has been challenging, and while some of the findings were replicated (Britschgi et  al. 2011), other groups failed to observe similar differences (see discussion further on). Possible reasons include but are not limited to variability between study centers, a difference in age between cases and controls, and the use of an early experimental platform.

Nevertheless, the set of 18 markers we derived through our unbiased analysis identified multiple molecules that were previously associated with AD or that have since been shown to have a possible role in the disease. Proteins that modulate AD-like disease in mice include TNFα, CSF3 (G-CSF), or CSF1 (M-CSF); several others are closely related to these or other proteins implicated in AD (e.g., MCP-3, ICAM1, IL-1α, and so on). Furthermore, in a separate study using an independent set of plasma and CSF samples, we measured 90 communicome proteins with a Luminex platform at a commercial contract lab and used a novel bioinformatics approach to predict pathological parameters in AD using the CSF or plasma markers as

Focused Plasma Proteomics and Neurodegeneration variables. Only 6 of the18 proteins that were part of the signature of Ray and colleagues were detectable with the Luminex platform; intriguingly, however, all six proteins were selected to model AD pathology. Of these, CSF1 may be particularly interesting as it is reduced in AD plasma, and treatment of AD model mice by the Rivest group (Boissonneault et al. 2008) and our lab (Luo et  al. 2013)  resulted in prevention or partial reversal of disease. We were able to show that CSF1 receptors are expressed not only in microglia in the brain but also in injured neurons and that deletion of the neuronal CSF1R renders mice more susceptible to neurodegeneration and death ((Luo et al. 2013). Together, these examples illustrate how changes in the plasma communicome with aging or neurodegeneration can uncover new insight into biological and pathophysiological processes relevant to disease. Notably, in these studies we measured only a small fraction of the proteins that make up the communicome. More advanced arrays described in the following text should hopefully further improve the utility of this approach.

METHODS FOR MEASURING THE COMMUNICOME The communicome has not been measured in its entirety, but various tools exist to measure parts of it:  commercially available platforms currently detect up to 200 secreted signaling proteins with multiplexed bead-based ELISAs (Luminex), or around 500 such proteins with printed microarrays of antibodies. Multiplexed proximity ligations assays using antibodies and sophisticated amplification techniques have recently been described to measure 74 communicome proteins with high sensitivity in as little as 1  μL of human plasma (Lundberg et  al. 2011). We have collected more than 600 antibodies directed against different human secreted signaling proteins, including cytokines, chemokines, growth factors, complement proteins, and other potential communicome factors (Figure  10.2A). We spot five replicates per antibody as well as several positive and negative controls in a stereotyped print pattern onto SuperEpoxy glass slides (Figure  10.2B) using a NanoPrint LM210 array printer fitted with 16 SMP4B pins (Arrayit, Sunnyvale, CA). In a test and validation experiment for our array technology, we used 96 normal human plasma samples (Figure  10.2C). These

187

samples were diluted and dialyzed (96-well Dispodialyzer/5kDa, Harvard Apparatus, Holliston, MA), biotinylated on primary amino groups (NHS SulfoBiotin, Thermo Scientific, Rockford, IL), and incubated with blocked antibody arrays. After multiple washing steps, antibody-bound proteins were detected using Alexa555 conjugated streptavidin (Invitrogen, Carlsbad, CA) and a GenePix4400A scanner coupled to an automated GenePixSL50 slide loader (Molecular Devices, Sunnyvale, CA). Individual array spots were background subtracted locally and the mean-intensity raw data were calculated from the five replicates for each antibody. Negative spot intensities were set to the half-minimal detection limit or 1, whichever was greater, and flagged as “at detection limit.” Antibodies with more than 55% of spots at the detection limit were removed from the dataset, yielding a total of 582 antibodies and excluding only about 3% of our collection. For general array quality control, we generate heat maps of mean background subtracted raw data and also plot the coefficient of variation (CV) expressed as percentage for this dataset. Log2 transformation and iterative row- and column-wise mean centering and normalization are performed and, after Z-scoring, the data can be analyzed using a variety of statistical tools (Figure  10.2D‒G). Ongoing studies in the lab are currently testing the utility of these arrays for the detection of disease-relevant changes in AD and other dementias and preliminary results are indeed promising.

CHALLENGES AND OPPORTUNITIES Neurodegenerative diseases such as AD develop over years or decades before clinical symptoms manifest. Molecular biomarkers that correlate with the disease process and are detectable in blood would be a desirable detection and screening tool if validated, but attempts to discover such markers have met numerous challenges. These include but are not limited to small sample sizes, age and sex differences in case-control studies, differences in sample collection across centers, unstable experimental detection platforms, and overfitting of the data to derive statistical models (Anderson et  al. 2012; Lista et al. 2012; Mitchell, 2010). Another critical problem with past studies of AD was their reliance on patients who already had the disease. Such patients frequently have many

188

the OMICs

(A)

(B)

Mean Log2 transformed raw array data

Raw data distribution (D) 450

(C) 15

5

350

200

Frequency

Antibodies

Intensity

10

Raw data Normal Fit

400

100

300

300 250 200 150

400

100 500

50

0

(E)

20

40 Samples (F)

CV distribution 7000

10 Intensity

20

Z-scored data distribution 800 700

q90=7.4

Z–scored data Normal Fit

Samples #1–41

600

4000 3000

Frequency

q50=1.7

0

(G)

Data correlation Rho = 0.972

5000

10

2000

500 400 300 200

1000 0

0

80

15

6000

Frequency

60

100 0

5 CV(%)

10

5 5

10 Samples #42–83

15

0 –5

0 Z–score

5

FIGURE  10.2: Assessment of array raw data quality. In an array validation study we tested 83 human plasma samples on an array with 582 different antibodies. (A)  Example of a raw data image from our antibody array. Features are printed in five replicates each. More than 1,000 antibodies can be spotted onto a single epoxy slide. (B)  Magnification of (A)  illustrating spotting precision. Raw data were extracted from images, background subtracted, mean and standard deviation was calculated across replicate spots, and the data were log2 transformed (C). (D)  Histogram of the log2 raw data showed minimal skewness of the distribution (arrow) or technical artifacts (arrowhead). Coefficient of variation (CV) analysis highlights the consistency of the printing process (E). Median CV was 1.7% and 90th percentile CV was 7.4%. Data reproducibility from slide to slide was high, with a Pearson correlation of 0.972 for means of random subsets of the data (F). Log2 transformed data were then centered (to reduce array and antibody intensity fluctuations), normalized (to adjust array and antibody variances), and Z-scored to yield a normal distributed data set ready for analysis (G).

Focused Plasma Proteomics and Neurodegeneration comorbidities, they are medicated with current AD and other symptomatic treatments, and their disease may have progressed to a stage where massive neuronal loss and neuroinflammation can obscure more causative or disease-specific changes. New molecular imaging tools that allow for the detection of amyloid in living individuals or measurements of β-amyloid and tau protein in CSF are good predictors of risk to develop AD (reviewed in Fagan et  al. 2009; Jack 2012; Mori et  al. 2012). The most powerful experimental setup in the search for disease-relevant systemic protein changes would thus be to analyze samples from longitudinal studies that track healthy individuals as they accumulate amyloid and develop AD. In this direction, a recent study taking advantage of patient data collected for the AD neuroimaging consortium (ADNI) described 13 proteins out of 190 plasma proteins measured (mostly communicome proteins) to be associated with brain amyloid levels in AD patients (Kiddle et  al. 2012). Some of these proteins have previously been implicated in AD, and it will be interesting to see whether this signature will be replicated. To solve the technical challenges of detecting communicome or other disease-relevant proteins in plasma more reliably, blood collection and plasma preparation procedures need to be optimized and standardized, sample numbers need to be increased, and findings need to be validated using independent methods (Lista et  al. 2012). In this respect the use of antibody arrays provides an additional challenge, as antibodies from different vendors are frequently not characterized carefully:  there are batch-tobatch variations and many vendors do not share the source and specificity of antibodies used in multiplex panels. Because most secreted factors are posttranslationally processed and exist in different isoforms, antibodies against the same protein may show increased or decreased levels in disease depending on which isoform is detected. Careful annotation of antibody specificities may help reconcile apparently discordant findings in the literature and could certainly help in moving the field forward in achieving reproducible findings. Biological validation of protein hits by linking their function to the disease process could also help to increase the likelihood that findings from unbiased discovery experiments will be replicated in independent studies.

189

CONCLUSION A growing number of experimental and epidemiological studies on aging and neurological diseases support the concept that changes in plasma proteins are associated with physiological and pathological changes in the CNS. Whether such changes will have enough sensitivity and specificity to enable discrimination among different neurological diseases or even to serve as prognostic tools remains to be seen. Numerous challenges related to sample number, specimen preparation, and assay reliability need to be overcome. We believe a reductionist approach that focuses on the proteome of key biological communication factors will be useful in identifying novel disease pathways and possibly in the development of future disease biomarkers. REFERENCES Aberg, M.A.I., Waern, M., Nyberg, J., Pedersen, N. L., Bergh, Y., Aberg, N. D., . . . Torén, K. (2012). Cardiovascular fitness in males at age 18 and risk of serious depression in adulthood: Swedish prospective population-based study. Br J Psychiatry 201(5), 352–359. Anderson, N. L. (2010). The clinical plasma proteome:  a survey of clinical assays for proteins in plasma and serum. Clin Chem 56, 177–185. Anderson, N. L., Ptolemy, A. S., & Rifai, N. (2012). The riddle of protein diagnostics: future bleak or bright? Clin Chem 59(1), 194–197. Björkqvist, M., Wild, E. J., Thiele, J., Silvestroni, A., Andre, R., Lahiri, N., . . . et  al. (2008). A novel pathogenic pathway of immune activation detectable before clinical onset in Huntington’s disease. J Exp Med 205, 1869–1877. Boissonneault, V., Filali, M., Lessard, M., Relton, J., Wong, G., & Rivest, S. (2008). Powerful beneficial effects of macrophage colony-stimulating factor on -amyloid deposition and cognitive impairment in Alzheimer’s disease. Brain 132, 1078–1092. Britschgi, M., & Wyss-Coray, T. (2007). Systemic and acquired immune responses in Alzheimer's disease. Int Rev Neurobiol 82, 205–233. Britschgi, M., Rufibach, K., Huang, S.L.B., Clark, C. M., Kaye, J. A., Li, G., . . . Wyss-Coray, T. (2011b). Modeling of pathological traits in Alzheimer’s disease based on systemic extracellular signaling proteome. Mol Cell Proteom 10, M111.008862. Choi, K. M., Kim, J. H., Cho, G. J., Baik, S. H., Park, H. S., & Kim, S. M. (2007). Effect of exercise training on plasma visfatin and eotaxin levels. Eur J Endocrinol 157, 437–442. Czirr, E., & Wyss-Coray, T. (2012). The immunology of neurodegeneration. J Clin Invest 122, 1156–1163.

190

the OMICs

Domenici, E., Wille, D. R., Tozzi, F., Prokopenko, I., Miller, S., McKeown, A., . . . et  al. (2010). Plasma protein biomarkers for depression and schizophrenia by multi analyte profiling of case-control collections. PLoS ONE 5, e9166. Fagan, A. M., Mintun, M. A., Shah, A. R., Aldea, P., Roe, C. M., Mach, R. H., . . . Holtzman, D.M. (2009). Cerebrospinal fluid tau and ptau(181) increase with cortical amyloid deposition in cognitively normal individuals:  implications for future clinical trials of Alzheimer’s disease. EMBO Mol Med 1, 371–380. Green, M. J., Matheson, S. L., Shepherd, A., Weickert, C. S., & Carr, V. J. (2011). Brain-derived neurotrophic factor levels in schizophrenia:  a systematic review with meta-analysis. Mol Psychiatry 16, 960–972. Hamer, M., & Chida, Y. (2009). Physical activity and risk of neurodegenerative disease:  a systematic review of prospective evidence. Psychol Med 39, 3–11. Hu, W. T., Holtzman, D. M., Fagan, A. M., Shaw, L. M., Perrin, R., Arnold, S. E., . . . et  al. (2012). Plasma multianalyte profiling in mild cognitive impairment and Alzheimer disease. Neurology 79, 897–905. Hye, A., Lynham, S., Thambisetty, M., Causevic, M., Campbell, J., Byers, H. L., . . . et  al. (2006). Proteome-based plasma biomarkers for Alzheimer’s disease. Brain 129, 3042–3050. Jack, C. R. (2012). Alzheimer disease:  new concepts on its neurobiology and the clinical role imaging will play. Radiology 263, 344–361. Johnstone, D., Milward, E. A., Berretta, R., & Moscato, P., for the Alzheimer’s Disease Neuroimaging Initiative (2012). Multivariate protein signatures of pre-clinical Alzheimer”s disease in the Alzheimers Disease Neuroimaging Initiative (ADNI) plasma proteome dataset. PLoS ONE 7, e34341. Kiddle, S. J., Thambisetty, M., Simmons, A., Riddoch-Contreras, J., Hye, A., Westman, E., . . . et al. (2012). Plasma based markers of [11C] PiB-PET brain amyloid burden. PLoS ONE 7, e44260. Kim, H.-J., Kim, C.-H., Lee, D.-H., Han, M.-W., Kim, M.-Y., . . . Do, M.-S. (2011). Expression of eotaxin in 3T3-L1 adipocytes and the effects of weight loss in high-fat diet induced obese mice. Nutr Res Pract 5, 11–19. Kurita, M., Nishino, S., Kato, M., Numata, Y., & Sato, T. (2012). Plasma brain-derived neurotrophic factor levels predict the clinical outcome of depression treatment in a naturalistic study. PLoS ONE 7, e39212. La, Y. J., Wan, C. L., Zhu, H., Yang, Y. F., Chen, Y. S., Pan, Y. X., Feng, G. Y., & He, L. (2007). Decreased

levels of apolipoprotein A-I in plasma of schizophrenic patients. J Neural Transm 114, 657–663. Lee, B. H., Kim, H., Park, S. H., & Kim, Y. K. (2007). Decreased plasma BDNF level in depressive patients. J Affect Disord 101, 239–244. Lista, I., & Sorrentino, G. (2009). Biological mechanisms of physical activity in preventing cognitive decline. Cell Mol Neurobiol 30, 493–503. Lista, S., Faltraco, F., & Hampel, H. (2012). Biological and methodical challenges of blood-based proteomics in the field of neurological research. Prog Neurobiol 1–17. Lundberg, M., Thorsen, S. B., Assarsson, E., Villablanca, A., Tran, B., Gee, N., . . . et al. (2011). Multiplexed homogeneous proximity ligation assays for high-throughput protein biomarker research in serological material. Mol Cell Proteom 10, M110.004978. Luo, J., Elwood, F., Villeda, S., Zhang, H., Ding, Z., Liyinn, Z., . . . et  al. (2013). Colony-stimulating factor 1 receptor (CSF1R) signaling in injured neurons facilitates protection and survival. J Exp Med 210(1), 157‒172. doi: 10.1084/jem.20120412. Epub 2013 Jan 7. Maalouf, M., Rho, J. M., & Mattson, M. P. (2009). The neuroprotective properties of calorie restriction, the ketogenic diet, and ketone bodies. Brain Res Rev 59, 293–315. Martinotti, G., Di Iorio, G., Marini, S., Ricci, V., De Berardis, D., & Di Giannantonio, M. (2012). Nerve growth factor and brain-derived neurotrophic factor concentrations in schizophrenia: a review. J Biol Regul Homeost Agents 26, 347–356. Mitchell, P. (2010). Proteomics retrenches. Nat Biotechnol 28, 665–670. Momeni, N., Bergquist, J., Brudin, L., Behnia, F., Sivberg, B., Joghataei, M. T., & Persson, B. L. (2012a). A novel blood-based biomarker for detection of autism spectrum disorders. Transl Psychiatry 2, e91. Momeni, N., Brudin, L., Behnia, F., Nordstrom, B., Yosefi-Oudarji, A., Sivberg, B., . . . Persson, B.L. (2012b). High complement factor I activity in the plasma of children with autism spectrum disorders. Autism Res Treat 2012, 868576. Mori, T., Maeda, J., Shimada, H., Higuchi, M., Shinotoh, H., Ueno, S.-I., & Suhara, T. (2012). Molecular imaging of dementia. Psychogeriatrics 12, 106–114. Nesvizhskii, A. I., & Aebersold, R. (2005). Interpretation of shotgun proteomic data:  the protein inference problem. Mol Cell Proteom 4, 1419–1440. Ray, S., Britschgi, M., Herbert, C., Takeda-Uchimura, Y., Boxer, A., Blennow, K., Friedman, L. F., . . . et al. (2007). Classification and prediction of clinical

Focused Plasma Proteomics and Neurodegeneration Alzheimer’s diagnosis based on plasma signaling proteins. Nat Med 13, 1359–1362. Ruckh, J. M., Zhao, J.-W., Shadrach, J.L., van Wijngaarden, P., Rao, T.N., Wagers, A.J., & Franklin, R.J.M. (2012). Rejuvenation of Regeneration in the Aging Central Nervous System. Stem Cell 10, 96–103. Shteynberg, D., Deutsch, E. W., Lam, H., Eng, J. K., Sun, Z., Tasman, N., . . . Nesvizhskii, A. I. (2011). iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol Cell Proteom 10, M111007690. Soares, H. D., Potter, W. Z., Pickering, E., Kuhn, M., Immermann, F. W., Shera, D. M., . . . et al. (2012). Plasma biomarkers associated with the apolipoprotein e genotype and Alzheimer disease. Arch Neurol 69(10), 1310–1317.

191

Sofi, F., Valecchi, D., Bacci, D., Abbate, R., Gensini, G. F., Casini, A., & Macchi, C. (2010). Physical activity and risk of cognitive decline:  a meta-analysis of prospective studies. J Intern Med 269, 107–117. van Praag, H., Kempermann, G., & Gage, F. H. (1999). Running increases cell proliferation and neurogenesis in the adult mouse dentate gyrus. Nat Neurosci 2, 266–270. Villeda, S. A., Luo, J., Mosher, K. I., Zou, B., Britschgi, M., Bieri, G., . . . et al. (2011). The ageing systemic milieu negatively regulates neurogenesis and cognitive function. Nature 477, 90–94. Wild, E., Magnusson, A., Lahiri, N., Krus, U., Orth, M., Tabrizi, S. J., & Björkqvist, M. (2011). Abnormal peripheral chemokine profile in Huntington’s disease. PLoS Curr 3, RRN1231.

PART  IV CELLS AND CONNECTIONS

11 Cellomics:  Characterization of Neural Subtypes by High-Throughput Methods and Transgenic Mouse  Models JOSEPH DOUGHERTY

INTRODUCTION Since the birth of neuroanatomy, it has been recognized that the nervous system exhibits a most remarkable cellular diversity (Ramón y Cajal et al. 1899; Sotelo, 2003). Although most cells can be broadly categorized as neurons or glia, there is a dizzying array of diverse cellular morphologies within these categories, particularly for neurons. Using the classic Golgi technique for sparse labeling of random neurons, even within a relatively homogenous structure such as the cerebellum, there exists at least a dozen different neuronal cell types distinguishable by cell size, location, length, and complexity of dendritic arbors or axonal projections (Figure  11.1). Many of the fundamental questions of neuroscience are centered on this diversity of form, and presumably function, within the nervous system. What is the purpose of this remarkable diversity? How does it arise, evolutionarily and ontologically? How does this contribute to the computational capacity of the brain? Which cell types are essential for which behaviors? And, importantly, what changes and in which cell types lead to neurological disorder? The purpose of this chapter is to discuss current methods with which to study these cell types within the natural context of the brain in comprehensive and high-throughput manners. The chapter first addresses the definition of cell types and then the approaches for targeting them genetically. Then it touches on the “omics”-level approaches available to investigate cell types, ending with a particular focus on the molecular characterization of cell type by transcriptome profiling.

DEFINITIONS OF CELL  TYPE Historically, neuroscientists have focused on several classical approaches for defining cell types:  morphological, physiological, functional, and molecular (Box 11.1). In an ideal taxonomy, each of these four levels of definition would be redundant and thus would be mutually interchangeable. For example, every Purkinje neuron would express exactly the same suite of genes (morphological  =  molecular), have the same peak firing rate and membrane capacitance (morphological  =  physiological), and have the same impact on the circuit when deleted (morphological  =  functional). We know that this is not always the case. For example, there are two classes of unipolar brush cells that are thus far morphologically indistinguishable yet can be distinguished by molecular markers (Nunzi et  al. 2002). There are subsets of Purkinje neurons that express zebrin (Leclerc et al. 1992), or tyrosine hydroxylase (Takada et  al. 1993). And the relationship between physiological classes of cortical interneurons and available molecular markers is particularly complex (Ascoli et  al. 2008; Yuste, 2005). However, this likely reflects our own imperfect characterization of these cells across multiple levels of investigation. The Centrality of Gene Expression to Definitions of Cell  Type To quote the American architect, Henry Louis Sullivan, “it is the pervasive law of all things organic and inorganic . . . that form ever follows function.” This maxim, usually paraphrased as “Form follows function,” states that buildings should be structured to suit the roles for which they are intended. This maxim can be applied

196

the OMICs

FIGURE 11.1: Illustration of the morphologically diverse cell types present in a single CNS structure: the cerebellar follia. Reprinted with permission. Santiago Ramón y Cajal. Legado Cajal. Instituto Cajal (CSIC). Madrid (Spain).

to cell types as well:  The form and physiology of each cell type has been carefully tuned to its particular function in the nervous system. However, in biology, adaptation is mediated not by an architect but by gene expression. Thus, to a biologist, the phrase would be:  “Physiology, function, and form follow gene expression.” The brain is a biological structure, tuned by evolution to input sensory information and output behavior. The embodiment of evolution is the cell, but the substrate of evolution is the inherited material—the DNA. To evolve a new cell type, DNA must change to generate new genes, or alter the regulation of existing genes, to provide a set of coherent specifications for the new cell type. While every cell has essentially an identical copy of the genome, each cell type has to be tuned to express the set of genes required for its particular functions at appropriate levels and times. A neuron with a large elaborate dendritic arbor, like a Purkinje neuron, will need to generate more postsynaptic proteins, dendritic

microtubules, and ribosomes. A  fast-spiking neuron will need to express more proteins for rapid repolarization of the membranes, the buffering intracellular calcium, or a unique set of channels. Neurons that use particular transmitters need to generate enzymes for transmitter synthesis and loading into vesicles. And all cell types will need to express a unique set of receptors to respond to the local and distal cues important for their particular functional roles in the calculus of the brain. Therefore, while cell types may be operationally defined by morphology, physiology, or function, all of those unique features must be mediated by the gene expression of that cell. Thus the molecular definition of cell type (Box 11.1), when comprehensive enough to be genomic, will allow a final taxonomy of cells, even before the relationship between particular genes and morphology or physiology are understood. However, the classic molecular methods described in Box 11.1 are not comprehensive

Cellomics

197

BOX  11.1

CLASSICAL APPROACHES TO THE DEFINITION OF CELL TYPES Morphological/Anatomical

The earliest definitions of cell type were strictly morphological or anatomical (Ramón y Cajal et al. 1899). Cells were defined by location and shape of the cell body, orientation and elaboration of the dendritic arbor, and the projection pattern of the axons. These early descriptive studies were fundamental to understanding the basic texture of the nervous system, and early inferences from morphology regarding development, information flow, and computational processes of the central nervous system were surprisingly insightful. However, rigorous, comprehensive classification schemes and robust quantification were not the focus of these early studies. Many clear cell types are apparent from these early investigations. For example, Purkinje neurons are unambiguously distinct from other cell types. Yet other gradations made by Cajal and compatriots, such as the distinctions between cerebellar Golgi neurons of long axon and Golgi neurons of short axon are less obviously a distinct classification. Functional

Perhaps the most pragmatic manner to define a cell type is functional. A  cell type serves as a particular computational node within a circuit, with the ultimate purpose of all circuits in the brain being to mediate behaviors. Thus, a cell type could be defined by its computational role in a circuit and its ultimate influence on an animal’s capacity for a certain behavior. Early investigations involved the physical ablation (Lashley 1930)  or stimulation (Penfield & Rasmussen 1950)  of entire regions followed by behavioral analysis, permitting connections between anatomical and functional levels of analysis. With time and technical development, the resolution of these approaches gradually progressed. Stereotactic targeting of smaller and smaller structures became practical. Excitotoxic agents permitted the ablation of cell bodies rather than traversing fibers within a region. Pharmacological agents permitted the selective manipulation of cells expressing just a particular receptor, linking functional and molecular levels of analysis to some extent. Finally, in the modern era, genetic approaches permit cell specific targeting of specific transgenic constructs for functional manipulation of cell types (Boyden et  al. 2005), allowing for definitions of cells important for behavioral functions in an intact animal, ranging from sleep and memory consolidation (Rolls et al. 2011) and feeding (Domingos et al. 2011) to fear conditioning (Haubensak et al. 2010; Letzkus et al. 2011). Physiological

With the development of electrophysiological techniques it became possible to define a cell type by a coherent and consistent set of physiological properties, such as firing rates and patterns, membrane capacitance, afterhyperpolarization, and electrical response to pharmacological agents. Data collection in the early studies was typically blind to any morphological or molecular features, and there were sometimes quite surprising divergences between these different levels of definition. In the modern era, improvements in microscopy, particularly the advent of calcium dyes and two-photon imaging, allow some combination of morphological, physiological, and molecular investigation. The development of transgenic mouse lines expressing GFP to molecularly define cell types has also been a boon to prospective physiological analysis of cell types, especially those that are present at low frequency in a tissue. Molecular

Finally, the development of early fluorescent microscopic techniques and of molecular reagents such as immunohistochemistry and in situ hybridization led to the recognition that (continued)

198

the OMICs BOX 11.1

CONTINUED even morphologically identical neurons may have somewhat distinct molecular composition (Coons AH et  al. 1941, 1950; Hyden & McEwen, 1966); thus methods developed permitting neurons to be classified by neurotransmitter phenotype (Hillarp et  al. 1966; Saito et  al. 1974), receptors for drugs or neurochemicals (Roth & Barlow 1961), or expression of particular calcium binding proteins (Celio & Heizmann 1981; Celio & Norman 1985; Hyden & McEwen 1966). Again, these molecular features sometimes showed direct correspondence with morphological or physiological criteria and sometimes did not, most notoriously in the case of interneurons, where the relationship between particular markers and physiology is quite complicated (Yuste 2005). It is also clear that there are particular molecular markers that are thought to be more indicative of a particular state of a neuron, rather than the presumably more stable trait of cell type. The classic example being the expression of an intermediate early gene, such as cFos, thought to correspond to a recent bolus of activity in a neuron. Finally, although we refer to molecular approaches rather broadly here, with current technology the most scalable and comprehensive will be those based on nucleic acid, such as measurement of transcript levels in particular cell types.

and can be conducted only postmortem. To approach a true molecular taxonomy of cell type requires prospective and comprehensive methods of analysis.

TA R G E T I N G C E L L T Y P E S F O R P R O S P E C T I V E A N A LY S I S While early methods for studying specific molecules in a cell, such as immunohistochemistry, collectively allowed the study of specific cell types post hoc, they did not directly permit in vivo observations of the cell types of interest, nor did they provide access for experimental manipulation. However, all cellular macromolecules are either direct or indirect products of genes. Thus the cell-specific localization of a molecule often indicates the cell-specific expression of certain genes. Therefore a particular gene’s genomic regulatory information can be coopted to drive the expression of foreign transgenes in specific cell types. A  variety of methods now exist for doing just this, as summarized in Box  11.2. Many of these methods are applicable across a variety of species, although this chapter focuses on applications in mice—the mammalian model organism most amenable to genetic manipulation. While not as high-throughput as readily scalable molecular methodologies such as sequencing or microarrays, several projects have systematically generated and characterized

a large number of mouse lines targeting a variety of transgenes to specific cell types (Gong et  al. 2007; Heintz 2004; Madisen et  al. 2010; Portales-Casamar et  al. 2010; Smedley et  al. 2011; Taniguchi et  al. 2011). These various transgenes (Box 11.3) facilitate the direct observation and, in some cases, manipulation of the targeted cell types. These techniques have been a boon to the parallel physiological, functional, and molecular characterization of the same cells. There now exist relatively efficient means of generating targeting constructs for transgenesis that are scalable for high-throughput approaches and potentially automation (Gong et  al. 2010; Poser et  al. 2008). However, the rate-limiting step will remain to be the actual genesis and husbandry of the mouse lines. For some reagents, such as BAC transgenics, there is some potential for multiplexing constructs in vivo into a single locus, stably inherited, with independent regulation (Figure  11.2), although this could improve efficiency to only a limited extent.

The Limitations of Genomic Information as a Method for Targeting Cell  Types Unfortunately for the neuroscientist, nature did not necessarily provide us with a single uniquely expressed gene for every cell type. Even long-established drivers—such as Pcp2

Cellomics

199

BOX  11.2

GENETIC TOOLS TO TARGET SPECIFIC CELL TYPES Genomes contain regulatory sequences, such as promoters, enhancers, and repressors, that collectively direct the expression of the surrounding genes—often with exquisite temporal and spatial selectivity, or in response to particular cellular states such as sustained neuronal depolarization. Promoters and enhancers are both thought to be elements that increase the expression of the adjacent gene in appropriate contexts, while repressors are elements that suppress the expression of the gene in inappropriate contexts. Promoters are found at the start site of transcription, while enhancers can be found nearly anywhere, including in introns, or even many kilobases from the closest exons. The current model is that the activity of these regulatory sequences is mediated by the presence of transcription factor binding sites and corresponding epigenetic modifications. While there are clear examples of genes in which these elements have been studied to a high degree of detail (Johansson et  al. 2002; Oberdick et al. 1990), for the vast majority of genes these features are inferred from sequence of epigenetic marks (Muers, 2011; Myers et al. 2011) but are functionally uncharacterized. Regardless, these genomic regions can be coopted to drive the expression of particular transgenes. Transgenesis

Transgenesis is the insertion of a distant or even exogenous gene into a new genomic context. In many species, including mouse, this can be accomplished by the injection of DNA fragments containing the gene into a fertilized oocyte (Gordon et  al. 1980; Jaenisch 1976; Palmiter & Brinster 1986). With some low but experimentally tractable frequency, this DNA will be integrated into the oocyte’s genomic DNA at one or more random locations. The fragments typically integrate into tandem repeated copies, often hundreds of copies in length (Chandler et al. 2007; Palmiter & Brinster 1986). If the DNA integrates early enough in development to contribute to the germline, mice derived from these oocytes can transmit this transgenic locus to their progeny, establishing a transgenic line. The random locus of integration of the transgene is both a strength and a weakness. If the transgenes, by chance, land in a locus containing a cell specific promoter, these transgenes will demonstrate specific and heritable celltype‒specific expression patterns. However, most frequently the injected fragment of DNA will contain a small (0.5- to 5-kb) promoter or enhancer upstream of the transgene. Depending on the locus of integration, this small regulatory element may be sufficient to direct cell-specific expression in some mouse lines, such as in the examples of the PCP2 promoter (Oberdick et al. 1990) and Nestin enhancers (Johansson et al. 2002). In other cases, the locus of integration interacts with the promoter to generate experimentally useful (but irreproducible) patterns of expression. For example, the small and fairly ubiquitous neuronal promoter Thy-1 was utilized to generate a range of transgenic lines, each having a heritable pattern of expression in selective subsets of neurons (Feng et al. 2000). Pros:

• Faster and cheaper than knockins • Less sophisticated molecular skills needed to generate construct than BAC transgenesis • Higher copy number can lead to high transgene expression Cons:

• Locus of integration effects more common than BACs • Retargeting the same cell type difficult in some cases (continued)

200

the OMICs BOX 11.2

CONTINUED • Transgenerational silencing • Unintended disruption of a random gene at the locus of integration BAC TRANSGENESIS

There are two primary limitations of small promoters in transgenesis. First, for many genes (and therefore many cell types), the appropriate promoters and enhancers regions for directing cell specific expression have not been characterized. Second, those promoters that are characterized are often strongly influenced by locus of integration effects. BAC transgenesis was developed to circumvent these difficulties (Yang et  al. 1997). Bacterial Artificial Chromosomes (BACs) are large 100- to 200-kb fragments of mouse or human genomic DNA, maintained in bacteria. As these originally served as a mechanism to fragment genomes into manageable sizes for genome sequencing projects, large libraries of BACs exist, tiling essentially the entire mouse genome. These fragments are thought to be large enough to contain most if not all of the enhancer, repressor, and promoter elements that direct cell specific expression of a gene, even if the individual elements are undefined. These BACs can be readily modified in bacteria, utilizing recombination-based methodologies (Gong et al. 2002; Hollenback et al. 2011; Poser et al. 2008; Yang et al. 1997), to insert transgenes into the translation start site of cell-specific “driver” genes. The modified BAC is then utilized for transgenesis, creating a novel genomic context for these transgenes. Performance is dependent on the BAC utilized, but overall, BACs have been shown to accurately and reproducibly target a wide range of cell types in the CNS (Heintz 2004). Pros:

• Multiple copies can insert (higher potential expression level than knockins). Greater accuracy (less locus of integration effects than small promoters). • Speed (more rapid to generate than knockins). • Little to no transgenerational silencing. • BACs are large enough to provide sufficient insulation from adjacent sequence, whether that is genomic sequence or even other BACs. Thus it is possible to multiplex BACs in a manner not possible in other techniques (Figure 11.2). Cons:

• • • •

Copy number not as high as small constructs (but higher than a knockin). Some locus of integration effects possible (less than small transgenes). BAC constructs require more expertise to generate. BACs may carry extra copies of unmodified genes adjacent to driver gene and thus result in increased gene dose for these genes. Often these genes are expressed in unrelated cells types (or even tissue) and thus may have little consequence on the phenotypes of interest in the brain, or they may be buffered by biological mechanisms that keep RNA levels constant even when extra copies of the gene are present. But this may need to be checked by qPCR for some experiments. KNOCKINS

There are genes in the genome that exceed even the size of a BAC as well as known examples of enhancers that are hundreds of kilobases distant from their target genes. Therefore even BACS may not contain all of the regulatory information necessary to recapitulate the endogenous expression pattern of a gene. Thus, in many ways, inserting a transgene directly into a particular locus by targeted homologous recombination provides the ultimate opportunity

Cellomics

201

BOX 11.2

CONTINUED to coopt genomic regulatory information to control the expression of an exogenous transgene. While there may be gains in accuracy for some genes, there are several subtleties to the strategy that should be mentioned. First, knockins are slower and more labor-intensive than transgenics—homologous recombination requires many extra steps and screening in embryonic stem cells and additional rounds of breeding to screen chimeras. Second, in many strategies, knocking in the transgene simultaneously knocks out one copy of the endogenous gene, meaning that all mice will be haplo-insufficient for the driver—a gene that may be of particular importance to the cell type of interest. (In some designs this can be overcome by careful use of bicistronic sequences such as IRES or viral 2a sequences)(Taniguchi et al. 2011). Third, compared with transgenics, knockins may tend to have lower expression of the transgene, as they will be present in only a single copy rather than tandem arrays. Finally, there are several reasons that even with the nearly perfect genomic context of a knockin, the transgene may differ in observed expression from the endogenous gene—including differences in transcript stability, protein stability, and translation efficiency between the gene and the transgene. Pros:

• Accuracy and reproducibility (no unintended locus of integration effects). • New technologies (TALENs and CRISPR/Cas9) improving efficiency and applicability beyond mouse. Cons:

• Slower and more expensive than transgenesis.* • Limited to at most one or two copies of the transgene. • Can create haploinsufficiency of the driver gene. C O M B I N AT I O N S A N D I N T E G R AT I O N S

It is worth noting that variations of these techniques exist, for inserting transgenes with small promoters (Portales-Casamar et  al. 2010), or BACs (Heaney et  al. 2004)  into specific loci (as is the case in knockins), thus balancing some of the pros and cons of knockins. Overall, the selection of the technique depends largely on the influence of the relevant pros and cons on a particular individual project as well as the experience and expertise of the investigator. Also, it is worth noting that many genes in the genome are heavily regulated at the level of splicing and selection of alternative transcription start sites. Thus, the selection of one transcription start site over another in targeting by BAC or knockin could clearly influence the expression of a transgene regardless of which method is chosen for targeting. This level of complexity is largely ignored by the field currently. *While this chapter was in press, new CRISPR technologies have alleviated this issue.

for Purkinje neurons, Nestin for neural stem cells, and emerging drivers such as Aldh1L1 for astrocytes—which each have apparently good specificity in the brain, are often expressed robustly in other tissues as well (Anthony & Heintz 2007; Cahoy et  al. 2008; Day et  al. 2007; Doyle et  al. 2008; Dubois et  al. 2006; Foo & Dougherty, 2013; Zhang et  al. 2005). It is also a common observation in the characterization of Cre lines that many drivers are expressed in unexpected populations during development, or even

the egg, resulting in widespread early recombination. This highlights the need for strategies that permit additional layers of experimental control for the temporal or spatial expression of genes.

Temporal Control of Transgenic Expression Experimentally, it is often important to control the time of transgene expression to prevent widespread recombination, particularly for studies of development (lineage tracing) or cell

202

the OMICs BOX  11.3

TRANSGENES OF NOTE FLUOROPHORES

The most common transgene providing anatomical access to particular cell types is the green fluorescent protein (GFP), derived from Aequorea victoria, or its variants. Over a decade of engineering and directed evolution have resulted in variants covering a range of fluorescence wavelengths, of which CFP, YFP, and GFP have been shown to have robust fluorescent properties and a lack of overt toxicity in vivo. The other major family of fluorescent proteins in use in laboratories is the variants of the red fluorescent DSRed gene, originally isolated from Discosoma. Though frequently incredibly bright and stable, several of these, particularly the monomeric forms, have been reported to have toxicity or to aggregate in vivo (Dougherty et  al. 2012b; Strack et  al. 2008). Tandem dimerized variations of this protein (tdTomato) apparently do not display these detrimental properties and have been gaining popularity both as transgenes and as the new standard reporter lines for Cre recombinase (Madisen et  al. 2010). Finally, this is quite an area of active research and new proteins with novel properties are continuously being discovered, designed, or evolved. In addition to continual modifications to adjust wavelengths and increase maturation, stability, and brightness, there also exist variants with photoswitchable wavelengths (Andresen et  al. 2008), new far-red variants (Dieguez-Hurtado et al. 2011; Shcherbo et al. 2009), variants whose wavelengths change over time (Terskikh et  al. 2000; Yanushevich et  al. 2003), as well as proteins that alter fluorescent properties in response to changes in intracellular calcium (Looger & Griesbeck, 2012; Tian et al. 2012). RECOMBINASES

Recombinases are proteins that recognize specific sequences in DNA and re-arrange or recombine the DNA in particular predictable manners. By far the most commonly used recombinase is the Cre recombinase, derived from bacteriophage. Cre canonically recognizes specific DNA sequences called LoxP sites. Depending on the relative orientation of the sites, Cre can be used to either excise or invert DNA flanked by Lox P sites (“Floxed”) or insert large fragments of DNA into single sites or even mediate engineered chromosomal rearrangements (Mills & Bradley, 2001; Nakatani et  al. 2009; van der Weyden & Bradley, 2006). Since the introduction of the technique, the most important adaptations have included modifications to optimize codon usage for mammalian systems (Shimshek et  al. 2002), the fusion with the estrogen receptor to permit tamoxifen inducible nuclear translocation (and thus recombination) (Indra et al. 1999), the identification of alternative variations of LoxP sites (Siegel et al. 2001), and the development of split Cre reagents for recombination mediated by the intersection of expression of two separate loci (Hirrlinger et al. 2009). As a distant second, the next most frequently utilized recombinase is Flpe, an optimized version of the Saccharomyces cerevisiae Flp-1 recombinase. This recombinase recognizes Frt sites rather than LoxP. It has also now been optimized for mammalian codon usage and thermostability (Kranz et  al. 2010; Rodriguez et  al. 2000)  and has inducible variations available (Hunter et al. 2005). In addition to providing flexibility of experimental strategies, the development of a second recombination system has opened the door to strategies that respond to particular combinations of gene expression. For example, there are now reporter mice that express GFP only after recombination by both Cre and Flpe (Farago et  al. 2006), allowing even more specific molecular definitions of cell type.

Cellomics

203

BOX 11.3

CONTINUED R E A G E N T S F O R A C T I VAT I N G A N D S I L E N C I N G C E L L   T Y P E S

For neuroscientists interested in the physiological properties and functional roles of particular cell types, the development of light-activatable ion channels has been an extraordinary advance. These channels combine the millisecond temporal resolution required for sophisticated and naturalistic manipulation of neurons together with all the potential for targeting of genetically encoded tools to provide extraordinary specificity for cell type. The basic tool for activation, Channelrhodopsin, derived from Chlamydomonas reinhardtii, is a channel that passes cations upon stimulation by blue wavelength light. This allows depolarization sufficient to trigger action potentials in neurons. For inhibition, the basic tool is Halorhodopsin, a yellow light‒driven chloride pump derived from Halobacterium, that could be utilized to hyperpolarize neurons and prevent them from firing. Both of these tools, as well as related proteins derived from other species, have now been extensively modified to adapt them to mammalian systems, increase stability, maturation, membrane localization, and provide a wider variety of wavelengths. The most current variants of each are recently reviewed (Chow et al. 2012; Lin, 2011). These tools have been shown to be effective when inserted in a cell-specific manner with virus (Haubensak et al. 2010; Letzkus et al. 2011) or under the control of specific BACs (Zhao et al. 2011a), or Cre reporters (Madisen et al. 2010). Experimentally activatable G protein‒coupled receptors are also now available. Examples include chimeras of vertebrate rhodopsins with the intracellular loop from a β2-adrenergic receptor allowing light activation of second messenger signaling (OptoXRs) (Airan et al. 2009), as well as a variety of receptors lacking endogenous mammalian ligands, such as DREADS, RASSLs, and Allostatin receptors (Masseck et  al. 2011). These tools will permit the investigation of a variety of neuromodulatory and secondary messenger signaling within genetically targeted cell types. Finally, it is worth noting a variety of tools for relatively permanently silencing neurons, including toxic transgenes (Garcia et  al. 2004; Hara et  al. 2001)  and tethered toxins (Auer et  al. 2010), now exist. While these tools lack the exquisite temporal resolution of their light-activated analogs, they are advantageous for studies requiring long-term silencing to study the behavioral consequences of the functional ablation of particular cell types. TA G S A N D F U S I O N S ( C H I P, C L I P, T R A P, A N D S U B C E L L U L A R P R O T E O M I C S )

A variety of strategies have been developed to adapt biochemical purification techniques to cell-specific profiling, most notably affinity purification of RNA, DNA, protein complexes, and specific subcellular organelles. One example of this is the TRAP methodology for translating ribosome affinity purification (Heiman et al. 2008). In this method a transgene that is a fusion protein of eGFP and the ribosomal protein L10a (Rpl10a) is targeted to specific cell types. This is typically done using BACs, although any of the methods in Box 11.2 could be applicable and Cre-responsive lines now exist. Because of the Rpl10a moiety, the protein is incorporated into ribosomes and is thus associated with mRNA that is undergoing translation only in the targeted cell type. The eGFP moiety then serves both as a fluorescent tag for anatomical studies as well as an affinity tag for biochemical purification:  mouse brains are rapidly homogenized, and the GFP-tagged ribosomes (and affiliated mRNAs) are captured with anti-GFP antibodies coupled to magnetic beads. The method has been shown to be effective across a range of cell types in the nervous system (Doyle et  al. 2008)  and, when coupled to microarray or RNAseq, allows a genome-wide snapshot of mRNAs in use in a particular cell. Although the method has not yet been validated with any genomic assays, the very similar Ribotag strategy utilizes an HA tag and a Cre-responsive knockin into the Rpl22 locus to capture ribosomes (Sanz et al. 2009). (continued)

204

the OMICs BOX 11.3

CONTINUED Similar strategies have been applied to study the interaction of mRNAs and microRNAs to particular RNA-binding proteins using the more generic CLIP (cross-linking and immunoprecipitation) approach, utilizing either tagged versions of the protein or antibodies against the endogenous protein (Chi et al. 2009; He et al. 2012; Jensen & Darnell 2008). Likewise any protein of interest can be expressed in a tagged form in specific cell populations, or under its endogenous promoter and be utilized for affinity purification followed by proteomic strategies to study protein-protein interactions and protein modifications in particular cell types (Bateup et  al. 2008; Zhong et  al. 2009). Tagged DNA-binding proteins can also be expressed in a cell-specific manner to permit cell-specific epigenetic approaches, such as studying the interaction of transcription factors to DNA with CHIP (chromatin immunoprecipitation) (Zhang et  al. 2008). This can be helpful both to study the differential binding patterns of the same protein in distinct cell types, or as an alternative strategy when antibodies to the endogenous protein are unavailable or ineffective. Finally, it has been demonstrated that tags of the correct protein can permit the cell-specific purification of entire organelles in a manner sufficient for proteomic analysis (Heller et  al. 2012; Selimi et  al. 2009). This approach opens completely new avenues of investigation for cell biologists interested in the nervous system.

function at discrete time points, such as during acquisition of a new behavior in a learning assay or after early development. Currently there exist several methods for integrating additional temporal regulation into genetic modifications. All have some utility, although there are particular advantages and disadvantages to each. First, utilization of inducible recombinases, such as a fusion protein of Cre or Flpe together with an estrogen receptor (Cre-Ert2), can limit recombination events to a particular time window when a drug (tamoxifen) is added, avoiding recombination in early development (Danielian et  al. 1998; Hunter et  al. 2005). Although Cre-Ert2 can be less efficient than normal Cre under the same driver (S. Gong, personal communication) (Gong et  al. 2007), this approach is becoming more widespread. Likewise, there exists a smaller set of lines utilizing tetracycline-responsive promoter and repressor elements coopted from Escherichia coli (Passman & Fishman 1994; Saez et al. 1997; Zhou et  al. 2009). The advantage over Cre is reversibility, with the ability to add and then remove the drug again later to turn the gene on and back off. The disadvantages are reported leakiness of the system in some lines, toxicity of tTA transgene at high levels, and the lack of lines targeting a wide range of cell types compared with the number of available Cre drivers.

Second, utilizing combinations of recombinases (Farago et  al. 2006), or split recombinases (Hirrlinger et  al. 2009), one can design logic gates such as “and” or “or” that would require the concurrent or sequential expression of two genes in a cell type for recombination to occur (Dymecki & Kim 2007). This would also permit the manipulation of cell types for which no single driver gene exists but which are unique in their expression of a particular combination of genes, with a particular timing. The drawbacks to this are the “allele problem” of combining three or more alleles into a single animal and the sometimes imperfect efficiency of recombination. The allele problem follows from simple Mendelian rules—in attempting to combine three or more alleles utilizing heterozygous breedings, only a small fraction of the progeny generated will have the correct combination of alleles to be experimentally useful. Thus these approaches may be particularly amenable to multiplexing transgenic approaches (Figure  11.2) (Dougherty et  al. 2012b). Still, even multiplexing may be stymied by imperfect efficiency of recombination—it is a common observation that even with accurate expression of recombinase, recombination may not occur in all cells. This is likely to be a function, in part, of the level of recombinase expression and, in part, of the accessibility and structure of the targeted region in the genome. Regardless,

Cellomics

205

(a) Clone three different fluorescent transgenes into BAC modification shuttle vector.

Using shuttle vectors, modify three BACs to place transgene start site into translation start site (ATG) of ‘driver’ gene. ATG

Cerulean-Myc shuttle vector

ATG

ATG

Inject fertilized mouse eggs

Yfp-HA shuttle vector ATG

ATG

mcherry shuttle vector

ATG

(b)

(c)

(d)

(e)

(f)

FIGURE  11.2: Multiplexed genetic targeting of cell types in the mouse central nervous system. (a)  Three different BAC promoter constructs, with drivers for neurons (Snap25), astrocytes (Aldh1L1), and oligodendrocytes (Mobp), are modified in parallel with three spectrally distinct fluorophores, then co-injected into fertilized mouse eggs. The resulting lines demonstrate distinct labeling of (b)  Aldh1L1 BAC drives Cerulean Fluorescent Protein specifically in astroglia of a 13-day-old mouse cerebellum. (c)  Snap25 BAC drives mCherry fluorophere specifically in neurons. (d)  Mobp BAC drives Yellow Fluorescent Protein specifically in oligodendrocytes. (e)  DAPI counterstain shows nuclei. (f)  Overlay demonstrates mutually exclusive expression of the three fluorophores in the same mouse.

206

the OMICs

these inefficiencies would be estimated to be multiplicative when dealing with strategies requiring combinations of recombinases. Finally, there is the powerful combination of Cre reagents with lentiviral and adenoviral constructs. There are two variations of this approach—a virus with a floxed transgene can be injected into a mouse line with cell specific Cre expression or a virus expressing Cre can injected into a mouse line with a floxed gene or transgene in the genome. There are several advantages to these approaches. First, the time of injection permits temporal control of transgene expression, thus circumventing difficulties with Cre lines with early expression. Second, locus of injection allows for a level of anatomical selectivity, or the intersection of region and cell type, in a manner that genomic drivers alone may not permit. Third, the strong promoters available in the virus may permit more robust transgene expression than most endogenous promoters, whether knockin or BAC. This is a particularly important benefit for transgenes such as channelrhodopsins that require high levels of expression for efficacy.

Nongenetic Methods of Targeting Cell  Types Overall, most of these methods generally rely on genomic information for targeting and thus are reliant on a molecular definition of cell types. However, alternatives do exist. For example, injecting retrogradely labeling fluorescent dyes into the spinal cord can permit the sorting and profiling of corticospinal upper motor neurons (Arlotta et  al. 2005), thus using an anatomical definition of cell type to harvest molecular information. As long as perfect drivers do not exist for all cell types, clever use of retrogradely integrating reagents, including an interesting set of engineered rabies viruses (Wall et  al. 2010), could also permit the expression of transgenes in a cell-specific manner based on projection patterns and anatomical connections. CELLOMICS:  HIGH-THROUGHPUT AND COMPREHENSIVE C H A R A C T E R I Z AT I O N O F CELL  TYPES While the act of transgenesis and animal husbandry remain low-throughput, once they are targeted there now exist several related methods for high-throughput characterization of the molecular properties of specific cell types.

These -omics and -omics-like approaches can be organized roughly by the tradition of cell-type definition (Box 11.1) that they best inform. Of these, the molecular approaches are the most readily scalable, although several technologies currently enhancing the throughput of other approaches should be mentioned.

Morphological “Anat-omics” Over the last decade, several complimentary projects have undertaken efforts to characterize the expression of hundreds or thousands of genes in the genome within the mouse brain and embryo (Easterday et  al. 2003; Geschwind et al. 2001; Gray et al. 2004; Heintz 2004; Jones et  al. 2009; Lein et  al. 2007; Magdaleno et  al. 2006; Shimogori et  al. 2010; Visel et  al. 2004). Although, as a general rule, quality is often inversely proportional to throughput, these resources have transformed the manner in which we analyze expression studies and demonstrated the potential of automation and informatics even when applied to techniques with components that are not as inherently scalable as sequencing or microarrays. These efforts collectively have generated a new encyclopedia of knowledge about gene expression, and we cannot overemphasize the importance of this work. However, there are several important caveats to consider in examining these sources of data (Jones et  al. 2009). First, null results for a given gene (lack of any expression in the brain) should be viewed with extreme caution, as these assays are typically executed with little optimization for any given gene and each technique has a different threshold of detection and dynamic range. RNA for genes with no measurable expression by in situ hybridization can be detected by other methods (Dougherty et al. 2010; Lee et al. 2008). Second, in most cases only a single isoform per gene has been assayed; thus our knowledge is likely incomplete for the other products of a given gene. Finally, these approaches are not inherently quantitative, and time of development for the enzymatic reactions may be variable. With these techniques, comparison across different experiments regarding the relative strength of gene expression can be misleading. Nonetheless, these resources provide an essential baseline analysis for the anatomy of gene expression in the nervous system. It is also worth noting there are methodological advances that may improve some of these shortcomings.

Cellomics One important advancement that may serve to increase the throughput of anatomical studies using genetically encoding fluorophores is the direct coupling of tissue sectioning with automation of fluorescent microscopy and data acquisition (Ragan et  al. 2012). Scanning the brain in real time as it is sectioned removes many intermediate steps that may result in experimental variability (such as development with enzymatic reagents) and provides data that are more readily quantifiable, with the exquisite morphological detail available from high-resolution two-photon imaging and the capacity for three-dimensional reconstruction. If the eventual cost of these systems permits it, future genome-wide efforts to characterize fluorescent molecule expression may bypass expensive, time-consuming, and nonquantitative enzymatic detection and amplification steps. Likewise, advances in imaging may remove the need for sectioning altogether (Chung & Deisseroth, 2013) . Another interesting advance has been the combination of multiple fluorophores in single animals both in targeted strategies (Dougherty et  al. 2012b; Feng et  al. 2000; Shuen et  al. 2008)  and the “Brainbow” approach for a fluorescent variation of a high-throughput Golgi-like labeling (Weissman et  al. 2011). These approaches, particularly in combination with digitization efforts, have the potential to permit some parallelization and automation of the analysis of morphological studies. Finally, one emerging technology proposes to adapt the throughput of sequencing technology to study connectivity (Oyibo et  al. 2011). Retrogradely transported viruses, capable of genomic integration, lift unique DNA “barcodes” from the genome of one cell to into the genomes of synaptically connected cells. Potentially, sequencing of these tags from the genomes of each cell might permit the construction of an all vs. all map of connectivity of a mouse nervous system.

Physiological and Functional Cellomics The mature high-throughput correlate for physiological studies would be the implantable multielectrode array. Developed over the last two decades, these arrays have permitted the simultaneous physiological assessment of tens to hundreds of individual neurons within a region. This advance has provided the opportunity to understand the behavior of cells in situ and

207

in a relatively unbiased manner. A  variety of important conceptual advances have been made possible by this technology. The understanding of the importance of neural synchrony to perceptual processing within and across regions has been strongly advanced by these studies (Miller & Wilson 2008), as have the hypotheses regarding memory consolidation during sleep (Sutherland & McNaughton 2000). However, aside from some recent examples utilizing these tools in clever combination with genetically encoded constructs for silencing or activating neurons (Haubensak et  al. 2010; Letzkus et  al. 2011), there is little about the technique that lends itself to analysis of prospectively targeted cell types. This niche is being filled now by the development of genetic indicators of neural activity—especially genetically coded calcium indicators (Looger & Griesbeck 2012; Tian et  al. 2012). Although early indicators have existed since the 1980s, only recently have proteins been developed with properties sufficient to merit their widespread adoption. The parallel development of enhanced capacity for in vivo imaging, notably multiphoton microscopy and derivatives, has facilitated the adoption of these technologies. Especially now with the advent of multicolor genetically encoded dyes (Zhao et al. 2011b), the potential begins to exist to study the physiological interactions between genetically targeted cell types in awake and behaving animals. In the future, this will permit descriptive correlational studies between the behavior of neurons and the behavior of animals that will allow some inferences as to the final functional role of particular cell types. However, by their very nature, behavioral analyses are going to remain relatively low-throughput methods as compared with molecular approaches.

Molecular Cellomics Epigenomics Between the DNA and the transcription of RNA lies the epigenome. Although, in classical genetics, it had a somewhat different definition (Waddington 2012), epigenetics has currently come to be defined as the suite of modifications occurring to the DNA molecule itself (methylation, hydroxymethylation) or associated histones (methylation, acetylation, etc.), or the presence of other DNA binding proteins that appear to correspond to changes in gene expression (Goldberg et  al. 2007). Whether

208

the OMICs

these changes should be considered causes or corollaries of gene expression is currently unclear. Their accurate assessment and interpretation is an area of very active investigation (Maunakea et al. 2010). Current technologies for assessing epigenetic state have been moving from hybridization based (CHIP-Chip) assays to sequencing-based approaches (Chip-Seq, MRE-Seq, Methyl-Seq) (Laird 2010). These technologies share the common feature of using an affinity reagent or enzymatic reaction (bi-sulfite sequencing) to capture epigenetically modified DNA, or epigenetic marks on proteins cross-linked to DNA, followed by sequencing to identify the approximate or exact sites of modifications (Harris et al. 2010). To date, several projects, notably ENCODE/ modENCODE, have conducted surveys across tissues to examine tissue-specific patterns of epigenetic modification (Muers 2011; Myers et  al. 2011). Despite the fact that these methods are DNA-based and thus amenable to working with small amounts of starting material, surprisingly little of the work has focused at the level of individual cell types with the notable exception of the work that discovered DNA hydroxymethylation in mammalian cells via examination of Purkinje cell nuclei in the brain (Kriaucionis & Heintz 2009). As an intermediary between the DNA and transcription, many of the changes apparent from the transcriptomics ought to be reflected in any epigenetic investigation of cell type, and direct comparisons of transcription of the epigenome may help deconvolute the meaning of many of the epigenetic marks in real in vivo contexts. This will likely be an area of active investigation in the next few years:  technology exists for purification of nuclei of specific cell types (Kriaucionis & Heintz 2009), and many epigenetic assays are scalable and applicable to small amounts of material.

Proteomics/Metabolomics If epigenomics is the study of what comes between the RNA and the DNA, then proteomics is the study of what comes after RNA. Nearly any substance present in a cell is in some manner the consequence of the expression of a particular gene. Most proteins will be coded for directly by a particular RNA, while many small molecules (metabolites) and modifications of larger proteins are the consequence of the expression of particular enzymes.

RNA expression, while necessary for the generation of many proteins, is not sufficient. The RNA must be translated—a process regulated extensively and in a transcript-dependent manner. By the best estimates we have available, RNA levels will correlate with the expression of their corresponding proteins at about 0.6 or better (Hegde et  al. 2003; Kislinger et  al. 2006). Thus one could argue that knowledge of protein levels directly is more important to the prediction of cell behavior and function than RNA levels alone. Proteomics also gives additional information about the post-translational modifications upon the products of genes. Many, if not all, proteins are regulated by extensive post-translational modifications such as phosphorylation, methylation, myristoylation, ubiquitination, acetylation, and many others that can profoundly alter their activity and function, and these modifications are impossible to predict from RNA sequence alone. There are many technical challenges, however, in performing global proteomic profiling. Given their size, proteins are typically digested and analyzed by mass spectrometry. To achieve profiling coverage of the hundreds of thousands of peptides generated, multidimensional separation strategies are used. Although effective, such strategies usually require 10 to 20 hours of analysis time and therefore impose significant limitations on throughput. Moreover, generally the global profiling approach does not screen for posttranslational modifications or other potential protein-protein interactions that may strongly influence protein activity. To more directly examine protein activity, activity-based proteomic strategies have been developed that use chemical probes engineered to react with enzymes of interest (Kam et  al. 1993). Alternatively, as another mechanism to monitor protein activity, the substrates and products of enzymes may be measured. This approach, known as metabolomics, aims to simultaneously measure small molecules involved in cellular pathways such as glycolysis and the Krebs cycle. By using state-of-the-art mass spectrometry technology, tens of thousands of small molecules—metabolites—can be detected within cells (Patti et  al. 2012; Scalbert et  al. 2009). While theoretically knowledge of the levels of all proteins present in a cell would predict the presence of particular metabolites, too little is known about most of these molecules. No such comprehensive relationships can

Cellomics be built. Only the high-level integration of data at multiple -omics levels of investigation will permit the generation of these sorts of predictive models. Here, we have lumped together the assessment of proteins and their products largely because the technologies available for their assessment are similar. As with epigenetic investigations, little effort has been focused on assays of protein or metabolite levels in specific cell types. Here the barriers are technological— unlike with nucleic acid, there are no readily apparent methods for amplifying proteins and metabolites, and as sensitive as these sophisticated devices have become, collecting sufficient material of a particular type will still be limiting for many cell types in the nervous system. Furthermore, unlike nucleic acids, individual proteins will vary widely in the biochemical properties, therefore these assays are much more sensitive to handling and extraction conditions than work with DNA and RNA. Nonetheless, there are at least two interesting advancements to be highlighted in the application of proteomics to the nervous system:  organellomics and in situ metabolomics.

Organellomics Two limitations of proteomics include the need to isolate large amounts of material from specific cell types, and the difficulty in detecting low abundance proteins. One possible approach to tackling this question is to focus on specific subcellular compartments in specific cell types. The focus on just a particular organelle or subcellular compartment (such as the synapse) should increase the relative abundance of important proteins that may be of relatively low abundance cell wide. The use of transgenic technologies can permit adding tags for affinity purification of specific organelles. For example, Selimi and colleagues tagged just the parallel fiber synapse on Purkinje neurons by expressing a fusion of GFP to the GluRδ2 protein specifically in Purkinje neurons. Affinity purification of these specific synapses from biochemical preparations of the mouse brain permitted proteomics of this specific synapse in a specific cell type, leading to important insights into signaling within this structure in these cells (Selimi et al. 2009). One could imagine parallel approaches for purification of other organelles from specific

209

cell types. Even existing mouse lines with GFP tagged ribosomes could be used to study differences in protein composition of ribosomes from particular cell types (Doyle et  al. 2008). The drawback to these approaches is that they may depend on de novo transgenesis for each cell type and biochemical optimization for each organelle of interest, but they permit more comprehensive information than would be available by any other method.

In Situ Metabolomics One major challenge of the application of proteomics and metabolomics to the nervous system is the requirement to isolate a sufficient number of cells in a manner that is not too disruptive to the very profile to be studied. Any technique that involves physical dissociation of living cells from the nervous system is likely to have some effect on the profile. Therefore, new technologies that combine classic anatomical preparations with mass spectrometry are of great interest to study the in situ profile of particular cell types. The nanostructure initiator mass spectrometry (NIMS) imaging technique rasters a laser across a slice of tissue on a special substrate and measures by mass spectrometry the ionized particles as they come off the tissue (Patti et  al. 2010). Currently the technique is limited in that only the most abundant products can be measured given ion suppression effects. Nonetheless, resolution is on the level of the individual cells, and combination with genetically labeled tissue may permit proteomic and metabolomic investigation of specific cell types in situ. Transcriptome/Translatomics Measurement of relative RNA abundances is the most inherently scalable and accessible approach to the molecular characterization of cell types. RNA, once converted to cDNA, can be amplified essentially limitlessly. Compared with genomic DNA (for epigenetic applications), most RNAs are present in more than two copies per gene per cell. And a variety of robust platforms exist for profiling RNA pools, ranging from the hybridization-based microarrays to emergent RNA-sequencing technologies. Because of these features, inputs as low as even single cells or tens of cells have successfully been utilized for measurement of gene expression by qPCR, sequencing and microarrays (Burgemeister et al.

210

the OMICs

2007; Dixon et  al. 2000; Hempel et  al. 2007; Islam et al. 2011; Mary et al. 2011). These studies have shown a remarkable diversity of gene expression across cell types in the nervous system. Comparison of even very closely related cell types, such as Drd1 and Drd2 medium spiny neurons (Heiman et  al. 2008; Lobo et  al. 2006)  or corticospinal neurons to corticostriatal neurons (Arlotta et  al. 2005; Schmidt et  al. 2012)  has identified hundreds of transcripts that are differentially regulated. For more distantly related cell types, such as comparing a Purkinje cell to an astrocyte or to a cholinergic neuron of the medial habenula, there are thousands of differentially regulated genes (Doyle et  al. 2008), with differences so robust they violate some of the assumptions normally utilized in microarray analysis (Dougherty et al. 2010). Indeed, one of the most striking findings of these studies was that different types of neurons were as specialized from one another as neurons were from glia. Genes that were particularly prone to high magnitudes of variation across cell types (high entropy) coded for classes of molecules typically found at the cell surface, such as receptors and channels, that mediate the response of particular cell types to the environment, as well as some transcription factors and calcium binding proteins (Doyle et al. 2008). These -omics-level approaches to gene expression in particular cell types provide a wealth of information beyond the capacity of pursuit of individual laboratories. These data can broadly inform both our understanding of the cells in the normal brain as well as their importance to pathological states. Continuing to capture and distribute this information is going to be an important part of any -omics level study. Unfortunately, the actual measurements are so platform-dependent that comparisons across studies are difficult. Nonetheless, analysis within a study can still provide novel biological insights regarding these cells in health and disease.

CURRENT AND EMERGING USES OF TRANSCRIPTOMIC A P P R O A C H E S A N D   D ATA Finding Novel Molecules Important for  the Function of Particular Cell  Types Studies profiling particular cell types have been conducted by laboratories interested in

identifying  novel features of these cells. The common finale of these transcriptomic studies of a particular cell type is a functional assay with one or a small number of the enriched transcripts discovered in the screen, often transcription factors (Arlotta et al. 2005; Dougherty et  al. 2012a; Lai et  al. 2008; Lobo et  al. 2006; Molyneaux et  al. 2005). For example, Arlotta et  al. compared gene expression during development of corticospinal neurons and callosal projection neurons and identified a corticospinal neuron-specifically expressed gene, Ctip2, that was then found to be necessary for normal development of these cells in functional assays (Arlotta et  al. 2005; Dougherty & Geschwind, 2005). While this selection of a single enriched transcript is necessary when pursing labor-intensive functional studies, these same profiles could easily be mined to identify many more transcripts key to cellular function.

Prediction of the Physiological Properties of the Cell  Types A related approach is to utilize the profiles as a means to predict the physiological properties of the cell type of interest. This is of particular interest in the identification of drug targets, such as receptors, that may have cell-specific expression and thus present unique therapeutic opportunities to alter the behavior of particular cell types in vivo. An example of this work was the identification by Heiman et  al. of a novel G protein‒coupled receptor enriched in a subclass of medium spiny neurons (Heiman et  al. 2008). A  separate group identified from microarray data particular potassium channels that likely mediate the development of fast spiking behavior in some cortical interneurons (Okaty et  al. 2009). Another example was the characterization of a new cell type that appears necessary to mediate much of the response to antidepressants (Schmidt et  al. 2012). In the future, profiling a particular, medically relevant cell type may be a powerful method to identify new drug targets for psychiatric or neurological disorders where the relevant cell types are known (Bartfai et al. 2012). Characterization of Cell Types in Disease Models and Other Manipulations As the reproducibility of cell-specific profiling has improved, it has become possible to study not just the profile of these cells in the normal

Cellomics

211

Recent work in this direction for autism is very promising (Voineagu et al. 2011).

state but also to conduct comparative studies of pathological conditions, such as injuries, or genetic models of human diseases. The ability to study the response of particular cell types to stimuli as varied as drug exposure (Heiman et  al. 2008; Schmidt et  al. 2012), tumorigenesis (Dougherty et  al. 2012a; Fomchenko et  al. 2011), or gene knockout and overexpression (Warner-Schmidt et  al. 2012)  gives us the potential to learn a great deal of new neurobiology about each of these processes at a resolution and throughput previously unavailable. With its high reproducibility (Doyle et  al. 2008), TRAP is particularly amenable to this approach, and these studies represent a large fraction of the TRAP studies currently under way.

Interpretation of Human Genetic Data from a Cellular Perspective Philosophically similar is the interpretation of human genetic data from a cellular perspective. Ongoing high-throughput genetic association studies utilizing single nucleotide polymorphisms (SNPs), copy number variations, or resequencing of exomes and genomes are identifying catalogs of common and rare variants that contribute to risk of developing various diseases of the nervous system. Cell-specific transcript profiling can contribute to these studies in at least two major ways.

Interpretation of Human Gene Expression Data from a Cellular Perspective Finally, even for those investigators not conducting transcriptome profiling in their own labs, these resources have proven very useful datasets to aid in the interpretation of other gene expression data. One recent example of this work is the comparative analysis of different subclasses of glioblastoma tumors with the expression profiles of the major cell types of the brain. This work hinted that different subtypes of glioblastoma likely had emerged from different classes of normal cell types in the brain (Cahoy et  al. 2008; Verhaak et  al. 2010). Likewise, work profiling whole human brains across regions, time, and species has suggested that many, and perhaps most, gene expression differences seen in these studies are really driven by differences in cellular composition across the samples (Kang et  al. 2011; Oldham et  al. 2006, 2008). Emerging approaches are attempting to explicitly incorporate cellular information into analytical models for human data (Kuhn et  al. 2011; Shen-Orr et  al. 2010, Xu & Dougherty, 2013). Future approaches more explicitly utilizing the cellular profiles available from model organisms may lead to further improvements in the analysis and interpretation of human gene expression data. This is particularly important in psychiatric diseases, such as autism, where the relevant cell types are not known. If the human expression data can guide us to consistent cellular alterations, even in the context of distinct genetic or environmental causes across different individuals, then treatments could be tailored to address the common cellular deficits.

1. If a particular cell type is already implicated in a disorder (such as dopaminergic neurons in Parkinson’s disease or hypocretin neurons in narcolepsy), then cell-specific transcriptional profiling can be utilized to identify candidate genes for genetic analysis in human populations. As an example, it is well documented that a subset of individuals with autism have hyperserotonemia (Lam et al. 2006), suggesting that there may also be fundamental differences in the regulation of serotonin in their brains. We profiled serotonin-producing cells in the brain and identified a set of transcripts enriched in these cells. Polymorphisms in two of these genes are associated with autism in humans, and mutations in one of these had the effect of altering serotonin levels in mouse brains as well as resulting in behaviors reminiscent of autism (Dougherty et al. 2013. This candidate-gene-list approach, while having merit, is going to be eclipsed in human genetics by the rapidly falling cost of genome-wide sequencing studies. However, cell-specific transcript profiling can still make important contributions to the interpretation of these studies. 2. It is clear from the current studies that there are many, if not hundreds, of genetic routes to manifesting a complex psychiatric disorder like schizophrenia (Lee et al. 2012) or autism (Bill & Geschwind 2009). Therefore treatment strategies focused on specific genetic

212

the OMICs defects will have limited applicability. However, if these defects converge at the level of particular cells, then, as noted above, the cell becomes the target for treatment. One manner in which a diverse set of genes may converge on a particular cell type is through gene expression. There are two ways in which to conduct these analyses— first statistically: if the set of genes implicated in a human psychopathology are expressed more often than expected by chance in a particular cell type, the suggestion is that cell type is important to the disease process. Second, biologically: if even a single strongly implicated gene is expressed in only one cell type in the brain, then that provides extremely robust evidence for that cell type in the disorder. For example, an apparent autosomal dominant form of Tourette’s syndrome was recently associated with a strongly deleterious mutation in the HDC gene in a large family (Ercan-Sencicek et al. 2010). As Hdc is expressed almost uniquely in histaminergic neurons in the brain, this strongly suggests that dysregulation of CNS histamine neurons can cause Tourette’s syndrome. Therefore drugs that influence these cells may be of use in the treatment of Tourette’s syndrome, at least in this family if not more broadly (Fernandez et al. 2012).

Finally, it is worth noting that the same cell-specific profiles that give the information for these analyses also simultaneously provide a list of potentially druggable molecules (receptors, kinases) that are enriched or uniquely expressed in the candidate cell types. This information could be essential to the design of treatments targeting these specific cell types (Bartfai et  al. 2012; Doyle et  al. 2008; Nelson et al. 2006).

The Grand Correlation—Assigning Putative Functions to Novel  Genes In the current era, there are roughly twenty thousand protein coding genes in the genome of the mouse and approximately the same number in humans. Of these, only a subset has even been named and even fewer are the focus of at least one publication. Thus the majority of the

genome remains essentially unstudied. There is a great as yet untapped potential in these cell-specific profiling data for putative functional categorization for novel genes. Given enough cell types and good systematic phenotypic data regarding them it may be possible from a grand correlation to infer rough functions for these unstudied or understudied genes. For example, imagine that one were interested in identifying genes involved in the maintenance of dendrites. Currently we have measured the RNA profiles of over twenty different types of neural cells (Dalal et  al. 2013; Dougherty et  al. 2012a; Dougherty et  al. 2013; Doyle et  al. 2008; Heiman et  al. 2008; Schmidt et  al. 2012). If one were to carefully measure a phenotype, such as the average dendritic area from each of these cell types, one could then look for genes whose expression is positively correlated with the phenotype as an in silico screen for genes involved in particular processes. The same approach could be taken for axon length, firing rate, nuclear size, fos expression in response to agonist, dendritic branching, or density of mitochondrial labeling. Much as webQTL permits in silico genetic investigations by only the phenotypic profiling of a standard set of strains (Wang et al. 2003), these profiling data provide an additional opportunity to leverage existing information to screen for novel contributors to a phenotype of interest. We have taken initial steps in this direction by providing a browsable interface for published bacTRAP data (http://java.bactrap.org/bactrap/ index.jsp) (Dougherty et al. 2010). Likewise, if parallel measurements can be conducted on the metabolic, epigenetic, and transcriptomic levels of a sufficiently large number of cell types, then the possibility exists for a truly grand correlation—a matrix that may permit the prediction of which epigenetic marks correspond to the production of which transcripts, and which transcripts indicate the presence of particular metabolites, and are thus related to the pathways that generate them. This combination of approaches—this overlapping of -omics with -omics—has the alluring potential to unlock many of the puzzles emerging from these intertwined fields.

The End Game of Molecular Analysis Of the molecular -omics methods, the most readily scalable are the molecular methods based around nucleic acids (RNA and DNA). A  final,

Cellomics comprehensive taxonomy of cell types within the brain may await the moment when every cell type in the brain can be individually and comprehensively assayed for RNA expression—a goal that is not perhaps as unimaginably distant as it might seem. If the rate of decrease in sequencing costs continues at the exponential pace of the last 4  years (NHGRI, 2012), within 13  years it will be feasible to conduct RNA-seq on each of the ~100 billion neurons of a human brain, with 30  million 100-bp reads per neuron, for less than $500,000. By 15  years, it would cost $20,000. This all assumes that every neuron would need to be assayed, rather than just a sufficiently large subset, to identify all extant types. Comprehensive clustering by gene expression would then permit a final molecular categorization of all cell types, down to the level of individual cells.

THE FINAL  WORD It is important to keep in mind, in this era of heady scientific acceleration, that although data generation may be high-throughput, good data interpretation is low-throughput. Careful tool development and even more careful thought will always be needed to cope with this deluge of data. ACKNOWLEDGMENTS The author would like to thank Dr.  S.  Maloney for helpful comments on this manuscript. JDD is supported by the NIH (5R00NS067239-03, 2R21MH099798, 1R21NS083052) and the Malinckrodt Foundation. REFERENCES Airan, R. D., Thompson, K. R., Fenno, L. E., Bernstein, H., & Deisseroth, K. (2009). Temporally precise in vivo control of intracellular signalling. Nature 458, 1025–1029. Andresen, M., Stiel, A. C., Folling, J., Wenzel, D., Schonle, A., Egner, A., . . . Jakobs, S. (2008). Photoswitchable fluorescent proteins enable monochromatic multilabel imaging and dual color fluorescence nanoscopy. Nat Biotechnol 26, 1035–1040. Anthony, T. E., & Heintz, N. (2007). The folate metabolic enzyme ALDH1L1 is restricted to the midline of the early CNS, suggesting a role in human neural tube defects. J Comp Neurol 500, 368–383. Arlotta, P., Molyneaux, B. J., Chen, J., Inoue, J., Kominami, R., & Macklis, J. D. (2005). Neuronal subtype-specific genes that control corticospinal

213

motor neuron development in vivo. Neuron 45, 207–221. Ascoli, G. A., Alonso-Nanclares, L., Anderson, S. A., Barrionuevo, G., Benavides-Piccione, R., Burkhalter, A.,. . . et al. (2008). Petilla terminology: nomenclature of features of GABAergic interneurons of the cerebral cortex. Nat Rev Neurosci 9, 557–568. Auer, S., Sturzebecher, A. S., Juttner, R., Santos-Torres, J., Hanack, C., Frahm, S., . . . Ibanez-Tallon, I. (2010). Silencing neurotransmission with membrane-tethered toxins. Nat Methods 7, 229–236. Bartfai, T., Buckley, P. T., & Eberwine, J. (2012). Drug targets:  single-cell transcriptomics hastens unbiased discovery. Trends Pharmacol Sci 33, 9–16. Bateup, H. S., Svenningsson, P., Kuroiwa, M., Gong, S., Nishi, A., Heintz, N., & Greengard, P. (2008). Cell type-specific regulation of DARPP-32 phosphorylation by psychostimulant and antipsychotic drugs. Nat Neurosci 11, 932–939. Bill, B. R., & Geschwind, D. H. (2009). Genetic advances in autism:  heterogeneity and convergence on shared pathways. Curr Opin Genet Dev 19, 271–278. Boyden, E. S., Zhang, F., Bamberg, E., Nagel, G., & Deisseroth, K. (2005). Millisecond-time scale, genetically targeted optical control of neural activity. Nat Neurosci 8, 1263–1268. Burgemeister, R., Friedemann, G., Schlieben, S., & Hitzler, H. (2007). Laser microdissection:  gene expression analysis at the single-cell level. Nat Methods An24–An25. Cahoy, J. D., Emery, B., Kaushal, A., Foo, L. C., Zamanian, J. L., Christopherson, K. S., . . . et  al. (2008). A transcriptome database for astrocytes, neurons, and oligodendrocytes:  a new resource for understanding brain development and function. J Neurosci (the official journal of the Society for Neuroscience) 28, 264–278. Celio, M. R., & Heizmann, C. W. (1981). Calcium-binding protein parvalbumin as a neuronal marker. Nature 293, 300–302. Celio, M. R., & Norman, A. W. (1985). Nucleus basalis Meynert neurons contain the vitamin D-induced calcium-binding protein (Calbindin-D 28k). Anat Embryol (Berl) 173, 143–148. Chandler, K. J., Chandler, R. L., Broeckelmann, E. M., Hou, Y., Southard-Smith, E. M., & Mortlock, D.P. (2007). Relevance of BAC transgene copy number in mice:  transgene copy number variation across multiple transgenic lines and correlations with transgene integrity and expression. Mamm Genome 18, 693–708. Chi, S. W., Zang, J. B., Mele, A., & Darnell, R. B. (2009). Argonaute HITS-CLIP decodes

214

the OMICs

microRNA-mRNA interaction maps. Nature 460, 479–486. Chow, B. Y., Han, X., & Boyden, E. S. (2012). Genetically encoded molecular tools for light-driven silencing of targeted neurons. Prog Brain Res 196, 49–61. Chung, K.  & Deisseroth, K.  (2013) CLARITY for mapping the nervous system., Nature Methods, 10. 508–513. Coons A. H., Creech H. J., & RN, J. (1941). Immunological properties of an antibody containing a fluorescent group. Proc Soc Exp Biol Med 47, 200–202. Coons, A. H., Snyder, J. C., et al. (1950). Localization of antigen in tissue cells; antigens of rickettsiae and mumps virus. J Exp Med 91, 31–38. Dalal, J., Roh, J. H., Maloney, S.E., Akuffo, A., Shah, S., Yuan, H., Wamsley, B., . . . Dougherty, J.D. (2013). Translational profiling of hypocretin neurons identifies candidate molecules for sleep regulation. Genes Dev 27, 565–578. Danielian, P. S., Muccino, D., Rowitch, D. H., Michael, S. K., & McMahon, A. P. (1998). Modification of gene activity in mouse embryos in utero by a tamoxifen-inducible form of Cre recombinase. Curr Biol 8, 1323–1326. Day, K., Shefer, G., Richardson, J. B., Enikolopov, G., & Yablonka-Reuveni, Z. (2007). Nestin-GFP reporter expression defines the quiescent state of skeletal muscle satellite cells. Dev Biol 304, 246–259. Dieguez-Hurtado, R., Martin, J., Martinez-Corral, I., Martinez, M. D., Megias, D., Olmeda, D., & Ortega, S. (2011). A Cre-reporter transgenic mouse expressing the far-red fluorescent protein Katushka. Genesis 49, 36–45. Dixon, A. K., Richardsen, P. J., Pinnock, R. D., & Lee, K. (2000). Gene-expression analysis at the single-cell level. Trends Pharmacol Sci 21, 65–70. Domingos, A. I., Vaynshteyn, J., Voss, H. U., Ren, X., Gradinaru, V., Zang, F., . . . Friedman, J. (2011). Leptin regulates the reward value of nutrient. Nat Neurosci 14, 1562–1568. Dougherty, J. D., Fomchenko, E. I., Afua, A. A., Schmidt, E., Helmy, K.Y., Bazzoli, E., . . . Milosevic, A. (2012a). Candidate pathways for promoting differentiation and quiescence in of oligodendrocyte progenitor-like cells in glioblastoma Cancer Res 72, 4856–4568. Dougherty, J. D., & Geschwind, D. H. (2005). Progress in realizing the promise of microarrays in systems neurobiology. Neuron 45, 183–185. Dougherty, J. D., Maloney, S. E., Wozniak, D. F., Rieger, M. A., Sonnenblick, L., Coppola, G., . . . et  al. (2013). The disruption of Celf6, a gene identified by translational profiling of serotonergic neurons, results in autism-related behaviors. J Neurosci 33, 2732–2753.

Dougherty, J. D., Schmidt, E. F., Nakajima, M., & Heintz, N. (2010). Analytical approaches to RNA profiling data for the identification of genes enriched in specific cells. Nucleic Acids Res 38, 4218–4230. Dougherty, J. D., Zhang, J., Feng, H., Gong, S., & Heintz, N. (2012b). Mouse transgenesis in a single locus with independent regulation for multiple fluorophores. Plos One Accepted. Doyle, J. P., Dougherty, J. D., Heiman, M., Schmidt, E. F., Stevens, T. R., Ma, G., . . . et  al. (2008). Application of a translational profiling approach for the comparative analysis of CNS cell types. Cell 135, 749–762. Dubois, N. C., Hofmann, D., Kaloulis, K., Bishop, J. M., & Trumpp, A. (2006). Nestin-Cre transgenic mouse line Nes-Cre1 mediates highly efficient Cre/loxP mediated recombination in the nervous system, kidney, and somite-derived tissues. Genesis 44, 355–360. Dymecki, S. M., & Kim, J. C. (2007). Molecular neuroanatomy’s “three Gs”: A primer. Neuron 54, 17–34. Easterday, M. C., Dougherty, J. D., Jackson, R. L., Ou, J., Nakano, I., Paucar, A. A., . . . et al. (2003). Neural progenitor genes. Germinal zone expression and analysis of genetic overlap in stem cell populations. Dev Biol 264, 309–322. Ercan-Sencicek, A. G., Stillman, A. A., Ghosh, A. K., Bilguvar, K., O’Roak, B. J., . . . et  al. (2010). L-histidine decarboxylase and Tourette’s syndrome. N Engl J Med 362, 1901–1908. Farago, A.F., Awatramani, R. B., & Dymecki, S. M. (2006). Assembly of the brainstem cochlear nuclear complex is revealed by intersectional and subtractive genetic fate maps. Neuron 50, 205–218. Feng, G., Mellor, R. H., Bernstein, M., Keller-Peck, C., Nguyen, Q. T., Wallace, M., . . . Sanes, J. R. (2000). Imaging neuronal subsets in transgenic mice expressing multiple spectral variants of GFP. Neuron 28, 41–51. Fernandez, T. V., Sanders, S. J., Yurkiewicz, I. R., Ercan-Sencicek, A. G., Kim, Y. S., Fishman, D. O., . . . et al. (2012). Rare copy number variants in Tourette syndrome disrupt genes in histaminergic pathways and overlap with autism. Biol Psychiatry 71, 392–402. Fomchenko, E. I., Dougherty, J. D., Helmy, K. Y., Katz, A. M., Pietras, A., Brennan, C., . . . Holland, E. C. (2011). Recruited cells can become transformed and overtake PDGF-induced murine gliomas in vivo during tumor progression. PLoS One 6, e20605. Foo, LC, and Dougherty, J.D. (2013). Aldh1L1 is expressed in postnatal neural stem cells in vivo. Glia, 61, 1533–1541.

Cellomics Garcia, A. D., Doan, N. B., Imura, T., Bush, T. G., & Sofroniew, M. V. (2004). GFAP-expressing progenitors are the principal source of constitutive neurogenesis in adult mouse forebrain. Nat Neurosci 7, 1233–1241. Geschwind, D. H., Ou, J., Easterday, M. C., Dougherty, J. D., Jackson, R. L., Chen, Z. G., . . . et al. (2001). A genetic analysis of neural progenitor differentiation. Neuron 29, 325–339. Goldberg, A. D., Allis, C. D., & Bernstein, E. (2007). Epigenetics:  A  landscape takes shape. Cell 128, 635–638. Gong, S., Doughty, M., Harbaugh, C. R., Cummins, A., Hatten, M. E., Heintz, N., & Gerfen, C.R. (2007). Targeting Cre recombinase to specific neuron populations with bacterial artificial chromosome constructs. J Neurosci (the official journal of the Society for Neuroscience) 27, 9817–9823. Gong, S., Kus, L., & Heintz, N. (2010). Rapid bacterial artificial chromosome modification for large-scale mouse transgenesis. Nat Protoc 5, 1678–1696. Gong, S., Yang, X. W., Li, C., & Heintz, N. (2002). Highly efficient modification of bacterial artificial chromosomes (BACs) using novel shuttle vectors containing the R6Kgamma origin of replication. Genome Res 12, 1992–1998. Gordon, J. W., Scangos, G. A., Plotkin, D. J., Barbosa, J. A., & Ruddle, F. H. (1980). Genetic transformation of mouse embryos by microinjection of purified DNA. Proc Natl Acad Sci U S A 77, 7380–7384. Gray, P. A., Fu, H., Luo, P., Zhao, Q., Yu, J., Ferrari, A., . . . et  al. (2004). Mouse brain organization revealed through direct genome-scale TF expression analysis. Science 306, 2255–2257. Hara, J., Beuckmann, C. T., Nambu, T., Willie, J. T., Chemelli, R. M., Sinton, C. M., . . . et  al. (2001). Genetic ablation of orexin neurons in mice results in narcolepsy, hypophagia, and obesity. Neuron 30, 345–354. Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P., Hong, C., Downey, S. L., . . . et  al. (2010). Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat Biotechnol 28, 1097–1105. Haubensak, W., Kunwar, P. S., Cai, H. J., Ciocchi, S., Wall, N. R., Ponnusamy, R., . . . et  al. (2010). Genetic dissection of an amygdala microcircuit that gates conditioned fear. Nature 468, 270–U230. He, M., Liu, Y., Wang, X., Zhang, M. Q., Hannon, G. J., & Huang, Z. J. (2012). Cell-type-based analysis of microRNA profiles in the mouse brain. Neuron 73, 35–48. Heaney, J. D., Rettew, A. N., & Bronson, S. K. (2004). Tissue-specific expression of a BAC transgene

215

targeted to the Hprt locus in mouse embryonic stem cells. Genomics 83, 1072–1082. Hegde, P. S., White, I. R., & Debouck, C. (2003). Interplay of transcriptomics and proteomics. Curr Opin Biotechnol 14, 647–651. Heiman, M., Schaefer, A., Gong, S., Peterson, J. D., Day, M., Ramsey, K. E., . . . et al. (2008). A translational profiling approach for the molecular characterization of CNS cell types. Cell 135, 738–748. Heintz, N. (2004). Gene expression nervous system atlas (GENSAT). Nature 7, 483–483. Heller, E. A., Zhang, W., Selimi, F., Earnheart, J. C., Ślimak, M. A., Santos-Torres, J., ... Heintz, N. (2012). The biochemical anatomy of cortical inhibitory synapses. PLoS One 7, e39572. Hempel, C. M., Sugino, K., & Nelson, S. B. (2007). A manual method for the purification of fluorescently labeled neurons from the mammalian brain. Nat Protoc 2, 2924–2929. Hillarp, N. A., Fuxe, K., & Dahlstrom, A. (1966). Demonstration and mapping of central neurons containing dopamine, noradrenaline, and 5-hydroxytryptamine and their reactions to psychopharmaca. Pharmacol Rev 18, 727–741. Hirrlinger, J., Scheller, A., Hirrlinger, P. G., Kellert, B., Tang, W., Wehr, M. C., . . . et  al. (2009). Split-cre complementation indicates coincident activity of different genes in vivo. PLoS One 4, e4286. Hollenback, S.M., Lyman, S., & Cheng, J. (2011). Recombineering-based procedure for creating BAC transgene constructs for animals and cell lines. Curr Protoc Mol Biol Chapter  23, Unit 23 14. Hunter, N. L., Awatramani, R. B., Farley, F. W., & Dymecki, S.M. (2005). Ligand-activated flpe for temporally regulated gene modifications. Genesis 41, 99–109. Hyden, H., & McEwen, B. (1966). A glial protein specific for the nervous system. Proc Natl Acad Sci U S A 55, 354–358. Indra, A. K., Warot, X., Brocard, J., Bornert, J. M., Xiao, J. H., Chambon, P., & Metzger, D. (1999). Temporally-controlled site-specific mutagenesis in the basal layer of the epidermis: comparison of the recombinase activity of the tamoxifen-inducible Cre-ERT and Cre-ERT2 recombinases. Nucleic Acids Res 27, 4324–4327. Islam, S., Kjallquist, U., Moliner, A., Zajac, P., Fan, J. B., Lonnerberg, P., & Linnarsson, S. (2011). Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res 21, 1160–1167. Jaenisch, R. (1976). Germ line integration and Mendelian transmission of the exogenous Moloney leukemia virus. Proc Natl Acad Sci U S A 73, 1260–1264.

216

the OMICs

Jensen, K. B., & Darnell, R. B. (2008). CLIP: crosslinking and immunoprecipitation of in vivo RNA targets of RNA-binding proteins. Methods Mol Biol 488, 85–98. Johansson, C. B., Lothian, C., Molin, M., Okano, H., & Lendahl, U. (2002). Nestin enhancer requirements for expression in normal and injured adult CNS. J Res 69, 784–794. Jones, A. R., Overly, C. C., & Sunkin, S.M. (2009). The Allen Brain Atlas:  5  years and beyond. Nat Rev Neurosci 10, 821–828. Kam, C. M., Abuelyaman, A. S., Li, Z. Z., Hudig, D., & Powers, J. C. (1993). Biotinylated isocoumarins, new inhibitors and reagents for detection, localization, and isolation of serine proteases. Bioconjugate Chem 4, 560–567. Kang, H. J., Kawasawa, Y. I., Cheng, F., Zhu, Y., Xu, X., Li, M., . . . et al. (2011). Spatio-temporal transcriptome of the human brain. Nature 478, 483–489. Kislinger, T., Cox, B., Kannan, A., Chung, C., Hu, P., Ignatchenko, A., . . . et  al. (2006). Global survey of organ and organelle protein expression in mouse:  combined proteomic and transcriptomic profiling. Cell 125, 173–186. Kranz, A., Fu, J., Duerschke, K., Weidlich, S., Naumann, R., Stewart, A.F., & Anastassiadis, K. (2010). An improved Flp deleter mouse in C57Bl/6 based on Flpo recombinase. Genesis 48, 512–520. Kriaucionis, S., & Heintz, N. (2009). The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 324, 929–930. Kuhn, A., Thu, D., Waldvogel, H. J., Faull, R. L., & Luthi-Carter, R. (2011). Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain. Nat Methods 8, 945–947. Lai, T., Jabaudon, D., Molyneaux, B. J., Azim, E., Arlotta, P., Menezes, J.R.L., & Macklis, J. D. (2008). SOX5 controls the sequential generation of distinct corticofugal neuron subtypes. Neuron 57, 232–247. Laird, P.W. (2010). Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet 11, 191–203. Lam, K. S., Aman, M. G., & Arnold, L. E. (2006). Neurochemical correlates of autistic disorder:  a review of the literature. Res Dev Disabil 27, 254–289. Lashley, K. S. (1930). Basic neural mechanisms in behavior. Psychol Rev 37, 1–24. Leclerc, N., Schwarting, G. A., Herrup, K., Hawkes, R., & Yamamoto, M. (1992). Compartmentation in mammalian cerebellum:  Zebrin II and P-path antibodies define three classes of sagittally

organized bands of Purkinje cells. Proc Natl Acad Sci U S A 89, 5006–5010. Lee, C. K., Sunkin, S. M., Kuan, C., Thompson, C. L., Pathak, S., Ng, L., . . . et  al. (2008). Quantitative methods for genome-scale analysis of in situ hybridization and correlation with microarray data. Genome Biol 9, R23. Lee, S. H., DeCandia, T. R., Ripke, S., Yang, J., Sullivan, P. F., Goddard, M. E., . . . et  al. (2012). Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nature Genet 44, 247–U235. Lein, E. S., Hawrylycz, M. J., Ao, N., Ayres, M., Bensinger, A., Bernard, A., Boe, A. F., . . . et  al. (2007). Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176. Letzkus, J. J., Wolff, S.B.E., Meyer, E.M.M., Tovote, P., Courtin, J., Herry, C., & Luthi, A. (2011). A disinhibitory microcircuit for associative fear learning in the auditory cortex. Nature 480, 331–U376. Lin, J. Y. (2011). A user’s guide to channelrhodopsin variants: features, limitations and future developments. Exp Physiol 96, 19–25. Lobo, M. K., Karsten, S. L., Gray, M., Geschwind, D. H., & Yang, X. W. (2006). FACS-array profiling of striatal projection neuron subtypes in juvenile and adult mouse brains. Nat Neurosci 9, 443–452. Looger, L. L., & Griesbeck, O. (2012). Genetically encoded neural activity indicators. Curr Opin Neurobiol 22, 18–23. Madisen, L., Zwingman, T. A., Sunkin, S. M., Oh, S. W., Zariwala, H. A., Gu, H., . . . et al. (2010). A robust and high-throughput Cre reporting and characterization system for the whole mouse brain. Nat Neurosci 13, 133–U311. Magdaleno, S., Jensen, P., Brumwell, C. L., Seal, A., Lehman, K., Asbury, A., . . . et al. (2006). BGEM: an in situ hybridization database of gene expression in the embryonic and adult mouse nervous system. PLoS Biol 4, e86. Mary, P., Dauphinot, L., Bois, N., Potier, M. C., Studer, V., & Tabeling, P. (2011). Analysis of gene expression at the single-cell level using microdroplet-based microfluidic technology. Biomicrofluidics 5. Masseck, O. A., Rubelowski, J. M., Spoida, K., & Herlitze, S. (2011). Light- and drug-activated G-protein-coupled receptors to control intracellular signalling. Exp Physiol 96, 51–56. Maunakea, A. K., Nagarajan, R. P., Bilenky, M., Ballinger, T. J., D’Souza, C., Fouse, S.D., . . . et  al. (2010). Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 466, 253–U131.

Cellomics Miller, E. K., & Wilson, M. A. (2008). All my circuits:  using multiple electrodes to understand functioning neural networks. Neuron 60, 483–488. Mills, A. A., & Bradley, A. (2001). From mouse to man:  generating megabase chromosome rearrangements. Trends Genet 17, 331–339. Molyneaux, B. J., Arlotta, P., Hirata, T., Hibi, M., & Macklis, J. D. (2005). Fezl is required for the birth and specification of corticospinal motor neurons. Neuron 47, 817–831. Muers, M. (2011). Functional genomics: The modENCODE guide to the genome. Nat Rev Genet 12. Nrg2942. Myers, R. M., Stamatoyannopoulos, J., Snyder, M., Dunham, I., Hardison, R. C., Bernstein, B.E., . . . et al. (2011). A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9, e1001046. Nakatani, J., Tamada, K., Hatanaka, F., Ise, S., Ohta, H., Inoue, K., . . . et al. (2009). Abnormal behavior in a chromosome-engineered mouse model for human 15q11-13 duplication seen in autism. Cell 137, 1235–1246. Nelson, S. B., Sugino, K., & Hempel, C. M. (2006). The problem of neuronal cell types:  a physiological genomics approach. Trends Neurosci 29, 339–345. NHGRI (2012). http://www.genome.gov/ sequencingcosts/. Nunzi, M. G., Shigemoto, R., & Mugnaini, E. (2002). Differential expression of calretinin and metabotropic glutamate receptor mGluR1alpha defines subsets of unipolar brush cells in mouse cerebellum. J Comp Neurol 451, 189–199. Oberdick, J., Smeyne, R. J., Mann, J. R., Zackson, S., & Morgan, J. I. (1990). A promoter that drives transgene expression in cerebellar Purkinje and retinal bipolar neurons. Science 248, 223–226. Okaty, B. W., Miller, M. N., Sugino, K., Hempel, C. M., & Nelson, S. B. (2009). Transcriptional and electrophysiological maturation of neocortical fast-spiking GABAergic interneurons. J Neurosci (the official journal of the Society for Neuroscience) 29, 7040–7052. Oldham, M. C., Horvath, S., & Geschwind, D. H. (2006). Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proc Natl Acad Sci U S A 103, 17973–17978. Oldham, M.C., Konopka, G., Iwamoto, K., Langfelder, P., Kato, T., Horvath, S., and Geschwind, D.H. (2008). Functional organization of the transcriptome in human brain. Nat Neurosci 11, 1271–1282. Oyibo, H., Cao, G., Zhan, H., Koulakov, A., Enquist, L., Dubnau, J., & Zador, A. (2011). Probing the connectivity of neural circuits at single-neuron resolution using high-throughput DNA sequencing. Paper presented at the Computational and

217

Systems Neuroscience Meeting (Cosyne) (Salt Lake City, Utah, Nature Proceedings). Palmiter, R. D., & Brinster, R. L. (1986). Germ-line transformation of mice. Annu Rev Genet 20, 465–499. Passman, R. S., & Fishman, G. I. (1994). Regulated expression of foreign genes in vivo after germline transfer. J Clin Invest 94, 2421–2425. Patti, G. J., Woo, H. K., Yanes, O., Shriver, L., Thomas, D., Uritboonthai, W., . . . Siuzdak, G. (2010). Detection of carbohydrates and steroids by cation-enhanced nanostructure-initiator mass spectrometry (NIMS) for biofluid analysis and tissue imaging. Anal Chem 82, 121–128. Patti, G. J., Yanes, O., & Siuzdak, G. (2012). Innovation:  Metabolomics:  the apogee of the omics trilogy. Nat Rev Mol Cell Biol 13, 263–269. Penfield, W., & Rasmussen, T. (1950). The cerebral cortex of man; a clinical study of localization of function. Oxford, UK: Macmillan. Portales-Casamar, E., Swanson, D. J., Liu, L., de Leeuw, C. N., Banks, K. G., Ho Sui, S. J., . . . et al. (2010). A regulatory toolbox of MiniPromoters to drive selective expression in the brain. Proc Natl Acad Sci U S A 107, 16589–16594. Poser, I., Sarov, M., Hutchins, J. R., Heriche, J. K., Toyoda, Y., Pozniakovsky, A., . . . et al. (2008). BAC TransgeneOmics:  a high-throughput method for exploration of protein function in mammals. Nat Methods 5, 409–415. Ragan, T., Kadiri, L. R., Venkataraju, K. U., Bahlmann, K., Sutin, J., Taranda, J., . . . Osten, P. (2012). Serial two-photon tomography for automated ex vivo mouse brain imaging. Nat Methods 9, 255–258. Ramón y Cajal, S., Pasik, P., & Pasik, T. (1899). Texture of the nervous system of man and the vertebrates (pp. v, 1‒3).Vienna and New York: Springer. Rodriguez, C. I., Buchholz, F., Galloway, J., Sequerra, R., Kasper, J., Ayala, R., . . . Dymecki, S.M. (2000). High-efficiency deleter mice show that FLPe is an alternative to Cre-loxP. Nat Genet 25, 139–140. Rolls, A., Colas, D., Adamantidis, A., Carter, M., Lanre-Amos, T., Heller, H. C., & de Lecea, L. (2011). Optogenetic disruption of sleep continuity impairs memory consolidation. Proc Natl Acad Sci U S A 108, 13305–13310. Roth, L. J., & Barlow, C. F. (1961). Drugs in the brain. Science 134, 22–31. Saez, E., No, D., West, A., & Evans, R.M. (1997). Inducible gene expression in mammalian cells and transgenic mice. Curr Opin Biotechnol 8, 608–616. Saito, K., Barber, R., Wu, J. Y., Matsuda, T., Roberts, E., & Vaughn, J. E. (1974). Immunohistochemical Localization of Glutamate Decarboxylase in Rat Cerebellum. Proc Natl Acad Sci U S A 71, 269–273.

218

the OMICs

Sanz, E., Yang, L., Su, T., Morris, D. R., McKnight, G. S., & Amieux, P. S. (2009). Cell-type-specific isolation of ribosome-associated mRNA from complex tissues. Proc Natl Acad Sci U S A 106, 13939–13944. Scalbert, A., Brennan, L., Fiehn, O., Hankemeier, T., Kristal, B. S., van Ommen, B., . . . Wopereis, S. (2009). Mass-spectrometry-based metabolomics:  limitations and recommendations for future progress with particular focus on nutrition research. Metabolomics 5, 435–458. Schmidt, E. F., Warner-Schmidt, J. L., Otopalik, B. G., Pickett, S. B., Greengard, P., & Heintz, N. (2012). Identification of the Cortical Neurons that Mediate Antidepressant Responses. Cell 149, 1152–1163. Selimi, F., Cristea, I. M., Heller, E., Chait, B. T., & Heintz, N. (2009). Proteomic studies of a single CNS synapse type: the parallel fiber/purkinje cell synapse. PLoS Biol 7, e83. Shcherbo, D., Murphy, C. S., Ermakova, G. V., Solovieva, E. A., Chepurnykh, T. V., Shcheglov, A. S., . . . et al. (2009). Far-red fluorescent tags for protein imaging in living tissues. Biochem J 418, 567–574. Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., . . . Butte, A.J. (2010). Cell type-specific gene expression differences in complex tissues. Nat Methods 7, 287–289. Shimogori, T., Lee, D. A., Miranda-Angulo, A., Yang, Y., Wang, H., Jiang, L., . . . et al. (2010). A genomic atlas of mouse hypothalamic development. Nat Neurosci 13, 767–775. Shimshek, D. R., Kim, J., Hubner, M. R., Spergel, D. J., Buchholz, F., Casanova, E., . . . Sprengel, R. (2002). Codon-improved Cre recombinase (iCre) expression in the mouse. Genesis 32, 19–26. Shuen, J. A., Chen, M., Gloss, B., & Calakos, N. (2008). Drd1a-tdTomato BAC transgenic mice for simultaneous visualization of medium spiny neurons in the direct and indirect pathways of the basal ganglia. J Neurosci 28, 2681–2685. Siegel, R. W., Jain, R., & Bradbury, A. (2001). Using an in vivo phagemid system to identify non-compatible loxP sequences (vol. 499, p. 147). FEBS Lett 505, 466–473. Smedley, D., Salimova, E., & Rosenthal, N. (2011). Cre recombinase resources for conditional mouse mutagenesis. Methods 53, 411–416. Sotelo, C. (2003). Viewing the brain through the master hand of Ramon y Cajal. Nat Rev Neurosci 4, 71–77. Strack, R. L., Strongin, D. E., Bhattacharyya, D., Tao, W., Berman, A., Broxmeyer, H. E., . . . Glick, B.S. (2008). A noncytotoxic DsRed variant for whole-cell labeling. Nat Methods 5, 955–957.

Sutherland, G. R., & McNaughton, B. (2000). Memory trace reactivation in hippocampal and neocortical neuronal ensembles. Curr Opin Neurobiol 10, 180–186. Takada, M., Sugimoto, T., & Hattori, T. (1993). Tyrosine hydroxylase immunoreactivity in cerebellar Purkinje cells of the rat. Neurosci Lett 150, 61–64. Taniguchi, H., He, M., Wu, P., Kim, S., Paik, R., Sugino, K., . . . et al. (2011). A resource of Cre driver lines for genetic targeting of GABAergic neurons in cerebral cortex. Neuron 71, 995–1013. Terskikh, A., Fradkov, A., Ermakova, G., Zaraisky, A., Tan, P., Kajava, A. V., . . . et al. (2000). “Fluorescent timer”:  protein that changes color with time. Science 290, 1585–1588. Tian, L., Akerboom, J., Schreiter, E.R., & Looger, L. L. (2012). Neural activity imaging with genetically encoded calcium indicators. Prog Brain Res 196, 79–94. van der Weyden, L., & Bradley, A. (2006). Mouse chromosome engineering for modeling human disease. Annu Rev Genom Hum Genet 7, 247–276. Verhaak, R. G., Hoadley, K. A., Purdom, E., Wang, V., Qi, Y., Wilkerson, M. D., . . . et al. (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110. Visel, A., Thaller, C., & Eichele, G. (2004). GenePaint. org:  an atlas of gene expression patterns in the mouse embryo. Nucleic Acids Res 32, D552–556. Voineagu, I., Wang, X., Johnston, P., Lowe, J. K., Tian, Y., Horvath, S., . . . Geschwind, D.H. (2011). Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature 474, 380–384. Waddington, C. H. (2012). The epigenotype. 1942. Int J Epidemiol 41, 10–13. Wall, N. R., Wickersham, I. R., Cetin, A., De La Parra, M., & Callaway, E. M. (2010). Monosynaptic circuit tracing in vivo through Cre-dependent targeting and complementation of modified rabies virus. Proc Natl Acad Sci USA 107, 21848–21853. Wang, J., Williams, R. W., & Manly, K. F. (2003). WebQTL:  web-based complex trait analysis. Neuroinformatics 1, 299–308. Warner-Schmidt, J. L., Schmidt, E. F., Marshall, J. J., Rubin, A. J., Arango-Lievano, M., Kaplitt, M. G., . . . Greengard, P. (2012). Cholinergic interneurons in the nucleus accumbens regulate depression-like behavior. Proc Natl Acad Sci U S A 109, 11360–11365. Weissman, T. A., Sanes, J. R., Lichtman, J. W., & Livet, J. (2011). Generating and imaging multicolor Brainbow mice. Cold Spring Harb Protoc 2011, 763–769.

Cellomics Xu, X.  & Dougherty, J.D. (2013), Cell Type Specific Analysis of Human Brain Transcriptome Data to Predict Alterations in Cellular Composition. Systems Biomedicine (accepted). Yang, X. W., Model, P., & Heintz, N. (1997). Homologous recombination based modification in Escherichia coli and germline transmission in transgenic mice of a bacterial artificial chromosome. Nat Biotechnol 15, 859–865. Yanushevich, Y. G., Gurskaya, N. G., Staroverov, D. B., Lukyanov, S. A., & Lukyanov, K. A. (2003). A natural fluorescent protein that changes its fluorescence color during maturation. Russ J Bioorg Chem 29, 325–329. Yuste, R. (2005). Origin and classification of neocortical interneurons. Neuron 48, 524–527. Zhang, X., Guo, C., Chen, Y., Shulha, H. P., Schnetz, M. P., LaFramboise, T., . . . et  al. (2008). Epitope tagging of endogenous proteins for genome-wide ChIP-chip studies. Nat Methods 5, 163–165. Zhang, X. M., Chen, B. Y., Ng, A. H., Tanner, J. A., Tay, D., So, K. F., . . . Huang, J.D. (2005). Transgenic

219

mice expressing Cre-recombinase specifically in retinal rod bipolar neurons. Invest Ophthalmol Vis Sci 46, 3515–3520. Zhao, S., Ting, J. T., Atallah, H. E., Qiu, L., Tan, J., Gloss, B., . . . et  al. (2011a). Cell type-specific channelrhodopsin-2 transgenic mice for optogenetic dissection of neural circuitry function. Nat Methods 8, 745–752. Zhao, Y., Araki, S., Wu, J., Teramoto, T., Chang, Y. F., Nakano, M., . . . et  al. (2011b). An expanded palette of genetically encoded Ca(2)(+) indicators. Science 333, 1888–1891. Zhong, Y., Wang, Q. J., Li, X., Yan, Y., Backer, J. M., Chait, B. T., . . . Yue, Z. (2009). Distinct regulation of autophagic activity by Atg14L and Rubicon associated with Beclin 1-phosphatidylinositol-3-kinase complex. Nat Cell Biol 11, 468–476. Zhou, H., Huang, C., Yang, M., Landel, C. P., Xia, P. Y., Liu, Y.J., & Xia, X. G. (2009). Developing tTA transgenic rats for inducible and reversible gene expression. Int J Biol Sci 5, 171–181.

12 Neuroscience and Metabolomics REZA M.SALEK

I N T R O D U C T I O N :   W H AT I S M E TA B O L O M I C S ? Metabolomics, also known as metabonomics, or metabolic profiling, is a novel and emerging field in the science of studying metabolic changes that represent a snapshot of the metabolic dynamics in a living cell or organ. Metabolic changes are the response of living systems to pathophysiological stimuli, the surrounding environment, or a response to a genetic modification, giving metabolomics a unique prospective as a diagnostic tool.1‒3 Metabolomics is defined as the study of “the complete set of metabolites/low-molecular-weight intermediates that are context-dependent, varying according to the physiological, developmental, or pathological state of the cell, tissue, organ or organism.”4 Jeremy Nicholson has suggested an alternative definition for metabolomics as “the measurement of metabolite concentrations and fluxes and secretion in cells and tissues in which there is a direct connection between the genetic activity, protein activity and the metabolic activity itself,”5 while metabonomics is “the quantitative measurement of the multivariate metabolic responses of multicellular systems to pathophysiological stimuli or genetic modification.”1,5 in the rest of this chapter, only the term metabolomics is used. The metabolome represents the collection of all the metabolites in a biological organism; these represent the end products of gene expression and protein synthesis.6 Metabolomics provides a unique insight into the state of health and metabolic functioning of an organism at a particular moment in time. This approach also has the potential to yield critical biomarkers for predicting individual susceptibility to adverse side effects or to successful treatment responses to therapeutics

(investigation into which is known as pharmacometabolomics) and could eventually lead to more personalized medicine. Several metabolomics studies have been carried out in the field of neuroscience; this chapter summarizes some of the most important research.

Metabolomics and Brain Disorders Metabolomics techniques have been successfully applied in profiling, biomarker discovery, and metabophenotyping in various tissues, including brain tissues and biofluids, for various disorders in both human and animal models. The use of biofluids in brain disorder diagnosis remains challenging owing to the existence of the blood-brain barrier, which encapsulates the central nervous system and restricts the movement of various metabolites that are selectively active. Despite this challenge, there have been reports of using blood plasma to detect certain metabolites or proteins as markers for neurological disorders, both in cases where the blood-brain barrier is intact and where it is affected by the disorder. For more on this, see reference 7. Alternatively, biofluids such as cerebrospinal fluid (CSF) have been extensively used to discover markers for neurological disorders.8 CSF has long been perceived as a source of potential biomarkers of neurological disorders since it is in direct contact with the extracellular space of the brain. The CSF metabolites have been captured in the CSF metabolome database (http://www.csfmetabolome.ca/), containing detailed information about small-molecule metabolites found in human CSF.8 The database contains, at the time of writing in 2012, about 470 detectable small-molecule metabolites from human CSF, with 1,650 concentration values associated with

Neuroscience and Metabolomics different conditions and disorders. However an update in the numbers of CSF metabolomics has been reported in version 3.0 of the human metabolomics database (http://www. hmdb.ca/), increasing the number of metabolites in CSF to 1,131 from the total of 40,214 metabolites detected in humans, including both water- and lipid-soluble metabolites.9 The range of metabolites detected varies from relatively high concentrations (equal to and greater than 1  μM) to low concentrations (less than 1 nM). The CSF metabolome database includes literature and experimentally derived chemical data, clinical data, and molecular/biochemistry data. All metabolite-related information is uniquely captured in what has been termed the MetaboCard entry, containing more than 110 data fields; many of these are connected to other relevant databases.9 Brain tissues can also be used directly in profiling various brain disorders and in looking for potential biomarkers. However, a number of challenges are associated with the use of human tissue beside ethical reasons. First, the tissue samples are rarely frozen rapidly after death, resulting in some inevitable degradation of metabolites or change in metabolites due to cell death. This is especially important in dealing with brain tissue, which is among the most metabolically active tissues (see reference 10). Second, it can be difficult to distinguish differences arising between control and disease samples that are not directly pathologically driven but are due to other factors, such as sample collection, age differences, gender, genetic background, and lifestyle. In addition, human patients with the disease are likely to have undergone prolonged treatment with various medications—for example, dopamine agonists and levodopa therapy for Parkinson’s patients—which strongly contribute to observed differences between control and disease groups. One way to overcome these challenges is to use animal models of neurological disorders to obtain better control over parameters such as genetic variation, mode of death, and environmental factors as well as sample collection and preparation in order to avoid any change or loss of metabolites due to sample handling and degradation. This is particularly important for time-sensitive and degrading metabolites in an active tissue, leading mostly to loss of metabolites involved in energy regulation, such as phosphates, glycogen, and glucose.11,12 For example, conversion of glucose downstream to lactate or N-acetylaspartate to acetate is

221

indicative of brain degradation over time and is well documented.13 There are several possible animal models; for example, based on the genes identified, transgenic mice have been developed to have links with neurological disorders or neurodegeneration. However, most of the models can capture only some aspect of the disease associated with a gene or are able to reproduce only some aspect of the disease. Hence a metabolomics study in such a model would be reflecting this gene/metabolome effect and relationship rather than the disease itself. It also could be the case that the true disease mechanism might yet not be fully known or is the result of several genes interacting over a lifetime. The transgenic animal models of neurodegeneration-based diseases in humans— including Alzheimer’s disease (AD), amyotrophic lateral sclerosis (ALS), Huntington’s disease (HD) and Parkinson’s disease (PD) and their associated loci and genes—are reviewed in greater detail in references 14 and 15.

Metabolomics Tools Used in Brain Disorders Several different analytical tools have been used for metabolomic studies of brain neurological disorders. Examples include capillary electrophoresis mass spectrometry,16 liquid chromatography coupled with electrochemical coulometric array detection,17 high-performance liquid chromatography coupled with electrospray ionization time-of-flight mass spectrometry,18 gas chromatography coupled to mass spectrometry,19 nuclear magnetic resonance (NMR) spectroscopy,20 high-resolution magic angle spinning NMR,21 and many more. The most common analytical techniques broadly used are classified as mass spectrometry (MS) and NMR spectroscopy. Both NMR and MS methods can be executed automatically (both in terms of acquiring data and processing the results) with high-throughput sample automation. In terms of sensitivity, NMR is less sensitive than MS and can detect metabolites within mM or μM concentrations, while MS techniques are far more sensitive within the μM-to-nM range. However, the more sensitive the technique, the more challenging it is to identify the compounds and metabolites; to this date, a large number of metabolites still remain unidentified (the number of ionized atoms and molecules detected could reach tens of thousands). To identify metabolites efficiently, MS often requires an additional separation step,

222

the OMICs

such as gas chromatography or liquid chromatography22 coupled with the MS.

NMR Spectroscopy NMR, in contrast to MS, is highly reproducible, noninvasive, and nondestructive to metabolites; it can potentially be used to detect stereoisomers and to perform chemical exchange analysis. The sample preparation for NMR spectroscopy is minimal compared with MS and also costs relatively less per sample. Low sensitivity is an inherent disadvantage of NMR spectroscopy, detecting metabolites in the mM and high μM range, at which a maximum of around 100 to 30023 metabolites can be detected in urine samples and even less in serum and intact tissue samples.24 On the other hand, if an NMR study can determine a biomarker through the analysis of a relatively small number of the most abundant metabolites, this may be considered an advantage. Recent progress, such as an increase in the strength of the magnetic fields with the use of cryogenically cooled probes and microprobes, has substantially improved the sensitivity of NMR spectroscopy.25,26 In addition, compared with MS, magnetic resonance spectroscopy (MRS) or HR-MAS can be used for in vivo measurements and studies on intact (not extracted) brain tissues. Any metabolite that is detected in vitro by NMR spectroscopy could in theory be followed up in vivo by MRS. Using NMR spectroscopy, investigators can also use isotopically labeled nuclei, such as 13C, to investigate the flow of metabolites through metabolic pathways.27 Mass Spectrometry MS compared to NMR spectroscopy is a highly sample-destructive method and has a lower experimental reproducibility. However, MS is inherently far more sensitive than NMR and can be used not only as a global profiling tool but also for targeted assays looking at specific compounds (e.g., amino acids or lipids). However, this approach requires sample preparation in advance to select a group of metabolites (e.g., aqueous or lipids) or to prepare the analytes to tailor them for the detection of certain chemical groups. Types of sample preparation include liquid-liquid extraction, solid-phase extraction, and membrane-assisted extraction methods, such as dialysis. Some biofluids, for example CSF, can be directly injected into the MS with minimal sample preparation,

such as directed injection electrospray ionization, mass spectrometry (DI-MS); ideally, in this case fewer metabolites are lost.28 A  separation step is commonly used prior to MS, such as liquid or gas chromatography or capillary electrophoresis, which allows separation of the metabolites prior to mass analysis (LC-ESI-MS). Multidimensional separation techniques such as two-dimensional GC and LC (GC x GC and LC x LC) have further enabled the separation of complex biological mixtures. Each methodology has its own advantages and disadvantages. These require a comprehensive review in themselves, which is, however, beyond the scope of chapter. For a complete review, see reference 29. Multiple steps of MS can also be used in what is known as tandem mass spectrometry, or MS/ MS, or MS2, with some fragmentation occurring at different stages and resulting in the extra separation of ions. For further information on this, see reference 30. Each analytical methodology can detect certain groups or types of metabolites from its biological matrix. One way of increasing the number of metabolomes detected is to combine different analytical instruments to yield a better metabolomics representation. David Wishart’s group used a combination of NMR, GC-MS, LC-MS, and direct flow injection mass spectrometry (DFI-MS/MS) and inductively coupled plasma-mass spectrometry (ICP-MS) to quantify metabolites in human CSF samples. Using the first three platforms, 70 CSF metabolites were identified, with DFI-MS/MS adding extra 78 metabolites, while ICP-MS provided another 33 metal ions in CSF.31 In total, together with literature data mining, 57 more metabolites were identified, reaching a total of 476 compounds in the human CSF database.31

Metabolomics of Animal  Models Animal models are important tools in experimental medicine to better understand the pathogenesis of human diseases. Much more control is possible in the study design of such experiments than in research involving humans. Animal models, in contrast to human samples, can be treated in a uniform way for both disease and control groups while having the same genetic background, thus eliminating any differences due to genetic predisposition, age, and medications. Several transgenic animal models ranging in biological complexity from yeast to fruit fly, mice, and larger

Neuroscience and Metabolomics mammals such as dogs, sheep, and monkeys have been used to study various neurodegenerative disorders and metabolomics. Large animal models provide a valuable complement to murine studies with regard to neurological disorders, since they have a longer life span, allowing for prolonged longitudinal studies, and their larger brain size, similar to a neonate’s, makes it easier to examine intraorgan variation. As an example, we have carried out a metabolomics investigation of CLN6 neuronal ceroid lipofuscinosis in affected South Hampshire sheep as an animal model of human Batten disease, using brain tissue and CSF samples.32 We used 1H-NMR and GS-MS to analyze our samples. The majority of metabolites detected in sheep models were similar to those previously reported in mouse models33 and showed similar metabolic cycles, with the advantage of longer longitudinal studies in the sheep model. As a model of fundamental cellular processes and metabolic pathways in the human, yeast is also accelerating our understanding of disease and gene function, having numerous advantages. First, complete genomic sequences are available, along with genome-wide deletion mutant initiatives set up for Saccharomyces cerevisiae and Schizosaccharomyces pombe34 (see example http://pombe.bioneer.co.kr/). Second, shuttle expression vectors harboring various selectable markers can be used with ease to introduce genes of interest into yeast cells. Third, the metabolomics-based technique FANCY (functional analysis of coresponses in yeast) has been developed to assign functions to these genes.2 For example, a metabolic fingerprinting of an existing yeast model of Batten disease looking for common pathways underlying disease progression is given in reference 35. Metabolomics is ideally placed as a phenotyping tool in the exploration of naturally occurring and transgenic animal models of neurodegeneration. The refinement of knockout and knockin strategies, combined with increasing masses of sequence data, has accelerated the generation of accurate disease models. Moreover, large-scale mouse mutagenesis programs have been set up, producing thousands of mutants in need of analysis.36 Metabolomics can play a key role in resolving different metabophenotypes, especially when the result of mutation is not phenotypically apparent.

223

M E TA B O L O M I C S I N NEUROSCIENCE Neurodegenerative Disease Around the world, particularly in developed countries with an increasing numbers of aging individuals, more people are being diagnosed with neurodegenerative diseases such as PD, AD, HD, Batten disease, and other forms of neurodegeneration. Battling neurodegenerative disease requires faster, more effective, and more reliable diagnosis in order to combat disease progression and implement effective treatment before extensive neuronal damage occurs. Unfortunately, to this date, our understanding of such disorders is limited despite the progress in functional genomics that has determined gene functions related to the pathological mechanisms of disease.37 By the time most neurodegenerative diseases are diagnosed, significant neuronal damage has occurred, most of which is irreversible. In addition, the vast majority of existing therapies act only to slow disease progression, increasing the patient’s life span without reversing the neuropathological hallmarks. Neurodegenerative diseases are temporal diseases whereby aging is the single most significant risk factor. The pathogenic mechanisms associated with most neurological disorders add to the complexity of disease diagnosis and treatment. In terms of diagnosis, throughout the patient’s life, direct access to the affected brain region/tissue is virtually impossible; examination of the brain cannot occur until the patient is deceased, thus limiting any histological classification and examination to postmortem. As a result, the discovery of biomarkers that can potentially assist with disease diagnosis or be used in monitoring treatment is vital with regard to neurological disorders.38,39 Metabolomic techniques can assist in addressing some of these issues.

Alzheimer’s Disease Early diagnosis of AD is essential to ensure the best treatment outcome. Although there are no preventative treatments or known cures, a number of interventions have been shown to slow disease progression, and metabolomics promises to deliver enhanced methods for early diagnosis. For example, Kork and colleagues compared the proton NMR spectra in CSF for AD patients with that of healthy

224

the OMICs

control subjects40 and found that specific multiplets were observed at 2.15 and 2.45 ppm in the disease group, an observation that was not made in most of the members of the healthy control group. The authors concluded that proton NMR spectroscopy is a promising tool for detecting signals that could serve as biomarkers for the early diagnosis of AD.40 In addition, Kaddurah-Daouk and colleagues used samples from postmortem ventricular CSF in a confirmed AD group and compared them with samples from a nondemented control group. Using liquid chromatography followed by coulometric array detection and quantitative analysis, they were able to identify alterations in the quantities of tyrosine, tryptophan, purine, and tocopherol in patients with AD. Reductions in norepinephrine and its related metabolites were also seen.41 Mild cognitive impairment (MCI) is considered to be a transition phase between normal aging and AD, indicating an increased risk of developing AD. In a prospective study, Orešič and associates sought to determine the serum metabolomic profiles associated with progression from MCI to AD. At baseline, the subjects enrolled in the study were classified into a healthy group, an MCI group, and an AD group. Subsequently, a substantial portion of the MCI group progressed to AD, and this was distinguished in the analysis of the follow-up study.39 The authors found that at baseline the AD group was characterized by diminished phospholipids, phosphatidylcholines, sphingomyelins, and sterols. Based on the follow-up study, a molecular signature was identified that was predictive of progression to AD. The major contributor to the predictive model was 2,4-dihydroxybutanoic acid, which was upregulated in AD progressors, suggesting the involvement of hypoxia in early AD pathogenesis.39

Amyotrophic Lateral Sclerosis ALS is a fatal neurodegenerative motor neuron disease caused by the degeneration of neurons in the ventral horn of the spinal cord, with most affected patients dying of respiratory compromise and pneumonia two to three years after disease onset. Little is known about the pathophysiological mechanisms involved in the development of ALS and no reliable markers are available for patient evaluation. Using 1 H-MRS, the visualization of corresponding metabolic changes in the brains of patients with ALS may provide surrogate markers for

early disease detection, monitoring progression, and evaluating response to treatment. In order to identify biomarkers that could be used in the early stages of ALS, Blasco and associates employed NMR-based metabolomics to compare the CSF of patients with ALS at the time of diagnosis with that of patients free of neurodegenerative diseases.42 The results showed a decreased concentration of acetate, while the concentrations of acetone, pyruvate, and ascorbate tended to be higher. The results suggest that proton NMR could be a useful and simple tool to improve the early diagnosis of this condition.42 These results were later confirmed using two-dimensional 1H-NMR spectroscopy on analyze samples of CSF from patients with ALS,42 again finding reduced levels of acetate with increased levels of acetone, pyruvate, and ascorbate. Wuolikainen and coworkers used gas chromatography coupled with mass spectrometry (GC/TOFMS) based metabolomics to analyze the CSF of subtype patients with ALS, characterizing about 120 different small metabolites.19 They reported different metabolic profiles for different subgroups of patients. Patients with sporadic amyotrophic lateral sclerosis (SALS) had a heterogeneous metabolic profile in their CSF, some similar to the control group, while patients with familial amyotrophic lateral sclerosis (FALS) without the superoxide dismutase-1 gene (SOD1) mutation were less heterogeneous. Those with the SOD1 gene mutation formed a separate homogeneous group.19 In addition, the investigators also reported that glutamate and glutamine were reduced in patients with a familial predisposition. The difference in metabolite profile of patients with FALS, SALS, and those carrying a mutation in the SOD1 gene is suggestive of subtypes of ALS involving different neurodegenerative processes. Overall, the use of high-resolution 1H-MRS has been shown to be a sensitive spatial and temporal metabolite profiling tool in the presymptomatic phase of ALS, even before significant neuronal cell loss occurs.

Huntington’s Disease HD is a genetic neurodegenerative disorder that leads to cognitive decline, psychiatric symptoms, and problems with muscular coordination. Genetic testing is available; thus metabolic biomarkers are not needed. But metabolomics can contribute to developing a

Neuroscience and Metabolomics greater understanding of the mechanism of action of the disease than is currently available and to the search for pharmaceutical treatments to slow or halt the progression of the disease. Tsang and colleagues used 1H-NMR spectroscopy to explore the effects of 3-nitropropionic acid (3-NP) administered to several rat brain regions in animal models of the disease. They observed dose-dependent increases in succinate levels resulting from the 3-NP-induced inhibition of succinate dehydrogenase. In addition, taurine and gamma-aminobutyric acid (GABA) were decreased in the majority of brain regions, whereas altered lipid profiles were observed only in the globus pallidus and dorsal striatum.43 Many of the metabolic changes reported in the 3-NP‒induced model animals, including reduced level of phosphatidylcholine and elevated glycerol levels, were similar to those reported in HD, highlighting potential mechanisms of pathology of the disease.43

Spinocerebellar  Ataxia SCA is a genetic progressive degenerative disease with multiple different subtypes. Many of the SCAs are caused by expansions of CAG trinucleotide repeats, which encode abnormal stretches of polyglutamine. SCA3, or Machado-Joseph disease (MJD), is the commonest dominant SCA. Griffin and coworkers studied a transgenic mouse model of SCA3 that had previously been observed to show a mild progressive cerebellar deficit. NMR-based metabolomics was used in conjunction with multivariate pattern recognition to detect a number of metabolic perturbations in the mice, including a consistent increase in glutamine concentration in the cerebellum and the cerebrum. Both brain regions additionally demonstrated decreases in GABA, choline, phosphocholine, and lactate. The study emphasizes that high-resolution 1H-NMR spectroscopy coupled with pattern recognition may provide a rapid method of assessing the phenotype of animal models of human disease.44 Parkinson’s Disease PD, a progressive bradykinetic disorder, is believed to be among the most common disorders of the elderly, affecting some 6  million people worldwide.45 PD is characterized by severe cell loss in the nigra pars compacta and the accumulation of α-synuclein in the brainstem, spinal cord, and cortical regions. PD can

225

be accurately diagnosed at a later stage of its progression, but many individuals with PD or related tremors remain undiagnosed or misdiagnosed in the early stages. Work with human tissue in the investigation of PD poses several challenges owing to the slow onset of the disease, its age dependence, and the fact that individuals are often on long-term medications, which makes separating the effects of treatment from those of disease quite challenging. However, several metabolomics studies have been carried out on animal models of PD. In one such study carried out on human samples, Johanson and colleagues investigated plasma samples in PD patients with the LRRK2 mutation (one of the most common known genetic causes of PD).46 The authors compared the metabolomics profile of the plasma of patients with PD caused by the G2019S LRRK2 mutation with that of asymptomatic family members (with or without the G2019S LRRK2 mutation) and that of patients with idiopathic PD. They found that both the idiopathic PD and the LRRK2 PD subjects could readily be separated from the control subjects. Both the LRRK2 and idiopathic PD patients showed significantly reduced levels of uric acid, hypoxanthine, and major metabolites of the purine pathway.46 The authors concluded that metabolomic profiling shows potential for predicting whether LRRK2 carriers will eventually develop PD. 1-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP) is a neurotoxin commonly used to generate animal models of PD. MPTP is metabolized to 1-methyl-4-phenylpyridinium (MPP[+]), a mitochondrial toxicant of central dopamine (DA) neurons. Lehner and coworkers used LC/MS to measure MPTP and MPP(+) in 8-week murine brains, correlating the results with changes in DA measurements. MPP(+) was detected in the nucleus accumbens (NA) and the striatum (ST), with the levels in the NA being three times as high as those in the ST.47 This approach has the advantage of allowing concurrent measurement of striatal DA, therefore enabling a direct correlation between accumulation of tissue MPP(+) and depletion of DA concentrations in discrete regions of the brain and yielding insight into the mechanisms of pathways known to be involved in PD.47 S100B is a protein involved in the maintenance and stimulation of neurons and glia cells, and increased levels of S100B have been reported in the serum of PD patients.48 Furthermore,

226

the OMICs

increased levels have also been detected in the substantia nigra and striatum of mice, correlated with the decrease in DA after the injection of MPTP. Liu and colleagues investigated the effects of S100B on the development of PD using high-performance liquid chromatography coupled with electrospray ionization time-of-flight mass spectrometry (HPLC/ MS-ESI-TOF). They aimed to investigate the effects of S100B protein on mice mesencephalon metabolite profiles. Twelve metabolites of S100B transgenic mice were identified as potential biomarkers. Of these, glutamic acid, GABA, tryptophan, phenylalanine, and histidine were related to the metabolic pathway of neurotransmitters in the CNS of mice and thus were believed to be candidates as markers for PD.18

Neuropsychiatric Disease Many significantly debilitating mental disorders—including depression, schizophrenia, and bipolar disorder—are strongly correlated with alterations in neuronal and brain systems and structures and known to be associated with certain genetic abnormalities. Furthermore, research has suggested the existence of correlations between various neuropsychiatric disorders and the metabolic syndrome, implicating common underlying pathogenic pathways.49 Many CNS disorders are linked with disturbances in metabolic pathways related to neurotransmitter systems, such as dopamine, serotonin, GABA, and glutamate. Also implicated are fatty acids (such as, the arachidonic acid cascade), which are involved in oxidative stress and mitochondrial function.50 Metabolomics tools are enabling the mapping of perturbations in biochemical pathways and disease pathological processes, facilitating the development of disease-specific biomarkers. Schizophrenia Schizophrenia is one of the most common mental disorders, with symptoms ranging from auditory hallucinations and paranoia to the disintegration of thought processes and emotional responsiveness, all leading to increased physical health problems. At present, no biological test for disease diagnosis exists; therefore diagnosis is based on the observed behavior of the patient and the experiences he or she reports. A multi -omics study of combined 1HRMAS and 1 H-NMR spectroscopy together with transcriptomics and proteomics was used to investigate

association of schizophrenia with metabolic deficits, including increased glycolytic flux compared with healthy controls and individuals with bipolar disorder.51 In a follow-up study, NMR-based metabolomics was used to investigate the metabolic profiles of CSF samples from patients with first-onset paranoid schizophrenia and healthy controls.52 Pattern recognition analysis showed a clear separation of the profiles of the patients with first-onset schizophrenia compared with those of the healthy controls.52 A metabolomics study using plasma samples was recently reported in which 103 metabolites were quantified and compared between healthy controls and schizophrenic patients; 25% of the latter were not taking antipsychotic medication.53 It is important to note that the effects of long-term treatment on metabolic profiles are significant; therefore they must be distinguished and differentiated from the metabolic profile typical of the disease. Of the 103 metabolites quantified, 5 were found to be significantly altered in schizophrenic patients and in neuroleptics-free probands respectively. These metabolites, forming candidate biomarkers for schizophrenia, include four amino acids (arginine, glutamine, histidine, and ornithine) and one lipid (PC ae C38:6).53 The authors further constructed a molecular network connecting these 5 aberrant metabolites with 13 schizophrenia risk genes, finding implicated risk factors in biosynthetic pathways linked to glutamine and arginine metabolism and associated signaling pathways. This finding may contribute to a better understanding of the underlying mechanisms of schizophrenic pathology and the associated memory deficits. In an effort to study these underlying pathological processes, Orešič and colleagues performed ultraperformance liquid chromatography coupled to time-of-flight mass spectrometry to obtain lipidomic profiles on serum samples from twin pairs discordant for schizophrenia as well as healthy twin pairs. Neurocognitive assessment data and gray matter density measurements taken from high-resolution magnetic resonance images were also obtained. Patients were found to have elevated triglyceride levels as compared with their healthy twins; they were also more insulin resistant and had diminished lysophosphatidylcholine levels, which was associated with decreased cognitive speed.54 This result may imply pathophysiological pathways, in that lysophosphatidylcholines are preferred carriers of polyunsaturated fatty acids across the

Neuroscience and Metabolomics blood-brain barrier. Their association with cognitive speed supports the view that the altered neurotransmission patterns in schizophrenia may be partly mediated by reactive lipids such as prostaglandins.54

Bipolar Disorder BD is a debilitating psychiatric condition in which patients suffer from alternating mood states of mania and depression accompanied in some cases by psychotic symptoms. Like other psychiatric disorders, the pathophysiology of BD and the mechanism of action of the drugs used to treat it are not yet well understood. Lan and associates used a 1H-NMR spectroscopy‒ based metabolomic analysis to identify molecular changes in postmortem brain tissue from the dorsolateral prefrontal cortex of patients who had a history of BD.55 They also used an animal model to determine the molecular signatures of lithium and valproate, two drugs commonly used to treat BD. Several metabolites showed alterations in both the animal and human samples. The human samples showed increased glutamate levels, while the glutamate/glutamine ratio was decreased in the animal models following valproate treatment.55 On the other hand, GABA levels were increased after lithium treatment. These results are concordant with the body of evidence suggesting that neurotransmission pathways are central to the disorder. In another study, blood serum samples from patients with BD were subjected to metabolic profiling in order to search for molecular changes related to the disorder. Sussulini and associates used 1 H-NMR spectroscopy followed by a chemometric analysis of serum samples from patients with BD and healthy controls, distinguishing the BD group further based on whether they were taking lithium or other medications.56 The authors report that they were able to distinguish the different groups based on their measured metabolic profiles, with differential metabolites including some lipids, lipid metabolism‒related molecules (acetate, choline, and myo-inositol), and some key amino acids (glutamate, glutamine). The results suggest that some of the determined differential metabolites may be due to medication induced metabolic changes while others may be directly related to the disorder. Other Neurological Conditions Rett syndrome (RS) is a leading cause of genetic mental retardation in girls, caused by mutations

227

in the Mecp2 gene. A  metabolomics study in mice used brain extracts from Mecp2-deleted (“Mecp2-null”) mice with high-resolution NMR to quantify individual water-soluble metabolites and phospholipids without prior selection for specific metabolic pathways.57 The study showed decreased levels of the astrocyte marker myo-inositol in the transgenic samples as compared with wild-type mice. It also showed reduced choline phospholipid turnover, implying a diminished potential of cells to grow, paralleled by globally reduced brain size and perturbed osmoregulation. Alterations of the platelet activating factor cycle—a bioactive lipid acting on neuronal growth, glutamate exocytosis, and other processes—were observed. Finally, changes in glutamine/glutamate ratios were observed, potentially indicating altered neurotransmitter recycling. These results establish metabolic fingerprints for perturbed brain growth, osmoregulation, and neurotransmission in a mouse model of RS, which may help to develop mechanistic links between genotype and phenotype for this important disorder. Fragile X syndrome (FXS) is a genetically inherited intellectual disability, caused by the silencing of the X-linked fragile X mental retardation 1 (Fmr1) gene encoding the RNA-binding protein FMRP. A  metabolomics study on brain compartments of a mouse model of the disease was performed, using 1 H-HR-MAS NMR spectroscopy.21 The authors showed that the Fmr1 gene inactivation has a profound effect on the mouse brain metabolism, leading to alterations in neurotransmitter levels, metabolites involved in osmoregulation, energy metabolism, and response to oxidative stress. The metabolomics study was combined with a functional interaction network that had been compiled based on existing knowledge harnessed from the literature and databases to yield a systems-wide model that functionally connected Fmr1-deficiency to its metabolic biomarkers. With this model, the authors showed that various mRNA targets and proteins interacting with FMRP initiated the metabolic response and were subsequently relayed by regulatory proteins. This “integrated metabolome and interactome mapping” has the promise of unifying novel metabolic findings with existing knowledge in order to highlight the role of specific to the pathophysiology of FXS, contributing to the identification of novel targets for therapeutic intervention.21

228

the OMICs

Adult-onset hypothyroidism (AOH) has been connected to alterations in neural activity leading to mental dysfunction. However, the underlying changes in brain metabolism and neural pathways have not been well understood. To investigate this question, Constantinou and coworkers used a GC-MS metabolomics protocol with a Balb/cJ induced mouse model of AOH.58 Their analysis implicated multiple metabolic phenomena, some of which had previously been linked to AOH while others appeared to implicate novel pathways. These phenomena included an overall decline in the metabolic activity of the hypothyroid compared to the euthyroid cerebellum, characteristically manifested as altered energy metabolism, glutamate/glutamine metabolism, osmolytic/antioxidant capacity, and protein/lipid synthesis. The alterations entail that the mammalian cerebellum is metabolically responsive to AOH. Since the cerebellum is involved in many core brain functions including neurocognition, the results provide mechanistic insights into the known phenotypic manifestations of AOH.58

FUTURE OUTLOOK Results of metabolomics experiments, including those relevant to neuroscience, are eventually published or stored in relevant databases. One such database is the MMMDB1.2 (Mouse Multiple tissue Metabolome DataBase—http:// mmmdb.iab.keio.ac.jp/), a metabolomic database containing collection of metabolites measured from multiple tissues including cerebra and cerebella from single mice.59 The datasets were collected using capillary electrophoresis time-of-flight mass spectrometry (CE-TOFMS) containing both identified metabolites and unknown metabolites.16 A  database for human CSF metabolites has already been mentioned, capturing the CSF metabolome (http://www. csfmetabolome.ca/); it contains detailed information about small-molecule metabolites found in human CSF.8 A  more general example is MetaboLights (http://www.ebi.ac.uk/ metabolights), a general-purpose open-access repository for metabolomics studies, their raw experimental data, and associated metadata.60,62 The results of investigations of metabolic perturbations in brain disorders involving biofluids or tissues from humans or animal models are stored in such databases. These data comprise sets of key metabolites or candidate metabolic biomarkers that can be used to

classify brain disorders. They can also be used as prognostic markers for early diagnosis or to suggest mechanisms of the disease by implicating affected pathways. In addition, metabolic changes in neurodegenerative or other neurological disorders have the potential to be used for in vivo monitoring of disease progression using MRS. The effectiveness of this depends on the availability of public open data across a broad range of experimental methods and conditions. As the volume of such data increases, so too will the accuracy of the results that can be obtained therewith and our understanding of metabolic perturbation linked to neurological disorders.

CONCLUSION AND S U M M A RY Metabolomics-based research in neuroscience is growing steadily; however, in comparison with other -omics or even other areas researched by metabolomics (for example, diabetes), it has far fewer publications. The progress observed within each generation of analytical instruments is considerable and resembles an arms race among instrument manufacturers. All new instruments claim significant improvements in their sensitivity, reliability, accuracy, and reproducibility of the results within their field of analytical assay. On the other hand, new animal models of particular diseases are constantly being introduced, claiming closer and better resemblance to the actual human ailment. Data handling and statistical analysis methodologies are also progressing rapidly. All of this can lead to the achievement of better and more accurate statistical/computational models of disease and bring us a step closer to more complete representations of neurological disorders. We can also improve our understanding of a particular disorder by combining different -omics, using the prediction power of each technique to combine and interpret the results at biological and statistical levels. These methods are also known as data fusion or systems biology approaches, promising greater understanding of neurological disorders. One such example is the investigation of mitochondrial dysfunction in schizophrenia by Prabakaran and coworkers. Using metabolomics, proteomics, and transcriptomics in parallel and highlighting genes and metabolites related to energy metabolism and oxidative stress,

Neuroscience and Metabolomics they were able to differentiate almost 90% of schizophrenic patients from the control group.51 Also, Davidovic and colleagues proposed a novel system biology approach using “integrated metabolome and interactome mapping” to unify their metabolic findings with previously unrelated knowledge in order to highlight the contribution of novel cellular pathways to the pathophysiology of FXS.21 Another approach would be to apply metabolomics techniques using data fusion methods to several biofluid samples collected from the same patient.61 As the cost of carrying out a multi-omics study is falling steadily, and the techniques are becoming more readily available, it can be predicted that more such studies will be carried out in the future, providing a more complete overview of disease progression. However, there remain challenges ahead. One example is differentiating the effect of drug treatment over long periods on brain metabolism from that of disease progression. Access to human brain bio banks is also a limiting factor, as samples can be collected only after the patient is deceased, and accumulating large numbers of samples can take quite a long time. Sample collection itself is another issue, which involved being able to accurately assess sample degradation, which itself depends on the type of surgical techniques used to collect the sample, age of the sample, postmortem handling and storage of the samples. These difficulties can constitute a large source of variation within the samples, eventually leading to discrepancy in statistical /computational models used to predict metabolic profiles. Similar challenges also exist for other biofluids collected from patients, such as blood plasma and CSF, including the effect from diet, treatment, and lifestyle. Nevertheless, overcoming these challenges would give us the opportunity to better understand the mechanism of disease onset and progress from a metabolomics point of view and its relationship to upstream processes such as proteome and transcriptome (systems biology), leading to novel biomarkers of neurological disorders or helping with a particular treatment regimen, thus bringing the discipline a step closer to personalized medicine.

REFERENCES 1.

Nicholson, J., J. Lindon, & E. Holmes. Metabonomics:  understanding the metabolic

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

229

responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica, 1999. 29: 1181–1189. Raamsdonk, L.M., et  al. A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nature Biotechnology, 2001. 19(1): 45–50. Goodacre, R., et  al. Metabolomics by numbers:  acquiring and understanding global metabolite data. Trends in Biotechnology, 2004. 22(5): 245–252. Oliver, S. Functional genomics:  all the king’s horses and all the king’s men can put humpty together again. Molecular Cell, 2003. 12(6): 1343–1344. Nicholson, J.K., & I. D. Wilson. Understanding’ global’ systems biology:  metabonomics and the continuum of metabolism. Nature Reviews Drug Discovery, 2003. 2(8): 668–676. Oliver, S. G., et  al. Systematic functional analysis of the yeast genome. Trends in Biotechnology, 1998. 16(9): 373–378. Griffin, J. L., & R. M. Salek. Metabolomic applications to neuroscience:  more challenges than chances? Expert Reviews in Proteomics, 2007. 4(4): 435–437. Wishart, D. S., et  al. The human cerebrospinal fluid metabolome. Journal of Chromatography. B, Analytical Technologies in the Biomedical and Life Sciences, 2008. 871(2): 164–173. Wishart, D. S., et al. HMDB: a knowledgebase for the human metabolome. Nucleic Acids Research, 2009. 37(Database issue): D603–D610. Griffin, J. L., & O. Corcoran, High-resolution magic-angle spinning 13C NMR spectroscopy of cerebral tissue. Magma, 2005. 18(1): 51–56. McIlwain, H., & H. S. Bachelard. Biochemistry and the central nervous systems. volume 14, issue 1, page 46, 1985. Scheurer, E., et  al. Statistical evaluation of time-dependent metabolite concentrations: estimation of post-mortem intervals based on in situ 1H-MRS of the brain. NMR Biomedicine, 2005. 18(3): 163–172. Tzika, A.A., et al. Combination of high-resolution magic angle spinning proton magnetic resonance spectroscopy and microscale genomics to type brain tumor biopsies. International Journal of Molecular Medicine, 2007. 20(2): 199–208. Chesselet, M.F., & S.T. Carmichael. Animal models of neurological disorders. Neurotherapeutics, 2012. 9(2): 241–244. Morsink, M. C., & D. F. Dukers. Teaching neurophysiology, neuropharmacology, and experimental design using animal models of psychiatric and

230

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

the OMICs neurological disorders. Advances in Physiology Education, 2009. 33(1): 46–52. Soga, T., et  al. Quantitative metabolome analysis using capillary electrophoresis mass spectrometry. Journal of Proteome Research, 2003. 2(5): 488–494. Bogdanov, M., et  al. Metabolomic profiling to develop blood biomarkers for Parkinson’s disease. Brain: A Journal of Neurology, 2008. 131(Pt 2): 389–396. Liu, J.L., et  al. Metabonomics study of brain-specific human S100B transgenic mice by using high-performance liquid chromatography coupled with quadrupole time of flight mass spectrometry. Biological & Pharmaceutical Bulletin, 2011. 34(6): 871–876. Wuolikainen, A., et  al. Disease-related changes in the cerebrospinal fluid metabolome in amyotrophic lateral sclerosis detected by GC/TOFMS. PloS One, 2011. 6(4): e17947. Salek, R. M., et  al. A metabolomic study of brain tissues from aged mice with low expression of the vesicular monoamine transporter 2 (VMAT2) gene. Neurochemical Research, 2008. 33(2): 292–300. Davidovic, L., et al. A metabolomic and systems biology perspective on the brain of the fragile X syndrome mouse model. Genome Research, 2011. 21(12): 2190–202. Lindon, J. C., E. Holmes, & J. K. Nicholson. Metabonomics techniques and applications to pharmaceutical research & development. Pharmaceutical Research, 2006. 23(6): 1075–1088. Connor, S. C., et al. Effects of feeding and body weight loss on the 1H-NMR-based urine metabolic profiles of male Wistar Han rats:  implications for biomarker discovery. Biomarkers, 2004. 9(2): 156–179. Rooney, O., et al. High‐resolution diffusion and relaxation‐edited magic angle spinning 1H NMR spectroscopy of intact liver tissue. Magnetic Resonance in Medicine, 2003. 50(5): 925–930. Gruetter, R., et al. Resolution Improvements in in Vivo1H NMR Spectra with increased magnetic field strength. Journal of Magnetic Resonance, 1998. 135(1): 260–264. Keun, H. C., et  al. Cryogenic probe 13C NMR spectroscopy of urine for metabonomic studies. Analytical Chemistry, 2002. 74(17): 4588–4593. Husted, C., et  al. Carbon-13  “magic-angle” sample-spinning nuclear magnetic resonance studies of human myelin, and model membrane systems. Magnetic Resonance in Medicine, 1993. 29(2): 168–178. Bai, F., et  al. Determination of vandetanib in human plasma and cerebrospinal fluid by liquid

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

chromatography electrospray ionization tandem mass spectrometry (LC-ESI-MS/MS). Journal of Chromatography. B, Analytical Technologies in the Biomedical and Life Sciences, 2011. 879(25): 2561–2566. Lei, Z., D. V. Huhman, & L. W. Sumner. Mass spectrometry strategies in metabolomics. The Journal of Biological Chemistry, 2011. 286(29): 25435–25442. Dettmer, K., A. Aronov, & B. D. Hammock. Mass spectrometry-based metabolomics. Mass Spectrometry Reviews, 2007. 26(1): 51–78. Mandal, R., et  al. Multi-platform characterization of the human cerebrospinal fluid metabolome: a comprehensive and quantitative update. Genome Medicine, 2012. 4(4): 38. Pears, M. R., et al. Metabolomic investigation of CLN6 neuronal ceroid lipofuscinosis in affected South Hampshire sheep. Journal of Neuroscience Research, 2007. Pears, M. R., et al. High resolution 1H NMR-based metabolomics indicates a neurotransmitter cycling deficit in cerebral tissue from a mouse model of Batten disease. Journal of Biological Chemistry, 2005. 280(52): 42508–42514. Dujon, B. European Functional Analysis Network (EUROFAN) and the functional analysis of the Saccharomyces cerevisiae genome. Electrophoresis, 1998. 19(4): 617–624. Pearce, D. A., & F. Sherman. BTN1, a yeast gene corresponding to the human gene responsible for Batten’s disease, is not essential for viability, mitochondrial function, or degradation of mitochondrial ATP synthase. Yeast, 1997. 13(8): 691–697. Hrabe de Angelis, M. H., et  al. Genome-wide, large-scale production of mutant mice by ENU mutagenesis. Nature Genetics, 2000. 25(4): 444–447. Forman, M. S., J. Q. Trojanowski, & V. M. Lee. Neurodegenerative diseases: a decade of discoveries paves the way for therapeutic breakthroughs. Nature Medicine, 2004. 10(10): 1055–1063. Shaw, L. M., et  al. Biomarkers of neurodegeneration for diagnosis and monitoring therapeutics. Nature reviews. Drug Discovery, 2007. 6(4): 295–303. Oresic, M., et al. Metabolome in progression to Alzheimer’s disease. Translational Psychiatry, 2011. 1: e57. Kork, F., et  al. A possible new diagnostic biomarker in early diagnosis of Alzheimer’s disease. Current Alzheimer Research, 2009. 6(6): 519–524. Kaddurah-Daouk, R., et  al. Metabolomic changes in autopsy-confirmed Alzheimer’s disease. Alzheimer’s & Dementia: The Journal of the Alzheimer’s Association, 2011. 7(3): 309–317.

Neuroscience and Metabolomics 42. Blasco, H., et  al. 1H-NMR-based metabolomic profiling of CSF in early amyotrophic lateral sclerosis. PloS One, 2010. 5(10): e13223. 43. Tsang, T. M., J. N. Haselden, & E. Holmes. Metabonomic characterization of the 3-nitropropionic acid rat model of Huntington’s disease. Neurochemical Research, 2009. 34(7): 1261–1271. 44. Griffin, J. L., C. K. Cemal, & M. A. Pook. Defining a metabolic phenotype in the brain of a transgenic mouse model of spinocerebellar ataxia 3. Physiological Genomics, 2004. 16(3): 334–340. 45. Lees, A. J., J. Hardy, & T. Revesz. Parkinson’s disease. Lancet, 2009. 373(9680): 2055–2066. 46. Johansen, K. K., et al. Metabolomic profiling in LRRK2-related Parkinson’s disease. PloS One, 2009. 4(10): e7551. 47. Lehner, A., et  al. Liquid chromatographicelectrospray mass spectrometric determination of 1-methyl-4-phenylpyridine (MPP+) in discrete regions of murine brain. Toxicology Mechanisms and Methods, 2011. 21(3): 171–182. 48. Wilhelm, K. R., et al. Immune reactivity towards insulin, its amyloid and protein S100B in blood sera of Parkinson’s disease patients. European Journal of Neurology:  The Official Journal of the European Federation of Neurological Societies, 2007. 14(3): 327–334. 49. Toalson, R.  Ph., et  al. The metabolic syndrome in patients with severe mental illnesses. Primary Care Companion to the Journal of Clinical Psychiatry, 2004. 6(4): 152–158. 50. Quinones, M.P., & R. Kaddurah-Daouk. Metabolomics tools for identifying biomarkers for neuropsychiatric diseases. Neurobiology of Disease, 2009. 35(2): 165–176. 51. Prabakaran, S., et al. Mitochondrial dysfunction in schizophrenia:  evidence for compromised brain metabolism and oxidative stress. Molecular Psychiatry, 2004. 9(7): 684–697, 643. 52. Holmes, E., T. M. Tsang, & S. J. Tabrizi. The application of NMR-based metabonomics

53.

54.

55.

56.

57.

58.

59.

60.

61.

62.

231

in neurological disorders. NeuroRx, 2006. 3(3): 358–372. He, Y., et  al. Schizophrenia shows a unique metabolomics signature in plasma. Translational Psychiatry, 2012. 2: e149. Oresic, M., et al. Phospholipids and insulin resistance in psychosis:  a lipidomics study of twin pairs discordant for schizophrenia. Genome Medicine, 2012. 4(1): 1. Lan, M. J., et al. Metabonomic analysis identifies molecular changes associated with the pathophysiology and drug treatment of bipolar disorder. Molecular psychiatry, 2009. 14(3): 269–279. Sussulini, A., et al. Metabolic profiling of human blood serum from treated patients with bipolar disorder employing 1H NMR spectroscopy and chemometrics. Analytical chemistry, 2009. 81(23): 9755–963. Viola, A., et al. Metabolic fingerprints of altered brain growth, osmoregulation and neurotransmission in a Rett syndrome model. PloS one, 2007. 2(1): e157. Constantinou, C., et  al. GC-MS metabolomic analysis reveals significant alterations in cerebellar metabolic physiology in a mouse model of adult onset hypothyroidism. Journal of Proteome Research, 2011. 10(2): 869–879. Sugimoto, M., et  al. MMMDB:  Mouse multiple tissue metabolome database. Nucleic Acids Research, 2012. 40(Database issue): D809–D814. Haug, K., et  al. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Research, 2013. 41(D1): D781–D786. Smolinska, A., et  al. Simultaneous analysis of plasma and CSF by NMR and hierarchical models fusion. Analytical and Bioanalytical Chemistry, 2012. 403(4): 947–959. Salek, R.  M., et  al. The MetaboLights repository:  curation challenges in metabolomics. Database: the journal of biological databases and curation 2013: bat029.

13 Brain Connectomics in Man and  Mouse A R T H U R W. T O G A , K R I S T I C L A R K , H O N G W E I D O N G , H O U R I H I N T I R YA N , P A U L M . T H O M P S O N , A N D J O H N D . V A N   H O R N

INTRODUCTION Historically, neuroimaging approaches to study the human brain typically adopted a modular view of the brain (e.g., region X is responsible for function Y); however, most scientists agree that the systems underlying many brain functions are widely distributed, involving connections between multiple systems. More recently, methods have emerged to study which networks are involved in a given function or which properties of a network as whole change with development or disease. To study brain networks, one needs to start with a basic understanding of how the brain is organized. The brain’s gray matter regions contain vast numbers of neuronal cell bodies that communicate with each other via the white matter, which is composed of fascicles, or axon bundles (Figure  13.1). It has not always been possible to measure properties of these anatomical networks in the living brain, as individual fascicles are not visible in a typical T1-weighted anatomical magnetic resonance image (MRI) (Figure  13.2, left). T1-weighted acquisitions are sensitive primarily to water content, so the major source of tissue contrast are the differences in water and lipid content of the gray and white matter. On a standard anatomical MRI scan, the white matter of the brain is merely seen as a homogenous mass (i.e., individual axonal bundles are not discernible). Previously the only way to see the individual trajectories was through time- and labor-intensive postmortem freezing and dissection (Figure  13.2, right). In the last couple of decades, however, diffusion-based MRI (dMRI) has generated a great deal of interest, as it is currently the only method that can measure the underlying

white matter structure of the living human brain (Basser P.  J.  et  al. 1994; Beaulieu 2002; Le Bihan 2003). dMRI can be used to estimate how gray matter regions are connected to each other via white matter bundles, using two main steps:  (1)  an estimation of the main direction of water diffusion within each voxel and (2)  an estimation of fiber trajectories across voxels by means of a tractography (tract tracing) algorithm. Several methods exist for each step, with varying strengths and weaknesses. The spatial scale of voxels is on the order of millimeters, but axons are only several microns wide; therefore dMRI is currently a macroscale approximation that views the properties of axon bundles in aggregate. Also, none of the signals in dMRI shows the direction of the flow of information; therefore any estimation of a brain network derived from dMRI is necessarily directionless. Other functional assessments are needed to infer how information is transferred or propagated throughout the structural network.

H I S T O RY O F D I F F U S I O N IMAGING For the first thirty or so years since diffusion MRI was first developed, it was primarily used by physicists and chemists to calculate the diffusion coefficients of pure liquids (Carr & Purcell 1954; Haacke et  al. 1999; Hahn, 1950; Stadnik et al. 2003). It was not until the 1980s that biologists first began to use dMRI to measure diffusion coefficients within biological tissues (Le Bihan et  al. 1986; Wesbey et  al. 1984a, 1984b). After that, it was not long before scientists realized that diffusion coefficients calculated from biological tissues depended on the directions of the magnetic pulses applied—in other words,

Brain Connectomics in Man and Mouse

233

Blood Capillary Dendrites

Astrocyte Oligodendrocyte

Microglial Cell

Cell Body Neuron

Synapses Myelinated Axon Internode

Node of Ranvier

Basic building blocks of the human brain. Each neuron has many dendrites for receiving information, one cell body for processing information, and one axon for transmitting information to the dendrites of other neurons via synapses. Neurons are supported by three main types of glial cells: oligodendrocytes myelinate some axons to increase the speed of transmission, astrocytes provide nutrients to the neurons from the blood capillaries, and microglial cells form the immune system of the brain.

FIGURE  13.1:

Sources/notes: Image by Vaughan Greer, Laboratory of Neuro Imaging, UCLA (now at USC). Adapted from Edgar, J. M., & Griffiths, I. R. (2009). White matter structure: a microscopist’s view. In Johansen-Berg, H., Behrens, T. E. (Eds.), Diffusion MRI: from quantitative measurement to in-vivo neuroanatomy (pp. 75‒98). London: Academic Press.

the diffusion processes in biological tissues tend to be anisotropic (Basser et  al. 1994; Moseley et  al. 1990, 1991; Thomsen et  al. 1987). It was this observation that led scientists to ask:  What is the biological correlate of the anisotropy? Experiments by Moseley, Beaulieu, and others demonstrated that in the nervous system,

faster diffusion is observed when the relative orientation of the diffusion-sensitizing gradient is parallel to axons (whether myelinated or not) than when it is perpendicular (Beaulieu & Allen 1994; Moseley et  al. 1990). These experiments led to a veritable explosion of research in the effort to develop ever more sophisticated

In a typical T1-weighted acquisition (left), the white matter is visible only as a homogenous mass. It is only through a time-consuming postmortem freezing and dissection technique (right) that the trajectories of individual fiber tracts can be resolved.

FIGURE 13.2:

Sources/notes: Klingler, J., Ludwig, E. (1956). Atlas cerebri humani. the inner structure of the brain demonstrated on the basis of macroscopical preparations. Basel: Karger, 1956.

234

the OMICs

acquisition and analysis techniques to infer the underlying trajectories of fascicles using dMRI.

THE TENSOR  MODEL The earliest mathematical model of anisotropic diffusion was developed by Basser and colleagues in 1994 to characterize the diffusion data with a second-rank tensor (Figure 13.3) (Basser et al. 1994): Sk

S0 e − bg

T

Dg

where Sk is the signal intensity for a diffusion weighted image, S0 is the signal intensity for a non‒diffusion weighted image, b is the b value, g and gT are the normalized diffusion gradient vectors (and its transpose), and D is the symmetrical second-rank diffusion tensor. Imaging studies that use the tensor model are called diffusion tensor imaging (DTI) experiments. The advent of DTI has led to the calculation of many scalar indices to quantify the degree of anisotropy (Alexander et  al. 2000; Bahn 1999; Basser et  al. 1994; Peled et  al. 1998). In addition, several methods for fiber tractography based on the tensor model have been developed (Basser et  al. 2000; Conturo et  al. 1999; Lazar et  al. 2003; Mori et  al. 1999). These methods have shown great promise in both elucidating normal neuroanatomy and also in detecting clinically significant abnormalities (Catani et al. 2005; Concha et al. 2005; Newton et al. 2006).

λ2 ν2

λ1 ν1

λ3 ν3

The tensor model. The eigenvectors v1-3 represent the direction of the tensor while the eigenvalues λ1-3 represent the relative strength of diffusion in the three directions. In diffusion tensor imaging (DTI), the primary eigenvector is assumed to spatially align with the axonal fibers.

FIGURE  13.3:

Sources/notes:  Images by Carlos Mena, Laboratory of Neuro Imaging, UCLA (now at USC). Adapted from Reinges, M.H. et al. (2004). European Journal of Radiology 49(2):91‒104.

Because the diffusion tensor is symmetrical, only six independent parameters need to be estimated in addition to the single intercept term; this leads to a minimum number of seven acquisitions (scans) that are necessary to estimate the tensor (Westin et  al. 2002). This is a strength of DTI, as it can be used in clinical populations who are unable to withstand long acquisition times. Another advantage of DTI is the speed of computation. Typically the intercept term in Equation 1 is treated as a constant and the log-transformed data are fit using a linear least squares regression. The main limitation of the tensor model is that it has an underlying assumption that there is one preferred direction of diffusion per voxel. Under this assumption, the underlying white matter (WM) tracts are assumed to align with the direction of the principal eigenvector of the diffusion tensor. The failures of the tensor model are directly related to departures from this assumption; for example, when two fiber orientations are present in the same voxel, the principal eigenvector will lie somewhere in between the two orientations. Although the limitations of the tensor model are well known, it is still useful in many studies and remains one of the most widely used models, primarily owing to its simplicity and speed of acquisition and analysis and because, in many situations, its limitations are acceptable.

THEORETICAL R E L AT I O N S H I P B E T W E E N MAGNETIC RESONANCE S I G N A L A N D U N D E R LY I N G W H I T E M AT T E R STRUCTURE In general, it is assumed that the maxima of the diffusion probability density function (PDF) correspond to the orientations of the underlying WM tracts (Alexander 2005). The tensor model is the simplest estimate of the diffusion PDF for the diffusion displacement vector, in which the PDF is modeled as a zero-mean trivariate Gaussian distribution: p(x ) = G (x D t ) = ⎡⎣(

⎛ − x T D −1 x ⎞ t )3 d det(( )−0 5 ⎤⎦ e ⎜ ⎟⎠ 4t ⎝

where p(x) is the PDF of particle displacements x over time t and D is the diffusion tensor (Alexander 2005).

Brain Connectomics in Man and Mouse The most theoretically accurate way to measure WM structure would be to use q-space imaging (QSI), which was developed in nuclear magnetic resonance (NMR) experiments to model the diffusion process in the presence of boundaries (Callaghan 1996; Tuch 2004). This is because biological tissues are structurally complex with many restrictions to diffusion, and diffusing molecules can belong to a large number of chemical environments, leading to a very complicated picture. The Fourier relationship between the signal intensity with respect to the wavevector q and the conditional probability density function (PDF) forms the basis of q-space imaging (Callaghan 1996; Tuch 2004): S ( q, Δ ) ∫ P s (R, Δ ) e

(i

•R

) dR

in which q is the reciprocal space vector, S(q,Δ) is the signal attenuation,  P s ( R, Δ ) is the probability that a molecular is displaced R over time Δ. q=

γ gδ 2π

where g is the gradient vector and δ is the duration of the gradient pulse. Alternatively, the Fourier relationship may be expressed as follows: P s ( , Δ ) ∫ S (q Δ )e

(i

•R

) dq

However, in order for QSI to be valid, the duration of the gradient pulse must be infinitely short because of the underlying assumption that the spins do not move during the gradient pulse (Callaghan 1996). Given the current state of MR hardware and safety issues, this condition cannot be met with a living human subject (Basser 2002). In addition, a very high sampling density in Cartesian space is necessary in order to ensure numerical stability in the Fourier transform; the requirement for such a large number of acquisitions prohibits the collection of QSI data (Tuch 2004). In one study, a single slice took approximately 25 minutes to acquire (Wedeen et  al. 2005). With such long acquisition times, motion artifacts become significant. The acquisition of QSI requires that some of the images be collected with a very high diffusion weighting;

235

there is very little SNR in these images and low tissue contrast, which causes problems for image registration algorithms (Tuch 2004). For these reasons, QSI is not yet possible in routine in vivo human studies, although QSI has been demonstrated to faithfully reconstruct the optic tracts in postmortem rat brains (Lin et  al.  2003).

PRACTICAL A LT E R N AT I V E S   T O   Q S I Since practical issues prohibit the use of QSI in a typical research experiment, multiple alternatives have been proposed. The tensor model itself is a very simple alternative to QSI. Diffusion spectrum imaging (Wedeen et  al. 2005)  is one of the most powerful and general approaches to recover the underlying fiber geometry that gives rise to the observed diffusion characteristics because it requires few assumptions about the underlying diffusion geometry. In DSI, the Fourier relationship between the diffusion signal attenuation (as sampled in q-space) and the diffusion propagator, p(r), follows directly from the Markov chain model of the diffusion process: p(

)

⎧ S (q ) ⎫ F −1 ⎨ i ⎬ ⎩ S0 ⎭

DSI recovers the full diffusion propagator in six dimensions, unlike more common five-dimensional Q-ball imaging (QBI) reconstruction schemes, which project the diffusion signal and cannot distinguish fast versus slow diffusion components observable at different diffusion sensitization levels (b-values) (Assaf & Pasternak 2008). Either DSI or QBI can be used to construct an orientation distribution function, or ODF, by radial projection (Aganj I et al. 2009). This encodes the orientational information required for tractography. Other methods exist to recover ODFs from fewer q-space samples. Q-ball imaging (Tuch 2004)  samples diffusion signals on a single shell in q-space and uses the Funk-Radon transform to estimate the ODF directly, without first generating p(r). The persistent angular structure (PAS) function uses the principle of maximum entropy to extract the angular structure in the propagator from a single shell of q-space samples (Jansons & Alexander

236

the OMICs

2003). Spherical deconvolution methods propose to estimate the fiber orientation distribution (FOD), which represents the underlying fiber geometry—the object most crucial for tractography—rather than the spin diffusion probabilities (Descoteaux et  al. 2007; Patel et  al. 2010; Tournier et  al. 2004). However, most existing spherical deconvolution methods generate FODs that are locally negative, which do not correspond to a physically possible diffusion process (Descoteaux et  al. 2007). Multiple studies have fit a mixture of two tensors to each voxel, in which it is assumed that the principal eigenvector of each tensor aligns with underlying WM fiber tracts (Alexander et  al. 2001; Anderson, 2005; Frank 2002; Inglis et  al. 2001; Parker & Alexander 2003; Tuch et  al. 2002). However, these models are numerically unstable, so assumptions must be made, such as constraining the eigenvalues to specific values. Alternatively, tensor distribution functions may be fitted to the scan data (Leow et  al. 2009), and these can theoretically estimate the contributions of multiple crossing fibers to each voxel. Currently a great deal of work is devoted to sampling the diffusion q-space efficiently; fast methods, such as hybrid diffusion imaging, offer high angular resolution and, using staggered sampling schemes, are sensitive to both fast and slow diffusion (Zhan et  al.  2011).

ACROSS VOXEL MODELING: TRACTOGRAPHY Tractography realizes the potential of dMRI— namely, to identify and quantify the structural connectivity of the in vivo human brain (Figure  13.4). The simplest form of fiber tractography is the deterministic streamline fiber tracking algorithm (Basser et  al. 2000; Mori et al. 1999; Xue et al. 1999). In this method, the within-voxel FOD, typically the primary eigenvector from DTI, is used to track major WM tracts in the brain in a point-by-point fashion. While this method is computationally simple and intuitive, there are several known disadvantages, such as decreased accuracy with distance from the region of interest (ROI), dependence on fractional anisotropy (FA) and turning angle thresholds, and errors in voxels with partial volume effects. One disadvantage of deterministic tractography methods is that there is no inherent indication of the confidence, or uncertainty,

of each reconstructed trajectory, although this can be remedied by using bootstrapping techniques (Jones et  al. 2005). Newer tractography algorithms are probabilistic in nature and show promise at overcoming some of the disadvantages of deterministic methods (e.g. Bayesian methods, fast-marching algorithms, front propagation methods, and PDE-based algorithms based on first-order heat equations and Navier-Stokes fluid equations) (Behrens et  al. 2003; Hageman et  al. 2009; Jbabdi et  al. 2007; O’Donnell et  al. 2002; Parker & Alexander 2003; Tournier et  al. 2003). However, probabilistic algorithms have different limitations—for example, the rules governing probabilistic algorithms are not based on known neuroanatomy (Jones & Pierpaoli 2005). Tractography algorithms are largely independent of the underlying within-voxel modeling (e.g., the streamline algorithm may be applied after any reconstruction method that generates an FOD). To thoroughly estimate the structural connectivity profile of the human brain, both large and small tracts must be identified. Manual protocols have been developed to identify large WM tracts (Wakana et al. 2007). While manual protocols are perhaps the gold standard, owing to their high accuracy, they are subjective; their accuracy depends heavily on the knowledge of the rater. Additionally, manual protocols are time-consuming, which limits the number of subjects that can be studied. An alternative to manual protocols is to use one of many existing atlases, such as the LBP40 atlas (Shattuck et  al. 2008), the AAL atlas (Tzourio-Mazoyer et  al. 2002), the Jülich histological atlases (Eickhoff et  al. 2005), or the Johns-Hopkins DTI-based atlases (Hua et  al. 2008)  to form seed points. These methods have the advantage of being objective and automatic; however, manual intervention is still recommended to prevent the identification of false tracts due to the interindividual variability in neuroanatomy. Several clustering algorithms have been proposed to identify small tracts (Gerig et al. 2004; Li et  al. 2009; Maddah et  al. 2008; O’Donnell & Westin 2007; Xia et  al. 2005). Given a set of fiber tracts, the spectral clustering method first projects them into an embedding space and then performs standard clustering in this space (Li et  al. 2009; O’Donnell & Westin 2007). Compared with other fiber clustering algorithms (Gerig et  al. 2004; Maddah et  al. 2008; Xia et al. 2005), spectral clustering captures the

Brain Connectomics in Man and Mouse

237

(a)

(b)

(c)

Fiber tracking with diffusion spectrum imaging (DSI). The centrum semiovale is an area where callosal (left to right), arcuate (anterior to posterior), and corona radiata (superior to inferior) fibers are known to cross. Where tensor models fail, DSI techniques can be used with manual intervention to track this highly complex region. FIGURE  13.4.

Sources/notes:  Fernandez-Miranda, J.  C., Pathak, S., Engh, J., Jarbo, K., Verstynen, T., Yeh, F.C., Wang, Y., Mintz, A., Boada, F.,  Schneider, W., & Friedlander, R.  (2012). High-definition fiber tractography of the human brain:  neuroanatomical validation and neurosurgical applications. Neurosurgery 71:430–453.

intrinsic relation of fiber bundles and is robust to pose variations (cf. the spectral modeling of intrinsic geometry in three-dimensional shape analysis) (Qiu et al. 2006; Reuter et al. 2006; Shi et  al. 2008, 2009a,b). Spectral clustering techniques are easy to implement as they involve only matrix-based computations.

A P P LY I N G G R A P H T H E O RY C O N S T R U C T S TO STRUCTURAL C O N N E C T I V I T Y E S T I M AT E S Once the full-brain structural connectivity pattern has been established, it becomes possible to study the behavior of given networks by adapting some of the core constructs from graph theory. In graph theory, a graph is any representation of a network in terms of a set of nodes and connections through which the

nodes interact. In the human brain, gray matter regions are treated as nodes and the axonal bundles, which can be estimated from dMRI, form the connections of the graph. Networks are typically classified according to two metrics:  a clustering coefficient (C), which is a measure of how connected each node is, and the characteristic path length (λ), which is a measure of the average distance between nodes, where λ is inversely related to network efficiency. Until 1998, networks were classified as either regular (high C and high λ) or random (low C and low  λ). In 1998, it was observed that many nonbiological networks, such as film actors and power grids, did not fit well into either of these categories; therefore a new class of network, called the “small world” network, was introduced as having a high C, like regular networks, and a low λ, like random networks

238

the OMICs

(Watts & Strogatz 1998). Recently this small world network model has been applied to the study of structural and functional connectivity of the human brain because it seems to capture many relevant features, such as high local clustering (cortical regions) combined with long-distance connections between clusters (large WM tracts) (Bullmore & Sporns 2009). These methods have been used to compute connectivity matrices and the small world network properties of cat and macaque cortex; the underlying hypothesis is that connectivity matrices can be used as a measure of the capacity of the cortex to process information (Sporns & Zwi 2004). Recently two studies have demonstrated that the efficiency of the human cortex is correlated with intelligence quotient (IQ), such that higher IQ is associated with a more efficient cortical network—that is, a shorter λ (Li et  al. 2009; van den Heuvel et  al. 2009). These methods have also been applied to large populations to demonstrate that the overall efficiency of the brain decreases over the life span (Gong et  al. 2009). Additionally, graph theory constructs have been used to reveal changes in the neuroanatomical networks of early blind subjects (Shu et  al. 2009)  and changes in the functional networks of subjects with attention-deficit hyperactivity disorder (ADHD) (Wang et al. 2009).

T H E I M PA C T O F E R R O R S IN TRACTOGRAPHY E S T I M AT E S O N G R A P H T H E O RY M E T R I C S Whether one is designing a study, interpreting the results of a study, or comparing results from multiple studies, it is important to bear in mind how errors in tractography can impact graph theory metrics, such as path length. Specifically, false-positive connections can artificially inflate measurements of clustering coefficients while deflating estimates of the path length. Therefore if methods include false-positive connections, the overall connectivity of the brain will be overestimated. Conversely, false-negative connections deflate estimates of C and inflate estimates of λ because the overall connectivity of the brain is underestimated. Methods that are overly simplistic, such as DTI and the streamline algorithm, are more prone to produce false-negative results because such methods are sensitive only to large WM tracts. However, sometimes even this very simple model can detect false positives

as well, particularly in regions where there is not much physical space between different tracts. For example, estimates of the corticospinal tract commonly “jump” to include callosal fibers because there is little physical space between the two (see Figure  13.5). More mathematically complex models can introduce both false positives and false negatives. Given the current state of hardware, there will always be false negatives in in vivo studies because the spatial resolution of dMRI is so much coarser than the spatial resolution of the underlying axons. In reality, some axons are efferent, transferring information away from a gray matter region, while others are afferent, transferring information toward a gray matter region. Unless some other type of information is available besides dMRI, the observed axons must be treated as bidirectional, which results in an overestimation of the connectivity. Because there are so many ways in which acquisition and analysis methods can impact graph theory metrics, meaningful comparisons can be made only among populations or studies in which the acquisition and analysis methods are more or less the same. Thus it would not be informative to compare the path length estimated from population A—using DTI and a streamline tractography algorithm—to population B, using DSI and a Navier-Stokes tractography algorithm. Recently Zhan and coworkers (2011) found that measures of anatomical connectivity depend on the scan parameters used, such as the spatial resolution, magnetic field strength of the MRI scanner, and the number of gradients used to estimate directional diffusion. However, as long as the methods are kept constant, populations and studies should be comparable. Thus differences in path length or clustering coefficient are attributable to the biological differences among populations. For example, people with schizophrenia have been shown to have more disorganized connectivity patterns than people without schizophrenia (Bassett et  al. 2008). In addition, clinical studies are beginning to emerge focusing on characteristic changes in brain networks in several brain diseases and disorders, including Alzheimer’s disease (Nir et  al. 2012), people at genetic risk for schizophrenia (Braskie et  al. 2012), patients with HIV/AIDS (Jahanshad et  al. 2012), patients with bipolar illness (Leow et  al. 2012)  and people at genetic risk for autism (Dennis et al. 2011).

Brain Connectomics in Man and Mouse

239

FIGURE 13.5: Because the physical space between the corticospinal tract (dark gray) and the callosal fibers (lighter gray) is small, attempts to automatically identify one will often include the other.

Sources/notes: Image by David Shattuck, Laboratory of Neuro Imaging, UCLA (LONI is now at USC). Adapted from Thomason M. E., & Thompson, P. M. (2011). Diffusion imaging, white matter, and pyschopathology. Annual Review of Clinical Psychology 7:63‒85.

P O P U L AT I O N S T U D I E S O F BRAIN NETWORKS With the increased availability of dMRI imaging protocols on most 1.5 and 3T scanning platforms, DTI imaging has become nearly ubiquitous in small, large, and multisite brain mapping studies. This has evolved from the examination of individual patterns of WM integrity and tractography to an interest in population-level mapping. As noted above, cortical network architecture has mostly been investigated using graph theoretical representations; however, whole-brain connectivity is often more complex than the connectivity of other networks (e.g., power grids) and thus requires a more complex visualization than the tree diagrams that are typically used to represent graphs. For example, it is not just the ways in which the various nodes are connected that matter but also the strength of the connections, the identification of subnetworks and hubs, and sometimes even how the networks change

dynamically (e.g., in response to disease) that are of interest. Figure13.6 demonstrates two examples of how investigators represent these relationships (Fair et  al. 2009; Holten, 2006; Irimia et  al. 2012a; Jovicich et  al. 2009; Modha & Singh 2010). These approaches have been applied to the examination of connectomic damage in mild to severe brain trauma (Irimia et  al. 2012b) as well as forming the population-level basis for characterizing the extent of WM loss in neurological injury (Van Horn et al. 2012). Indeed, the rapid collection of dMRI data from large populations has made it possible to discover general factors that affect connectivity patterns in the brain, such as changes that occur with disease or as the brain changes during development (Dennis et  al. 2012). These patterns may not be observable in any one individual scan but can be detected by compiling connectivity matrices from very large number of subjects and performing statistical analysis

Visualization methods. Left:  This type of circular representation demonstrates the hierarchical nature of connections as well as the basic governing principals of structural and connectomic brain organization: several highly interconnected subnetworks for specialization that communicate through a few strong long-distance connections. Right: This form of “spring-embedded” representation shows how brain networks develop from a local to distributed organization.

Sources/notes:  Left, Image based on Irimia, A., Chambers, M.  C., Torgerson, C.  M., & Van Horn, J.  D. (2012). Circular representation of human cortical networks for subject and population-level connectomic visualization. NeuroImage 60(2):1340–1351. Right:  Fair, D.  A., Cohen, A.  L., Power, J.  D., Dosenbach, N.  U., Church, J.  A., Miezin, F.  M., Schlaggar, B.L., & Petersen, S.  E. (2009). Functional brain networks develop from a “local to distributed” organization. PLoS Comput Biol 5:e1000381. DOI: 10.1371/journal.pcbi.1000381.

FIGURE  13.6:

Brain Connectomics in Man and Mouse to discover factors—such as age, sex, or specific genetic variants—that affect the density of connections, their integrity, or even their network organization. One of the largest studies to date (Dennis et  al. 2012)  examined changes in anatomical brain connectivity between ages 12 and 30 by computing connectivity metrics from DTI scans of 484 adolescents and adults. An earlier study (Chiang et  al. 2010)  charted the maturational trajectory of fiber integrity in 705 twins using DTI and found that the changes were under strong genetic control. These connectivity analyses have led to the development of age-related statistical norms for several key anatomical connectivity parameters, including the most widely used measures of network topology and efficiency. This work is making it feasible to begin to detect deviations from normal brain development and to assess the maturity of the brain’s neural networks as a child develops. This interest in population-based connectivity mapping has given rise to the Human Connectome Project (HCP), an NIH-funded multicenter initiative aiming to generate a whole-brain coverage connectivity map of the human brain using noninvasive in vivo neuroimaging dMRI techniques combined with computational tractography (http://www. humanconnectomeproject.org/) (Toga et  al. 2012; Van Essen et  al. 2012). It is anticipated that this comprehensive human connectome will provide valuable diagnostic references and insight into many neurological and neuropsychiatric diseases, such as Alzheimer’s disease, Huntington’s disease, autism, and schizophrenia. These devastating ailments affect systems of interconnected neural regions through macrocircuitry in both anterograde and retrograde directions and are often termed “disconnection syndromes” (Catani & ffytche 2005; Molko et  al. 2002)—a concept originally introduced well before the connectomic era by Norman Geschwind in the 1960s (Geschwind 1965). As such, a comprehensive population-level human connectomic map will be invaluable for understanding the extent of these connectopathies and the exact mechanisms underlying them.

CURRENT AND FUTURE DIRECTIONS IN THE STUDY OF BRAIN NETWORKS One new direction in our understanding of brain connectivity involves the search for

241

common genetic variants—individual differences in our DNA—that contribute to and affect the quality of the brain’s wiring. It has only recently become feasible to search for these genes by mining large databases of connectomic and genetic data. Large international consortia, such as the ENIGMA (http://enigma. loni.ucla.edu) (Kochunov et al. 2012; Stein et al. 2012), are engaged in the statistical analysis of connectivity data from thousands of subjects with the goal of discovering single nucleotide differences in the genome that contribute to the brain’s organization. In one approach, Chiang and colleagues (2012) discovered an entire network of genetic variants that affect brain integrity using genomic scans and a large database of high angular resolution diffusion imaging (HARDI) scans. In related work, Jahanshad and associates (2012) discovered that the integrity of the brain’s connections are influenced by a common genetic variant in the HFE gene, which influences iron transport in the body and brain. Dennis and coworkers (2011) also found that people who carry a very common variant in “autism risk” gene—CNTNAP2—have altered brain connectivity, affecting the relative “isolation” of various nodes in their anatomical networks. As genetic factors are beginning to be discovered that affect the brain’s wiring, Kohannim and colleagues (2012) developed a genetic test to predict a person’s brain integrity on a HARDI scan, based on a genetic profile of 7 common variants that can be genotyped in a cheek-swab sample of saliva. In ongoing work, Jahanshad and associates (2013) studied the genetics of path lengths in brain connectivity networks, based on 457 adults scanned with HARDI. They developed a method to perform genome-wide scanning of all the connections in the human connectome and discovered previously unknown genes that affect the risk for brain diseases such as Alzheimer’s. Connectomic analyses are also leading to new kinds of biomarkers of brain diseases such as Alzheimer’s disease and mild cognitive impairment. Daianu and colleagues (2012), for example, applied a novel network metric, called the “k-core,” to recover the “structural backbone” of the fiber connectivity network in Alzheimer’s disease patients and controls. They showed how parts of this network start to break down as a person develops Alzheimer’s disease—as the brain becomes progressively more structurally and functionally disconnected. Nir and

242

the OMICs

associates (2012) also reported that measures of “small-world” network properties can be used to predict imminent brain tissue loss in people with Alzheimer’s disease. Several mathematical innovations have been stimulated by the wide availability of data on brain networks. Duarte-Carvajalino and coworkers (2012), for example, showed that they could very efficiently detect and classify the gender and kinship (family relationships) in brain networks based on graph-theoretical analyses of patterns of connections. In parallel with these developments in genetics research, there is a growing consensus in the neuroimaging field that diffusion MRI measures—including measures of network organization—will advance research in Alzheimer’s disease, HIV/AIDS, neurological development, and neuropsychiatric disorders such as schizophrenia (clinical applications of DTI are reviewed in Thomason & Thompson 2011).

NEURAL CONNECTOMICS IN THE  MOUSE Although constructing a human connectome with the help of sophisticated imaging like DTI is both promising and exciting, there are several limitations that need to be considered. First, the relationship between tensor fields, tractography, and classically defined neuroanatomical fiber tracts remains controversial, and the vast majority of nonmyelinated axons are in gray matter, which cannot be reliably represented by DTI tractography. Second, all myelinated axons traveling through WM (fiber tracts) arise from and terminate in neurons in gray matter; therefore DTI tractography represents only one segment of a given circuitry. Third, DTI is unable to detect synaptic connectivity, which is formed by axonal terminals ending at other neuronal elements (mostly somas or dendrites). Thus, although it is anticipated that DTI will facilitate the characterization of earlier stages of neurological/neuropsychiatric diseases (i.e., mild cognitive impairment), a direct correlation with cellular and molecular pathology of connectopathies will be difficult to establish. The cellular and molecular etiology of these diseases can be addressed by using more invasive techniques in animal models. A more direct approach for generating a comprehensive connectome at the cellular resolution that will complement the HCP is to use a different species. The laboratory mouse (Mus

musculus) has naturally become the primary animal of choice for several reasons. More than 20,000 genes have been mapped systematically throughout the mouse brain using large-scale in situ hybridization (Lein et  al. 2007)  and BAC-transgenic methods (Gong et  al. 2003). Further, mouse genes can be altered efficiently and precisely using powerful gene-targeted mutagenesis technology to create many mouse models of human diseases. Also, combinatory efforts have recently been proposed to systematically produce and phenotype knockouts for all mouse genes (see NIH KOMP website:  http://www.nih.gov/science/models/ mouse/index.html). Together, these substantial efforts will rapidly expand our knowledge of in vivo gene function, potentially improving our understanding of human diseases and advancing treatments for them. Therefore a systematic and comprehensive effort to generate a mouse brain connectome with the most reliable and practical methods available will be of enormous benefit to the research community. A  mouse connectome will propagate hypotheses regarding the functional roles of these genes and allow interpretation of their phenotypes (Bohland et  al. 2009; Bota et  al. 2012; Swanson & Bota 2010; Osten & Margrie, 2013). Several different approaches have been adopted to create this mouse connectome, each of which offers unique advantages.

Macro-Scale Mouse Connectomics Billions of neurons in the mouse brain aggregate into roughly 500 to 800 anatomically and functionally distinct gray matter regions (i.e., nuclei, areas) based on their topographic locations and cytoarchitectural properties (Dong, 2007). Thus the first feasible step in establishing the global wiring framework for the mouse brain is to systematically characterize the macropathways that interconnect these brain regions using sensitive, well-proven, modern tract tracing methods. Unlike the “tracts” that are revealed with in vivo DTI images with ambiguous “start” and “end” points, anterograde circuit tracers (i.e., Phaseolus vulgaris leucoagglutinin [PHAL] and biotinylated dextran amine [BDA]) label individual axons arising from the injection site (or efferent pathways) and travel through both gray and WM. Potential synaptic connectivity within their targeted regions is indicated by labeled varicosities, axonal terminal boutons, and

Brain Connectomics in Man and Mouse terminal fields. Meanwhile, retrograde tracers such as Fluorogold (FG) or cholera toxin b subunit (CTb) are used to reveal neuronal inputs (or afferent pathways) to the injection sites. Brain regions containing retrogradely labeled neurons send direct monosynaptic inputs to the injection sites. The reliability and sensitivity of these tracers have been well established over the past two decades. To accelerate the progress of generating a whole-brain coverage, high-resolution macro-scale connectome in a relatively brief period of time, complementary approaches have been adopted at USC (www. MouseConnectome.org), the Allen Institute for Brain Sciences (AIBS, http://connectivity.brain-map.org/), and Cold Spring Harbor Laboratories (CSHL, http://brainarchitecture. org/mouse/about). The various neural tract tracing strategies used in these individual projects are illustrated on their websites. The USC iConnectome group uses a double coinjection strategy (Hintiryan et  al. 2012), which initially was reported for tracing neural circuits in rats (Thompson & Swanson 2003). Within one animal, two nonoverlapping coinjections are made, each consisting of one anterograde and one retrograde tracer. PHAL (anterograde) is coinjected with CTb (retrograde), while BDA (anterograde) is coinjected with FG (retrograde) (Figure 13.7a, e). These double coinjections allow concurrent examination of input and output pathways from each injection (Figure  13.7b) and yield four times the amount of data collected from classic single tracer injections reducing cost, time, and number of animals used. Importantly, two injections are paired for the purposes of (1) directly exposing topographically distinct connectional patterns associated with different brain regions within the same brain (Figure 13.7f ) and (2)  potentially revealing regions of interactions between injection sites via multisynaptic pathways (Figure 13.7d, i), which allows connectivity to be studied at the network level. This method is powerful when mapping the topographic connectivity of different regions of larger brain structures such as the olfactory bulb, cerebral cortical areas, the hippocampus, and the caudoputamen (Hintiryan et al. 2012). Figure 13.7 shows a typical experimental case with two confined, nonoverlapping coinjections placed into two cortical areas (Figure  13.7e). One injection, a mixture of

243

PHAL/CTb, is in the secondary motor cortex (MOs), while the other, BDA/FG, is in the ventral anterior cingulate cortex (ACAv). The sizes of most injection sites are about 350 to 500  μm, although injections with smaller diameters are also applied. Potential synaptic connectivity is revealed by the colocalization of an anterograde tracer with a retrograde tracer. This synaptic connectivity can reveal either (1)  reciprocity between an injection site and a brain region (PHAL overlapping with CTb or BDA with FG (Figure 13.7c, h–i) and/ or (2)  an interaction station between injection sites (PHAL overlapping with FG-labeled neurons (Figure 13.7h–i). As illustrated here, MOs and ACAv connections display distinct, topographically arranged labeling in the cortex (Figure 13.6f ) and thalamic central lateral (CL), lateral mediodorsal (MDl), and posterior nuclei (PO) (Figure 13.7g–i). Neurons that are colabeled with both retrograde tracers CTb and FG like those in the, ventral medial (VM) in Figure 13.7h, provide input to both injection sites, which suggests a divergent connectivity pattern for those neurons. A  region that contains intermixed PHAL- and BDA-labeled axons receives convergent inputs from the two injection sites. Finally, multisynaptic connections and integration nodes between MOs and ACAv networks are also revealed. MOs PHAL fibers innervate FG back-labeled neurons from ACAv in thalamic VM (Figure 13.7h, white arrow), MDl (Figure 13.7i, white arrow), and CL (Figure 13.7i, red arrow) suggesting a MOs → VM, MDl, CL → ACAv connectivity  chain. The Mouse Architectural Project at the Cold Spring Harbor Laboratory and the Mouse Connectivity Project of the Allen Institute for Brain Sciences use one tracer per brain injection strategy with either classic tracers (BDA or CTb) or genetically modified viral vectors (recombinant adeno-associated viruses [rAAV]). The rAAV is currently the most reliable viral vector used as an anterograde tracer that maps axonal projections and their terminals with high sensitivity (Chamberlin et  al. 1998; Harris et  al. 2012). However, unlike the putative anterograde tracer PHAL, which is transported solely unidirectionally, BDA and rAAV also can be retrogradely transported to label some upstream neurons and their axons. This is important in analyzing connectivity data because minor projections labeled by such “incorrect” transport could lead to erroneous

244

the OMICs

FIGURE  13.7. USC iConnectome double coinjection tract tracing strategy and online interactive visualization tool (www.MouseConnectome.org). (a)  Is a schematic illustration of PHAL/CTb and BDA/FG double coinjections in two independent brain regions that label both input to and output of each injection (b), detect reciprocity (c), and multisynaptic interactions (d). (e)  Shows a typical experimental case with two confined, nonoverlapping coinjections placed into two cortical areas. One injection, a mixture of PHAL/CTb, is in the secondary motor cortical area (MOs), while the other, BDA/FG, is in the ventral anterior cingulate cortical area (ACAv). (f) MOs (PHAL) and ACAv (BDA) display distinct, topographically arranged connectivity patterns in the cortex and thalamus (g). Thalamic nuclei, such as the ventral medial nucleus (VM) (h)  and paracentral nucleus (PCN) (i), contain intermixed labeling of PHAL and CTb, indicating their reciprocal connectivity with MOs. MOs-originating PHAL fibers innervate FG back-labeled neurons from ACAv in the lateral part of the mediodorsal thalamic nucleus (MDl) (i) and central lateral nucleus (CL) (i, arrow), (suggesting an MOs → MDl, CL → ACAv connectivity chain. These multi-fluorescent connectivity data can be viewed either on their own Nissl-stained background (e, j) or their corresponding level of the Allen Reference Atlas (k), both of which provide anatomic references for the labeling. Scale bars, 500 μm (e‒f, i); 200 μm (g‒h).

Abbreviations: ACAv, ventral anterior cingulate cortical area; BDA, biotinylated dextran amine; CL, central lateral thalamic nucleus; CTb, cholera toxin subunit b; FG, fluorogold; MOs, secondary motor cortical area; MDl, lateral mediodorsal thalamic nucleus; PCN, paracentral nucleus of the thalamus; PHAL, Phaseolus vulgaris leucoagglutinin; VM, ventromedial thalamic nucleus.

Brain Connectomics in Man and Mouse conclusions. Nevertheless, the accuracy and reliability of the connectivity data generated in these large-scale mapping projects can be validated using different technologies. Global sharing of data has had a remarkable impact on the advancement of science. One of the purposes of macro-connectome projects is to provide a framework of global neural connectivity, which can in turn be used by scientists from different backgrounds to generate hypotheses regarding brain microcircuitry, function, and disease. Therefore hundreds of neuronal pathways labeled by these three groups have been released through their websites and are accessible by both the neuroscience community and general public worldwide (USC [www. MouseConnectome.org], AIBS [www.brain-map. org]. and CSHL [http://brainarchitecture.org]). Specifically, the USC iConnectome interactive visualization tool (www.MouseConnectome.org) offers unique features that will aid in the analysis and interpretation of the connectivity data. It features a searchable catalog of multifluorescent tracer injections and labeled pathways that can be viewed at high-resolution within their own cytoarchitectural background (Hintiryan et  al. 2012). The background is a bright-field Nissl stain of the same section that provides cytoarchitectonic detail for direct analysis (Figure 13.7e,j). An additional feature offered by the iConnectome is that all data is registered on a standard C57BL/6J mouse brain atlas (Figure 13.7k), the Allen Reference Atlas (ARA) (Dong, 2007), which not only aids in the analysis of the data but also allows correlation between connectivity and the large-scale ABA gene expression data (Lein et  al. 2007), linking molecular and tracing approaches to evaluate functional circuitry.

Meso-Scale Mouse Connectomics In the brain, one anatomically defined brain region usually contains several intermixed but distinct cell types (i.e., dopaminergic neurons and GABAergic neurons in the ventral tegmental area), which presumably possess distinct neuronal connectivity and are involved in different functions. Based on the definition of Swanson and Bota (Swanson & Bota 2010), projects determining the whole-brain connectivity of these different cell-types are called meso-scale connectomes. Although the definition of a particular cell type remains controversial (that is, what defines a cortical cell

245

type is not agreed upon) (Nelson et  al. 2006), advanced technologies have been developed to characterize neuronal circuits of genetically defined neuronal populations using rAAV or other genetically modified viruses (Luo et  al. 2008). This is achieved by the expression of Cre-recombinase in a brain region that is targeted for viral tracer injection. The AIBS Mouse Connectivity Project (www.brain-map.org) aims to map cell type‒specific axonal projections arising from neuronal populations defined by ~100 different Cre-driver lines, labeled by viral tracers and visualized using serial two-photon tomography (Harris et al. 2012). Meanwhile, various genetically modified pseudorabies viruses ( Card et  al. 2011)  or rabies virus‒based tracers (Marshel et  al. 2010)  also have been developed to reveal inputs to genetically defined neuronal populations across multiple synapses in the retrograde direction. Even more sophisticated, Callaway’s lab has developed viral tracers that genetically target and trace monosynaptic inputs to a single neuron in vitro and in vivo (Marshel et al. 2010), which can be applied to manipulate single neuronal networks throughout the brain. This new generation of markers, which may undergo either anterograde or retrograde transport, are especially valuable because they offer the added advantage of being able to genetically manipulate a pathway of interest and to monitor or control neuronal activity within that particular pathway using optogenetic approaches (Luo et al. 2008; Osakada et al. 2011).

Micro-Scale Mouse Connectomics—or Synaptomics Currently, several projects are under way to assemble an ultra-high-resolution mouse connectome at the level of a single neuron or synapse. Lichtman and colleagues at Harvard University have developed a Brainbow method (Livet et  al. 2007)  that permits individual neurons to be labeled with different colors (Jackson Laboratories, Strain Name[s]:  B6. Cg-Tg[Thy1-Brainbow1.0-2.1]RLich/J) in the hope that this method will map the detailed morphology and connectivity of individual neurons in exquisite detail. The challenge of constructing a connectome with this approach is to untangle simultaneously labeled long-range axons arising from numerous neurons of the entire brain; however, Brainbow mice have been integral for constructing an

246

the OMICs

ultra-high-resolution connectivity map of peripheral axons innervating muscles (Lu et al. 2009). Another technique called array tomography allows the three-dimensional assessment of morphological structures and synaptic connectivity. Array tomography is based on repeatedly staining and imaging ordered arrays of ultrathin (50 to 200  nm) resin-embedded serial sections on glass microscope slides. This allows for quantitative, high-resolution, large-field volumetric imaging of numerous antigens, fluorescent proteins, and ultrastructures in individual tissue specimens. The application of array tomography can reveal important, previously unseen features of the brain’s molecular architecture. Compared with other microscale level connectome approaches, the most appealing advantage of array tomography is its potential to reveal the chemical properties (e.g., neurotransmitters) of pre- and postsynaptic elements. Thus far, this technique has been applied to investigate the fine molecular architecture of microcircuits in a well-preserved neuroanatomical context (Soiza-Reilly & Commons 2011)  as well as in neurodegenerative disease models (Kopeikina et al. 2011). A third approach used for constructing an ultra-high-resolution connectivity map of the mouse brain utilizes serial block-face scanning electron microscopy. This method initially was used to generate the Caenorhabditis elegans (C. elegans) connectome (Chklovskii et  al. 2010; Jarrell et al. 2012) and is recently being applied in the mouse brain (Bock et al. 2011; Briggman et al. 2011). Using sophisticated computer algorithms, axons and their synaptic connectivity within one particular region (i.e., 1  mm2 cortex; see Bock et  al. 2011)  or retina (Briggman et  al. 2011)  can be graphically reconstructed in astonishing detail. The reconstruction of the entire mouse brain using this approach would be extraordinary, although its completion would take a long time. Overall, these microscale connectomic or synaptomic approaches are currently practical for mapping local circuits in mice and for dissecting circuits in simpler species like Drosophila (Chklovskii et  al. 2010)  and C.  elegans (Jarrell et  al. 2012; White et  al. 1986). Assembling an ultra-high-resolution mouse brain connectome with these technologies is very attractive; however, their feasibility for such a large-scale undertaking remains to be demonstrated.

To date the only complete connectome that exists was generated in the C. elegans brain (Chklovskii et  al. 2010; Jarrell et  al. 2012), which is composed of approximately 300 nodes (neurons) linked by more than 6,000 chemical synapses, 900 electrical junctions, and 1,400 neuromuscular junctions ((Jarrell et  al. 2012; White et  al. 1986); http://www. wormatlas.org/neuronalwiring.html). Despite concerted and continuous efforts since the first paper on the C.  elegans connectome over a decade ago, construction of the whole-brain mammalian connectome—with its trillions of synapses, billions of neurons, and hundreds of thousands of macropathways—remains a daunting task, vastly more complex than sequencing of the human genome. However, employing the complementary approaches discussed in this chapter, the ultimate goal of the mammalian connectome is within reach. Macro-connectome projects like the one at USC can establish the global connectivity patterns of the entire mouse brain, while genetic tracers can parse cell type‒specific circuitries within this global framework. These circuitries can then be investigated at greater levels of resolution employing techniques like Brainbow mice (Livet et  al. 2007), array tomography (Micheva & Smith 2007), and high-throughput automated electron microscopy (Briggman & Denk 2006), while the functional significance of the circuitries can be examined using techniques like optogenetics (Peron & Svoboda 2011; Yizhar et  al.  2011).

CONCLUSION Knowledge of WM tracts is likely to be crucial to a precise understanding of the functional architecture of the human brain. Previously, this knowledge was severely limited as it was difficult or impossible to visualize these structures either postmortem or with conventional imaging techniques such as T1-weighted MR images. Diffusion magnetic resonance imaging (dMRI) has recently generated a great deal of interest due to the information it provides on the WM structure of the living human brain. Multiscale tract tracing in the mouse provides a valuable complementary approach providing meso-, micro-, and synaptomic detail in the most widely used model organism. Taken together, the science of brain connectivity can be expected to be a rich area of research for many years to come.

Brain Connectomics in Man and Mouse ACKNOWLEDGMENTS This work was supported by U01MH093765 (Rosen), R01MH094343, and P41EB015922 to AWT. REFERENCES Aganj, I., Lenglet, C., & Sapiro, G. (2009). ODF reconstruction in q-ball imaging with solid angle consideration, Sixth IEEE International Symposium on Biomedical Imaging, Boston. Alexander A. L., Hasan K., Kindlmann G., Parker D. L., & Tsuruda, J. S. (2000). A geometric analysis of diffusion tensor measurements of the human brain. Magnetic Resonance in Medicine:  Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 44:283–291. Alexander, A. L., Hasan, K. M., Lazar, M., Tsuruda, J. S., & Parker, D. L. (2001). Analysis of partial volume effects in diffusion-tensor MRI. Magnetic Resonance in Medicine:  Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 45:770–780. Alexander, D. C. (2005). Multiple-fiber reconstruction algorithms for diffusion MRI. Annals of the New York Academy of Sciences 1064:113–133. Anderson, A. W. (2005). Measurement of fiber orientation distributions using high angular resolution diffusion imaging. Magnetic Resonance in Medicine: Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 54:1194–1206. Assaf, Y., & Pasternak, O. (2008). Diffusion tensor imaging (DTI)-based white matter mapping in brain research:  a review. Journal of Molecular Neuroscience 34:51–61. DOI:  10.1007/ s12031-007-0029-0. Bahn, M. M. (1999). Comparison of scalar measures used in magnetic resonance diffusion tensor imaging. Journal of Magnetic Resonance Imaging 139:1–7. Basser, P. J. (2002). Relationships between diffusion tensor and q-space MRI. Magnetic Resonance in Medicine: Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 47:392–327. Basser, P. J., Mattiello, J., & LeBihan, D. (1994). Estimation of the effective self-diffusion tensor from the NMR spin echo. Journal of Magnetic Resonance Imaging B 103:247–254. Basser, P. J., Pajevic, S., Pierpaoli, C., Duda, J., & Aldroubi, A. (2000). In vivo fiber tractography using DT-MRI data. Magnetic Resonance in Medicine: Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 44:625–632.

247

Basser, P. J., Mattiello J., et  al. (1994). Estimation of the self-diffusion tensor from the NMR spin echo. Journal of Magnetic Resonance Imaging B 103:247–254. Bassett, D. S., Bullmore, E., Verchinski, B. A., Mattay, V. S., Weinberger, D. R., & Meyer-Lindenberg, A. (2008). Hierarchical organization of human cortical networks in health and schizophrenia. The Journal of Neuroscience:  The Official Journal of the Society for Neuroscience 28:9239–9248. DOI: 10.1523/JNEUROSCI.1929-08.2008. Beaulieu, C. (2002). The basis of anisotropic water diffusion in the nervous system—a technical review. NMR in Biomedicine 15:435–455. Beaulieu, C., & Allen, P. S. (1994). Determinants of anisotropic water diffusion in nerves. Magnetic Resonance in Medicine:  Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 31:394–400. Behrens, T. E., Woolrich, M. W., Jenkinson, M., Johansen-Berg, H., Nunes, R. G., Clare, S., . . . Smith, S. M. (2003). Characterization and propagation of uncertainty in diffusion-weighted MR imaging. Magnetic Resonance in Medicine:  Official Journal of the Society of Magnetic Resonance in Medicine/ Society of Magnetic Resonance in Medicine 50:1077–1088. Bock, D. D., Lee, W. C., Kerlin, A. M., Andermann, M. L., Hood, G., Wetzel A. W., . . . Reid, R. C. (2011). Network anatomy and in vivo physiology of visual cortical neurons. Nature 471:177–182. DOI: nature09802 [pii] 10.1038/nature09802. Bohland, J. W., Wu, C. Z., Barbas, H., Bokil, H., Bota, M., Breiter, H. C., . . . Mitra, P. P. (2009). A proposal for a coordinated effort for the determination of brainwide neuroanatomical connectivity in model organisms at a mesoscopic scale. PLoS Computational Biology 5. DOI:  ARTN e1000334 DOI 10.1371/journal.pcbi.1000334. Bota, M., Dong, H. W., & Swanson, L. W. (2012). Combining collation and annotation efforts toward completion of the rat and mouse connectomes in BAMS. Frontiers in Neuroinformatics 6:2. DOI: 10.3389/fninf.2012.00002. Braskie, M. N., Jahanshad, N., Stein, J. L., Barysheva, M., Johnson, K., McMahon, K. L., . . . Thompson, P. M. (2012). Relationship of a variant in the NTRK1 gene to white matter microstructure in young adults. Journal of Neuroscience 32:5964–5972. DOI:  32/17/5964 [pii] 10.1523/ JNEUROSCI.5561-11.2012. Briggman, K. L., & Denk, W. (2006). Towards neural circuit reconstruction with volume electron microscopy techniques. Current Opinion in Neurobiology 16:562–570. DOI:  S0959-4388(06)00114-0 [pii] 10.1016/j.conb.2006.08.010.

248

the OMICs

Briggman, K. L., Helmstaedter, M., & Denk, W. (2011). Wiring specificity in the direction-selectivity circuit of the retina. Nature 471:183–188. DOI: nature09818 [pii] 10.1038/nature09818. Bullmore, E., & Sporns, O. (2009). Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience 10:186–198. Callaghan, P. T. (1996). NMR imaging, NMR diffraction and applications of pulsed gradient spin echoes in porous media. Magnetic Resonance Imaging 14:701–709. Card, J. P., Kobiler, O., McCambridge, J., Ebdlahad, S., Shan, Z., Raizada, M. K., . . . Enquist, L. W. (2011). Microdissection of neural networks by conditional reporter expression from a Brainbow herpesvirus. Proceedings of the National Academy of Sciences of the United States of America 108:3377–3382. DOI: 1015033108 [pii] 10.1073/pnas.1015033108. Carr, H. Y., & Purcell, E. M. (1954). Effects of diffusion on free precession in nuclear magnetic resonance experiments. Physical Review 94:630–635. Catani, M., & ffytche D. H. (2005). The rises and falls of disconnection syndromes. Brain 128:2224–2239. Catani, M., Jones, D. K., & ffytche D. H. (2005). Perisylvian language networks of the human brain. Annals of Neurology 57:8–16. Chamberlin, N. L., Du, B., de Lacalle, S., & Saper, C. B. (1998). Recombinant adeno-associated virus vector:  use for transgene expression and anterograde tract tracing in the CNS. Brain Research 793:169–175. Chiang, M., Barysheva, M., McMahon, K. L., De Zubicaray, G. I., Johnson, K., Martin, N. G., . . . Thompson P.M. (2012). Gene Network Effects on Brain Microstructure and Intellectual Performance Identified in 472 Twins. Journal of Neuroscience 32:8732–8745. Chiang, M. C., McMahon, K.L., De Zubicaray, G.I., Martin, N. G., Hickie, I, Toga, A. W., . . . Thompson, P.M. (2010). Genetics of white matter development: a DTI study of 705 twins and their siblings aged 12 to 29. NeuroImage 54:2308–2317. Chklovskii, D. B., Vitaladevuni, S., & Scheffer L.K. (2010). Semi-automated reconstruction of neural circuits using electron microscopy. Current Opinion in Neurobiology 20:667–675. DOI:  DOI 10.1016/j.conb.2010.08.002. Concha, L., Gross, D. W., & Beaulieu, C. (2005). Diffusion tensor tractography of the limbic system. AJNR. American Journal of Neuroradiology 26:2267–2274. Conturo, T. E., Lori, N. F., Cull, T. S., Akbudak, E., Snyder, A. Z., Shimony, J. S., . . . Raichle, M. E. (1999). Tracking neuronal fiber pathways in the

living human brain. Proceedings of the National Academy of Sciences of the United States of America 96:10422–10427. Daianu, M., Jahanshad, N., Nir, T. M., Toga, A. W., Jack, C. R., Weiner, M. W., Thompson, P. M., Alzheimer’s Disease Neuroimaging Initiative (ADNI). (2012). analyzing the structural k-core of brain connectivity networks in normal aging and Alzheimer’s disease, MICCAI NIBAD. Dennis, E. L., Jahanshad, N., Toga, A. W., Johnson, K., McMahon, K. L., de Zubicaray, G. I., . . . Thompson, P. M. (2012). Changes in anatomical brain connectivity between ages 12 and 30: A HARDI study of 484 adolescents and adults, IEEE International Symposium on Biomedical Imaging (ISBI), Barcelona, Spain, 904–908. Dennis, E. L., Jahanshad, N., Rudie, J. D., Brown, J. A., . . . Thompson, P. M. (2011). Altered structural brain connectivity in healthy carriers of the autism risk gene, CNTNAP2. Brain Connectivity 6:447–459. Descoteaux, M., Lenglet, C., & Deriche, R. (2007). Diffusion tensor sharpening improves white matter tractography, Proceedings SPIE 6512, Medical Imaging 2007: Image Processing: 65121J. DOI: 10.1117/12.708988. San Diego, CA. Dong, H. W. (2007). Allen reference atlas: a digital color brain atlas of the c57BL/6J male mouse. Hoboken, NJ: John Wiley & Sons. Duarte-Carvajalino, J. M., Jahanshad, N., Lenglet, C., McMahon, K. L., De Zubicaray, G. I., Martin, N. G., . . . Sapiro, G. (2012). Hierarchical topological network analysis of anatomical human brain connectivity and differences related to sex and kinship. NeuroImage 59:3784–3804. Eickhoff, S. B., Stephan, K. E., Mohlberg, H., Grefkes, C., Fink, G. R., Amunts, K., & Zilles K. (2005). A new SPM toolbox for combining probabilistic cytoarchitectonic maps and functional imaging data. NeuroImage 25:1325–1335. Fair, D. A., Cohen, A. L., Power, J. D., Dosenbach, N. U., Church, J. A., Miezin, F. M., . . . Petersen, S. E. (2009). Functional brain networks develop from a “local to distributed” organization. PLoS Computational Biology 5:e1000381. DOI: 10.1371/ journal.pcbi.1000381. Frank, L. R. (2002). Characterization of anisotropy in high angular resolution diffusion-weighted MRI. Magnetic Resonance in Medicine:  Official Journal of the Society of Magnetic Resonance in Medicine/ Society of Magnetic Resonance in Medicine 47:1083–1099. Gerig, G., Gouttard, S., & Corouge, I. (2004). Analysis of brain white matter via fiber tract modeling. Conf Proc IEEE Eng Med

Brain Connectomics in Man and Mouse Biol Soc 6:4421–4424. DOI: 10.1109/ IEMBS.2004.1404229. Geschwind, N. (1965). Disconnection syndromes in animals and man. Brain 88:237–294. Gong, G., Rosa-Neto, P., Carbonell, F., Chen, Z. J., He, Y., & Evans, A. C. (2009). Age- and gender-related differences in the cortical anatomical network. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience 29:15684–15693. DOI: 29/50/15684 [pii] 10.1523/ JNEUROSCI.2308-09.2009. Gong, S., Zheng C., Doughty, M. L., Losos, K., Didkovsky, N., Schambra, U. B., Nowak, N. J., . . . Heintz, N. (2003). A gene expression atlas of the central nervous system based on bacterial artificial chromosomes. Nature 425:917–925. DOI: 10.1038/nature02033 nature02033 [pii]. Haacke, E.M., Brown, R. W., Thompson, M. R., & Venkatesan, R. (1999). Magnetic resonace imaging:  physical principles and sequence design. New York, NY: John Wiley & Sons. Hageman N. S., Toga A. W., Narr K. L., & Shattuck D. W. (2009). A diffusion tensor imaging tractography algorithm based on navier-stokes fluid mechanics. IEEE Transactions on Medical Imaging 28:348–360. DOI: Doi 10.1109/Tmi.2008.2004403. Hahn, E. L. (1950). Spin echoes. Physical Review 80:580–594. Harris, J. A., Wook, Oh, S., & Zeng, H. (2012). Adeno-associated viral vectors for anterograde axonal tracing with fluorescent proteins in nontransgenic and cre driver mice. Current Protocols inNeurosci Chapter 1:Unit 1 20 1–18. DOI: 10.1002/0471142301.ns0120s59. Hintiryan, H., Gou, L., Zingg, B., Yamashita, S., Lyden, H. M., Song, M. Y., . . . Dong, H. W. (2012). Comprehensive connectivity of the mouse main olfactory bulb:  analysis and online digital atlas. Frontiers in Neuroanatomy 6:30. DOI:  10.3389/ fnana.2012.00030. Holten, D. (2006). Hierarchical edge bundles:  visualization of adjacency relations in hierarchical data. IEEE Transactions on Visualization and Computer Graphics 12:741–748. DOI:  10.1109/ TVCG.2006.147. Hua, K., Zhang, J., Wakana, S., Jiang, H., Li, X., Reich, D.S., . . . Mori, S. (2008). Tract probability maps in stereotaxic spaces: Analyses of white matter anatomy and tract-specific quantification. NeuroImage 39:336–347. Inglis, B. A., Bossart, E. L., Buckley, D. L., Wirth, E. D.  III, & Mareci, T. H. (2001). Visualization of neural tissue water compartments using biexponential diffusion tensor MRI. Magnetic Resonance in Medicine:  Official Journal of the Society of

249

Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 45:580–587. Irimia, A., Chambers, M. C., Torgerson, C. M., & Horn, J. D. (2012a). Circular representation of human cortical networks for subject and population-level connectomic visualization. NeuroImage 60: 1340–1351. DOI: 10.1016/j.neuroimage.2012.01.107. Irimia, A., Chambers, M. C., Torgerson, C. M., Filippou, M., Hovda, D. A., Alger, J. R., . . .Van Horn, J. D. (2012b). Patient-tailored connectomics visualization for the assessment of white matter atrophy in traumatic brain injury. Frontiers in Neurology 3. DOI: 10.3389/fneur.2012.00010. Jahanshad, N., Kohannim, O., Hibar, D. P., Stein, J.L., McMahon, K. L., de Zubicaray, G. I., . . . Thompson, P.M. (2012). Brain structure in healthy adults is related to serum transferrin and the H63D polymorphism in the HFE gene. Proceedings of the National Academy of Sciences of the United States of America 109:E851–E859. DOI:  DOI 10.1073/ pnas.1105543109. Jahanshad, N., Valcour, V. G., Nir, T. M., Kohannim, O, Busovaca, E, Nicolas, K., & Thompson, P. M. (2012). Disrupted brain networks in the aging HIV+ population. Brain Connectivity 2:335–344. Jahanshad, N., Rajagopalan, P., Hua, X., Hibar, D. P., Nir, T. M., Toga, A. W., . . .Thompson, P. M. (2013). Genome-wide scan of healthy human connectome discovers SPON1 gene variant influencing dementia severity. Proc Natl Acad Sci USA 110:4768–4773. Jansons, K. M., & Alexander, D.C. (2003). Persistent angular structure:  new insights from diffusion MRI data. Dummy version. Information Processing in Medical Imaging 18:672–683. Jarrell, T. A., Wang, Y., Bloniarz, A. E., Brittin, C. A., Xu, M., Thomson, J.N., . . . Emmons S. W. (2012). The Connectome of a Decision-Making Neural Network. Science 337:437–444. DOI: DOI 10.1126/science.1221762. Jbabdi, S., Woolrich, M. W., Andersson, J.L.R.,& Behrens, T.E.J. (2007). A Bayesian framework for global tractography. NeuroImage 37:116–129. DOI: DOI 10.1016/j.neuroimage.2007.04.039. Jones, D. K., & Pierpaoli, C. (2005). Confidence mapping in diffusion tensor magnetic resonance imaging tractography using a bootstrap approach. Magnetic Resonance in Medicine:  Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine53:1143–1149. Jones, D. K., Travis, A. R., Eden, G., Pierpaoli, C., & Basser P. J. (2005). PASTA: pointwise assessment of streamline tractography attributes. Magnetic Resonance in Medicine:  Official Journal of the

250

the OMICs

Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine53:1462–1467. Jovicich, J., Czanner, S., Han, X., Salat, D., van der Kouwe, A., Quinn, B., . . . Fischl, B. (2009). MRI-derived measurements of human subcortical, ventricular and intracranial brain volumes: Reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths. NeuroImage 46:177–192. DOI:  S1053-8119(09)00150-5 [pii] 10.1016/j.neuroimage.2009.02.010. Kochunov, P., Jahanshad, N., Sprooten, E., Thompson, P., McIntosh, A., Deary, I., . . . Glahn. D. (2012). Genome-wide association of full brain white matter integrity—from the ENIGMA DTI working group. Presented at the annual meeting for the Organization for Human Brain Mapping (OHBM), Beijing, China. Kohannim, O., Jahanshad, N., Braskie, M. N., Stein, J. L., Chiang, M., Reese, A. H., . . . Thompson, P. M. (2012). Predicting white matter integrity from common genetic variants. Neuropsychopharmacology 37:2012–2019. Kopeikina, K. J., Carlson, G. A., Pitstick, R., Ludvigson, A. E., Peters, A., Luebke, J. I., . . . Spires-Jones, T. L. (2011). Tau Accumulation Causes Mitochondrial Distribution Deficits in Neurons in a Mouse Model of Tauopathy and in Human Alzheimer’s Disease Brain. American Journal of Pathology 179: 2071–2082. DOI: DOI 10.1016/j.ajpath.2011.07.004. Lazar, M., Weinstein, D. M., Tsuruda, J. S., Hasan, K. M., Arfanakis, K., Meyerand, M. E., . . . Alexander, A. L. (2003). White matter tractography using diffusion tensor deflection. Human Brain Mapping 18:306–321. Le Bihan, D. (2003). Looking into the functional architecture of the brain with diffusion MRI. Nature reviews. Neuroscience 4:469–480. Le Bihan, D., Breton, E., Lallemand, D., Grenier, P., Cabanis, E., & Laval-Jeantet, M. (1986). MR imaging of intravoxel incoherent motions: application to diffusion and perfusion in neurologic disorders. Radiology 161:401–407. Lein, E. S., Hawrylycz, M. J., Ao, N., Ayres, M., Bensinger, A., Bernard, A., . . . et  al. (2007). Genome-wide atlas of gene expression in the adult mouse brain. Nature 445:168–176. Leow, A. D., Zhu, S., Zhan, L., McMahon, K., de Zubicaray, G. I., Meredith, M., . . . Thompson P. M. (2009). The tensor distribution function. Magnetic Resonance Medicine 61:205–214. DOI:  10.1002/ mrm.21852. Leow, A. D., Zhan L., Ajilore, O., GadElkarim, J., Zhang, A., Arienzo, D., . . . Altshuler, L. (2012). Measuring inter-hemispheric integration in bipolar

affective disorder using brain network analyses and HARDI, ISBI, Barcelona, Spain. Li, Y., Liu, Y., Li, J., Qin, W., Li, K., Yu, C., & Jiang, T. (2009). Brain anatomical network and intelligence. PLoS Computational Biology 5:e1000395. DOI: 10.1371/journal.pcbi.1000395. Lin, C. P., Wedeen, V. J., Chen, J. H., Yao, C., & Tseng, W. Y. (2003). Validation of diffusion spectrum magnetic resonance imaging with manganese-enhanced rat optic tracts and ex vivo phantoms. NeuroImage 19:482–495. Livet, J., Weissman, T. A., Kang, H. N., Draft, R. W., Lu, J., Bennis, R. A., . . . Lichtman J. W. (2007). Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system. Nature 450:56. DOI: Doi 10.1038/Nature06293. Lu J., Tapia J. C., White O. L., & Lichtman J. W. (2009). The interscutularis muscle connectome (vol. 7, e1000032, 2009). Plos Biology 7:996–996. DOI:  ARTN e1000108 DOI 10.1371/journal. pbio.1000108. Luo, L., Callaway, E. M., & Svoboda, K. (2008). Genetic dissection of neural circuits. Neuron 57:634–660. DOI:  S0896-6273(08)00031-7 [pii] 10.1016/j. neuron.2008.01.002. Maddahm M., Grimsonm W.E.L., Warfieldm S. K., & Wellsm W. M. (2008). A unified framework for clustering and quantitative analysis of white matter fiber tracts. Medical Image Analysis 12:191–202. DOI:  DOI 10.1016/j. media.2007.10.003. Marshel, J. H., Mori, T., Nielsen, K. J., & Callaway E. M. (2010). Targeting single neuronal networks for gene expression and cell labeling in vivo. Neuron 67:562–574. DOI:  S0896-6273(10)00588-X [pii] 10.1016/j.neuron.2010.08.001. Micheva, K. D., & Smith S. J. (2007). Array tomography:  a new tool for imaging the molecular architecture and ultrastructure of neural circuits. Neuron 55:25–36. DOI:  S0896-6273(07)00441-2 [pii] 10.1016/j.neuron.2007.06.014. Modha, D. S., & Singh, R. (2010). Network architecture of the long-distance pathways in the macaque brain. Proceedings of the National Academy of Sciences of the United States of America 107: 13485–13490. DOI: 10.1073/pnas.1008054107. Molko, N., Cohen, L., Mangin, J. F., Chochon, F., Lehericy, S., Le Bihan, D., & Dehaene S. (2002). Visualizing the neural bases of a disconnection syndrome with diffusion tensor imaging. Journal of Cognitive Neuroscience 14:629–636. Mori S., Crain B. J., Chacko V. P., & van Zijl P.C. (1999). Three-dimensional tracking of axonal projections in the brain by magnetic resonance imaging. Annals of Neurology 45:265–269.

Brain Connectomics in Man and Mouse Moseley, M. E., Kucharczyk, J., Asgari, H. S., & Norman, D. (1991). Anisotropy in diffusion-weighted MRI. Magnetic Resonance Medicine 19:321–326. Moseley, M. E., Cohen, Y., Kucharczyk, J., Mintorovitch, J., Asgari H. S., Wendland, M. F., . . . Norman, D. (1990). Diffusion-weighted MR imaging of anisotropic water diffusion in cat central nervous system. Radiology 176:439–445. Nelson, S. B., Sugino K., & Hempel C. M. (2006). The problem of neuronal cell types: a physiological genomics approach. Trends in Neuroscience 29:339–345. DOI:  S0166-2236(06)00092-0 [pii] 10.1016/j.tins.2006.05.004. Newton, J. M., Ward, N. S., Parker, G. J., Deichmann, R., Alexander, D. C., Friston, K.J., & Frackowiak, R. S. (2006). Non-invasive mapping of corticofugal fibres from multiple motor areas—relevance to stroke recovery. Brain: A Journal Of Neurology 129:1844–1858. Nir, T., Jahanshad, N., Jack, C. R., Weiner, M. W., Toga, A. W., Thompson, P.M., Alzheimer’s Disease Neuroimaging Initiative (ADNI). (2012). Small world network measures predict white matter degeneration in patients with early-state mild cognitive impairment, ISBI, Barcelona, Spain. O’Donnell, L., Haker, S., & Westin, C. F. (2002). New approaches to estimation of white matter connectivity in diffusion tensor MRI:  elliptic PDEs and geodesics in a tensor-warped space. Medical Image Computing and Computer-Assisted Intervention-MICCAI 2002, Pt 1 2488:459–466. O’Donnell, L.J., & Westin, C. F. (2007). Automatic tractography segmentation using a high-dimensional white matter atlas. IEEE Transactions on Medical Imaging 26:1562–1575. DOI:  Doi 10.1109/ Tmi.2007.906785. Osakada, F., Mori, T., Cetin, A. H., Marshel, J. H., Virgen, B., & Callaway, E. M. (2011). New rabies virus variants for monitoring and manipulating activity and gene expression in defined neural circuits. Neuron 71:617–631. DOI:  S0896-6273(11)00600-3 [pii] 10.1016/j. neuron.2011.07.005. Osten, P., & Margrie, T. W. (2013). Mapping brain circuitry with a light microscope. Nature Methods 10:515–523. Parker, G. J., & Alexander, D. C. (2003). Probabilistic Monte Carlo based mapping of cerebral connections utilising whole-brain crossing fibre information. Information Processing in Medical Imaging 18:684–695. Patel, V., Shi Y. G., Thompson, P. M., & Toga, A. W. (2010). Mesh-based spherical deconvolution:  a flexible approach to reconstruction of non-negative fiber orientation distributions.

251

NeuroImage 51:1071–1081. DOI:  DOI 10.1016/j. neuroimage.2010.02.060. Peled, S., Gudbjartsson, H., Westin, C. F., Kikinis, R., & Jolesz, F. A. (1998). Magnetic resonance imaging shows orientation and asymmetry of white matter fiber tracts. Brain Research 780:27–33. Peron, S., & Svoboda, K. (2011). From cudgel to scalpel: toward precise neural control with optogenetics. Nature Methods 8:30–34. DOI:  nmeth.f.325 [pii] 10.1038/nmeth.f.325. Qiu, A. Q., Bitouk, D., & Miller, M. I. (2006). Smooth functional and structural maps on the neocortex via orthonormal bases of the Laplace-Beltrami operator. IEEE Transactions on Medical Imaging 25:1296–1306. DOI:  Doi 10.1109/ Tmi.2006.882143. Reuter, M., Wolter, F.E., & Peinecke N. (2006). Laplace-Beltrami spectra as “Shape-DNA” of surfaces and solids. Computer-Aided Design 38: 342–366. DOI: DOI 10.1016/j.cad.2005.10.011. Shattuck, D. W., Mirza, M., Adisetiyo, V., Hojatkashani, C., Salamon, G., Narr, K.L., . . . Toga, A. W. (2008). Construction of a 3D probabilistic atlas of human cortical structures. NeuroImage 39:1064–1080. Shi Y., Dinov I., Toga A.W. (2009a). Cortical shape analysis in the Laplace-Beltrami feature space. Medical Imaging Computing Computer Assisted Intervention 12:208–215. Shi, Y., Morra, J. H., Thompson, P. M., & Toga, A.W. (2009b). Inverse-consistent surface mapping with Laplace-Beltrami eigen-features. Information Processing in Medical Imaging 21:467–478. Shi, Y., Lai, R., Kern, K., Sicotte, N., Dinov, I., & Toga, A. W. (2008). Harmonic surface mapping with Laplace-Beltrami eigenmaps. Medical Imaging Computing Computer Assisted Intervention 11:147–154. Shu, N., Liu, Y., Li, J., Li, Y., Yu, C., & Jiang T. (2009). Altered anatomical network in early blindness revealed by diffusion tensor tractography. PloS One 4:e7228. Soiza-Reilly, M., & Commons, K. G. (2011). Quantitative analysis of glutamatergic innervation of the mouse dorsal raphe nucleus using array tomography. Journal of Comparative Neurology 519:3802–3814. DOI: Doi 10.1002/Cne.22734. Sporns, O., & Zwi, J. D. (2004). The small world of the cerebral cortex. Neuroinformatics 2:145–162. Stadnik, T. W., Demaerel, P., Luypaert, R. R., Chaskis, C., Van Rompaey, K. L., Michotte, A., & Osteaux M. J. (2003). Imaging tutorial:  differential diagnosis of bright lesions on diffusion-weighted MR images. Radiographics 23:e7. Stein, J. L., Medland, S. E., Vasquez, A. A., Hibar, D. P., Senstad, R. E., Winkler, A. M., et  al. (2012).

252

the OMICs

Identification of common variants associated with human hippocampal and intracranial volumes. Nature Genetics 44:552–561. Swanson, L. W., & Bota, M. (2010). Foundational model of structural connectivity in the nervous system with a schema for wiring diagrams, connectome, and basic plan architecture. Proceedings of the National Academy of Sciences of the United States of America 107:20610–20617. DOI:  DOI 10.1073/pnas.1015128107. Thomason, M. E., & Thompson, P. M. (2011). Diffusion imaging, white matter, and psychopathology. Annual Review of Clinical Psychology 7:63–85. DOI: 10.1146/annurev-clinpsy-032210-104507. Thompson, R. H., & Swanson L. W. (2003). Structural characterization of a hypothalamic visceromotor pattern generator network. Brain Research Reviews 41:153–202. DOI:  Doi 10.1016/ S0165-0173(02)00232-1. Thomsen C., Henriksen, O., & Ring, P. (1987). In vivo measurement of water self diffusion in the human brain by magnetic resonance imaging. Acta Radiologica 28:353–361. Toga, A. W., Clark, K. A., Thompson, P. M., Shattuck, D. W., & Van Horn, J. D. (2012). Mapping the human connectome. Neurosurgery 71:1–5. DOI:  10.1227/NEU.0b013e318258e9ff [doi] 00006123-201207000-00001 [pii]. Tournier, J. D., Calamante, F., Gadian, D. G., & Connelly, A. (2003). Diffusion-weighted magnetic resonance imaging fibre tracking using a front evolution algorithm. NeuroImage 20:276–288. DOI: Doi 10.1016/S1053-8119(03)00236-2. Tournier, J. D., Calamante, F., Gadian D. G., & Connelly A. (2004). Direct estimation of the fiber orientation density function from diffusion-weighted MRI data using spherical deconvolution. NeuroImage 23:1176–1185. DOI:  DOI 10.1016/j. neuroimage.2004.07.037. Tuch D. S. (2004). Q-ball imaging. Magnetic Resonance in Medicine:  Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 52:1358–1372. Tuch, D. S., Reese, T. G., Wiegell, M. R., Makris, N., Belliveau, J. W., & Wedeen, V. J. (2002). High angular resolution diffusion imaging reveals intravoxel white matter fiber heterogeneity. Magnetic Resonance in Medicine: Official Journal of the Society of Magnetic Resonance in Medicine/ Society of Magnetic Resonance in Medicine 48:577–582. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., . . . Joliot, M. (2002). Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage 15:273–289.

van den Heuvel, M. P., Stam, C. J., Kahn, R. S., & Hulshoff Pol, H. E. (2009). Efficiency of functional brain networks and intellectual performance. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience 29:7619–7624. Van Essen, D. C., Ugurbil, K., Auerbach, E., Barch, D., Behrens, T. E., Bucholz, R., . . . Yacoub, E. (2012). The Human Connectome Project:  A  data acquisition perspective. Neuroimage 62:2222–2231. DOI:  S1053-8119(12)00195-4 [pii] 10.1016/j. neuroimage.2012.02.018 [doi]. Van Horn, J. D., Irimia, A., Torgerson, C. M., Chambers, M. C., Kikinis, R., & Toga A. W. (2012). Mapping connectivity damage in the case of Phineas Gage. PloS One 7:e37454. DOI: 10.1371/ journal.pone.0037454. Wakana, S., Caprihan, A., Panzenboeck, M. M., Fallon, J. H., Perry, M., Gollub, R. L., . . . Mori, S. (2007). Reproducibility of quantitative tractography methods applied to cerebral white matter. NeuroImage 36:630–644. DOI:  DOI 10.1016/j. neuroimage.2007.02.049. Wang, L., Zhu, C., He, Y., Zang, Y., Cao, Q., Zhang, H. . . Wang, Y. (2009). Altered small-world brain functional networks in children with attention-deficit/ hyperactivity disorder. Human Brain Mapping 30:638–649. Watts, D. J., & Strogatz, S.H. (1998). Collective dynamics of “small-world” networks. Nature 393:440–442. Wedeen, V.J., Hagmann, P., Tseng, W. Y., Reese, T. G., & Weisskoff R. M. (2005). Mapping complex tissue architecture with diffusion spectrum magnetic resonance imaging. Magnetic Resonance in Medicine: Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 54:1377–1386. Wesbey, G. E., Moseley, M. E., & Ehman, R. L. (1984a). Translational molecular self-diffusion in magnetic resonance imaging. I. Effects on observed spin-spin relaxation. Investigative Radiology 19:484–490. Wesbey, G. E., Moseley, M. E., & Ehman, R. L. (1984b). Translational molecular self-diffusion in magnetic resonance imaging. II. Measurement of the self-diffusion coefficient. Investigative Radiology 19:491–498. Westin, C. F., Maier, S. E., Mamata, H., Nabavi, A., Jolesz, F. A., & Kikinis, R. (2002). Processing and visualization for diffusion tensor MRI. Medical Image Analysis 6:93–108. White, J. G., Southgate, E., Thomson, J. N., & Brenner, S. (1986). The structure of the nervous-system of the nematode Caenorhabditis-elegans. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 314:1–340. Xia, Y., Turken, U., Whitfield-Gabrieli, S.L., & Gabrieli, J. D. (2005). Knowledge-based classification of

Brain Connectomics in Man and Mouse neuronal fibers in entire brain. Medical Image Computing and Computer-Assisted Intervention— MICCAI 2005, Pt 1 3749:205–212. Xue, R., van Zijl, P. C., Crain, B. J., Solaiyappan, M., & Mori, S. (1999). In vivo three-dimensional reconstruction of rat brain axonal projections by diffusion tensor imaging. Magnetic Resonance in Medicine: Official Journal of the Society of Magnetic Resonance in Medicine/Society of Magnetic Resonance in Medicine 42:1123–1127.

253

Yizhar, O., Fenno, L. E., Davidson, T. J., Mogri, M., & Deisseroth, K. (2011). Optogenetics in neural systems. Neuron 71:9–34. Zhan, L., Leow, A. D., Aganj, I., Lenglet, C., Sapiro, G., Yacoub, E., . . . Thompson. P. M. (2011). Differential Information Content in Staggered Multiple Shell HARDI Measured by the Tensor Distribution Function. IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Chicago, 305–309.

14 Optogenetics RICHIE E. KOHMAN, HUA-AN TSENG, AND XUE  HAN

INTRODUCTION Large-scale genomic and proteomic studies of the nervous system have yielded important insights into the molecular and cellular mechanisms of neuronal functions and related diseases. Integration of systems-level genomic and proteomic analyses with neural circuit‒ level description of the brain’s wiring diagram will ultimately lead to mechanistic understandings of functional connectomics—the dynamics of the neural network connections that give rise to the neural codes underlying behaviors (Geschwind & Konopka 2009; Alivisatos et  al. 2012). Advances have been made in reconstructing static brain wiring diagrams and also in monitoring neural activity patterns with various spatiotemporal resolutions. However, the ability to precisely manipulate neural circuit activities, activating or silencing particular cells within a neural circuit at the millisecond time scale, has only recently been realized with the development of optogenetics technique. The ability to simultaneously control neural circuit and monitor neural network responses, along with other -omic analysis, may be essential to building the functional connectomics and ultimately to understanding and treating brain disorders. Optogenetic techniques use the combination of light with genetic methods to manipulate the activity patterns of brain cells and cellular pathways. To sensitize neurons or other brain cells to light, specific cells or anatomical pathways are genetically transduced to express light sensitive proteins. The discovery of small and easy-to-use microbial light-sensitive proteins, the opsins from archaebacteria and green algae, was critical to the widespread adaption of

optogenetics in the neuroscience community. Upon light illumination, neurons or pathways expressing light-sensitive opsins can be reversibly activated or silenced at a desired time scale, as fast as submilliseconds. The use of the optogenetic technique requires expression of opsins in desired cells followed by exposure to the appropriate wavelength of light. Three major classes of microbial opsins have been successfully implemented, channelrhodopsins, halorhodopsins, and archaerhodopsins (Figure  14.1) (Han 2012). These microbial opsins are encoded by very small genes less than a kilobase long, similar to the size of a GFP (green fluorescent protein). Because of their small size, it is very easy to express opsins in vivo with viral gene transduction methods. As a result, optogenetic experiments are now routinely performed in vivo in a variety of model organisms such as worms, rodents, and primates. Some success has also been made in optogenetically controlling human retina explant, highlighting the translational potential of optogenetics in treating blindness or other neurological disorders. Optogenetics is orthogonal to a number of neural activity readout methods, such as traditional electrophysiological recording techniques, functional magnetic resonance imaging (fMRI), and cellular imaging methods; therefore it can be used in conjunction with them. The ability to activate or silence particular cells while simultaneously monitoring neural network responses and behavioral outcomes can help to determine the causal role of certain neurons in neural network functions and related brain disorders. Herein is a brief overview of both the molecular and cellular aspects of optogenetics along with

Optogenetics (A) Channelrhodopsins

(B) Halorhodopsins

255

(C) Archaerhodopsins

Extracellular

Intracellular

Optogenetic molecular sensors. Upon light illumination, channelrhodopsins passively transport Na+, K , H , Ca down their electrochemical gradients to depolarize neurons (A); halorhodopsins actively pump Clinto the cell to hyperpolarize neurons (B); archaerhodopsins actively pump H+ out of the cell to hyperpolarize neurons (C). FIGURE 14.1: +

+

2+

the genetic modification methods and optical technologies being used. Many other reviews and books also cover a variety of aspects of the current field (Bernstein & Boyden 2011; Chow et  al. 2012; Han 2012; Knopfel & Boyden 2012; Miesenbock 2011; Yizhar et  al. 2011; Zhang et al. 2011).

OVERVIEW OF SENSORS General Considerations The light-sensitive proteins that have been utilized for optogenetics belong to microbial type I  opsins. These are seven transmembrane proteins found in archaea, algae, bacteria, and fungi, where they provide natural functions for light sensing and photosynthesis (Spudich et  al. 2000). The mechanism of action for some of these proteins has been studied for decades. However they have been adapted only recently to control the function of neurons with light. The classes of opsins that have been shown to convert light energy into ion transport across the plasma membrane include channelrhodopsins, archaerhodopsins, and halorhodopsins. Type II animal opsins are found in higher eukaryotes and function by recruiting cellular kinases instead of inducing direct ion transport; they are therefore not as useful in providing temporal precise optogenetic controls. A key structural component of these opsins is the presence of all-trans retinal as the active chromophore. Retinal (vitamin A-aldehyde) is bound as a protonated Schiff base to a conserved lysine residue within the seventh transmembrane helix (Spudich et al. 2000; Spudich 2006). The photo-induced isomerization of retinal

is responsible for a series of conformational changes that ultimately drive ion transport. The molecular understanding of this mechanism is important in guiding the development of opsin variations with improved properties. The main properties of interest with opsins are absorbance wavelength, photocycle kinetics, and the magnitude of the photocurrent. Most opsins have a broad absorbance spectrum at the visible wavelengths, centered at blue- to yellow-light wavelengths. As tissue light penetration increases with red light of about 630 nm or more, there is a motivation to produce opsins that can be excited by red or infrared light. A  narrower action spectrum is also desirable because it would reduce cross talk in performing multicolor experiments involving more than one type of opsin. Photocycle kinetics, the transition of opsin protein conformations through a series of nonconducting and conducting states, determines the efficiency and the speed of light-evoked ion transport, which ultimately determines an opsin’s ability to control cells. Lastly, photocurrent magnitude determines how much light must be used to activate an opsin. Excessive light irradiation can cause heating, which would be damaging to the tissue.

Channelrhodopsins for Neural Activation Channelrhodopsin-2 (ChR2) is a light-gated ion channel naturally found in the green alga Chlamydomonas reinhardtii, responsible for algal phototaxis, as a sensory rhodopsin homologous to phototaxis receptors and light-driven ion transporters in prokaryotes (Sineshchekov et  al. 2009). ChR2 was the first opsin used to

256

the OMICs

control neural activity, demonstrated by Boyden and colleagues (2005), followed by a few other groups around the same time (Boyden 2011). The protein commonly used in many laboratories is the 315 amino acid N-terminal domain of the native alga ChR2. Since ChR2’s discovery, there has been much interest in elucidating its mechanisms of action (Ishizuka et al. 2006). Typical methods—such as ultraviolet-visible (UV-vis), Fourier transform infrared (FTIR), Raman spectroscopy, and x-ray crystallography—have been used for this purpose. Absorbance spectroscopy (UV-vis) gives valuable information about the overall conformation of the protein, such as the protonation state of the retinal imine group, charge distribution within the conjugated chromophore, and interactions between the protein and the chromophore (Bamann et  al. 2008; Ritter et  al. 2008); vibrational spectroscopy (FTIR, Raman) gives structural change information in terms of amino acid protonation states, alterations in hydrogen bonding, retinal geometry changes, and protein chromophore interactions (Radu et  al. 2009); and x-ray crystallography provides structural information and can tell the position of crucial functional groups in the protein for comparison with structures of more thoroughly examined opsins. A consensus for a molecular mechanism of ChR2 is beginning to emerge. A  simplified mechanism is shown in Figure 14.2. Irradiation of dark-state ChR2 (D470, indicating peak absorption wavelength at 470  nm) isomerizes the retinal chromophore from its all-trans conformation to a 13-cis conformation to provide intermediate P500. All studied opsins require photo-induced isomerization of retinal in order to function. D470 exists with the retinal/protein imine groups in a protonated, iminium form. Deprotonation of this iminium group in intermediate P500 leads to the formation of P390. The proton transfer occurs toward the extracellular side of the protein by transfer to a carboxylate amino acid side chain residue and embedded water molecules. The Schiff base is then protonated from the cytoplasmic space to give intermediate P520. The net effect of this transfer is that of a light-driven proton pump; however, conformational changes occurring at this step produce a nonselective ion channel in the P520 conducting state that overrides the pumping effect (Feldbauer et  al. 2009). The channel allows protons as well as sodium,

potassium, and calcium ions to pass through (Nagel et  al. 2003). Under ionic concentrations found in vivo, the end result is the light-induced depolarization of neurons. Retinal isomerization back to the all-trans conformation produces P480, which can relax back to the initial dark state D470. This relaxation process takes tens of seconds to occur. However, the P480 intermediate is also photoactive; therefore light mediated ion transport can be initiated from this state as well. The mechanistic studies summarized here along with the recently solved crystal structure of the dark state of channelrhodopsin (Kato et  al. 2012)  will allow for the design of ChR2 variants with improved properties that will further enhance the power of optogenetics. Motivated by the success of using ChR2 in driving neural spiking, a variety of channelrhodopsin variants with improved properties have been developed. Higher-amplitude mutants have been engineered as ChR2 point mutants, such as ChR2(H134R), ChR2(T159C), or chimeras of ChR1 and ChR2 (ChIEF, ChRGR) (Berndt et  al. 2011; Lin et  al. 2009; Nagel et  al. 2005; Wang et  al. 2009; Wen et  al. 2010. Faster kinetic mutants were achieved in CheTA (Gunaydin et  al. 2010). Long-term depolarization was achieved in ChR2 point mutants ChR2 (D156A) and ChR2 (C128A/S) (Bamann et  al. 2010; Berndt et  al. 2009). Red light‒activated mutants were found, such as VChR1 from Volvox carteri, MChR1 from Mesostigma viride, and chimeras C1V1 (Govorunova et  al. 2011; Yizhar et al. 2011; Zhang et al. 2008). Enhanced calcium permeability mutant was realized in point mutation ChR2 (L132C) (Kleinlogel et al. 2011). These ChR2 variants allow experimentation to activate cells at different time scales, with the precision as short as submilliseconds or as long as a few minutes.

Halorhodopsins for Neural Silencing Shortly after the demonstration of neuron activation with ChR2, halorhodopsins were expressed in neurons to generate hyperpolarizing currents upon exposure to light (Han & Boyden 2007; Zhang et al. 2007). The first used halorhodopsin was from Natronomonas pharaonis (Halo/NpHR), which is a yellow light‒ activated chloride pump. Its initial use in vivo was hampered by the fact that it trafficked poorly toward plasma membranes. Newer generations of Halo with improved photocurrents

Optogenetics Extracellular

Structural relaxation

257

D470

hv Intracellular

Retinal isomerization P500

P480

Retinal isomerization

P520

Retinal deprotonation

Retinal protonation & conformation change

P390

FIGURE  14.2: Simplified ChR2 photocycle. Light initiates retinal isomerization which leads to a cascade of conformational changes and proton transfers. Although structural changes happen at every step, the cation ion channel is only produced in intermediate P520. Intermediates are named after peaks observed in absorbance spectra. hv = light.

have been created by tagging the protein with mammalian membrane trafficking sequences (Gradinaru et  al. 2008, 2010; Ma et  al. 2001; Zhao et al. 2008). Although Halo was used for optogenetics after the discovery of ChR2, information about the structure and photochemical mechanism of the Halo protein was initially pursued decades prior, using the same methods described above for ChR2 (Bamberg et  al. 1993; Essen 2002; Haupts et al. 1997; Kolbe et al. 2000; Oesterhelt et al. 1985; Schobert & Lanyi 1982). The mechanisms of halorhodopsin photocycles are similar to those of ChR2 in that they are all initiated by light absorption of the bound chromophore all-trans retinal. The protein conformational changes, however, do not create a channel for ions to passively diffuse through; rather, they selectively pump a specific ion across the cell membrane. Figure  14.3 shows the simplified mechanism of Halo beginning with photoisomerization of the retinal to give intermediate HR600. This intermediate is also associated with a dipole shift of the Schiff base from the extracellular to the intracellular side of the protein. In this state, along with the dark state (HR580), there is a chloride counter ion bound in the extracellular side of the protein channel.

Conformational changes allow the chloride ion to shift from one side of the protein to the other, following the dipole shift of the iminium group (intermediate HR520). In this state the chloride ion can be transferred into the cytosol, producing intermediate HR640. Conversion to the original conformation occurs after retinal isomerization (intermediate HR565) followed by structural relaxation and extracellular chloride uptake. This conformational relaxation restores the iminium dipole to its initial conformation.

Archaerhodopsins for Neural Silencing The limited ability of neural silencing with  the first-generation halorhodopsins provoked the search for more successful hyperpolarizing opsins. Success was achieved by using a class of light-driven outward proton pumps called Archaerhodopsins. Archearhodopsin-3 from Halorubrum sodomense (Arch) was shown to mediate rapid and reversible silencing when expressed in neurons and stimulated with green-yellow light (Chow et  al. 2010). Arch traffics well to the plasma membrane and works well in vivo. Several variants have now been reported. ArchT from the Halorubrum strain TP009 possesses an improved photocurrent

258

the OMICs Structural relaxation & Chloride reuptake Cl–

Extracellular

HR580

hv Intracellular

Retinal isomerization HR600

HR565

Chloride translocation

Retinal isomerization

HR640 Cl–

Chloride release

HR520

FIGURE  14.3: Simplified Halorhodopsin photocycle. Light initiates a mechanism involving conformational protein changes, and iminium dipole shifting leads to the pumping of chloride ions inward. This occurs because the iminium counterion, chloride, can be translocated by following the dipole shifting of the Schiff base. Intermediates are named after peaks observed in absorbance spectra. hv = light.

(Han et  al. 2011), and Mac from Leptosphaeria maculans has a blue-shifted action spectrum (Chow et  al. 2010). Despite being a proton pump, it has been shown that Arch does not alter cellular pH because of compensatory mechanisms. No negative effect has been reported on the extracellular space of cells expressing Arch; however, it has not been thoroughly investigated. The mechanism of Arch has not been extensively investigated; however, bacteriorhodopsin, which is highly homologous to Arch, has been studied for decades (Lanyi 2004; Lanyi  & Luecke 2001; Luecke 2000). The simplified mechanism for bacteriorhodopsin activity is show in Figure 14.4. Irradiation with light isomerizes the retinal chromophore. Conformational shifts from dark state BR570 to L550 enable a proton transfer from the iminium group to a carboxylate group positioned toward the extracellular side of the protein. Upon deprotonation, the imine dipole shifts so as to point the nitrogen lone pair toward the intracellular side of the protein rather than the extracellular side giving intermediate M412. This intermediate is positioned so that the imine group can be reprotonated by a carboxylic acid on the intracellular portion of the opsin. In this state (N560), the proton that was initially bound to the chromophore iminium group is release into

the extracellular space and the cytosolic carboxylate is reprotonated from the intracellular side. The net effect is the transfer of one proton out of the cell per photocycle. This cycle is completed upon isomerization of the chromophore back to the all-trans conformation (intermediate O610) followed by conformational changes that reposition the chromophore dipole toward the extracellular space. The Bacteriorhodopsin mechanism shares many similarities with that of Halorhodopsin, but a major difference is the presence of carboxylic acid groups along the channel interior of Bacteriorhodopsin. As a result, proton transfers involving the protonated retinal Schiff base are possible. In Halo, these residues are not present to transfer protons; therefore the ionic species that is pumped is the iminium counterion chloride.

FUTURE PERSPECTIVES ON OPSIN MOLECULAR ENGINEERING As detailed above, an opsin’s photocurrent critically depends on its photocycle, and its proper targeting to the plasma membrane. Continued improvement of the molecular properties of the opsins will facilitate more versatile and powerful optogenetic controls. In general, light-activated ion channels such as channelrhodopsins can be easily engineered to conduct current through

Optogenetics Extracellular

Structural relaxation

259

BR570

hv Intracellular

Retinal isomerization L550

O610

Retinal deprotonation

Retinal isomerization

N560

Retinal protonation

M412

FIGURE 14.4: Simplified Bacteriorhodopsin photocycle. Light initiates a mechanism involving conformational protein changes and iminium dipole shifting leads to the shuttling of protons outward. This is made possible by proton transfers from carboxylic acid residues on both the intra and extracellular pores and the Schiff base. Intermediates are named after peaks observed in absorbance spectra. hv = light.

a wide range of time scales because photonic energy is used to gate the opening of the channel, and the kinetics of channel closing can be engineered to produce ChR2 with a closing time constant of a few milliseconds to minutes (Yizhar et  al. 2011). In contrast, for light-activated ion pumps, halorhodopsins and archaerhodopsins, the photonic energy is consumed to actively transport ions across the membrane; thus the timing of ion transport parallels the presence of light illumination and it is more difficult to engineer ion transport across various time scales. In addition, fusing the photo-sensing domain of microbial opsins with G protein‒coupled receptors (GPCRs) has produced light activated GPCRs capable of controlling cellular signaling (Airan et  al. 2009). Over the past few years, rapid progress has been made in engineering opsin variants to achieve powerful activation and silencing at various time scales. Continued opsin engineering will provide neuroscientists with more powerful and versatile optogenetic controls.

METHODS OF GENETIC M O D I F I C AT I O N Successful implementation of optogenetics involves the expression of opsins in desired cell types. The ability to control specific cells with optogenetics is a major advantage of

optogenetic technology as compared with traditional pharmacological or electrical perturbation methods. Fortunately viral gene delivery methods can achieve high efficiency, and the whole animal transgenic approach allows targeted expression in specific cell types. Nonviral methods, however, generally exhibit low transduction efficiency in vivo at present (Luo and Saltzman 2000) and are not discussed here.

Viral Delivery Viruses remain the most commonly used method of introducing exogeneous genes into brain cells. Viral gene delivery has been well established and is used in both basic research and in human gene therapy (Han 2012; Waehler et  al. 2007). The most commonly used vectors are adeno-associated virus (AAV) and lentivirus. They have been engineered to possess high transduction efficiency and low toxicity. Despite these advantages, viruses suffer from a limited payload capacity. The limit for AAV is about 4.7 kb of DNA and for lentivirus approximately 8 kb. Herpes simplex virus (HSV) can in theory package greater than 30 kb but in practice can only efficiently deliver a much smaller payload (Neve & Carlezon 2002). A  major drawback to viral delivery is that it is relatively nonselective for which cell it infects. This is because of the intrinsic viral tropism as well as the ability to

the OMICs

refine cell targeting with the small promoter sequences that a virus is capable of packaging. As a result, it remains difficult to target specific cell types in nontransgenic species.

Transgenics Whole-animal transgenic approaches utilize the entire chromosome, including all promoter sequences and regulatory elements for restricted gene expression, and thus are capable of achieving high cell-type specificity. In particular, much effort has been directed toward the generation of transgenic mice utilizing Cre-LoxP recombination technologies. Cre, a recombinase, can catalyze the recombination between 34 base pair long LoxP recognition sequences flanking the genes of interest (Sauer & Henderson 1988; Tsien et  al. 1996)  (Figure  14.5A). For example, The National Institutes of Health and private organizations and laboratories have generated various Cre driver transgenic mice with targeted

(A) i

Cre expression in specific cell types. A  useful database for Cre transgenic mice is GENSAT (gene expression nervous system atlas, http:// www.gensat.org/CrePipeline.jsp), where one can find the status of particular transgenic mice with targeted Cre expression in a cell type of interest. To target specific cell types, viruses with LoxP sequences flanking opsins genes can be injected into specific brain regions of a particular Cre-driver mouse. Even though the virus infects all cells nonselectively within an injection area, opsin genes will only express in the presence of Cre. Since Cre are expressed in specific cell types in these transgenic mice, opsins expression is then restricted to the same type of cells that also express Cre. Alternatively, one can create offspring of transgenic mice containing LoxP regulated opsins genes crossed with Cre-driver transgenic lines (Madisen et  al. 2012). In these mice, even though all cells

ii Cag

Stop

ChR2-GFP

LoxP

ChR2-GFP

260

Cag

LoxP

LoxP Lox2272 + Cre

+ Cre Cag

LoxP Lox2272

Cag

ChR2-GFP

ChR2-GFP LoxP

LoxP

Lox2272

(B) 5’

LTR

Psi

RRE

(C) i

α-CaMKII

ChR2-GFP

WPRE

PPT

LTR

3’

ii A

A B

B D

D C

C

Cell specific targeting. (A), cell type specific targeting with Cre-LoxP strategies. Ai, an example demonstrates targeted expression of opsins in Cre expressing cells through removing transcription/translation stop sequences that block the expression of opsins. Aii, an example demonstrates targeted expression of opsins in Cre expressing cells through flipping the opsin gene that is otherwise in the non-coding orientation (adapted from (69)). (B), cell type specific targeting using short promoters in virus. An example demonstrates selective targeting of cortical excitatory neurons using a lentivirus with a CamKII promoter (adapted from (54)). (C), anatomical pathway specific targeting. Ci, an example demonstrates light illumination at area B selectively activates A  to B projection. Cii, an example demonstrates light illumination at the cell bodies retrogradely labeled through axon terminals projecting to area B.  In this case, all the downstream areas that receive projections from area A  are modulated.

FIGURE 14.5:

Optogenetics contain Lox-P regulated nonexpressive opsin genes, opsin genes will be expressed in cells that have Cre, achieving high cell type specificity. Cre transgenic mice along with LoxP-cre technology for targeted optogenetic control of specific cells has been widely use to study the functional role of various cell types in a number of behaviors. Similar effort is now being directed to generate Cre transgenic rats (Witten et al. 2011).

L I G H T- S T I M U L AT I O N TECHNOLOGY Once the cells of interest are successfully labeled with certain opsin proteins, light of the appropriate wavelength is needed to control/perturb the function of these cells. As the size of the optical fiber is often comparable to the size of an electrode used in vivo, the use of an optical fiber can easily be approximated by using a recording electrode or an infusion cannula. Simultaneous optogenetic perturbation can be combined with readout methods to simultaneously monitor downstream neural circuit effects and behavioral consequences. Optogenetics has been combined with various neural monitoring techniques—that is, fMRI and other cellular imaging methods for large-scale neural network readouts. Wavelength Considerations Since the action spectra for opsins are rather broad, a range of wavelengths can be used to

261

activate a specific opsin. This wavelength flexibility makes it economical when it comes to considering what colors of laser or LED to use for a particular experiment. For example, ChR2  has  a  peak activation spectrum around 470  nm, and a cheap laser of under $1,000 exists at 473  nm. In contrast, for Halo and Arch, whose peak activation spectrum is around 590  nm, stable yellow lasers at this wavelength are expensive. Instead, the 525-nm green laser, which is cheap, can be a great substitute for activating Arch at about 65% efficiency. At the moment, many companies sell analog-modulated fiber-coupled lasers that can easily be incorporated into behavioral or electrophysiological setups.

Light Illumination  Volume Light propagation in brain tissue is influenced by tissue absorption and scattering, which is wavelength-dependent (Mobley & Vo-Dinh 2003). In the visible wavelength range, where most of the opsins operate, blood hemoglobin absorption is the major consideration (Figure  14.6). Thus it is desired to develop opsins that can be operated by red light at wavelengths greater than 630  nm where the hemoglobin absorption coefficient is drastically reduced and light can penetrate a much larger tissue volume. The volume of penetration can be best modeled via Monte Carlo simulations, and in general light falls off nonlinearly, with the sharpest drop happening within the first

ChR2 Arch Halo Molar Extinction Coefficient (M/cm)

1000000

100000

10000

Hb 1000

HbO2 100 300

400

500

600 700 Wavelength (nm)

800

900

1000

Hemoglobin, oxygenated (HbO2) and deoxygenated (Hb) absorption spectrum. Values are based on that summarized by Scott Prahl, http://omlc.ogi.edu/spectra. The peak excitation wavelengths for ChR2, Arch, and Halo are indicated.

FIGURE  14.6:

262

the OMICs

couple of hundred microns. As penetration by every fiber introduces certain mechanical damage to brain tissue, ideally one would want to use small fibers to illuminate as much brain volume as possible. However, heat generated at the tip of the optical fiber could also be damaging. It is important to find a proper balance between the size and numbers of optical fibers and the light intensity generated from each fiber in illuminating the brain volume of interest.

CONCLUSION The development of microbial opsin-based optogenetic techniques has revolutionized the study of the function of specific cells in neural computation and behavior. This technique now offers the ability to excite or inhibit neurons with millisecond time resolution, relevant for neural computation, and can be easily combined with a variety of neural and behavioral readout techniques. Continued effort is being directed to develop new generations of opsin molecules with improved functions—that is, channelrhodopsin families that can activate neurons, and halorhodopsin and archaerhodopsins families that can inhibit neurons. Transgenic technologies allow researchers to target opsin expression in specific cell types in mice, and viral vectors allow opsin expression in other animal models and humans. These genetic techniques, combined with the advances in light illumination technology, will allow a greater level of specificity in which researchers can study causal relationships in networks of neurons and better understand the neural circuitry of brain diseases. REFERENCES Airan, R. D., K. R. Thompson, et al. (2009). Temporally precise in vivo control of intracellular signalling. Nature 458(7241): 1025–1029. Alivisatos, A. P., M. Chun, et  al. (2012). The brain activity map project and the challenge of functional connectomics. Neuron 74(6): 970–974. Bamann, C., R. Gueta, et al. (2010). Structural guidance of the photocycle of channelrhodopsin-2 by an interhelical hydrogen bond. Biochemistry 49(2): 267–278. Bamann, C., T. Kirsch, et al. (2008). Spectral characteristics of the photocycle of channelrhodopsin-2 and its implication for channel function. Journal of Molecular Biology 375(3): 686–694. Bamberg, E., J. Tittor, et al. (1993). Light-driven proton or chloride pumping by halorhodopsin. Proc Natl Acad Sci U S A 90(2): 639–643.

Berndt, A., P. Schoenenberger, et  al. (2011). High-efficiency channelrhodopsins for fast neuronal stimulation at low light levels. Proc Natl Acad Sci U S A 108(18): 7595–7600. Berndt, A., O. Yizhar, et  al. (2009). Bi-stable neural state switches. Nat Neurosci 12(2): 229–234. Bernstein, J. G., & E. S. Boyden (2011). Optogenetic tools for analyzing the neural circuits of behavior. Trends Cogn Sci 15(12): 592–600. Boyden, E. S. (2011). A history of optogenetics:  the development of tools for controlling brain circuits with light. F1000 Biology Reports 3:11(doi:10.3410/ B3-11). Boyden, E. S., F. Zhang, et  al. (2005). Millisecondtimescale, genetically targeted optical control of neural activity. Nat Neurosci 8(9): 1263–1268. Chow, B. Y., X. Han, et al. (2012). Genetically encoded molecular tools for light-driven silencing of targeted neurons. Prog Brain Res 196: 49–61. Chow, B. Y., X. Han, et al. (2010). High-performance genetically targetable optical neural silencing by light-driven proton pumps. Nature 463(7277): 98–102. Essen, L.-O. (2002). Halorhodopsin: light-driven ion pumping made simple? Curr Opin Struct Biol 12(4): 516–522. Feldbauer, K., D. Zimmermann, et  al. (2009). Channelrhodopsin-2 is a leaky proton pump. Proc Natl Acad Sci U S A 106(30): 12317–12322. Geschwind, D. H., & G. Konopka (2009). Neuroscience in the era of functional genomics and systems biology. Nature 461(7266): 908–915. Govorunova, E. G., E. N. Spudich, et al. (2011). New channelrhodopsin with a red-shifted spectrum and rapid kinetics from Mesostigma viride. MBio 2(3): e00115–00111. Gradinaru, V., K. R. Thompson, et al. (2008). eNpHR: a Natronomonas halorhodopsin enhanced for optogenetic applications. Brain Cell Biol 36(1-4): 129–139. Gradinaru, V., F. Zhang, et al. (2010). Molecular and cellular approaches for diversifying and extending optogenetics. Cell 141(1): 154–165. Gunaydin, L. A., O. Yizhar, et  al. (2010). Ultrafast optogenetic control. Nat Neurosci 13(3): 387–392. Han, X. (2012). In vivo application of optogenetics for neural circuit analysis. ACS Chem Neurosci 3(8): 577–584. Han, X. (2012). Optogenetics in the nonhuman primate. Prog Brain Res 196: 215–233. Han, X., & E. S. Boyden (2007). Multiple-color optical activation, silencing, and desynchronization of neural activity, with single-spike temporal resolution. PLoS One 2(3): e299.

Optogenetics Han, X., B. Y. Chow, et  al. (2011). A high-light sensitivity optical neural silencer:  development and application to optogenetic control of non-human primate cortex. Front Syst Neurosci 5: 18. Haupts, U., J. Tittor, et  al. (1997). General Concept for Ion Translocation by Halobacterial Retinal Proteins: The Isomerization/Switch/Transfer (IST) Model. Biochemistry 36(1): 2–7. Ishizuka, T., M. Kakuda, et al. (2006). Kinetic evaluation of photosensitivity in genetically engineered neurons expressing green algae light-gated channels. Neurosci Res 54(2): 85–94. Kato, H. E., F. Zhang, et al. (2012). Crystal structure of the channelrhodopsin light-gated cation channel. Nature 482(7385): 369–374. Kleinlogel, S., K. Feldbauer, et  al. (2011). Ultra light-sensitive and fast neuronal activation with the Ca(2)+-permeable channelrhodopsin CatCh. Nat Neurosci 14(4): 513–518. Knopfel, T. & E. S. Boyden, Eds. (2012). Optogenetics:  tools for controlling and monitoring neuronal activity. Prog Brain Res, vol. 196, (Elsevier). Kolbe, M., H. Besir, et  al. (2000). Structure of the light-driven chloride pump halorhodopsin at 1.8 Å resolution. Science 288(5470): 1390–1396. Lanyi, J. K. (2004). Bacteriorhodopsin. Annu Rev Physiol 66: 665–688. Lanyi, J. K., & H. Luecke (2001). Bacteriorhodopsin. Curr OpinStruct Biol11(4): 415–419. Lin, J. Y., M. Z. Lin, et  al. (2009). Characterization of engineered channelrhodopsin variants with improved properties and kinetics. Biophys J 96(5): 1803–1814. Luecke, H. (2000). Atomic resolution structures of bacteriorhodopsin photocycle intermediates:  the role of discrete water molecules in the function of this light-driven ion pump. Biochim Biophys Acta (BBA)—Bioenergetics 1460(1): 133–156. Luo, D., & W. M. Saltzman (2000). Synthetic DNA delivery systems. Nat Biotechnology 18(1): 33–37. Ma, D., N. Zerangue, et al. (2001). Role of ER export signals in controlling surface potassium channel numbers. Science 291(5502): 316–319. Madisen, L., T. Mao, et  al. (2012). A toolbox of Cre-dependent optogenetic transgenic mice for light-induced activation and silencing. Nat Neurosci 15(5): 793–802. Miesenbock, G. (2011). Optogenetic control of cells and circuits. Annu Rev Cell Dev Biol 27: 731–758. Mobley, J., & T. Vo-Dinh (2003). Optical properties of tissue. Biomedical Photonics Handbook (pp. 1–72). Boca Raton, FL: CRC Press. Nagel, G., M. Brauner, et  al. (2005). Light activation of channelrhodopsin-2 in excitable cells of

263

Caenorhabditis elegans triggers rapid behavioral responses. Curr Biol 15(24): 2279–2284. Nagel, G., T. Szellas, et al. (2003). Channelrhodopsin-2, a directly light-gated cation-selective membrane channel. Proc Natl Acad Sci U S A 100(24): 13940–13945. Neve, R. L., & W. A. Carlezon, Jr., Eds. (2002). Gene delivery into the brain using viral vectors. Neuropsychopharmacology. The Fifth Generation of Progress. Edited by Kenneth L. Davis, Dennis Charney, Joseph T. Coyle, and Charles Nemeroff. American College of Neuropsychopharmacology, Brentwood, TN. Oesterhelt, D., Hegemann, P., Tittor, J. (1985). The photocycle of the chloride pump halorhodopsin. II: Quantum yields and a kinetic model. EMBO J Sep;4(9):2351–2356. Radu, I., C. Bamann, et  al. (2009). Conformational Changes of Channelrhodopsin-2. J Am Chem Soc 131(21): 7313–7319. Ritter, E., K. Stehfest, et  al. (2008). Monitoring Light-induced Structural Changes of Channelrhodopsin-2 by UV-visible and Fourier Transform Infrared Spectroscopy. J Biol Chem 283(50): 35033–35041. Sauer, B., & N. Henderson (1988). Site-specific DNA recombination in mammalian cells by the Cre recombinase of bacteriophage P1. Proc Natl Acad Sci U S A 85(14): 5166–5170. Schobert, B., & J. K. Lanyi (1982). Halorhodopsin is a light-driven chloride pump. J Biol Chem 257(17): 10306–10313. Sineshchekov, O. A., E. G. Govorunova, et al. (2009). Photosensory functions of channelrhodopsins in native algal cells. Photochem Photobiol 85(2): 556–563. Spudich, J. L. (2006). The multitalented microbial sensory rhodopsins. Trends Microbiol 14(11): 480–487. Spudich, J. L., C. S. Yang, et  al. (2000). Retinylidene proteins: structures and functions from archaea to humans. Annu Rev Cell Dev Biol 16: 365–392. Tsien, J. Z., D. F. Chen, et al. (1996). Subregion- and cell type-restricted gene knockout in mouse brain. Cell 87(7): 1317–1326. Waehler, R., S. J. Russell, et al. (2007). Engineering targeted viral vectors for gene therapy. Nat Rev Genet 8(8): 573–587. Wang, H., Y. Sugiyama, et al. (2009). Molecular determinants differentiating photocurrent properties of two channelrhodopsins from chlamydomonas. J Biol Chem 284(9): 5685–5696. Wen, L., H. Wang, et al. (2010). Opto-current-clamp actuation of cortical neurons using a strategically designed channelrhodopsin. PLoS One 5(9): e12893.

264

the OMICs

Witten, I. B., E. E. Steinberg, et  al. (2011). Recombinase-driver rat lines:  tools, techniques, and optogenetic application to dopamine-mediated reinforcement. Neuron 72(5): 721–733. Yizhar, O., L. E. Fenno, et al. (2011a). Optogenetics in neural systems. Neuron 71(1): 9–34. Yizhar, O., L. E. Fenno, et  al. (2011b). Neocortical excitation/inhibition balance in information processing and social dysfunction. Nature 477(7363): 171–178. Zhang, F., M. Prigge, et  al. (2008). Red-shifted optogenetic excitation:  a tool for fast neural

control derived from Volvox carteri. Nat Neurosci 11(6): 631–633. Zhang, F., J. Vierock, et  al. (2011). The microbial opsin family of optogenetic tools. Cell 147(7): 1446–1457. Zhang, F., L. P. Wang, et  al. (2007). Multimodal fast optical interrogation of neural circuitry. Nature 446(7136): 633–639. Zhao, S., C. Cunha, et  al. (2008). Improved expression of halorhodopsin for light-induced silencing of neuronal activity. Brain Cell Biol 36(1-4): 141–154.

15 Characterizing the Gut Microbiome: Role in Brain–Gut Function GERARD CLARKE , PAUL W . O’TOOLE , JOHN F. CRYAN , AND TIMOTHY G .  DINAN

Th’ whole worl’s in a terrible state o’ chassis. SEAN O’CASEY, Juno and the Paycock

INTRODUCTION In Sean O’Casey’s play1 Captain Boyle laments the cruel chaos of the times he lives in, a theme that frequently resonates with the difficulty in characterizing the gut microbiota in health and disease. Attempts to bring to some order to this “chassis” is a rapidly expanding area of research and now occupies the minds of both microbiologists and neuroscientists, unusual bedfellows in historical scientific circles. Indeed, with the exception of cerebral infections and neuroAIDS, the advance of these medical disciplines has largely been along separate paths.2 However, a new research narrative is emerging that posits a crucial role for the intestinal microbiota in communication along the gut-brain axis. Independent efforts from both partners have generated tremendous advances in our understanding of the dynamics and structure of this “virtual organ,”3‒5 which, coupled with large-scale microbiome projects, are elucidating a role for microbiota in health and disease. In particular there is an increasing appreciation of the impact of the microbiota on the brain and in complex behaviors.2,6‒8 These converging trajectories have seen the gut microbiota positioned at the interface of microbiology and neuroscience. The shift toward an increased understanding of the impact of gut microbiota on CNS function has emerged in tandem with advances at the technological level in our understanding of the composition of the microbiota itself.9 In this chapter, we briefly chart the route that has seen the focus of microbiology switch from the individual to the collective and, in particular, the corresponding methodological advances, leading to the metagenomic approaches that currently

dominate this research arena. We outline how these high-throughput technologies have revolutionized our comprehension of the gut microbiome, both in health and disease, and discuss the experimental strategies that have been employed to investigate its impact on the central nervous system (CNS). Finally, we sketch the far-reaching implications of these findings and contemplate the new directions arising from this research.

CHARACTERIZING THE M I C R O B I O TA : A PA R A D I G M S H I F T Microbiology as a discipline has traditionally relied on culture-based methods to study individual members of a given microbial community. From an academic, industrial, and medical standpoint, this is a strategy that has undoubtedly been fruitful. Nonetheless, the fact that the vast majority of mammalian gut bacteria are not readily amenable to culture using the widely applied media and isolation techniques is a serious limitation and has curtailed a broader appreciation of intestinal community composition and diversity. Indeed, it is estimated that approximately 70% of gut microbes exhibit poor laboratory culturability, which, coupled with the potential of selective growth bias toward specific organisms, results in a distorted picture of the true composition of the natural community.10,11 This has also been referred to as the “great plate-count anomaly,” alluding to fact that the majority of microbial cells from a complex environment such as soil or mammalian feces that are visible under the microscope cannot be readily cultured in the laboratory using the most routine techniques.12 Aligned

266

the OMICs

with this culture-based strategy and its inherent limitations, a bias emerged that favored the study of pathogens or genera with industrial applications over the troublesome scrutiny of the majority of the “benign” components of the commensal bacteria.13,14 This bias persisted despite illustrious proponents of a microbial basis for health and disease throughout history, including Hippocrates, Antonie van Leeuwenhoek, and Elie Metchnikoff (for review and historical perspective, see references 4, 15, and 16). A renaissance of this age-old concept was sparked by converging evidence ( further on) that supported a role for the community gut microbiota in critical aspects of both normal and pathophysiological states.17,18 Once it became clear that a reliance on the culturing of bacteria could not answer the many questions arising from these studies and that a more comprehensive characterization of the entire gut microbiota was required, the armamentarium of microbiologists evolved to include increasingly sophisticated culture-independent techniques.9 In many instances, the technological advances were facilitated by advances in distinct but parallel fields such as environmental microbiology11 or the sequencing technology that became available in the wake of the human genome project.19

C U LT U R E - I N D E P E N D E N T TECHNIQUES FOR ASSESSING M I C R O B I O TA C O M P O S I T I O N The 16S rRNA gene, which harbors highly conserved domains flanking species-specific hypervariable regions, is used as a phylogenetic marker for microbial communities.10,20 This gene can be amplified from most bacteria, obviating the need to culture them.21 The earliest community profiling methods, known collectively as fingerprinting techniques, neatly exploited this identity tag to assess the structure of complex communities. Examples include denaturing gradient gel electrophoresis (DGGE) and terminal restriction fragment length polymorphism (T-RFLP) (reviewed in reference 9). However, both techniques are best described as semiquantitative and suffer from some limitations including the lack of phylogenetic identification, the inability to measure low-abundance organisms and a PCR bias.14 PCR is an essential initial workflow component of both techniques as there is a requirement to amplify the 16S rRNA genes present in the overall community

sample. Moreover, quantitative or real-time PCR (qPCR) can be used in a complementary fashion to provide more detailed information on diversity and abundance of the gut microbiota. It should be noted that this technique requires the design of primers that either target  all bacterial phyla or a single phylum or species; thus the technique does not automatically facilitate the detection of unknown species.9 PCR bias is itself an important limitation to be aware of in evaluating the literature. In particular, bacterial taxa with a high G + C content (e.g., Bifidobacterium) tend to be underestimated owing to low amplification efficiencies, leading to a preference for some sequences over others.22 Furthermore, primer design is critical and can also influence the apparent representation of bacterial groups in an analysis.23 PCR can also lead to base changes in clonally amplified templates, which are then erroneously detected as sequence variants.24

SEQUENCING AND THE M E TA G E N O M I C R E V O L U T I O N Although probe hybridization techniques such as fluorescence in situ hybridization (FISH) and DNA microarrays allow for high-throughput phylogenetic identification and quantification of phylogenetic groups and species, sequencing is now the gold standard for reliable large-scale identification at this depth.9 It is in the sequencing arena that the options available to microbiologists have mushroomed, particularly in the last decade. Initial genomic approaches were based on targeted sequencing of the 16s rRNA identifier gene in community DNA extracted from feces.14 Based on Sanger sequencing, long base-pair reads were produced which, when coupled with robust bioinformatics tools, facilitated bacterial identification to genera and species resolution.10 This approach relied on the expression of cloned genes in foreign hosts25 and favored the detection of dominant organisms in a community over their less abundant colleagues.14 The cost, complexity, and protracted nature of Sanger sequencing methods was also a bottleneck in early diversity studies.26 Next-generation (high-throughput) sequencing technologies are now the most widely used techniques for assessing composition as they deliver sequencing information on a vastly superior time scale and at a relatively lower although not insignificant cost.27 This impressive suite of techniques typically bypasses the

Characterizing the Gut Microbiome need for cloned amplicons, the amplicon DNA or total community DNA being directly assessed instead.14 Targeted metagenomic approaches again use the 16S rRNA gene as a phylogenetic marker in a workflow that typically involves fragmentation of the 16S rRNA amplicon, amplification of the fragments, and subsequent sequencing, followed by organization of the sequence data (see figure 15.1). An alternative and increasingly favored strategy is microbiome shotgun sequencing, which assesses whole community DNA by massive parallel sequencing of the nonmanipulated total community microbial DNA. This provides information on both genetic diversity and potential functions of the gut microbiota.28 Metagenomics can mean different things to different people and is not easily defined. In this chapter we apply the description supplied by the National Research Council of the National Academies (www.nationalacademies.org/nrc/), which broadly defines both the field and associated set of research techniques as encompassing a “cultivation-independent genome-level characterization of communities or their members and high-throughput gene-level studies of communities.”12 This designation incorporates a metagenomic philosophy that can be either compositionally or functionally driven and with a targeted or untargeted strategy.29 We also acknowledge that stricter definitions have been applied in the literature, with, for example, the term functional metagenomics being applied only to those studies which identified genes “specifically by their function by cloning them directly from the environment and expressing genes in a surrogate host.”14 Others have more narrowly applied the metagenomic moniker only to the shotgun sequencing paradigm,21 which has also been described as “true metagenomics.”9 Our summary of the microbiome literature below benefits from the greater inclusivity of the earlier, less restrictive definition but emphasizes in particular the compositionally focused studies that dominate the majority of current studies. These high-throughput approaches generate voluminous amounts of data, which require bioinformatic approaches to elucidate major functions and genes that encode clinically relevant enzymes.30 This includes clustering (identifying clusters of sequences that share an evolutionary origin), binning (assignment of overlapping target-organ-specific sequences or contiguous sequences into groups according to specific

267

genomes), gene annotation (classifying genes into different families of known predicted function), and gene prediction (identifying extent of coding sequences and assigning biological functions).12 16S rRNA gene sequences can also be extracted from shotgun reads for identification purposes, but this approach is less sensitive than direct targeting of the same gene, and an increasing number of studies pair the two strategies in a complementary fashion for maximum benefit.14,31 Sequence data is converted into gene catalogues that can be compared against databases such as the NCBI nonredundant Clusters of Orthologous Groups (COG) or the Kyoto Encyclopedia of Genes and Genomes (KEGG) in order to shuttle gene products into pathways and processes.14,32 To overcome the limitation imposed by the fact that a substantial proportion of the metagenome does not generate hits when compared with reference genomes, an alternative approach is to identify functional clusters that might correspond to taxonomic classifications.32 From a conceptual point of view, targeting the 16S rRNA gene provides knowledge about bacterial diversity and composition but not directly about potential bacterial function, which is better catered for by shotgun methods (see figure 15.1). The consequence of this is that the former method is employed to give information about the compositional structure at the species level in health or disease (compositional dysbiosis) while the latter approach identifies specific microbial genes (functional dysbiosis).30 However, compositional information in itself is useful and metabolic networks and biochemical pathways can be indirectly inferred from the taxonomic composition.26

M E TA G E N O M I C S : METHODOLOGICAL C O N S I D E R AT I O N S The metagenomic studies we describe below share both advantages and disadvantages, and there are a variety of considerations that need to be applied when interpreting the data. Clearly they have enriched our appreciation of the gut microbiome at both a compositional and functional level in a manner far beyond what can ever be achieved by culture-based methods.20 Moreover, the sampling depth achieved with next-generation sequencing technologies surpasses anything that could be achieved using fingerprinting techniques.33

268

the OMICs

The ability to produce vast tracts of invaluable data in such an efficient manner will soon be taken for granted and was unimaginable just a decade ago.24 The strides made toward defining what constitutes a healthy microbiome and how structural changes contribute to disease leads to a vista with a myriad of potential healthcare benefits.15,34 A  more general feature of this field is the abundance of instruments and platforms available and rapid technical advances, reflecting the intense competition between suppliers and service providers, something that is likely to continue for the foreseeable future. Indeed, commentators suggest that our sequencing capability is now increasing at such a rate that it has surpassed the rate of progress described by Moore’s law.27 An unintended consequence of this is that it seems likely that currently popular next-generation technologies—which include pyrosequencing, SOLiD (sequencing by oligonucleotide ligation and detection) and Illumina—will shortly be outdated.30,35 One should also be cognizant of the limitations of these technologies and concerns about the metagenomic profiling of whole-community DNA. As always, and although it may seem like a trivial point, the quality of the output data relies initially on the veracity of the sampling, storage of samples, and DNA extraction methods.23 The latter caveats are probably less of a concern than sampling strategies. From a pragmatic point of view, immediate processing of samples is not essential; they can be frozen and processed later as long as the requisite diligence is applied during collection.22,36 Nevertheless, it is essential that samples reach the laboratory as quickly as possible to minimize any potential for environmental bacterial contamination9 and avoid prolonged storage at room temperature.10 Factors pertaining to downstream processing and DNA extraction that require careful consideration include yield, purity, fragment size, and representativeness.37 Concerns have been expressed about a potential overreliance on the analysis of fecal samples, which are essentially a proxy for more arduous (and more ethically challenging) sampling at multiple mucosal sites and of the luminal contents. This does not account for the fact that the intestinal microbiota has a specific spatial organization with differences between the mucosa-associated and stool microbiota, which may also be important in the context of health and disease.4,9,10 Neither should the

proximal-to-distal gastrointestinal gradient be considered homogeneous.34 By the same token, we concede that the intestinal biopsy procedure itself could introduce some bias due to the requirement for fasting and cleansing of the colon prior to endoscopy collection procedures.10 Because the highest densities of bacteria are found in the colon, there is some justification for a fecal sampling policy even allowing for the difference in composition and abundance between mucosal and colonic microbial communities.4 Furthermore, there are practical, clinical and ethical considerations that favor fecal sampling.22 Most community genomic analyses do not distinguish between DNA emanating from cells that are active or viable, dormant, or dead.38,39 This is an important consideration in light of the fact that more than 50% of the cells in fecal samples are either nonviable or present in damaged states.40 This type of analysis also does not distinguish between community members that are autochthonous (colonizers) or allochthonous (transient).41 An additional caveat is that the information garnered is usually sufficient to identify only bacterial species but not specific strains,32 which is a limitation particularly applicable to shorter (less than 500 nt) 16S rRNA gene amplicon sequences. As we know from clinical and prelinical studies of beneficial bacteria (probiotics), for example, the properties and characteristics of bacteria are usually highly strain-specific.42 Targeted approaches in particular are subject to PCR biases (see the preceding discussion), which may result in over- and underestimation of community membership at the species level or above.29 One of the criticisms being leveled at the current panoply of metagenomic studies is that they illuminate the potential capability  of the gut microbiota rather than what they are actively doing at particular time points.25,43 This is because all genes are detected regardless of whether they are expressed or not, and there is no guarantee that the most expressed genes equate to the most abundant ones.4,30 A  move toward integration with other -omics technologies has been advocated to circumvent the limitations of in silico predictions, particularly through the use of metatranscriptomics (community mRNA), metaproteomics (collective proteins/peptides), and metametabolomics/ metabonomics (collective array of metabolites) approaches.5,20 These areas are also experiencing

Characterizing the Gut Microbiome a development boom, and we will likely see this reflected in the literature in the not too distant future.35 Indeed, a recent study that applied the multi-omic approach to the study of a single patient undergoing antibiotic treatment for 14 days found major changes at the level of the gut microbiota metabolism and active microbial fractions in a temporal fashion.44 We should also bear in mind that the majority of reported studies are correlative in nature rather than establishing a causal role for the gut microbiota in particular disorders. Criteria that have previously been established for implicating singular microorganisms as causative agents of disease, such as Koch’s postulates (Table  15.1), are not so easily applied to compositional shifts in community structure20 or to their involvement in complex, heterogeneous mental disorders.45 In this regard, there are aspects of other criteria for disease causation such as the Hill criteria46 (Table  15.1), which may be more appealing in establishing causality.32,46 Of course one should not underestimate the formidable nature of this challenge, which will require prospective studies and a continuation of the efforts that are allowing us to grasp exactly what it is that constitutes a normal healthy microbiome. Metagenomic studies also provide huge data interpretation and handling challenges, which are associated with a variety of limitations.

TABLE  15.1. GUIDELINES THAT CAN BE USED TO IMPLICATE MICROORGANISMS IN DISEASE STATES*

Koch’s Postulates20

Hill Criteria46

Present in disease Absent in health Can be isolated from host Introduction to new host should cause disease Can be reisolated from newly infected host

Strength of association Consistency Specificity Temporal relationship

Biological gradient Biological plausibility Coherence Experimental reversibility Alternate explanations

*A variety of criteria have been proposed to determine whether microorganisms can be causally linked to disease states. These include Koch’s postulates and the Hill criteria, with the latter being more applicable to the concept of community as pathogen.

269

The main issue to be aware of here is that the sequences obtained are assembled using fragments originating from the manifold species present in the community sample and that the raw data are basically provided in an incomplete format.26,35 Most of these species do not have full genome information available, meaning that the reference databases against which fragments are aligned lack matching homologies.15 In essence, the available databases are biased toward model organisms or cultivable microorganisms.33 In some cases this can lead to a situation where a sequence is associated with a particular disorder but cannot be assigned to a species, leading to classification instead as an operational taxonomic unit (OTU) that shares 97% or more sequence similarity with database curations of sequences from bacteria that have not yet been cultured.4 The term OTU is analogous to phylotype, and both terms are essentially approximations of a bacterial species.30 Numerous similar OTUs in metagenomic studies increase misclassifications.35 High-throughput cultivation techniques, described as culturomics, may help to improve database coverage47 and the sequencing of genomes of individual organisms to produce a superior catalogue of reference genomes. Such efforts are frequently aligned to community genomic approaches and constitute an active focus of current research efforts.14 It is important to note that, for highly abundant microorganisms, it is possible to reconstruct complete genomes and gain valuable further insights into biological potential. A  stellar example of such an application can be seen in the isolation of a dominant bacterial species (denoted WG-1) from the wallaby microbiota following reconstruction of the bacterium’s metabolism from binned metagenomic data, which allowed successful cultivation strategies to be devised.48 The ability to reconstruct genome sequences from sequence data depends on characteristics such as genome depth, evenness of coverage, read quality, and read length.49 Initially, the reads are usually assembled into longer contiguous sequences (contigs) which, when analyzed, can generate information on open reading frames (ORF) and operons.35 Short-sequence reads, a characteristic of next-generation sequencing, usually mean the reconstruction of complete transcriptional units, not to mention the entire genome, is fraught with difficulty and prone to

270

the OMICs

error.25,35 This difficulty is further compounded by higher error rates as compared with Sanger sequencing methods.10,24 Gene prediction is hampered by these errors, which encourage frame shifts and are compounded by misassembly due to the high frequency of polymorphisms along with the presence of genetic information from viruses and lysogenic phages.26 Some genes contain multiple domains and some domains exist as multiple copies in the genome; these are assembled into repeat consensus contiguous sequences (overlapping sequence data) that cannot be unambiguously placed in the genome.27 Removal of chimeras (sequences originating from the combination of at least two fragments, possibly from different species in metagenomic studies) is also critical for accuracy of the reconstruction process18,26 and to avoid incorrect identification of new taxa.4 The shorter the read length, the greater the number of reads required for adequate coverage to ensure that all reads overlap and that each overlap is unique, making for unwieldy datasets.35 Contamination from human sequences and artifacts such as duplicate reads also need to be accounted for during data analysis.14 The fact that 86% of bacterial genomic DNA corresponds to coding ORFs should allow for an initial recognition of candidate protein coding regions in a DNA sequence.30 However, shorter contiguous sequences (a consequence of shorter initial read length) interfere with the ability to identify ORFs and operons,25 while distinguishing between true ORFs and false ones is a difficult task.26 Despite the impressive technological advances and reductions in cost driven by the race toward the $1,000 human genome,50 this area of research remains an expensive business.4 Investment in an available platform requires a clear idea of what one wants to achieve, as each approach has specific capabilities.49 For example, pyrosequencing using the Roche FLX genome sequencer is more suited to targeted 16S rRNA compositional analysis owing to its ability to generate longer reads,26 whereas Illumina and SOLiD are thought to be better suited to shotgun sequencing.24 However, the longer read capability of the pyrosequencing technology is accompanied by higher reagent costs. Illumina has a lower multiplexing capability, while the SOLiD has longer run times.24 Each of these high-end instruments requires

a setup investment running into hundreds of thousands of dollars, although more modestly priced benchtop instruments with a reduced capability are also available. There is also a selection of ancillary equipment required relating to sample preparation, quality assessment, and high-throughput sample handling,27 while investment in staff expertise is ultimately critical to project success. Logistical concerns pertaining to the way in which the huge data output from metagenomic platforms is handled are valid, and appropriate infrastructure is required to meet this challenge.25 Commentators have argued that we are now at the stage where the costs of generating metagenomic data have been surpassed by those of handling and analyzing the datasets generated.27 In many cases the investment in capital (both human and infrastructural) is beyond the means of most laboratories; in such instances outsourcing of projects to larger sequencing centers may be the best option. This is particularly important considering the potentially short technological lifetimes of modern instruments. It is clear that there is a revolution in the technology available for assessing the composition of the microbiome and its functional consequences in health and disease. The challenges and limitations outlined above in the application of metagenomics to the study of the microbiome by no means compromise the importance of such scientific pursuits. Rather, they should serve as a series of pointers or a useful guide through sometimes challenging terrain and a primer for the discussion that follows below. While we are obliged to point to the deficiencies in correlative studies relating alterations in the microbiome to disease states, such studies pave the way for the causative studies that are likely to follow. The transformative nature of studies in this area should not be easily discounted, and the success of this venture can no longer be said to be constrained by technological limitations.

M E TA G E N O M I C S A N D T H E MICROBIOME The lack of a sound empirical grasp of what exactly we mean by a “healthy microbiota” is masked by a phalanx of numerical descriptions that underscore the challenge being embraced by the metagenomic revolution. For example, 60% of fecal mass is attributable to bacterial cells,3 and the gut microbiota harbors 10 times

Characterizing the Gut Microbiome more cells than the total number of human cells (for review, see reference 7). At the genomic level, the human gene complement is outnumbered by a factor of 150 as compared with the gut microbiome.9 Moreover, it is likely that in excess of 1,000 species and 7,000 strains reside in the adult gut microbiome.51,52 Cumulatively these inhabitants exert a pivotal and essential influence on multiple aspects of normal gastrointestinal function and have been implicated in a range of protective, structural, and metabolic effects (Table 15.2).3 Initial culture-independent efforts at characterizing the gut microbiome revealed a high interindividual diversity and helped to establish associations between alterations in the microbiota and disease. For example, one of the most comprehensive initial assessments of microbial diversity was carried out on mucosal biopsy and fecal samples from just three individuals; the results of the sequencing analysis from the cloned 16S rRNA genes indicated Bacteriodetes and Firmicutes as the dominant phylotypes.53 However it has taken large-scale metagenomic-based studies such as NIH funded Human Microbiome Project (HMP) (http://commonfund.nih.gov/hmp), the European funded Metagenomics of the Human Intestinal Tract (MetaHIT) (http://www.metahit. eu) consortium, and the International Human Microbiome Consortium (IHMC) (http://www. human-microbiome.org) to help determine the breadth of microbial variation and function across large populations.20 Using fecal samples from 124 Spanish and Danish subjects and a shotgun sequencing approach based on the Illumina technology, one such study established a core microbiota of only 75 organisms (out of a total species count of 1,150) and 294,000 TABLE  15.2. FUNCTIONS OF THE

MICROBIOME* Regulation of gastrointestinal motility Development and maintenance of barrier function Development of gut-associated lymphoid tissue Development of the immune system Support of digestion and host metabolism Salvage of energy Prevention of colonization by pathogens Source: Modified from Grenham et al.7  * The gut microbiome is essential for numerous gastrointestinal functions, many of which have been divined using germ-free animals.

271

genes (out of a total of 3.3  million) that were shared by more than half of the subjects.52 Interestingly, database comparisons allowed only 12% of the reference set of genes to be associated with bacterial species. Based on fecal samples from 242 American adults and a combination of targeted (16S amplicon sequencing) and shotgun sequencing technologies, the HMP confirmed this diversity and helped to establish that although there was considerable between-subject variation, an adult individual’s intestinal/fecal microbial community was relatively stable over time.54,55 Although the results verified the previously held conviction that the gut microbiota was dominated by Bacteroidetes and Firmicutes phyla, they also highlighted huge differences in abundance between individual subjects.20 The shotgun sequencing component of the study, which was carried out on a subset of the main cohort, demonstrated that there was a stable housekeeping core of functional capabilities with a high metagenomic abundance.54 The populations in these studies were of limited genetic and environmental variability and both projects were charged with addressing fundamental questions about the compositional and functional diversity of the healthy microbiome in adult populations. A  study that was more concerned with addressing the influence of geography and age on microbiome diversity characterized bacterial species in fecal samples from 510 individuals and gene content in a subset of 110 study subjects of varying ages across American, Malawian, and Amerindian locations using a combination of targeted and shotgun approaches. Microbial diversity, which is assumed to confer health benefits, increased with age in all populations, but American adults microbiota exhibited a lower diversity than those of Malawian and Amerindian adults, with the latter two groups exhibiting more compositional similarity to each other than to the American cohort.56 Of note is that this distinction was evident both in early infancy and in adulthood. Analysis of inferred gene families and enzyme classes suggested a functional capability that was also dependant on age and geography, with, for example, the infant microbiome better equipped with folate-producing enzymes than the adult microbiome.20 Microbiota stability, response to disturbance, and resilience are important for human health.17 The interest in defining a core microbiota stems

272

the OMICs

from a belief that this approach will identify the microbes that are most relevant for health over the transient gut inhabitants that vary according to environmental factors such as diet.57 Taken together, the studies discussed above are largely supportive of the belief that a core microbiota exists. However, a recent study that sampled the fecal microbiota of two individuals repeatedly, covering 396 time points over 15  months and using a targeted approach with the Illumina sequencer, calls into question whether this core microbiota exists at the abundance suggested by previous studies and highlighted a much higher adult intraindividual temporal variability.58 The earlier studies benefited from much larger study populations while the latter investigation included such an impressive number of sampling points that it may be difficult to replicate with an increased number of subjects. Reconciling the different viewpoints is not an easy task, but the likelihood that even a lower abundance core microbiome could influence health should not be discounted. Low-abundance community members have also been shown to contribute a disproportionately large proportion of the functional gene repertoire and metabolic capacity.59 Additionally, rare community members or genes with low expression are likely to perform some functions.19 A  further complication is that a common core may be dependent on factors other than the number of subjects in the study, such as the depth of analysis.4,57 A superior concept may be a core microbiome with a functional repertoire such that regardless of the bacterial composition, functional capability is maintained.34 In support of this concept, a recent metagenomic analysis of samples from 43 healthy subjects taken at two or more time points during a year found a stable microbiome despite variation in species composition.60 Whether this genetic stability persists for periods longer than a year remains to be determined.61 However, as outlined above, the functional capability of the microbiome is transcriptionally regulated, and increased application of metatranscriptomic, metaproteomic, and metametabolomic technologies will also be required to substantiate this important concept. The concept of bacterial enterotypes or alternative stable states within the gut ecosystem has recently attracted much attention and is an intriguing proposition.62 These enterotypes were derived from computation evaluation of

metagenomic datasets, including the HMP and MetaHIT projects, and suggested that despite the apparent diversity in our microbiome, microbial communities can be grouped into three different assemblies robust enough to circumvent the geographical variety outlined above. The enterotypes are driven by species composition; defined by a relatively high abundance of either Bacteroides, Prevotella, or Ruminococcus species; and also independent of body mass index, age, or gender.59 This concept was further expanded on in a later study that applied targeted and shotgun metagenomics to fecal samples from 98 individuals.63 Although this study did not find a well-defined Ruminococcus enterotype, both the Bacteroides and Prevotella enterotypes were identified and found to be associated with diets enriched for protein/animal fat and carbohydrate respectively. Interestingly, although a short-term (10-day) dietary intervention was sufficient to rearrange microbial composition, the enterotypes remained stable.63 As in the infant, the adult-type microbiome has a functionality that is specialized for energy harvest from particular diets.30 An extreme example of this can be seen in the transfer of enzymes from marine bacteria to the gut microbiota of Japanese individuals whose diets contain seaweed-based products.64 However, caution is advisable, and there is much debate in the literature as to the clarity of these enterotypes and whether gradients should instead be considered more relevant.65 In support of this contention is a more recent study, based on HMP data, in which a proportion of intermediate subjects were identified, which blurred enterotype distinctions.66 Nevertheless, the enterotype concept is one worthy of further investigation, especially in relation to potentially corresponding phenotypes and their use as biological markers to be used in personalized treatment schedules.4 Developmentally speaking and although bacteria have been isolated from the meconium of healthy neonates,67 microbial penetration of the amniotic cavity is considered extremely rare, meaning that prior to birth the developing fetus is essentially sterile.68,69 Much of our current understanding at this extreme of life is shaped by studies predating the metagenomics revolution that has taken hold in adult populations. Collectively, these studies indicate that delivery mode is a key determinant of our initial microbial community composition,

Characterizing the Gut Microbiome an effect that can last for months and influence health outcomes.17 Of further interest is a recent report highlighting the fact that the microbiota of the preterm infant differs from that of healthy term infants.70 Importantly, activation of the HPA axis coincides with the early colonization period.71 The temporal variation that is characteristic prior to the attainment of the stable adult microbiome after one to three years of life is punctuated by shifts in composition that likely reflect key early life events, such as weaning or infections.17,68 Moreover, the functional capability of the microbiome is intrinsically linked to the infant host requirements. For example, infants who are exclusively breastfed exhibit an increased abundance of Bifidobacterium species equipped to utilitize human milk oligosaccharides.68 Each dietary stage as the infant progresses toward adulthood is accompanied by changes in the microbiota and the enrichment of genes specialized toward microbial digestion of that diet.18 Whether an infant’s unique microbial experience during this early unstable phase confers health benefits or constitutes a risk for future disease is currently a matter of intense speculation, but it seems likely. Importantly, although infants delivered by cesarean section exhibit delayed acquisition of the Firmicutes and Bacteroidetes, which dominate the adult microbiome, they do eventually catch up to their vaginally delivered counterparts.72 Given that the former infants are more likely to suffer from allergies, asthma, and diabetes later in life,73,74 early-life colonization patterns as a risk factor for disease require as much study as perturbations of the adult microbiome. This is a challenge to which metagenomics approaches are ideally suited and one that will likely be taken on in the future. Maternal influences such as vaginal birth and breastfeeding favor a similarity between the maternal and developing infant microbiome.75 It is likely that family members and other environmental sources also play a role in defining the adult core microbiome. A  study that used a Sanger sequencing approach to compare the infant’s fecal microbiota with that of the mother at 1 and 11  months after delivery suggests that early colonizers are easily replaced by externally acquired species.76,77 A  pyrosequencing study that compared the fecal microbiomes of monozygotic twins, dizygotic twins, and their mothers revealed a microbiota composition that was

273

more similar between related than unrelated individuals. However, compositional profiles indicated both monozygotic and dizygotic twins to have broadly similar core microbiomes.78 This might suggest that familial similarities have an environmental rather than a genetic basis5 and that microbial ecologies tend to cluster in family members.72 Clearly the contribution of host genotype and environmental factors to the core microbiome remains to be fully elucidated. At the other extreme of life, metagenomic approaches have also provided key insights. A targeted pyrosequencing study that compared the fecal microbiota of an aged population to that of younger adults found that the core microbiota of elderly subjects was distinct from that previously established for younger adults, particularly in relation to an enhanced abundance of Bacteroides species and an altered pattern of Clostridium clusters.79 In addition, the between-individual variability was much larger than previously reported for younger adults. A subsequent study from the same investigators also used a targeted pyrosequencing approach in evaluating the fecal microbiota of 178 elderly individuals; it found that compositional groups correlating with residence location in the community, such as day hospital, rehabilitation, or long-term residential care. The results were also influenced by diet.80 Crucially, the loss of community-associated microbiota correlated with increased frailty. The population was geographically and ethnically homogenous but, if confirmed in more diverse cohorts, these results could have important implications for our understanding of the interactions between diet, the microbiota, health, and aging.81 Interestingly, a recent study in aged mice has demonstrated that a diet rich in omega-6 polyunsaturated fatty acids (n-6 PUFAs), a feature of “western” diets, can influence the composition of the microbiota and intestinal inflammation.82

COMPOSITION OF THE M I C R O B I O TA A N D T H E G U T- B R A I N   A X I S Earlier we discussed the promises and pitfalls in metagenomic approaches to the study of the gut microbiome. The efforts directed toward this area of study, reflected in large-scale projects such as the NIH-funded Human Microbiome Project (http://commonfund.nih.gov/hmp),55 underlines a growing appreciation that our gut microbiome has a crucial bearing on virtually

274

the OMICs

all aspects of normal physiological processes.2 That the microorganisms resident in our gastrointestinal tracts might impact brain function is a relatively new concept, which has emerged in part owing the emphasis placed on a construct known as the gut-brain axis. This construct, which describes the complex bidirectional communication system that links the CNS and the gastrointestinal tract, is vital for maintaining homeostasis, and its dysregulation has been implicated in disease states.7,83 In addition to facilitating the central regulation of digestive function and satiety, impairments of signaling along the gut-brain axis are associated with gut inflammation, chronic abdominal pain syndromes, and eating disorders.2 Moreover, the function of the gut-brain axis and its modulation are linked with the stress response and behavior.84 More and more researchers in this area now acknowledge that the microbiome itself is a critical node in this bidirectional communication network.6‒8,85 The gut-brain axis can be considered the scaffolding though which the gut microbiota can exert a marked influence at the level of the CNS. The fundamental elements of the axis include the CNS, the neuroendocrine and neuroimmune systems, both the sympathetic and parasympathetic limbs of the autonomic nervous system (ANS), and the enteric nervous system (ENS).7 Signaling along the axis is facilitated by afferent fibers projecting to integrative cortical CNS structures and efferent projections to the smooth muscle in the intestinal wall, which form a complex reflex network.86 Together, these components provide neural, hormonal, and immunological lines of communication through which the brain can influence the motor, sensory, and secretory functions of the gastrointestinal tract and the connections through which the gastrointestinal tract can exert its influence on brain function.8,85 Although the field of neurogastroenterology has prospered thanks to a solid understanding of the reciprocal communication between the ENS and the CNS, the role of the gut microbiota is less well delineated. This deficit can largely be attributed to a poor understanding of the gastrointestinal inhabitants and their functional repertoire and is one that the metagenomic revolution is destined to address. In tandem, efforts to demarcate the implications for the CNS are continuing apace. The discussion below outlines the overarching

ideas that are uniting the fields of microbiology and neuroscience.

THE GUT MICROBIOME IN C N S - R E L AT E D C O N D I T I O N S Although the microbiome is generally accepted as essential for health, undesirable connotations have also been established for it in a variety of CNS-related conditions. Obesity has both central and peripheral mechanisms underlying its pathogenesis. Increasing evidence suggests that obesity might have a potential microbial basis, given the key role played by the microbiota in supporting host digestion and metabolism.7 However, clinical studies to date (Table  15.3) have not established whether the altered microbiota precedes the development of obesity or is a consequence of dietary factors and host physiological alterations.87 Commentators have also cautioned that clinical studies using metagenomic approaches to interrogate correlations between indices of obesity and microbiome alterations have not always been answered in the affirmative.39,88 All the same, data from the preclinical literature argues in favor of such a linkage (Table  15.3). Interestingly, olanzapine, an atypical antipsychotic that causes weight gain, has been shown to alter the composition of the gut microbiota in rats assessed by pyrosequencing technology.89 On the other hand, recent studies have confirmed that high-fat diets influence the composition of the microbiome.90,91 Besides this limitation, obesity is likely multifactorial, with hypothalamic dysfunction also being implicated,92 and it is currently unclear whether CNS control of food intake falls under the remit of microbial influences at relevant brain centers.2 From a therapeutic perspective, it is encouraging to note that a recent study in mice with diet-induced obesity to which vancomycin (a broad-spectrum antibiotic) was administered associated the resultant changes in Firmicutes, Bacteroidetes, and Proteobacteria with an improvement in metabolic abnormalities.93 Furthermore, a fatty acid‒based dietary intervention (using an oleic acid derived compound) has been shown to offset dysbiosis of the gut induced by a high-fat diet and to improve weight gain measures in obese mice.94 Autism spectrum disorders (ASD) are neurodevelopmental disorders defined by deficits in social interaction and communication and the presence of limited, repetitive stereotyped

Characterizing the Gut Microbiome

275

TABLE  15.3. STUDIES THAT HAVE INVESTIGATED ALTERATIONS IN THE GUT

MICROBIOME IN CNS-RELATED DISORDERS* Disorder

Method

Study Type

Key Finding

Obesity

Fecal pyrosequencing (targeted and shotgun)

Clinical (lean and obese twins)

16s rRNA sequencing (cloned amplicons)

Clinical (obese and healthy subjects)

Analysis of body fat content and insulin resistance

Preclinical (germ-free, conventional and conventionalised germ-free mice) Preclinical (germ-free and conventional mice) Preclinical (ob/ob mice)

Changes in the microbiota 78, 111, at the phylum level; 125 reduced bacterial diversity; skewed distribution of bacterial genes and metabolic pathways; obesity associated with variation from a core microbiome. Decreased Bacteroides 126 in obesity; increased Bacteroides during calorie restriction. Microbiome important 127 for control of body fat distribution.

Diet-induced obesity

PCR amplification of 16S rRNA genes from cecal contents Pyrosequencing

Autism spectrum disorders

Fecal pyrosequencing

Fecal pyrosequencing Irritable bowel Fecal pyrosequencing syndrome (IBS)

Fecal pyrosequencing

Preclinical (humanization of germ-free mice) Clinical (autistic and control children)

Clinical (autistic and neurotypical controls) Clinical (D-IBS vs healthy controls)

Clinical (mixed IBS vs healthy controls

References

Microbiome is a factor in diet-induced obesity.

128

Mice genetically predisposed to obesity have an altered Bacteroidetes:Firmicutes ratio. Adiposity transmissible via microbiome transplantation. Bacteroidetes increased in severely autistic group; Firmicutes more predominant in the control group. No clinically meaningful differences. D-IBS associated with reduced microbial richness, a decrease in beneficial and an increase in detrimental bacterial species. No uniform change between IBS and control; associations between microbiota and metadata (colonic transit indices; comorbid depression).

129

111

130

131 132

133

(continued)

276

the OMICs TABLE  15.3. CONTINUED

Disorder

Method

Study Type

Key Finding

References

Stress

Caecal content pyrosequencing

Preclinical (mice)

134

Coprocultures

Preclinical (rhesus monkeys)

DGGE fingerprinting

Preclinical (rats)

Social stressor reduced Bacteroides spp and increased Clostridium spp; Stressor-induced elevations in circulating levels of IL-6 and MCP-1 significantly correlated with changes to three bacterial genera (Coprococcus, Pseudobutyrivibrio, and Dorea); antibiotic cocktail abolished the stressor-induced increases in circulating cytokines. Maternal separation transiently altered fecal lactobacilli content. Maternal separation induced brain-gut axis dysfunction and reduced microbiota diversity.

135

103

*Clinical and preclinical studies have been used to investigate the potential contribution of the gut microbiome to a variety of CNS-related disorders including obesity, IBS, autism spectrum disorders, and multiple sclerosis.

interests and behaviors.2 On the basis that gastrointestinal symptoms and disturbances are commonly reported in children with ASD, a role for an abnormal composition of the gut microbiome has been suggested in this disorder.95 Application of pyrosequencing technology to probe for compositional perturbations of the microbiome in ASD has yielded conflicting results (Table  15.3). Interpretation of results from these studies is complicated by the knowledge that individuals suffering from ASD have high rates of antibiotic usage and consume diets that often differ from those of healthy populations.2 On the other hand, there are lines of evidence that lend credence to the concept. Vancomycin, a minimally absorbed antibiotic that, following oral administration, targets gram-positive anaerobes in the gut, can transiently improve symptoms in regressive-onset autism.96 Altered fecal concentrations of short-chain fatty acids, potentially neuroactive microbial metabolites, have also been reported in ASD.97 Of note

is that the administration of propionic acid, a short-chain fatty acid, to animals via the intracerebroventricular route results in some autistic-like behaviors, albeit at high doses that might not reflect the clinically observed alterations.98,99 Despite these dosing concerns, these studies do suggest a potential mechanism though which an altered microbiome might induce symptoms of ASD. Clearly this is an area that warrants further investigation100 and which would benefit from a more detailed characterization at the metagenomic level. Irritable bowel syndrome (IBS) is regarded by many as the prototypical stress-related disorder of the brain-gut axis.7 It is a debilitating functional gastrointestinal disorder characterized by abdominal pain, altered defecatory patterns, bloating, and the absence of reliable biological markers.101 Conceptually, the involvement of the microbiome in the generation of IBS symptoms is suggested both by the development of the condition in a significant proportion of individuals following an episode of bacteriologically

Characterizing the Gut Microbiome

Gut Stool sample

Mucosal biopsy Extraction

Community genomic DNA PCR

Shotgun (Total DNA)

Targeted (16S RNA amplicon) Fragmentation of amplicons and amplification

Fragmentation of DNA

Sequencing (high throughput) Bioinformatics

T G A G C T T C

Database comparisons

Community composition + diversity

Genes, pathways, functional capability

Health + Disease FIGURE  15.1: Overview of metagenomic strategies to elucidate community structure and function. Investigators can take either a targeted or shotgun approach. Targeted approaches are based on sequencing of the 16S rRNA amplicon, with subsequent bioinformatics analysis leading to community composition and diversity. Alternatively, total DNA can be evaluated for genes, pathways and functional capability using a shotgun approach. These approaches are frequently combined to define what constitutes a healthy microbiome and to identify dysbiosis in disease.

confirmed gastroenteritis102 and by the presence of a chronic low-grade inflammation in IBS subjects.101 Indirectly, the finding that specific probiotic strains can be beneficial in relieving particular symptoms of IBS also supports this concept.42 This seems particularly relevant for the pain component of the disorder, which is likely due to visceral hypersensitivity. Visceral pain perception is complex and involves both

277

central and peripheral mechanisms, which can potentially be affected by alterations in the intestinal microbiota.2 Both qualitative alterations (Table  15.3) and a temporal instability in the composition of the microbiota have been reported in IBS.101 Microbiota profiling also identified two IBS subtypes:  those patients with an altered high Firmicutes/Bacteroidetes ratio in the microbiota and those with a normal microbiota but higher anxiety and depression scores.133 Alterations in microbial diversity have also been replicated in preclinical models of the disorder.103 Quantitative alterations, such as small intestinal bacterial overgrowth (SIBO), have been reported but are more controversial.104 Not all studies have found disturbances in the microbiota composition, and it is currently unclear whether the alterations that have been reported are primary or secondary in nature.105 It is anticipated that increased uptake of metagenomic approaches will help to resolve these reservations.106 The challenge associated with defining the human gut microbiome in health and disease should not be underestimated and should be clear from our previous discussion. Advances enabled by the metagenomic revolution have seen huge strides being made toward this objective. Unimaginable just a few years ago, we can now foresee the monitoring of the microbiome during diagnostic and therapeutic pipelines.68 A  recent study suggests the presence of metagenomic genotype signatures with relevance for diet and drug intake.60 Of course many factors remain to be unraveled before the microbiome can be confidently targeted in this manner. Furthermore, characterization of the microbiome using metagenomics cannot by itself address important questions about complex behaviors that are under its influence at the level of the CNS or the mechanism though which that influence is exerted.107 Resolution of these queries is currently being attempted in the neuroscience domain.

S T R AT E G I E S T O I N V E S T I G AT E P E R T U R B AT I O N O F T H E MICROBIOME-BRAIN-GUT  AXIS Information about communication along the microbiome-gut-brain axis can be mined using a number of different strategies (Figure  15.2). The use of germ-free animals in particular has contributed greatly to our appreciation of the

278

the OMICs

role of the microbiota in general health and well-being (Table  15.2)3 and is an approach that is now frequently applied to CNS-related studies.2 Taking advantage of the principle of a sterile uterine environment (see the preceding discussion), it is possible to prevent the normal postnatal colonization of the gastrointestinal tract by combining a surgical delivery mode with a germ-free rearing environment in gnotobiotic units.7 This allows for direct comparison with conventionally colonized counterparts as well as manipulations aimed at reintroducing a partial or complete gut microbiome later in life. Such studies can essentially be regarded as invaluable proof-of-principle studies, although alternative strategies that better reflect real-life scenarios are also employed and include the use of antibiotics, deliberate

infections, fecal transplantation, and probiotic studies (Figure 15.2).2

I M PA C T O F T H E M I C R O B I O M E O N B R A I N A N D B E H AV I O R Table  15.4 below summarizes the recent advances arising from these strategies and highlights the impact of the microbiome on brain and behavior. Translating these encouraging findings to the clinic requires much clinical validation and remains a challenge. Although some mechanistic insights are available and suggest neural, immune, and endocrine effects, there is an urgent need for more detailed assessments of the mode of action involved. It is encouraging to note that a recent preliminary report on an emotional reactivity test in a healthy cohort suggested that a probiotic mixture could alter

MicrobiotaGut-Brain Axis

Antibiotics

Germ-free status

Infection

Probiotics

Fecal transplantation

Behaviour

Anxiety

Depression

Pain

Cognition

Light-dark box

Forced swim test

Colorectal distention

T-maze

Elevated plus maze

Novel object recognition

Open field FIGURE  15.2: Investigating the role of the microbiome in health and disease. The impact of the microbiome at the level of the CNS can be assessed by assessing behavior following deliberate alterations such as germ-free studies, infection studies, probiotic studies, antibiotic studies, and fecal transplanatation studies. The combined information gleaned from these studies to date implicates the microbiome in a variety of complex behaviors and functions including anxiety (using the light-dark box, the open field, and the elevated plus maze), depression (using the forced swim test), pain (using colorectal distention) and cognition (using the T-maze and novel object recognition).

Characterizing the Gut Microbiome

279

TABLE  15.4. IMPACT OF THE MICROBIOME ON BRAIN AND BEHAVIOR

Strategy

Key Findings

Germ-free

Exaggerated stress response that was partially normalised by colonization and fully normalized by monoassociation with Bifidobacterium infantis; decreased hippocampal BDNF; altered NMDA receptor subunit NR2A expression in cortex and hippocampus. Reduced anxiety-like behaviors, which were normalized by colonization; elevated hippocampal 5-HT and 5-HIAA (metabolite of 5-HT), which were resistant to colonization of germ-free animals; increased plasma concentrations of tryptophan (precursor of 5-HT), which were normalized following colonization; reduced hippocampal BDNF expression; exaggerated stress response. Reduced anxiety-like behaviors; increased motor activity; altered expression of PSD-95 and synaptophysin in the striatum, which was normalized following colonization. Reduced anxiety-like behaviors. Nonspatial and working memory deficits. Fecal Colonization of germ-free BALB/c (anxiety-prone) and NIH Swiss Webster transplantation (normoanxious) mice with microbiota from conventional mice of their own strain resulted in exploratory behavior similar to their respective controls; a behavioral profile similar to the donor strain was evident following colonization of germ-free mice with microbiota from the other strain. Antibiotics Reduction in anxiety-like behaviors and region specific BNDF alterations in amygdala and hippocampus in mice treated with a cocktail of neomycin, bacitracin and piramicin. Increased visceral hypersensitivity; reversed by administration of L. Paracasei. Infection Trichuris muris increased anxiety-like behaviors, decreased hippocampal Bdnf mRNA, increased degradation of tryptophan along the kynurenine pathway and increased plasma proinflammatory cytokines. Citrobacter rodentium increased anxiety-like behaviors; mediated by vagus nerve. C. rodentium infection caused memory dysfunction, which was evident only following an acute stressor; increased serum corticosterone, reduced hippocampal BDNF and increased cFOS expression; daily administration of a probiotic prior to infection prevented stress-induced memory dysfunction and other physiological consequences; memory dysfunction evident without acute-stressor following infection of germ-free mice. Probiotics Alleviation of visceral pain by strains of Lactobacillus and Bifidobacterium.

References 136

117

137

138 139 140

140

141 142

143 139

141, 144‒149 L. rhamnosus (JB-1) decreased anxiety- and depressive-like behaviors, reduced 150 plasma corticosterone, altered mRNA expression of GABAa and GABAb receptors in complex regions specific manner; vagotomy prevented both the behavioral and molecular alterations. L. helveticus and B. longum had anxiolytic activity in rats and reduced 151, 152 depressive-like behaviors evident after myocardial infarction in rats. B. infantis reduced depressive-like behaviors, normalized peripheral 153, 154 proinflammatory cytokines in maternal separation model; increased tryptophan availability. B. breve strain NCIMB 702258 but not B. breve strain DPC 6330 elevated 155 arachidonic acid and docosahexaenoic acid, fatty acids essential for neurodevelopmental processes. B. longum NCC3001 but not L. rhamnosus NCC4007 reversed inflammation/ 140, 142 colitis-induced anxiety-like behaviors and alteration in hippocampal Bdnf mRNA levels; most likely mediated via the vagus nerve,

*A variety of preclinical strategies have been used to investigate the impact of the gut microbiota on brain and behavior.

280

the OMICs

TABLE  15.5. PROPOSED MECHANISMS

WHEREBY THE MICROBIOTA INFLUENCES THE CNS* Alters microbiome composition Modulates Immune parameters (e.g., Cytokines) Signals via the Vagus nerve Alters Tryptophan metabolism Produces Microbiome metabolites Produces Neuroactive microbiome metabolites Effects of Bacterial cell wall sugars *CNS function can be modulated by the gut microbiota and a number of potential mechanisms have been suggested through which this can happen. For review, see Cryan and Dinan.2

activity in brain regions (middle and posterior insula) relevant to anxiety.108 Taken together, these experimental strategies (Figure  15.2 and Table  15.4) have shed light on the range of complex behaviors that fall under the remit of the gastrointestinal microbiome. These include pain, anxiety, depression, and aspects of cognitive function, with neural, humoral, and as yet unknown mechanisms being implicated (Table  15.5).2 Combined with the metagenomic approaches that are delineating a healthy microbiome from that present in disease, our appreciation of the impact on behavior, neurophysiology, and neurochemistry is rapidly expanding. Third-generation sequencing may generate information at the strain level assuming the bioinformatic tools keep pace with such rapid technological advances. Crucial deficits in knowledge remain, particularly in relation to mechanistic insights, which are urgently required. Efforts aimed at bridging these gaps continue apace, further appraising the full extent to which the microbiome influences the CNS and attempting to translate these efforts to the clinic. Success in this venture, which demands the continued application of both metagenomic technologies and experimental approaches, has many important implications.

I M P L I C AT I O N S A N D F U T U R E PERSPECTIVES The metagenomic revolution has provided some key insights on the composition and diversity of the gut microbiome and opened up a vista with preventative, therapeutic, and diagnostic opportunities.14 Although hinted at by earlier studies, it now seems clear that the human gut harbors a

core microbiome providing a functional capability that is essential for health and which varies considerably at the extremes of life. Lagging behind this revelation is a broader appreciation of what this genomic capacity means for actual metabolic activity,39 whether it can be beneficially shifted through intentional compositional alterations, and how this might best be achieved. This is a conundrum that can be addressed only by integrating this technology with the increasing availability of reliable metatransciptomic, metaproteomic, and metametabolomic25 tools as well as the experimental approaches outlined above.6,109 Further large-scale initiatives such as the American Gut Project (http://humanfoodproject.com/ american-gut/) are also likely to be of benefit. For example, the concept of enterotypes suggests multiple stable states for the human microbiome, but the functional consequences (either in terms of general health and well-being or at the level of the CNS) of these enterotypes remains to be fully defined, especially in relation to whether each enterotype confers a unique metabolic phenotype.15 Omics approaches cannot on their own answer all these questions, and it is likely that gnotobiotic methodology, whereby germ-free animals can be colonized with either single strains or representative microbial compositions, will provide some invaluable clues. It is important to note that microbial species are likely to be adapted to specific hosts, which may complicate the interpretation of results from human-mouse transplantation studies.110 A  further caveat is that differences in intestinal scaffolding and environments between humans and mice likely contribute to a very different gut microbiota/ microbiome in both species.4 In this regard, pigs may offer a more comparable model for manipulation of the gut microbiome.29 Such concerns notwithstanding, humanizing the microbiota of germ-free mice has been proven to induce phenotypes reminiscent of the donor in the recipients,111 as have transplants between different strains of the same species.6 The fate of these enterotypes in disease states also remains to be fully elucidated. The cross-sectional studies performed to date suggest alternative stable states with disease implications or that the enterotypes which have been defined can constitute a disease risk when combined with other genetic or environmental factors.30 However, it is important to reiterate that fecal samples do not conserve the spatial

Characterizing the Gut Microbiome organization of bacterial communities along the gastrointestinal tract.112 Prospective studies aimed at elucidating a causal role for the gut microbiome in underpinning disease states are urgently required and will be a prerequisite to harnessing the diagnostic potential of an altered gut microbiome.88 Preclinical data generated to date indicts the gut microbiome as a key regulator of CNS function and strongly suggests these studies should now include psychiatric cohorts.6 A surmountable but challenging obstacle will be the need to adequately power such studies in light of the inherent variation of the gut microbiome.10,31 The limitations on data quality imposed by the short reads (which are characteristic of the second-generation technologies) will need to be overcome before diagnostic potential can be fully harnessed.18,50 The long-term implications of gut microbiome dysbiosis (e.g., antibiotics) or altered development patterns (e.g., cesarean section vs. vaginal delivery, breastfeeding vs. formula feeding) early in life, particularly at key brain centers, is an issue that requires close examination. Equally, the factors which contribute to detrimental health outcomes at the other extreme of life are also now very much on the agenda, especially given the worldwide trend toward an aging population.113 Chief among these, it seems, is the contribution made by diet to the reshaping of the gut microbiome and the deleterious consequences of reduced microbial diversity.11,80,88,114 From a therapeutic perspective, the implications are no less profound and the microbiome can be regarded as a druggable target, although the conditions that promote the growth of desirable species at the expense of less-sought-after members remains elusive at present.5 The ability to restore the “normal” gut microbiome after deleterious shifts does appear to offer clinical benefits.17 In particular, there may be therapeutic windows in early life during which intervention might confer health benefits15 while narrow rather than broad-spectrum antibiotics might offer a more directed approach to the  removal of harmful community members.75 The advances described have put on the agenda the possibility of microbial-based therapeutics to beneficially influence the CNS.2 This is likely still some way off, although some interesting candidates have emerged. These options range from those in more advanced stages, such as the probiotic strains that have already been assessed in some

281

brain-gut axis disorders42 to the more controversial fecal transplantation strategies, which have proven effective in treatment resistant cases of infection with Clostridium difficile.115 Clearly, the more palatable option at the minute is the former one and initially as an adjunct therapy rather than a distinct therapeutic regimen.2,116,117 Interestingly, probiotics produce a complement of neurotransmitters and have been proposed as delivery vehicles for neuroactive compounds.118 However, the transition from the use of probiotics to support health to a more targeted use in specific disease states is one that needs to be made following substantive dialogue between all relevant stakeholders. These include regulatory agencies such as the US Food and Drug Administration (FDA) and the European Food Safety Authority (EFSA), the patient groups who could accrue potential benefits, healthcare professionals, and the probiotic manufacturers, who may be wary of subjecting their products to the same rigorous evaluation that is applied to pharmaceutical drugs. Indeed, the FDA now requires that probiotics with a specific health claim be regarded as investigational new drugs (INDs).119 There is also a growing awareness that an individual’s gut microbiome might directly (from activation or inactivation) and indirectly (via changes in host gene expression) affect drug metabolism,120 with clear consequences in altered microbial states. Drug action, toxicity, and fate are likely affected by microbial community composition.121 The advent of an array of delayed-release drug formulations means that an increasing number of drugs are now exposed to bacteria in the distal gut and subject to metabolism there.122 Therapeutic mining of the gut microbiome also represents a viable option and is already in progress.3,122

CONCLUSION In O’Casey’s play1 the benefits promised by a false inheritance leads to a corruption of values and a tragic crisis rather than the transformative changes envisioned by the Boyle family. It is worth bearing in mind that the health riches promised by the human genomic revolution have also largely not yet come to pass. This is principally because of a thesis suggesting that once genetic defects were detectable, readily available intervention strategies would soon follow suit.20 However, genetic defects have proved difficult to modify123; the concept also overlooked

282

the OMICs

any interaction with the bacterial microbiome and discounted the associated consequences for health and disease.3,30,50,124 Current indications are that this mistake is unlikely to be repeated and that the lessons learned bode well for the success of current ventures that aim to understand and exploit the finite but sometimes chaotic gut microbiome. Indeed, the microbiome now appears both amenable to quantitation and malleable.20 As our understanding of our relationship with our microbial self increases, the associated health benefits will no doubt increase as well.

ACKNOWLEDGMENTS The Alimentary Pharmabiotic Centre is a research center funded by Science Foundation Ireland (SFI) through the Irish Government’s National Development Plan (NDP). The authors and their work were supported by SFI (grant nos. 02/CE/B124 and 07/CE/B1368). GC was in receipt of a research grant from the American Neurogastroenterology and Motility Society (ANMS). PWOT and TD, also as part of the NDP, are supported by way of a Department of Agriculture Food and Marine, and Health Research Board FHRI award to the ELDERMET project. GC, JFC and TD are also funded by the Irish Health Research Board (HRB) Health Research Awards (grant no. HRA_ POR/2011/23). The authors would like to thank Dr. Marcela Julio for preparing the images used in this review (http://www.facebook.com/imagenesciencia) and Dr.  Sue Grenham for helpful comments and discussions. REFERENCES 1. O’Casey, S. Juno and the Paycock. 1924. London: Macmillan. 2. Cryan, J. F., & T. G. Dinan. Mind-altering microorganisms: the impact of the gut microbiota on brain and behaviour. Nat Rev Neurosci, 2012. 13(10): 701–712. 3. O’Hara, A. M., & F. Shanahan. The gut flora as a forgotten organ. EMBO Rep, 2006. 7(7): 688–693. 4. de Vos, W. M., & E. A. de Vos. Role of the intestinal microbiome in health and disease: from correlation to causation. Nutr Rev, 2012. 70 Suppl 1: S45–S56. 5. Lozupone, C. A., et  al. Diversity, stability and resilience of the human gut microbiota. Nature, 2012. 489(7415): 220–230. 6. Collins, S.,M., M. Surette, & Bercik,. P. The interplay between the intestinal microbiota and the brain. Nat Rev Microbiol, 2012. 10(11): 735–742.

7. Grenham, S., et al. Brain-gut-microbe communication in health and disease. Frontiers in gastrointestinal science, 2011. 2:94. 8. Mayer, E. A. Gut feelings: the emerging biology of gut-brain communication. Nat Rev Neurosci, 2011. 12(8): 453–466. 9. Fraher, M. H., W. O’Toole, & E. M. Quigley. Techniques used to characterize the gut microbiota:  a guide for the clinician. Nature Rev Gastroenterol Hepatol, 2012. 9(6): 312–322. 10. Maneesh, D., et  al. The human gut microbiome:  current knowledge, challenges, and future directions. Transl Res, 2012. 160(4): 246–257. 11. O’Toole, W. Changes in the intestinal microbiota from adulthood through to old age. Clin Microbiol Infect, 2012. 18 Suppl 4: 44–46. 12. Committee on Metagenomics:  Challenges and Functional Applications, National Research Council. The new science of metagenomics: revealing the secrets of our microbial planet. 2007. Washington, DC: National Academies Press. 13. Hill, C. Probiotics and pharmabiotics:  alternative medicine or an evidence-based alternative? Bioeng Bugs, 2010. 1(2): 79–84. 14. Weinstock, G. M. Genomic approaches to studying the human microbiota. Nature, 2012. 489(7415): 250–256. 15. Holmes, E., et  al. Gut microbiota composition and activity in relation to host metabolic phenotype and disease risk. Cell Metab, 2012. 16(5): 559–564. 16. Podolsky, S. H. Metchnikoff and the microbiome. Lancet, 2012. 380(9856): 1810–1811. 17. Relman, D. A. The human microbiome: ecosystem resilience and health. Nutr Rev, 2012. 70 Suppl 1: S2–S9. 18. Ursell, L. K., et al. Defining the human microbiome. Nutr Rev, 2012. 70 Suppl 1: S38–S44. 19. Relman, D. A. Microbiology:  Learning about who we are. Nature, 2012. 486(7402): 194–195. 20. Morgan, X. C., N. Segata, & C. Huttenhower. Biodiversity and functional genomics in the human microbiome. Trends Genet, 2012. Jan;29(1): 51–58. 21. Lasken, R. S. Genomic sequencing of uncultured microorganisms from single cells. Nat Rev Microbiol, 2012. 10(9): 631–640. 22. Ventura, M., et  al. Microbial diversity in the human intestine and novel insights from metagenomics. Front Biosci, 2009. 14: 3214–3221. 23. Gueimonde, M., & M. C. Collado. Metagenomics and probiotics. Clin Microbiol Infect, 2012. 18 Suppl 4: 32–34. 24. Metzker, M. L. Sequencing technologies— the next generation. Nat Rev Genet, 2010. 11(1): 31–46.

Characterizing the Gut Microbiome 25. Suenaga, H. Targeted metagenomics:  a highresolution metagenomics approach for specific gene clusters in complex microbial communities. Environ Microbiol, 2012. 14(1): 13–22. 26. De Filippo, C., et  al. Bioinformatic approaches for functional annotation and pathway inference in metagenomics data. Brief Bioinform, 2012. 13(6): 696–710. 27. Loman, N. J., et  al. High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol, 2012. 10(9): 599–606. 28. Aziz, Q., et al. Gut microbiota and gastrointestinal health:  current concepts and future directions. Neurogastroenterol Motil, 2013. 25(1): 4–15. 29. Tremaroli, V., & F. Backhed. Functional interactions between the gut microbiota and host metabolism. Nature, 2012. 489(7415): 242–249. 30. Lepage,P et al. A metagenomic insight into our gut’s microbiome. Gut, 2013. 62(1):146–158. 31. Gevers, D., et  al. The human microbiome project:  a community resource for the healthy human microbiome. PLoS Biol, 2012. 10(8): e1001377. 32. Cho, I., & M. J. Blaser. The human microbiome:  at the interface of health and disease. Nat Rev Genet, 2012. 13(4): 260–270. 33. Simon, C., & R. Daniel. Metagenomic analyses:  past and future trends. Appl Environ Microbiol, 2011. 77(4): 1153–1161. 34. Backhed, F., et al. Defining a healthy human gut microbiome: current concepts, future directions, and clinical applications. Cell Host Microbe, 2012. 12(5): 611–622. 35. Wooley, J. C., A. Godzik, & I. Friedberg. A primer on metagenomics. PLoS Comput Biol, 2010. 6(2): e1000667. 36. Carroll, I. M., et al. Characterization of the fecal microbiota using high-throughput sequencing reveals a stable microbial community during storage. PLoS One, 2012. 7(10): e46953. 37. Ekkers, D. M., et al. The great screen anomaly—a new frontier in product discovery through functional metagenomics. Appl Microbiol Biotechnol, 2012. 93(3): 1005–1020. 38. Verberkmoes, N. C., et  al. Shotgun metaproteomics of the human distal gut microbiota. ISME J, 2009. 3(2): 179–189. 39. Ottman, N., et al. The function of our microbiota: who is out there and what do they do? Front Cell Infect Microbiol, 2012. 2: 104. 40. Ben-Amor, K., et al. Genetic diversity of viable, injured, and dead fecal bacteria assessed by fluorescence-activated cell sorting and 16S rRNA gene analysis. Appl Environ Microbiol, 2005. 71(8): 4679–489.

283

41. Ventura, M., et  al. Genome-scale analyses of health-promoting bacteria: probiogenomics. Nat Rev Microbiol, 2009. 7(1): 61–71. 42. Clarke, G., et al. Review article: probiotics for the treatment of irritable bowel syndrome—focus on lactic acid bacteria. Aliment Pharmacol Ther, 2012. 35(4): 403–413. 43. Gosalbes, M. J., et  al. Metagenomics of human microbiome:  beyond 16s rDNA. Clin Microbiol Infect, 2012. 18 Suppl 4: 47–49. 44. Perez-Cobas, A.E., et al. Gut microbiota disturbance during antibiotic therapy:  a multi-omic approach. Gut, 2012. 45. Yolken, R. H., & E. F. Torrey. Are some cases of psychosis caused by microbial agents? A  review of the evidence. Mol Psychiatry, 2008. 13(5): 470–479. 46. Hill, A. B. The environment and disease:  association or causation? Proc R Soc Med, 1965. 58: 295–300. 47. Greub, G. Culturomics: a new approach to study the human microbiome. Clin Microbiol Infect, 2012. 18(12): 1157–1159. 48. Pope, B., et al. Muramidases found in the foregut microbiome of the Tammar wallaby can direct cell aggregation and biofilm formation. ISME J, 2011. 5(2): 341–350. 49. Loman, N. J., et  al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol, 2012. 30(5): 434–439. 50. Venter, J.C. Multiple personal genomes await. Nature, 2010. 464(7289): 676–677. 51. Ley, R. E., D. A. Peterson, & J. I. Gordon. Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cell, 2006. 124(4): 837–848. 52. Qin, J., et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 2010. 464(7285): 59–65. 53. Eckburg, B., et  al. Diversity of the Human Intestinal Microbial Flora. Science, 2005. 308(5728): 1635–1638. 54. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature, 2012. 486(7402): 207–214. 55. Human Microbiome Project Consortium. A framework for human microbiome research. Nature, 2012. 486(7402): 215–221. 56. Yatsunenko, T., et  al. Human gut microbiome viewed across age and geography. Nature, 2012. 486(7402): 222–227. 57. Salonen, A., et al. The adult intestinal core microbiota is determined by analysis depth and health status. Clin Microbiol Infect, 2012. 18 Suppl 4: 16–20.

284

the OMICs

58. Caporaso, J. G., et al. Moving pictures of the human microbiome. Genome Biol, 2011. 12(5): R50. 59. Arumugam, M., et  al. Enterotypes of the human gut microbiome. Nature, 2011. 473(7346): 174–180. 60. Schloissnig, S., et  al. Genomic variation landscape of the human gut microbiome. Nature, 2013. 493(7430): 45–50. 61. Stower, H. Metagenomics:  Personalized gut microbiome variants. Nat Rev Genet, 2012. 62. Faust, K., & J. Raes. Microbial interactions: from networks to models. Nat Rev Microbiol, 2012. 10(8): 538–550. 63. Wu, G. D., et al. Linking long-term dietary patterns with gut microbial enterotypes. Science, 2011. 334(6052): 105–108. 64. Hehemann, J. H., et  al. Transfer of carbohydrate-active enzymes from marine bacteria to Japanese gut microbiota. Nature, 2010. 464(7290): 908–912. 65. Jeffery, I. B., et  al. Categorization of the gut microbiota:  enterotypes or gradients? Nat Rev Microbiol, 2012. 10(9): 591–592. 66. Huse, S. M., et  al. A core human microbiome as viewed through 16S rRNA sequence clusters. PLoS One, 2012. 7(6): e34242. 67. Jimenez, E., et  al. Is meconium from healthy newborns actually sterile? Res Microbiol, 2008. 159(3): 187–193. 68. Costello, E. K., et  al. The application of ecological theory toward an understanding of the human microbiome. Science, 2012. 336(6086): 1255–1262. 69. Adlerberth, I., & A. E. Wold, Establishment of the gut microbiota in Western infants. Acta Paediatr, 2009. 98(2): 229–238. 70. Barrett, E., et  al. The individual-specific and diverse nature of the preterm infant microbiota. Arch Dis Child Fetal Neonatal Ed, 2013. 98(4):F334–340. 71. Nicholson, J. K., et  al. Host-gut microbiota metabolic interactions. Science, 2012. 336(6086): 1262–1267. 72. Maynard, C. L., et al. Reciprocal interactions of the intestinal microbiota and immune system. Nature, 2012. 489(7415): 231–241. 73. Cho, C. E., & M. Norman. Cesarean section and development of the immune system in the offspring. Am J Obstet Gynecol, 2013. 208(4):249–254. 74. Romero, R., & S. J. Korzeniewski. Is there a longterm price to pay for infants not exposed to the stress of labor? How the microbiome and the immune system can affect our lives. Am J Obstet Gynecol, 2012. 75. Parfrey, L. W., & R. Knight. Spatial and temporal variability of the human microbiota. Clin Microbiol Infect, 2012. 18 Suppl 4: 8–11.

76. Valles, Y., et al. Metagenomics and development of the gut microbiota in infants. Clin Microbiol Infect, 2012. 18 Suppl 4: 21–26. 77. Vaishampayan, A., et  al. Comparative metagenomics and population dynamics of the gut microbiota in mother and infant. Genome Biol Evol, 2010. 2: 53–66. 78. Turnbaugh, J., et  al. A core gut microbiome in obese and lean twins. Nature, 2009. 457(7228): 480–484. 79. Claesson, M. J., et  al. Composition, variability, and temporal stability of the intestinal microbiota of the elderly. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: 4586–4591. 80. Claesson, M. J., et  al. Gut microbiota composition correlates with diet and health in the elderly. Nature, 2012. 488(7410): 178–184. 81. Kinross, J., & J. K. Nicholson. Gut microbiota:  dietary and social modulation of gut microbiota in the elderly. Nat Rev Gastroenterol Hepatol, 2012. 9(10): 563–564. 82. Ghosh, S., et al. Diets rich in n-6 PUFA induce intestinal microbial dysbiosis in aged mice. Br J Nutr, 2013: 1–9. 83. Cryan, J. F., & S. M. O’Mahony. The microbiome-gut-brain axis:  from bowel to behavior. Neurogastroenterol Motil, 2011. 23(3): 187–192. 84. Dinan, T. G., & J. F. Cryan. Regulation of the stress response by the gut microbiota:  Implications for psychoneuroendocrinology. Psychoneuroendocrinology, 2012. 37(9):1369–1378. 85. Rhee, S. H., C. Pothoulakis, & E. A. Mayer. Principles and clinical implications of the brain-gut-enteric microbiota axis. Nat Rev Gastroenterol Hepatol, 2009. 6(5): 306–314. 86. O’Mahony, S. M., et  al. Maternal separation as a model of brain-gut axis dysfunction. Psychopharmacology (Berl), 2011. 214(1): 71–88. 87. Flint, H. J., et al. The role of the gut microbiota in nutrition and health. Nat Rev Gastroenterol Hepatol, 2012. 9(10): 577–589. 88. Kau, A. L., et al. Human nutrition, the gut microbiome and the immune system. Nature, 2011. 474(7351): 327–336. 89. Davey, K. J., et  al. Gender-dependent consequences of chronic olanzapine in the rat: effects on body weight, inflammatory, metabolic and microbiota parameters. Psychopharmacology (Berl), 2012. 221(1): 155–169. 90. Hildebrandt, M. A., et al. High-fat diet determines the composition of the murine gut microbiome independently of obesity. Gastroenterology, 2009. 137(5): 1716–24 e1–2. 91. Murphy, E. F., et al. Composition and energy harvesting capacity of the gut microbiota: relationship to diet, obesity and time in mouse models. Gut, 2010. 59(12): 1635–1642.

Characterizing the Gut Microbiome 92. Schellekens, H., et  al. Ghrelin signalling and obesity: at the interface of stress, mood and food reward. Pharmacol Ther, 2012. 135(3): 316–326. 93. Murphy, E. F., et  al. Divergent metabolic outcomes arising from targeted manipulation of the gut microbiota in diet-induced obesity. Gut, 2012. 94. Mujico, J. R., et al. Changes in gut microbiota due to supplemented fatty acids in diet-induced obese mice. Br J Nutr, 2013: 1–10. 95. de Theije, C.G., et al. Pathways underlying the gut-to-brain connection in autism spectrum disorders as future targets for disease management. Eur J Pharmacol, 2011. 668 Suppl 1: S70–S80. 96. Sandler, R. H., et  al. Short-term benefit from oral vancomycin treatment of regressive-onset autism. J Child Neurol, 2000. 15(7): 429–435. 97. Wang, L., et al. Elevated fecal short chain fatty acid and ammonia concentrations in children with autism spectrum disorder. Dig Dis Sci, 2012. 98. MacFabe, D. F., et  al. Effects of the enteric bacterial metabolic product propionic acid on object-directed behavior, social behavior, cognition, and neuroinflammation in adolescent rats:  relevance to autism spectrum disorder. Behav Brain Res, 2011. 217(1): 47–54. 99. Thomas, R. H., et  al. The enteric bacterial metabolite propionic acid alters brain and plasma phospholipid molecular species:  further development of a rodent model of autism spectrum disorders. J Neuroinflam, 2012. 9(1): 153. 100. Mulle, J. G., W. G. Sharp, & J. F. Cubells. The gut microbiome: a new frontier in autism research. Curr Psychiatry Rep, 2013. 15(2): 337. 101. Clarke, G., et  al. Irritable bowel syndrome:  towards biomarker identification. Trends Mol Med, 2009. 15(10): 478–489. 102. Spiller, R. & K. Garsed. Postinfectious Irritable Bowel Syndrome. Gastroenterology, 2009. 136(6): 1979–1988. 103. O’Mahony, S. M., et  al. Early life stress alters behavior, immunity, and microbiota in rats: implications for irritable bowel syndrome and psychiatric illnesses. Biol Psychiatry, 2009. 65(3): 263–267. 104. Quigley, E. M. Bacterial flora in irritable bowel syndrome:  role in pathophysiology, implications for management. J Dig Dis, 2007. 8(1): 2–7. 105. Quigley, E. M. Do patients with functional gastrointestinal disorders have an altered gut flora? Ther Adv Gastroenterol, 2009. 2(4): 23–30. 106. Spiller, R., & C. Lam. An update on post-infectious irritable bowel syndrome:  role of genetics, immune activation, serotonin and altered

107.

108.

109.

110.

111.

112.

113.

114.

115.

116.

117.

118.

119.

120.

121.

285

microbiome. J Neurogastroenterol Motil, 2012. 18(3): 258–268. Ezenwa, V. O., et  al. Microbiology. Animal behavior and the microbiome. Science, 2012. 338(6104): 198–199. Tillisch, K., et al. Modulation of the brain-gut axis after 4-week intervention with a probiotic fermented dairy product. Gastroenterology, 2012. 142(5): S–115. Bercik, P., et al. The intestinal microbiota affect central levels of brain-derived neurotropic factor and behavior in mice. Gastroenterology, 2011. 141(2): 599–609.e3. Hooper, L. V., D. R. Littman, & A. J. Macpherson. Interactions between the microbiota and the immune system. Science, 2012. 336(6086): 1268–1273. Turnbaugh, J., et  al. The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med, 2009. 1(6): 6ra14. Bruls, T., & J. Weissenbach. The human metagenome:  our other genome? Hum Mol Genet, 2011. 20(R2): R142–R148. Camfield, D. A., et  al. Dairy constituents and neurocognitive health in ageing. Br J Nutr, 2011. 106(2): 159–174. Moschen, A. R., V. Wieser, & H. Tilg. Dietary factors:  major regulators of the gut’s microbiota. Gut Liver, 2012. 6(4): 411–416. Borody, T. J., & A. Khoruts. Fecal microbiota transplantation and emerging applications. Nat Rev Gastroenterol Hepatol, 2012. 9(2): 88–96. Aragon, G., et al. Probiotic therapy for irritable bowel syndrome. Gastroenterol Hepatol (NY), 2010. 6(1): 39–44. Clarke, G., et  al. The microbiome-gut-brain axis during early-life regulates the hippocampal serotonergic system in a gender-dependent manner. Mol Psychiatry, 2013. 18(6):666–673. Lyte, M. Probiotics function mechanistically as delivery vehicles for neuroactive compounds:  microbial endocrinology in the design and use of probiotics. BioEssays, 2011. 33(8): 574–581. Rauch, M., & S. V. Lynch. The potential for probiotic manipulation of the gastrointestinal microbiome. Curr Opin Biotechnol, 2012. 23(2): 192–201. Haiser, H. J., & J. Turnbaugh. Is it time for a metagenomic basis of therapeutics? Science, 2012. 336(6086): 1253–1255. Saad, R., M. R. Rizkallah, & R. K. Aziz. Gut pharmacomicrobiomics:  the tip of an iceberg of complex interactions between drugs and gut-associated microbes. Gut Pathog, 2012. 4(1): 16.

286

the OMICs

122. Shanahan, F. The gut microbiota—a clinical perspective on lessons learned. Nat Rev Gastroenterol Hepatol, 2012. 9(10): 609–614. 123. Hall, W. D., R. Mathews, & K. I. Morley. Being more realistic about the public health impact of genomic medicine. PLoS Med, 2010. 7(10). e1000347. 124. Jia, W., et  al. Gut microbiota:  a potential new territory for drug targeting. Nat Rev Drug Discov, 2008. 7(2): 123–129. 125. Turnbaugh, J. & J. I. Gordon. The core gut microbiome, energy balance and obesity. J Physiol, 2009. 587(Pt 17): 4153–4158. 126. Ley, R. E., et al. Microbial ecology: human gut microbes associated with obesity. Nature, 2006. 444(7122): 1022–1023. 127. Backhed, F., et  al. The gut microbiota as an environmental factor that regulates fat storage. Proc Natl Acad Sci U S A, 2004. 101(44): 15718–15723. 128. Backhed, F., et al. Mechanisms underlying the resistance to diet-induced obesity in germfree mice. Proc Natl Acad Sci U S A, 2007. 104(3): 979–984. 129. Ley, R. E., et  al. Obesity alters gut microbial ecology. Proc Natl Acad Sci U S A, 2005. 102(31): 11070–11075. 130. Finegold, S. M., et  al. Pyrosequencing study of fecal microflora of autistic and control children. Anaerobe, 2010. 16(4): 444–453. 131. Gondalia, S. V., et  al. Molecular characterisation of gastrointestinal microbiota of children with autism (with and without gastrointestinal dysfunction) and their neurotypical siblings. Autism Res, 2012. 5(6):419–427. 132. Carroll, I. M., et  al. Alterations in composition and diversity of the intestinal microbiota in patients with diarrhea-predominant irritable bowel syndrome. Neurogastroenterol Motil, 2012. 24(6): 521–530, e248. 133. Jeffery, I. B., et  al. An irritable bowel syndrome subtype defined by species-specific alterations in fecal microbiota. Gut, 2012. 61(7): 997–1006. 134. Bailey, M. T., et al. Exposure to a social stressor alters the structure of the intestinal microbiota:  implications for stressor-induced immunomodulation. Brain Behav Immun, 2011. 25(3): 397–407. 135. Bailey, M. T., & C. L. Coe. Maternal separation disrupts the integrity of the intestinal microflora in infant rhesus monkeys. Dev Psychobiol, 1999. 35(2): 146–155. 136. Sudo, N., et al. Postnatal microbial colonization programs the hypothalamic-pituitary-adrenal system for stress response in mice. J Physiol, 2004. 558(Pt 1): 263–275.

137. Heijtz, R. D., et  al. Normal gut microbiota modulates brain development and behavior. Proc Natl Acad Sci U S A, 2011. 108(7): 3047–3052. 138. Neufeld, K. M., et  al. Reduced anxiety-like behavior and central neurochemical change in germ-free mice. Neurogastroenterol Motil, 2010. 23(3): 255–264, e119. 139. Gareau, M. G., et al. Bacterial infection causes stress-induced memory dysfunction in mice. Gut, 2011. 60(3): 307–317. 140. Bercik, P., et  al. The anxiolytic effect of Bifidobacterium longum NCC3001 involves vagal pathways for gut-brain communication. Neurogastroenterol Motil, 2011. 23(12): 1132–1139. 141. Verdu, E. F., et  al. Specific probiotic therapy attenuates antibiotic induced visceral hypersensitivity in mice. Gut, 2006. 55(2): 182–190. 142. Bercik, P., et al. Chronic gastrointestinal inflammation induces anxiety-like behavior and alters central nervous system biochemistry in mice. Gastroenterology, 2010. 139(6): 2102–2112 e1. 143. Lyte, M., et al. Induction of anxiety-like behavior in mice during the initial stages of infection with the agent of murine colonic hyperplasia Citrobacter rodentium. Physiol Behav, 2006. 89(3): 350–357. 144. Ait-Belgnaoui, A., et al. Prevention of gut leakiness by a probiotic treatment leads to attenuated HPA response to an acute psychological stress in rats. Psychoneuroendocrinology, 2012. 145. Gareau, M. G., et  al. Probiotic treatment of rat pups normalises corticosterone release and ameliorates colonic dysfunction induced by maternal separation. Gut, 2007. 56(11): 1522–1528. 146. Johnson, A. C., B. Greenwood-Van Meerveld, & J. McRorie. Effects of Bifidobacterium infantis 35624 on post-inflammatory visceral hypersensitivity in the rat. Dig Dis Sci, 2011. 56(11): 3179–3186. 147. McKernan, D. P., et  al. The probiotic Bifidobacterium infantis 35624 displays visceral antinociceptive effects in the rat. Neurogastroenterol Motil, 2010. 22(9): 1029–1035, e268. 148. Rousseaux, C., et al. Lactobacillus acidophilus modulates intestinal pain and induces opioid and cannabinoid receptors. Nat Med, 2007. 13(1): 35–37. 149. Wang, B., et  al. Lactobacillus reuteri ingestion and IK(Ca) channel blockade have similar effects on rat colon motility and myenteric neurones. Neurogastroenterol Motil, 2010. 22(1): 98–107, e33. 150. Bravo, J. A., et  al. Ingestion of Lactobacillus strain regulates emotional behavior and central

Characterizing the Gut Microbiome GABA receptor expression in a mouse via the vagus nerve. Proc Natl Acad Sci U S A, 2011. 108(38): 16050–16055. 151. Logan, A. C., & M. Katzman. Major depressive disorder:  probiotics may be an adjuvant therapy. Med Hypotheses, 2005. 64(3): 533–5338. 152. Messaoudi, M., et  al. Assessment of psychotropic-like properties of a probiotic formulation (Lactobacillus helveticus R0052 and Bifidobacterium longum R0175) in rats and human subjects. Br J Nutr, 2011. 105(5): 755–764.

287

153. Desbonnet, L., et al. The probiotic Bifidobacteria infantis: an assessment of potential antidepressant properties in the rat. J Psychiatr Res, 2008. 43(2): 164–174. 154. Desbonnet, L., et  al. Effects of the probiotic Bifidobacterium infantis in the maternal separation model of depression. Neuroscience, 2010. 170(4): 1179–1188. 155. Wall, R., et  al. Contrasting effects of Bifidobacterium breve NCIMB 702258 and Bifidobacterium breve DPC 6330 on the composition of murine brain fatty acids and gut microbiota. Am J Clin Nutr, 2012.

PART  V THERAPEUTICS

16 OMICs in Drug Discovery:  From Small Molecule Leads to Clinical Candidates B. MICHAEL  SILBER

INTRODUCTION This chapter reviews how new molecular entities (NMEs) are discovered and optimized to become clinical candidates that are ultimately evaluated in the clinic and gain regulatory approval to become breakthrough medicines for neurology-related diseases; also discussed is the role of OMICs throughout this entire process. Many neurological diseases are listed, as well as the associated therapies and indications of their effectiveness (Table 16.1). Some conditions, like pain, involve a relatively mature understanding of the root cause at the molecular level and the genes, proteins, and pathways involved. Fortunately, medicines are available to treat a wide range of symptoms, from mild to severe forms of acute and chronic pain, and great progress has been made in understanding the molecular mechanisms involved in the pathogenesis of pain (Basbaum et  al. 2009). Typically medicines are low-molecular-weight compounds that—after oral, parenteral, or transdermal administration— have good access to the affected fibers and nerves. Treatment of moderate to severe pain is more problematic, since it can require the use of narcotics, which can lead to addiction and death due to respiratory failure. However, there continues to be a paucity of new clinically validated targets to drive new drug discovery programs aimed at developing breakthrough medicines for pain. Epilepsy is generally well understood and treatment options are good to excellent, albeit with manageable side effects. In contrast, there are no effective and “safe” medicines available to treat addiction, spinal cord injuries, or developmental disorders. In

addiction, therapies often have potentially serious adverse effects (SAE). There are treatments to prevent and treat strokes, although treatment is only partially successful and depends on intervention within about six hours following a stroke. Treatment includes older drugs like streptokinase and newer clot-busting drugs like tissue plasminogen activator. The problem in stroke is the sequelae, including tissue necrosis resulting from inadequate blood supply to tissues in the brain and leaks following the stroke, causing inflammation, pressure, or other pathological processes. Others, like neurodegenerative diseases, are at best poorly understood at the molecular level, where hypotheses for druggable targets are few and none has been validated except for elevating dopamine levels in Parkinson’s disease (PD), although the effects of drugs targeting increasing dopamine diminish over time or are poorly tolerated because of SAE. Certainty about disease biology, potential targets for therapy, and effective and safe drugs are grossly inadequate in all of the neurodegenerative diseases; as a result, there are no effective treatments. All of these conditions are progressive and ultimately fatal; they differ primarily in the time to onset as well as the rate of progression. The genetic basis for neurodegenerative disease is clear for some (e.g., Huntington’s disease [HD]) but unclear for the vast majority of the others. For these cases, there is no clear association between specific genetic makeup, disease penetrance, disease phenotype, and outcomes. This chapter focuses on the discovery and optimization of low-molecular-weight compounds (less than about 350 molecular weight)

292

the OMICs TABLE  16.1. NEUROLOGY-BASED DISEASES IN THE US, MODIFIED FROM THE

DANA ALLIANCE FOR BRAIN INITIATIVES Neurology-Based Disease Pain Migraine Addiction Developmental disorders Alzheimer’s disease* Stroke Epilepsy Traumatic brain injury/chronic traumatic encephalopathy* Multiple sclerosis* Parkinson’s disease* Frontotemporal dementia/primary supranuclear palsy* Spinal Cord Injury Amyotrophic lateral sclerosis* Huntington’s disease* Creutzfeldt-Jakob disease*

Prevalence (Million)

Therapies^

90 35 30 15 4.0 3 2.5 1.5 0.6 0.5 0.25 0.25 0.1 0.025 0.0003

++ + • + ++ • • -

*Indicates neurodegenerative disease. ^ Indicates where there is effective treatment (++), a partially beneficial treatment, especially if treatment is started very early (+), short-term symptomatic treatment (•), or no effective treatment (-).

that are capable of crossing the blood-brain barrier (BBB) and achieving and maintaining effective drug concentrations—especially unbound concentrations—at the site of action in the brain; in this regard, the chapter also focuses on the importance of omics. These experimental drugs must have physicochemical properties that make them suitable for targeting the brain, including lower molecular weight, good aqueous solubility, low polar surface area, low lipophilicity (clogP), and a small number of hydrogen bond donors and acceptors; they must not be good substrates for the MDR1 transporter (P-glycoprotein) or other transporters found on endothelial cells making up the BBB that efflux drugs out of the brain (Table  16.2). The vast majority of medicines (more than 95%) approved by the US Food and Drug Administration (FDA), European Medicines Agency (EMA), or other regulatory authorities do not significantly accumulate in the brain because they were never designed to have good brain-penetrant physicochemical properties. This review does not include any discussion regarding the discovery of peptide-, protein-, or antibody-based therapeutics and the role of omics with regard to these medicines, since they encompass a very different set of issues in considering such therapies in neurology.

BACKGROUND Most drugs that target neurology-related brain diseases gain access to the brain via passive diffusion across the BBB. In addition to traditional approaches to delivering drugs to the brain using passive mechanisms, attempts have been made at using active processes targeting transporters on endothelial cells at the BBB (Bahadduri et  al. 2010; Cundy et  al. 2004; Pardridge 2009) or delivering drugs to the brain via nerve fibers following intranasal administration (Reger et al. 2008). While this approach for delivering macromolecules like insulin has met with only limited success, primarily because of the low and variable bioavailability of drugs administered via this route of administration, they continue to be explored, especially in diseases where effective therapies are absent and morbidity and mortality are high. The critical steps in the drug discovery process include target identification and validation; development of biochemical, ELISA, FRET, or other cell- or non-cell-based assays used to screen chemical or fragment-based libraries of compounds to identify and confirm hits, leads, and optimized leads using structure-activity relationships (SAR); use of in vitro and in vivo screens to assess drug absorption, distribution, metabolism, excretion, and toxicity (ADMET); animal pharmacology models of disease to

OMICs in Drug Discovery

293

TABLE  16.2. SUCCESS ATTRIBUTES FOR ORALLY ADMINISTERED DRUGS USED FOR NONNEUROLOGY (NON-CNS) AND NEUROLOGY (CNS) INDICATIONS, WHERE THE LATTER MUST CROSS THE BLOOD-BRAIN-BARRIER

Non-CNS

CNS Drug

Attribute

Ideal

Acceptable

Concerning

Ideal

Acceptable

Concerning

MW^ clogP PSA^ (A2) HBD^ HBA^ Number of rotatable bonds Solubility (μg/mL)

7 >12 >10 8 >8