Functional Analysis of Long Non-Coding RNAs: Methods and Protocols [1st ed.] 9781071611579, 9781071611586

This detailed volume presents a comprehensive bioinformatic and experimental toolbox for prioritizing, annotating, and f

1,246 146 9MB

English Pages XIII, 351 [351] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Chemical Biology of Long Noncoding RNAs [1st ed.] 9783030447427, 9783030447434

This book offers a comprehensive and detailed overview of various aspects of long non-coding RNAs. It discusses their em

501 115 11MB Read more

Plant Long Non-Coding RNAs: Methods and Protocols [1st ed.] 978-1-4939-9044-3;978-1-4939-9045-0

This volume focuses on various approaches to studying long non-coding RNAs (lncRNAs), including techniques for finding l

782 26 11MB Read more

Noncoding RNAs and Bone [1 ed.] 9811624011, 9789811624018

The book provides an in-depth and comprehensive overview of the essential role of non-coding RNAs (ncRNAs) in bone forma

432 30 11MB Read more

Circular RNAs Methods and Protocols [1st edition 2018] 9781493975617, 9781493975624, 1493975617, 1493975625

This volume provides established approaches for identifying, characterizing, and manipulating circRNAs in vitro, in vivo

533 50 8MB Read more

RNA Abundance Analysis : Methods and Protocols [2nd ed.] 9781071607428, 9781071607435

This updated book covers a wide range of techniques on RNA extraction, detection, quantification, visualization, and gen

589 137 6MB Read more

Gapmers: Methods and Protocols [1st ed.] 9781071607701, 9781071607718

This volume presents a comprehensive collection of detailed state-of-the-art protocols for gapmer-mediated RNA knockdown

665 61 6MB Read more

The Integrin Interactome: Methods and Protocols [1st ed.] 9781071609613, 9781071609620

This volume provides the most cutting edge technologies related to the study of integrin activation and the characteriza

455 62 9MB Read more

RNA-Chromatin Interactions: Methods and Protocols [1st ed.] 9781071606797, 9781071606803

This volume focuses on RNAs interacting with chromatin and their function. Chapters guide readers through transcription,

596 123 7MB Read more

The Plant Microbiome: Methods and Protocols [1st ed.] 9781071610398, 9781071610404

This volume provides methods, protocols, and reviews that are useful for new and experienced plant microbiome researcher

947 149 10MB Read more

Wound Regeneration: Methods and Protocols [1st ed.] 9781071608449, 9781071608456

This detailed book explores a diverse range of topics related to wound healing. Some areas include wound generation as a

681 101 6MB Read more

Functional Analysis of Long Non-Coding RNAs: Methods and Protocols [1st ed.]
9781071611579, 9781071611586

Author / Uploaded
Haiming Cao

Table of contents :
Front Matter ....Pages i-xiii
Bioinformatics Approaches for Functional Prediction of Long Noncoding RNAs (Fayaz Seifuddin, Mehdi Pirooznia)....Pages 1-13
Visualization of lncRNA and mRNA Structure Models Within the Integrative Genomics Viewer (Steven Busan, Kevin M. Weeks)....Pages 15-25
RNA Coding Potential Prediction Using Alignment-Free Logistic Regression Model (Ying Li, Liguo Wang)....Pages 27-39
Classification of Long Noncoding RNAs by k-mer Content (Jessime M. Kirk, Daniel Sprague, J. Mauro Calabrese)....Pages 41-60
Genome-Wide Computational Analysis and Validation of Potential Long Noncoding RNA-Mediated DNA–DNA–RNA Triplexes in the Human Genome (Saakshi Jalali, Amrita Singh, Vinod Scaria, Souvik Maiti)....Pages 61-71
Using INFERNO to Infer the Molecular Mechanisms Underlying Noncoding Genetic Associations (Alexandre Amlie-Wolf, Pavel P. Kuksa, Chien-Yueh Lee, Elisabeth Mlynarski, Yuk Yee Leung, Li-San Wang)....Pages 73-91
A Bioinformatic Pipeline to Integrate GWAS and eQTL Datasets to Identify Disease Relevant Human Long Noncoding RNAs (Yi Chen, Ping Li, Haiming Cao)....Pages 93-110
AnnoLnc: A One-Stop Portal to Systematically Annotate Novel Human Long Noncoding RNAs (De-Chang Yang, Lan Ke, Yang Ding, Ge Gao)....Pages 111-131
Annotation of Full-Length Long Noncoding RNAs with Capture Long-Read Sequencing (CLS) (Sílvia Carbonell Sala, Barbara Uszczyńska-Ratajczak, Julien Lagarde, Rory Johnson, Roderic Guigó)....Pages 133-159
Single-Cell Analysis of Long Noncoding RNAs (lncRNAs) in Mouse Brain Cells (Boyang Zhang, Wentao Xu, James Eberwine)....Pages 161-177
Detection and Characterization of Ribosome-Associated Long Noncoding RNAs (Chao Zeng, Michiaki Hamada)....Pages 179-194
Analysis of Annotated and Unannotated Long Noncoding RNAs from Exosome Subtypes Using Next-Generation RNA Sequencing (Wittaya Suwakulsiri, Maoshan Chen, David W. Greening, Rong Xu, Richard J. Simpson)....Pages 195-218
DMS-MaPseq for Genome-Wide or Targeted RNA Structure Probing In Vitro and In Vivo (Phillip Tomezsko, Harish Swaminathan, Silvi Rouskin)....Pages 219-238
Labeling and Purification of Temporally Expressed RNAs During the S-Phase of the Cell Cycle in Living Cells (Matthieu Meryet-Figuiere, Mohamad Moustafa Ali, Santhilal Subhash, Chandrasekhar Kanduri)....Pages 239-249
A Quick Immuno-FISH Protocol for Detecting RNAs, Proteins, and Chromatin Modifications (Akiyo Ogawa, Yuya Ogawa)....Pages 251-257
M(R)apping RNA–Protein Interactions (Jasmine Barra, Roberto Vendramin, Eleonora Leucci)....Pages 259-272
In Vivo Administration of Therapeutic Antisense Oligonucleotides (Luisa Statello, Mohamad Moustafa Ali, Chandrasekhar Kanduri)....Pages 273-282
CRISPR-Mediated Mutagenesis of Long Noncoding RNAs (Tomohiro Yamazaki, Tetsuro Hirose)....Pages 283-303
In Vivo CRISPR/Cas9-Based Targeted Disruption and Knockin of a Long Noncoding RNA (Xi Cheng, Samuel T. Peters, Shondra M. Pruett-Miller, Thomas L. Saunders, Bina Joe)....Pages 305-321
Genome-Scale Perturbation of Long Noncoding RNA Expression Using CRISPR Interference (S. John Liu, Max A. Horlbeck, Jonathan S. Weissman, Daniel A. Lim)....Pages 323-338
In Vivo Functional Analysis of Nonconserved Human lncRNAs Using a Humanized Mouse Model (Yonghe Ma, Cheng-Fei Jiang, Ping Li, Haiming Cao)....Pages 339-347
Back Matter ....Pages 349-351

Citation preview

Methods in Molecular Biology 2254

Haiming Cao Editor

Functional Analysis of Long Non-Coding RNAs Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Functional Analysis of Long Non-Coding RNAs Methods and Protocols

Edited by

Haiming Cao Cardiovascular Branch, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MA, USA

Editor Haiming Cao Cardiovascular Branch National Heart, Lung and Blood Institute National Institutes of Health Bethesda, MA, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-1157-9 ISBN 978-1-0716-1158-6 (eBook) https://doi.org/10.1007/978-1-0716-1158-6 © Springer Science+Business Media, LLC, part of Springer Nature 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface Although 2% of the human genome is sufficient to encode all protein-coding genes, a vast majority of the genome can be transcribed giving rise to 60,000 to 100,000 long noncoding RNAs (lncRNAs). Mounting evidence supports the notion that lncRNAs play a vital role in diverse biological progresses, and their dysfunctions have been implicated in a growing list of human diseases. Studying the function of lncRNAs thus has great potential to quickly fill knowledge gaps in molecular biology and to identify novel therapeutic targets for human diseases. However, it remains difficult to identify which lncRNAs are important and how they carry out their functions. The challenges are threefold. First, compared to proteincoding genes, our current understanding of the sequence-function relationship of lncRNAs is limited, and it is difficult to deduce the function of lncRNAs based on their primary sequence. More robust bioinformatics analyses are often needed to select lncRNAs for downstream experimental interrogation. Second, the annotation of human lncRNAs is far from complete. Third, lncRNAs often exercise their function through unique mechanisms and novel approaches, and systems are required to define their functional importance and molecular mechanisms. For example, most lncRNAs studied so far exert their function through their protein-binding partners whose identification is crucial to define a lncRNA’s mechanism of action. Furthermore, most human lncRNAs are non-conserved and new models are required to define their physiological function. This book addresses these fundamental challenges by presenting a comprehensive bioinformatic (Chapters 1–8) and experimental (Chapters 9–21) toolbox for prioritizing, annotating, and functionally analyzing lncRNAs. It is my hope that this book will provide a timely and convenient resource to facilitate the identification and characterization of disease-associated human lncRNAs, which would shed light on their role in biology and pathophysiology and ultimately lead to novel therapeutic approaches targeting lncRNAs for the amelioration of human diseases. Bethesda, MD, USA

Haiming Cao

v

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Bioinformatics Approaches for Functional Prediction of Long Noncoding RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fayaz Seifuddin and Mehdi Pirooznia 2 Visualization of lncRNA and mRNA Structure Models Within the Integrative Genomics Viewer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steven Busan and Kevin M. Weeks 3 RNA Coding Potential Prediction Using Alignment-Free Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Li and Liguo Wang 4 Classification of Long Noncoding RNAs by k-mer Content . . . . . . . . . . . . . . . . . . Jessime M. Kirk, Daniel Sprague, and J. Mauro Calabrese 5 Genome-Wide Computational Analysis and Validation of Potential Long Noncoding RNA-Mediated DNA–DNA–RNA Triplexes in the Human Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saakshi Jalali, Amrita Singh, Vinod Scaria, and Souvik Maiti 6 Using INFERNO to Infer the Molecular Mechanisms Underlying Noncoding Genetic Associations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre Amlie-Wolf, Pavel P. Kuksa, Chien-Yueh Lee, Elisabeth Mlynarski, Yuk Yee Leung, and Li-San Wang 7 A Bioinformatic Pipeline to Integrate GWAS and eQTL Datasets to Identify Disease Relevant Human Long Noncoding RNAs . . . . . . . . . . . . . . . . Yi Chen, Ping Li, and Haiming Cao 8 AnnoLnc: A One-Stop Portal to Systematically Annotate Novel Human Long Noncoding RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . De-Chang Yang, Lan Ke, Yang Ding, and Ge Gao 9 Annotation of Full-Length Long Noncoding RNAs with Capture Long-Read Sequencing (CLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sı´lvia Carbonell Sala, Barbara Uszczyn´ska-Ratajczak, Julien Lagarde, Rory Johnson, and Roderic Guigo 10 Single-Cell Analysis of Long Noncoding RNAs (lncRNAs) in Mouse Brain Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boyang Zhang, Wentao Xu, and James Eberwine 11 Detection and Characterization of Ribosome-Associated Long Noncoding RNAs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao Zeng and Michiaki Hamada 12 Analysis of Annotated and Unannotated Long Noncoding RNAs from Exosome Subtypes Using Next-Generation RNA Sequencing . . . . . . . . . . . Wittaya Suwakulsiri, Maoshan Chen, David W. Greening, Rong Xu, and Richard J. Simpson

vii

v ix

1

15

27 41

61

73

93

111

133

161

179

195

viii

13

14

15

16 17

18 19

20

21

Contents

DMS-MaPseq for Genome-Wide or Targeted RNA Structure Probing In Vitro and In Vivo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phillip Tomezsko, Harish Swaminathan, and Silvi Rouskin Labeling and Purification of Temporally Expressed RNAs During the S-Phase of the Cell Cycle in Living Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthieu Meryet-Figuiere, Mohamad Moustafa Ali, Santhilal Subhash, and Chandrasekhar Kanduri A Quick Immuno-FISH Protocol for Detecting RNAs, Proteins, and Chromatin Modifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akiyo Ogawa and Yuya Ogawa M(R)apping RNA–Protein Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jasmine Barra, Roberto Vendramin, and Eleonora Leucci In Vivo Administration of Therapeutic Antisense Oligonucleotides . . . . . . . . . . . Luisa Statello, Mohamad Moustafa Ali, and Chandrasekhar Kanduri CRISPR-Mediated Mutagenesis of Long Noncoding RNAs. . . . . . . . . . . . . . . . . . Tomohiro Yamazaki and Tetsuro Hirose In Vivo CRISPR/Cas9-Based Targeted Disruption and Knockin of a Long Noncoding RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xi Cheng, Samuel T. Peters, Shondra M. Pruett-Miller, Thomas L. Saunders, and Bina Joe Genome-Scale Perturbation of Long Noncoding RNA Expression Using CRISPR Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. John Liu, Max A. Horlbeck, Jonathan S. Weissman, and Daniel A. Lim In Vivo Functional Analysis of Nonconserved Human lncRNAs Using a Humanized Mouse Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yonghe Ma, Cheng-Fei Jiang, Ping Li, and Haiming Cao

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

219

239

251 259 273

283

305

323

339 349

Contributors MOHAMAD MOUSTAFA ALI • Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden ALEXANDRE AMLIE-WOLF • Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA JASMINE BARRA • Department of Oncology, Laboratory of RNA Cancer Biology, LKI Leuven Cancer Institute, KU Leuven-University of Leuven, Leuven, Belgium; Laboratory for Molecular Cancer Biology, Center for Cancer Biology, VIB, Leuven, Belgium STEVEN BUSAN • Department of Chemistry, University of North Carolina, Chapel Hill, NC, USA J. MAURO CALABRESE • Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA HAIMING CAO • Cardiovascular Branch, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA SI´LVIA CARBONELL SALA • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain MAOSHAN CHEN • Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science (LIMS), La Trobe University, Melbourne, VIC, Australia; Australian Centre for Blood Diseases, Alfred Hospital, Monash University, Melbourne, VIC, Australia YI CHEN • Cardiovascular Branch, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA XI CHENG • Department of Physiology and Pharmacology, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA YANG DING • Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing, China; State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, China JAMES EBERWINE • Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania Perlman School of Medicine, Philadelphia, PA, USA GE GAO • Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing, China; State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, China DAVID W. GREENING • Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science (LIMS), La Trobe University, Melbourne, VIC, Australia; Molecular Proteomics, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia

ix

x

Contributors

RODERIC GUIGO´ • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain MICHIAKI HAMADA • AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan; Faculty of Science and Engineering, Waseda University, Tokyo, Japan; Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan; Institute for Medical-Oriented Structural Biology, Waseda University, Tokyo, Japan; Graduate School of Medicine, Nippon Medical School, Tokyo, Japan TETSURO HIROSE • Graduate School of Frontier Biosciences, Osaka University, Suita, Japan MAX A. HORLBECK • Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA, USA; Howard Hughes Medical Institute, University of California, San Francisco, CA, USA; California Institute for Quantitative Biomedical Research, University of California, San Francisco, CA, USA; Center for RNA Systems Biology, University of California, San Francisco, CA, USA SAAKSHI JALALI • CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India; Reliance Technology Group, Reliance Industries Limited, Navi Mumbai, India CHENG-FEI JIANG • Cardiovascular Branch, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA BINA JOE • Department of Physiology and Pharmacology, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA RORY JOHNSON • Department of Medical Oncology, Inselspital, University Hospital, University of Bern, Bern, Switzerland; Department of Biomedical Research (DBMR), University of Bern, Bern, Switzerland; School of Biology and Environmental Science, University College Dublin, Dublin, Ireland; Conway Institute of Biomedical and Biomolecular Research, University College Dublin, Dublin, Ireland CHANDRASEKHAR KANDURI • Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden LAN KE • Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing, China; State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, China JESSIME M. KIRK • Invitae Corporation, San Francisco, CA, USA; Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA PAVEL P. KUKSA • Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA JULIEN LAGARDE • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain CHIEN-YUEH LEE • Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA

Contributors

xi

ELEONORA LEUCCI • Department of Oncology, Laboratory of RNA Cancer Biology, LKI Leuven Cancer Institute, KU Leuven-University of Leuven, Leuven, Belgium YUK YEE LEUNG • Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA PING LI • Cardiovascular Branch, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA YING LI • Division of Biomedical Statistics and Informatics, Mayo Clinic College of Medicine, Rochester, MN, USA DANIEL A. LIM • Flagship Pioneering, Boston, MA, USA; Department of Neurological Surgery, University of California, San Francisco, CA, USA; Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, University of California, San Francisco, CA, USA; San Francisco Veterans Affairs Medical Center, University of California, San Francisco, CA, USA S. JOHN LIU • Department of Neurological Surgery, University of California, San Francisco, CA, USA; Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, University of California, San Francisco, CA, USA YONGHE MA • Cardiovascular Branch, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA SOUVIK MAITI • CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India; Academy of Scientific and Innovative Research (AcSIR), CSIR IGIB South Campus, Delhi, India MATTHIEU MERYET-FIGUIERE • Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden; Normandie Univ, UNICAEN, INSERM U1086 “ANTICIPE” (Interdisciplinary Research Unit for Cancers Prevention and Treatment, Axis BioTICLA “Biology and Innovative Therapeutics for Ovarian Cancers”), Caen, France; Comprehensive Cancer Centre Franc¸ois Baclesse, UNICANCER, Caen, France ELISABETH MLYNARSKI • Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA AKIYO OGAWA • Division of Reproductive Sciences, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA YUYA OGAWA • Division of Reproductive Sciences, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA; Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH, USA SAMUEL T. PETERS • Center for Advanced Genome Engineering, St. Jude Children’s Research Hospital, Memphis, TN, USA MEHDI PIROOZNIA • Bioinformatics and Computational Biology, National Heart, Lung, and Blood Institute National Institutes of Health, Bethesda, MD, USA SHONDRA M. PRUETT-MILLER • Center for Advanced Genome Engineering, St. Jude Children’s Research Hospital, Memphis, TN, USA SILVI ROUSKIN • Whitehead Institute, Cambridge, MA, USA THOMAS L. SAUNDERS • Transgenic Animal Model Core, University of Michigan, Ann Arbor, MI, USA

xii

Contributors

VINOD SCARIA • CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India; Academy of Scientific and Innovative Research (AcSIR), CSIR IGIB South Campus, Delhi, India FAYAZ SEIFUDDIN • Bioinformatics and Computational Biology, National Heart, Lung, and Blood Institute National Institutes of Health, Bethesda, MD, USA RICHARD J. SIMPSON • Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science (LIMS), La Trobe University, Melbourne, VIC, Australia AMRITA SINGH • CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi, India DANIEL SPRAGUE • Flagship Pioneering, Boston, MA, USA; Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Curriculum in Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA LUISA STATELLO • Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden SANTHILAL SUBHASH • Department of Medical Biochemistry and Cell Biology, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden WITTAYA SUWAKULSIRI • Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science (LIMS), La Trobe University, Melbourne, VIC, Australia HARISH SWAMINATHAN • Whitehead Institute, Cambridge, MA, USA PHILLIP TOMEZSKO • Program in Virology, Harvard Medical School, Boston, MA, USA; Whitehead Institute, Cambridge, MA, USA BARBARA USZCZYN´SKA-RATAJCZAK • Institute of Bioorganic Chemistry Polish Academy, Poznan, Poland ROBERTO VENDRAMIN • Department of Oncology, Laboratory of RNA Cancer Biology, LKI Leuven Cancer Institute, KU Leuven-University of Leuven, Leuven, Belgium; Laboratory for Molecular Cancer Biology, Center for Cancer Biology, VIB, Leuven, Belgium LIGUO WANG • Division of Biomedical Statistics and Informatics, Mayo Clinic College of Medicine, Rochester, MN, USA; Department of Biochemistry and Molecular Biology, Mayo Clinic College of Medicine, Rochester, MN, USA LI-SAN WANG • Penn Neurodegeneration Genomics Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA KEVIN M. WEEKS • Department of Chemistry, University of North Carolina, Chapel Hill, NC, USA JONATHAN S. WEISSMAN • Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA, USA; Howard Hughes Medical Institute, University of California, San Francisco, CA, USA; California Institute for Quantitative Biomedical Research, University of California, San Francisco, CA, USA; Center for RNA Systems Biology, University of California, San Francisco, CA, USA RONG XU • Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science (LIMS), La Trobe University, Melbourne, VIC, Australia WENTAO XU • Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing, China

Contributors

xiii

TOMOHIRO YAMAZAKI • Graduate School of Frontier Biosciences, Osaka University, Suita, Japan DE-CHANG YANG • Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, Beijing, China; State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing, China CHAO ZENG • AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan; Faculty of Science and Engineering, Waseda University, Tokyo, Japan BOYANG ZHANG • Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania Perlman School of Medicine, Philadelphia, PA, USA; Beijing Advanced Innovation Center for Food Nutrition and Human Health, College of Food Science and Nutritional Engineering, China Agricultural University, Beijing, China

Chapter 1 Bioinformatics Approaches for Functional Prediction of Long Noncoding RNAs Fayaz Seifuddin and Mehdi Pirooznia Abstract There is accumulating evidence that long noncoding RNAs (lncRNAs) play crucial roles in biological processes and diseases. In recent years, computational models have been widely used to predict potential lncRNA–disease relations. In this chapter, we systematically describe various computational algorithms and prediction tools that have been developed to elucidate the roles of lncRNAs in diseases, coding potential/ functional characterization, or ascertaining their involvement in critical biological processes as well as provide a comprehensive summary of these applications. Key words LncRNA functional prediction, LncRNA–disease association, LncRNA–protein correlation, LncRNA coding potential, LncRNA Bioinformatics

1

Introduction Recently, there has been evidence suggesting that long noncoding RNAs (lncRNAs) can participate in various crucial biological processes and can also be used as the most promising biomarkers for the treatment of certain diseases such as coronary artery disease and various cancers [1]. Due to costs and time complexity, the number of possible disease-related lncRNAs that can be verified by biological experiments is very limited. Therefore, in recent years, the widespread use of computational models to predict potential lncRNA–disease associations is common [2, 3]. Developing powerful computational models for potential disease-related lncRNAs identification would benefit biomarker identification and drug discovery for human disease diagnosis, treatment, prognosis, and prevention. We systematically describe various computational algorithms and prediction tools to illuminate the roles of lncRNAs in diseases, and their functional characterization in critical biological processes. In addition, we present a comprehensive summary of the tools with references and links to their GitHub or webpages for implementation. Based on our assessment, these 25 publicly

Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_1, © Springer Science+Business Media, LLC, part of Springer Nature 2021

1

2

Fayaz Seifuddin and Mehdi Pirooznia

available tools can be classified into three main categories (Table 1), lncRNA–disease (n = 11), lncRNA–protein function (n = 7), and coding potential (n = 4), as well as some tools with hybrid functionality (n = 3). Our description of each tool will include a brief summary of the methods and its utility. We conclude with discussing the limitations of these computational methods and future directions of lncRNA function prediction.

2

LncRNA–Disease Association Tools Tools in this category operate under the assumption that there is a semantic similarity among disease and lncRNAs function. Several studies have indicated that lncRNAs are highly associated with the occurrence or progression of a wide variety of diseases [4–7]. Tools in this category such as NBCLDA [8], PADLMHOOI [9], DCSLDA [10], SIMCLDA [11], TPGLDA [12], PMFILDA [13], LLCLPLDA [14], BPLLDA [15], Lnc2Cancer [16], TSSR [17], and ProphTools [18] use lncRNA–disease association information, disease similarity, lncRNA–disease correlation, and other similarity mechanism to construct the lncRNA–disease heterogeneous network and implement global network similarity-based models [19], to predict potential lncRNA–disease associations (Table 1).

2.1

NBCLDA

Naı¨ve Bayesian Classifier used to predict potential LncRNA–Disease Associations (NBCLDA) [8] is a novel probabilistic model constructed using a global tripartite network by integrating three kinds of heterogeneous networks including an lncRNA–disease association network, an miRNA–disease association network, and an miRNA–lncRNA interaction network. In addition, a quadruple global network is appended to the tripartite network developed above using a more heterogeneous gene–lncRNA interaction network, a gene–disease association network, and a gene–miRNA interaction network. NBCLDA is assembled on these two newly constructed global networks.

2.2

PADLMHOOI

PADLMHOOI [9] proposes to infer potential associations between diseases and lncRNA–miRNA pairs. It consists of four major steps. The first step involves downloading disease–lncRNA, disease–miRNA, and lncRNA–miRNA associations from various databases. Using these associations, bipartite networks are constructed for each association and eventually integrated into a tripartite network of disease, lncRNA, and miRNA. The second step involves measuring similarity between diseases, lncRNAs and miRNAs including functional similarity. The third step involves adding weights between known diseases, lncRNAs, and miRNAs to improve the prediction performance of PADLMHOOI. The fourth

Wang L et al.

Lei Kong et al.

Hu et al.

Pyfrom et al.

CPAT

CPC

COME

PLAIDOH

(continued)

https://github.com/sarahpyfrom/ 30,767,760 2019 Predicting LncRNA activity through integrative dataPLAIDOH driven ‘omics and heuristics (PLAIDOH). PLAIDOH integrates transcriptome, subcellular localization, enhancer landscape, genome architecture, chromatin interaction, and RNA-binding (eCLIP) data and generates statistically defined output scores

27,608,726 2017 Using a supervised machine learning model trained on https://github.com/lulab/COME mRNAs and lncRNAs. It is a robust coding potential calculation tool for lncRNA characterization based on multiple features

17,631,615 2007 A support vector machine-based classifier, named coding http://cpc.gao-lab.org/ potential calculator (CPC), to assess the proteincoding potential of a transcript based on six biologically meaningful sequence features

23,335,781 2013 Coding potential assessment tool (CPAT), which rapidly http://lilab.research.bcm.edu/cpat/ recognizes coding and noncoding transcripts from a large pool of candidates. To this end, CPAT uses a logistic regression model built with four sequence features: Open reading frame size, open reading frame coverage, Fickett TESTCODE statistic and hexamer usage bias

https://github.com/tderrien/FEELnc 28,053,114 2017 FEELnc (FlExible extraction of LncRNAs), an alignment-free program that accurately annotates lncRNAs based on a random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames

URL

Wucher V et al.

Year Description

FEELnc

PMID

30,258,427 2018 A sequence-based predictor for identifying non-coding http://server.malab.cn/Sc_ncDNAPred/ DNA in Saccharomyces cerevisiae index.jsp

Authors

ScHe et al. ncDNAPred

Methods

Table 1 Publicly available tools that have been developed to elucidate the roles of lncRNAs in diseases, coding potential/functional characterization, or ascertaining their involvement in critical biological processes

lncRNAs Functional Prediction Methods and Tools 3

Chen et al. 27,322,210 2016 Fuzzy measure-based lncRNA functional similarity calculation model

Yu et al.

FMLNCSIM

NBCLDA

URL

https://www.ncbi.nlm.nih.gov/pmc/ articles/pmid/30046354/

Zhao et al. 30,046,354 2018 Predicting disease-lncRNA associations based on the distance correlation set and information of the miRNAs

Lu et al.

Ding et. Al.

Xuan et al. 30,744,078 2019 Infers potential lncRNA-disease associations based on the probability matrix decomposition

Xie G et al. 31,250,107 2019 Locality-constrained linear coding label propagation strategy

Xiao et al.

DCSLDA

SIMCLDA

TPGLDA

PMFILDA

LLCLPLDA

BPLLDA

30,459,803 2018 Based on simple paths with limited lengths in a heterogeneous network

https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC6232683/

https://www.ncbi.nlm.nih.gov/pubmed/ 31250107

https://www.ncbi.nlm.nih.gov/pmc/ articles/pmid/30744078/

29,348,552 2018 lncRNA-disease-gene tripartite graph, which integrates https://github.com/USTC-HIlab/ TPGLDA gene-disease associations with lncRNA-disease associations

https://github.com//bioinfomaticsCSU/ SIMCLDA

https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC6525924/

PADLMHOOI Xuan et al. 31,191,710 2019 Predicting disease-associated LncRNA-MiRNA pairs based on the higher-order orthogonal iteration

29,718,113 2018 Predicting potential lncRNA-disease associations based on inductive matrix completion

https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC6071012/

29,986,541 2018 A novel probability model for LncRNA Disease association prediction based on the Naı¨ve Bayesian classifier

https://www.ncbi.nlm.nih.gov/pmc/ articles/pmid/27322210/

https://www.ncbi.nlm.nih.gov/pmc/ 27,028,993 2016 Improved LRLSLDA by ingrating disease similarity articles/PMC5041953/ fromLNCSIM and disease Gaussian interaction profile kernel similarity

Huang et al.

Year Description

ILNCSIM

PMID

Authors

Methods

Table 1 (continued)

4 Fayaz Seifuddin and Mehdi Pirooznia

http://rtools.cbrc.jp/LncRRIsearch/

http://www.rnanut.net/lncompare/

Navarro C 29,186,475 2017 Based on prioritization tools for heterogeneous et al. biological networks

Zhao et al. 30,182,833 2018 Random walk for lncRNA-protein associations prediction

31,191,601 2019 Web server for lncRNA-RNA interaction prediction integrated with tissue-specific expression and subcellular localization data

Hu et al.

Xie et al.

Fukunaga et al.

Carlevaro- 31,147,707 2019 Web server tests for enrichment amongst a panel of quantitative and categorical features Fita J et al.

Zhu R et al.

Zhang et al.

RWLPAP

LPI-ETSLP

LPI-IBNRA

LncRRIsearch

LnCompare

ACCBN

SFPEL-LPI

https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC6482170/

30,533,006 2018 Sequence-based feature projection ensemble learning for http://www.bioinfotech.cn/SFPEL-LPI/ predicting LncRNA-protein interactions

30,626,319 2019 Ant-Colony-clustering-based bipartite network method https://github.com/ for predicting long non-coding RNA-protein BioMedicalBigDataMiningLabWhu/ interactions lncRNA-protein-interaction-prediction

31,057,602 2019 lncRNA-protein interaction prediction based on improved bipartite network recommender algorithm

28,702,594 2017 lncRNA-protein interaction prediction using eigenvalue https://www.ncbi.nlm.nih.gov/pubmed/ transformation-based semi-supervised link prediction 28702594

https://www.ncbi.nlm.nih.gov/pubmed/ 30182833

https://github.com/cnluzon/prophtools

https://github.com/Oyl-CityU/TSSR

ProphTools

31,191,605 2019 LncRNA-disease association prediction using two-side sparse self-representation

Ou-Yang et al.

http://www.bio-bigdata.net/lnc2cancer

TSSR

30,407,549 2019 Queying experimentally supported long non-coding RNAs in human cancers

Gao et al.

Lnc2Cancer

lncRNAs Functional Prediction Methods and Tools 5

6

Fayaz Seifuddin and Mehdi Pirooznia

step is to decompose the tripartite network and produce scores of the lncRNA–miRNA pairs associated with each disease. 2.3

DCSLDA

DCSLDA [10] is a novel model based on similarity of disease pairs and distance correlation to predict potential lncRNA–disease associations. It integrates known lncRNA–miRNA associations and known miRNA–disease associations. In addition, to optimize the prediction performance of DCSLDA, new methods to calculate the similarity of disease–disease pairs and lncRNA–lncRNA pairs are developed simultaneously.

2.4

SIMCLDA

SIMCLDA [11] implements lncRNA–disease association as a recommendation system machine learning problem. Generally, a recommendation system is an information filtering system that seeks to predict the preference that user would give to a certain item, given only partial known preference information. The lncRNA– disease association prediction task takes the set of lncRNAs, the set of diseases and the set of partially known associations between lncRNAs and diseases, then recommends lncRNAs for a given disease, using prior information about lncRNAs and diseases. Similar to the assumption in the user-item recommendation system that users with similar behavior share similar preferences toward items, the lncRNA–disease prediction assumes that functionally similar lncRNAs exhibit similar interaction patterns with diseases.

2.5

TPGLDA

Tripartite Graph for potential LncRNA–Disease Association identification (TPGLDA) [12] uses a Tripartite Graph to identify potential lncRNA–disease associations. It builds a lncRNA–disease–gene tripartite graph to delineate the heterogeneity of coding–noncoding genes–disease association. Prediction accuracy is improved by incorporating lncRNA expression similarity and disease semantic similarity into the network.

2.6

PMFILDA

PMFILDA [13] is a lncRNA–disease prediction model based on probability matrix decomposition. It constructs three kinds of binary association networks based on experimentally validated lncRNA–miRNA associations, miRNA–disease associations, and lncRNA–disease associations separately. A weighted lncRNA–disease association network is built using the three individual components and updated based on the semantic similarity of disease and the functional similarity of lncRNA.

2.7

LLCLPLDA

LLCLPLDA [14] is based on locality-constrained linear coding (LLC) to predict lncRNA–disease associations. LLC is leveraged to project the features of lncRNAs and diseases to local-constraint features, and then, a label propagation (LP) strategy is used to mix up the initial association matrix and the obtained features of lncRNAs and diseases.

lncRNAs Functional Prediction Methods and Tools

7

2.8

BPLLDA

BPLLDA [15] infers lncRNA–disease associations by constructing a network comprising of the known lncRNA–disease association network, the disease similarity network, and the lncRNA similarity network. It calculates the lncRNA–disease associations based on the lengths of the paths connecting them.

2.9

Lnc2Cancer

Lnc2Cancer [16] is a systematically curated database that contain thousands of associations between human lncRNAs and cancer subtypes curated by reviewing published literature. It contains associations between lncRNA–miRNA, transcription factors (TFs), single nucleotide polymorphisms (SNPs), and methylation. A LncRNA-Cancer_score was developed to evaluate the associations between lncRNA and cancer.

2.10

TSSR

Two-side sparse self-representation (TSSR) model [17] is designed to learn two nonnegative sparse self-representation matrices which capture the intrasimilarities among lncRNAs and diseases respectively based on known lncRNA–disease associations. Predictions can be improved from the intra-associations among disease and lncRNAs that are derived from external information of lncRNAs and diseases to generate more accurate estimation of the representation matrices.

2.11

ProphTools

ProphTools [18] is a lncRNA–disease prioritization algorithm. It performs prioritization on a heterogeneous network combining a Flow Propagation algorithm similar to a Random Walk with Restarts and a weighted propagation method.

3

LncRNA–Protein Function Prediction Tools Tools this category identify the interactions between proteins and lncRNAs to decipher the functional mechanisms of lncRNAs. A variety of methods have been developed to predict protein–lncRNA interactions, among those are, Random Walk for lncRNA–protein Associations Prediction (RWLPAP) [20], LPI-ETSLP [21] prediction model based eigenvalue transformation, LPI-IBNRA [22] which uses expression similarity, LncRRISearch [23] interaction prediction system based on a seed-and-extension approach called Rlblast [24], LncCompare [25], the Ant-Colony-Clustering-Based Bipartite Network (ACCBN) [26], and the Sequence-based feature projection ensemble learning (SFPEL-LPI) [27].

3.1

RWLPAP

Random Walk for lncRNA–Protein Associations Prediction (RWLPAP) [20] is a network algorithm that consists of known lncRNA–protein interactions, lncRNA sequence similarity matrix, and protein sequence similarity matrix. Compared to traditional machine learning methods which require negative samples in the

8

Fayaz Seifuddin and Mehdi Pirooznia

training data to measure specificity, RWLPAP uses semisupervised learning strategies to predict unknown data mainly by known associations and similarities between them. 3.2

LPI-ETSLP

LPI-ETSLP [21] is a link prediction model based on eigenvalue transformation to predict lncRNA–protein interactions. It integrates known lncRNA–protein association, lncRNA–lncRNA similarity network and protein–protein similarity network to predict potential lncRNA–protein interactions. It is also based on the similarity matrix rather than data features and thus does not require negative samples in the training data to assess the specificity.

3.3

LPI-IBNRA

LPI-IBNRA [22] uses known lncRNA–protein and protein–protein interactions, and lncRNA expression similarity, and then eliminates second-order correlations on the bipartite network appropriately to enhance the prediction accuracy.

3.4

LncRRIsearch

LncRRIsearch [23] is a web server for comprehensive prediction of human and mouse lncRNA–mRNA and lncRNA–lncRNA interactions. Utilizing the human and mouse transcriptome in an ultrafast RNA–RNA interaction prediction system based on a seed-andextension approach called Rlblast [24], it provides multiple local base-pairing interactions for each lncRNA–RNA interaction. It integrates tissue-specific RNA expression and subcellular localization data of lncRNAs which help to strengthen the accuracy of the predicted interactions.

3.5

LnCompare

LnCompare [25] is a tool that compares lncRNA genes based on a comprehensive feature set with more than 100 attributes covering diverse aspects, including gene structure, nucleotide composition, evolutionary conservation, cell and tissue expression, subcellular localization, tissue specificity, repetitive sequence content, and phenotypic association. It is comprised of two main modules: (1) gene set feature comparison which compares two sets of lncRNAs, and identifies significantly different features, both categorical and quantitative; (2) similar gene discovery which uses user-defined gene(s)of-interest to query for the set of most similar lncRNAs, based on one or more features.

3.6

ACCBN

The Ant-Colony-Clustering-Based Bipartite Network (ACCBN) [26] method predicts lncRNA–protein interactions using three techniques: (1) lncRNA is represented as a feature vector and lncRNA is used as a data point in the feature space; (2) the similarity is enhanced by using the Ant Colony Clustering method and (3) effective prediction of lncRNA–protein interactions is achieved by applying a lncRNA–protein bipartite network.

lncRNAs Functional Prediction Methods and Tools

3.7

4

SFPEL-LPI

9

Sequence-based feature projection ensemble learning (SFPELLPI) [27] method predicts lncRNA–protein interactions using two approaches: (1) extracts lncRNA sequence-based features and protein sequence-based features; (2) calculates multiple lncRNA– lncRNA similarities and protein–protein similarities by using lncRNA sequences, protein sequences, and known lncRNA–protein interactions. Then, it combines multiple similarities and multiple features with a feature projection ensemble learning frame.

LncRNA Coding potential Calculators In recent years machine learning methods have become a great asset to assess the protein-coding potential of lncRNAs using supervised machine learning model trained on mRNAs based on features curated from sequence information, such as open reading frame (ORF) length and coverage, nucleotide composition features, k-mer sequence motif and codon usage, conservation scores, and secondary structure.

4.1

Sc-ncDNAPred

Sc-ncDNAPred [28] uses a support vector machine (SVM)-based computational algorithm to classify noncoding RNA(ncDNA) sequences in Saccharomyces cerevisiae (S. cerevisiae). Several features such as mononucleotide composition (MNC), dimer nucleotide composition (DNC), trimer nucleotide composition (TNC), tetramer nucleotide composition (TrNC), pentamer nucleotide composition (PNC), and hexamer nucleotide composition (HNC) are used to identify noncoding RNA.

4.2

FEELnc

FlExible Extraction of LncRNAs (FEELnc) [29] is a tool developed to annotate lncRNAs from RNA-seq assembled transcripts. It classifies/annotates and calculates the coding potential of lncRNAs. FEELnc annotates lncRNAs based on a machine learning method, Random Forest (RF), trained with general features such as multi kmer frequencies, RNA sequence length and open reading frames (ORFs) size. It is comprised of three modules: (1) filter, (2) coding potential, and (3) classifier.

4.3

CPAT

Coding-Potential Assessment Tool (CPAT) [30], an alignmentfree program, which uses logistic regression to distinguish between coding and noncoding transcripts on the basis of four sequence features.

4.4

CPC

Coding Potential Calculator (CPC) [31] is a support vector machine (SVM)-based classifier developed to assess the proteincoding potential of a transcript based on six biologically meaningful sequence features.

10

4.5

5

Fayaz Seifuddin and Mehdi Pirooznia

COME

Come [32] (coding potential calculator based on multiple evidences), is using a supervised machine learning model trained on mRNAs and lncRNAs. It is a robust coding potential calculation tool for lncRNA characterization based on multiple features. It integrates multiple sequence-derived and experiment-based features using a decompose–compose method and substantially improves the consistency of predication results from other coding potential calculators.

Hybrid Tools Hybrid methods combine multiple algorithms and features in order to capture complex correlations at multiple levels. ILNCSIM [33], FMLNCSIM [34], and PLAIDOH [35] are among them, and they all use a combination of features and multiple methods in order to achieve better performance and higher accuracy. Integration multiple features and approaches will likely reduce technical noise and may depict true mechanisms of lncRNA function.

5.1

ILNCSIM

Improved LNCRNA functional SIMilarity calculation model (ILNCSIM) [33] is a lncRNA–disease association predictor that integrates known lncRNA–disease associations and disease Directed Acyclic Graphs (DAGs) and calculates diseases similarity by an edge-based calculation model. It computes the functional similarity of two lncRNAs based on the semantic similarity of disease groups associated with these two lncRNAs. To further evaluate the performance of ILNCSIM, the calculated lncRNA functional similarity was used to predict the lncRNA–disease associations by combing ILNCSIM with the model of LRLSLDA [36].

5.2

FMLNCSIM

Fuzzy Measure–based LNCRNA functional SIMilarity calculation model (FMLNCSIM) [34] was designed on the basis that similar diseases tend to be involved with functionally similar lncRNAs and vice versa. The first step involves grouping diseases with similar descriptions/terms together computed by combining the concepts of information content and fuzzy measure. The second step involves computing functional similarity of two lncRNAs based on the semantic similarities of their associated disease groups. To further evaluate the performance of FMLNCSIM, the calculated lncRNA functional similarity was used to predict the lncRNA– disease associations by combing FMLNCSIM with the model of LRLSLDA [36].

5.3

PLAIDOH

Predicting LncRNA Activity through Integrative Data-driven ‘Omics and Heuristics (PLAIDOH) [35] is a recently developed computational method that integrates multiomics data to rank functional connections between individual lncRNA, coding gene,

lncRNAs Functional Prediction Methods and Tools

11

and protein pairs. It combines transcriptome, subcellular localization, enhancer landscape, genome architecture, chromatin interaction, and RNA-binding data, generating statistically defined output scores for ranking lncRNAs.

6

Conclusion lncRNAs are described as “Dark Matters” of our genome [37]. However, advances in NGS technology in recent years identified functionality of thousands of as abnormally altered in the cancer genome, differentially expressed in different tissues or conditions, or aberrant biological processes. However, despite many challenges, the therapeutic potential of lncRNAs have attracted considerable interest in the past few years. The experimental technologies and approaches to uncover the function of lncRNAs are promising but remain expensive and limited currently. Computational frameworks for predicting lncRNA function have greatly advanced in recent years. While they are extremely useful to uncover the lncRNAs function, the staggering complexity of the lncRNA transcriptome, in many cases remains challenging and requires further investigations [38]. Here, we systematically reviewed various computational algorithms and prediction tools that have been developed to elucidate the roles of lncRNAs. We hope this review will assist researchers and scientists in understanding methodologies for deciphering lncRNAs functionalities as well as provide suggestions for new directions in future development.

References 1. DiStefano JK (1706) The emerging role of long noncoding RNAs in human disease. Methods Mol Biol 2018:91–110 2. Jalali S, Kapoor S, Sivadas A, Bhartiya D, Scaria V (2015) Computational approaches towards understanding human long non-coding RNA biology. Bioinformatics 31(14):2241–2251 3. Zhang Y, Huang H, Zhang D, Qiu J, Yang J, Wang K, Zhu L, Fan J, Yang J (2017) A review on recent computational methods for predicting noncoding RNAs. Biomed Res Int 2017:9139504 4. Lalevee S, Feil R (2015) Long noncoding RNAs in human disease: emerging mechanisms and therapeutic strategies. Epigenomics 7 (6):877–879 5. Wapinski O, Chang HY (2011) Long noncoding RNAs and human disease. Trends Cell Biol 21(6):354–361 6. Bao Z, Yang Z, Huang Z, Zhou Y, Cui Q, Dong D (2019) LncRNADisease 2.0: an

updated database of long non-coding RNA-associated diseases. Nucleic Acids Res 47(D1): D1034–D1037 7. Chen X, Sun YZ, Guan NN, Qu J, Huang ZA, Zhu ZX, Li JQ (2019) Computational models for lncRNA function prediction and functional similarity calculation. Brief Funct Genomics 18 (1):58–82 8. Yu J, Ping P, Wang L, Kuang L, Li X, Wu Z (2018) A novel probability model for LncRNA ( )disease association prediction based on the naive Bayesian classifier. Genes (Basel) 9 (7):345 9. Xuan Z, Feng X, Yu J, Ping P, Zhao H, Zhu X, Wang L (2019) A novel method for predicting disease-associated LncRNA-MiRNA pairs based on the higher-order orthogonal iteration. Comput Math Methods Med 2019:7614850 10. Zhao H, Kuang L, Wang L, Xuan Z (2018) A novel approach for predicting disease-lncRNA

12

Fayaz Seifuddin and Mehdi Pirooznia

associations based on the distance correlation set and information of the miRNAs. Comput Math Methods Med 2018:6747453 11. Lu C, Yang M, Luo F, Wu FX, Li M, Pan Y, Li Y, Wang J (2018) Prediction of lncRNAdisease associations based on inductive matrix completion. Bioinformatics 34 (19):3357–3364 12. Ding L, Wang M, Sun D, Li A (2018) TPGLDA: novel prediction of associations between lncRNAs and diseases via lncRNAdisease-gene tripartite graph. Sci Rep 8 (1):1065 13. Xuan Z, Li J, Yu J, Feng X, Zhao B, Wang L (2019) A probabilistic matrix factorization method for identifying lncRNA-disease associations. Genes (Basel) 10(2):126 14. Xie G, Huang S, Luo Y, Ma L, Lin Z, Sun Y (2019) LLCLPLDA: a novel model for predicting lncRNA-disease associations. Mol Gen Genomics 294(6):1477–1486 15. Xiao X, Zhu W, Liao B, Xu J, Gu C, Ji B, Yao Y, Peng L, Yang J (2018) BPLLDA: predicting lncRNA-disease associations based on simple paths with limited lengths in a heterogeneous network. Front Genet 9:411 16. Gao Y, Wang P, Wang Y, Ma X, Zhi H, Zhou D, Li X, Fang Y, Shen W, Xu Y et al (2019) Lnc2Cancer v2.0: updated database of experimentally supported long non-coding RNAs in human cancers. Nucleic Acids Res 47(D1):D1028–D1033 17. Ou-Yang L, Huang J, Zhang XF, Li YR, Sun Y, He S, Zhu Z (2019) LncRNA-disease association prediction using two-side sparse selfrepresentation. Front Genet 10:476 18. Navarro C, Martinez V, Blanco A, Cano C (2017) ProphTools: general prioritization tools for heterogeneous biological networks. Gigascience 6(12):1–8 19. Chen X, Yan CC, Zhang X, You ZH (2017) Long non-coding RNAs and complex diseases: from experimental results to computational models. Brief Bioinform 18(4):558–576 20. Zhao Q, Liang D, Hu H, Ren G, Liu H (2018) RWLPAP: random walk for lncRNA-protein associations prediction. Protein Pept Lett 25 (9):830–837 21. Hu H, Zhu C, Ai H, Zhang L, Zhao J, Zhao Q, Liu H (2017) LPI-ETSLP: lncRNA-protein interaction prediction using eigenvalue transformation-based semi-supervised link prediction. Mol BioSyst 13(9):1781–1787 22. Xie G, Wu C, Sun Y, Fan Z, Liu J (2019) LPI-IBNRA: long non-coding RNA-protein interaction prediction based on improved

bipartite network recommender algorithm. Front Genet 10:343 23. Fukunaga T, Iwakiri J, Ono Y, Hamada M (2019) LncRRIsearch: a web server for lncRNA-RNA interaction prediction integrated with tissue-specific expression and subcellular localization data. Front Genet 10:462 24. Fukunaga T, Hamada M (2017) RIblast: an ultrafast RNA-RNA interaction prediction system based on a seed-and-extension approach. Bioinformatics 33(17):2666–2674 25. Carlevaro-Fita J, Liu L, Zhou Y, Zhang S, Chouvardas P, Johnson R, Li J (2019) LnCompare: gene set feature analysis for human long non-coding RNAs. Nucleic Acids Res 47(W1): W523–W529 26. Zhu R, Li G, Liu JX, Dai LY, Guo Y (2019) ACCBN: ant-Colony-clustering-based bipartite network method for predicting long non-coding RNA-protein interactions. BMC Bioinformatics 20(1):16 27. Zhang W, Yue X, Tang G, Wu W, Huang F, Zhang X (2018) SFPEL-LPI: sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions. PLoS Comput Biol 14(12):e1006616 28. He W, Ju Y, Zeng X, Liu X, Zou Q (2018) Sc-ncDNAPred: a sequence-based predictor for identifying non-coding DNA in Saccharomyces cerevisiae. Front Microbiol 9:2174 29. Wucher V, Legeai F, Hedan B, Rizk G, Lagoutte L, Leeb T, Jagannathan V, Cadieu E, David A, Lohi H et al (2017) FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res 45(8):e57 30. Wang L, Park HJ, Dasari S, Wang S, Kocher JP, Li W (2013) CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res 41(6):e74 31. Kong L, Zhang Y, Ye ZQ, Liu XQ, Zhao SQ, Wei L, Gao G (2007) CPC: assess the proteincoding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 35(Web Server issue):W345–W349 32. Hu L, Xu Z, Hu B, Lu ZJ (2017) COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features. Nucleic Acids Res 45(1):e2 33. Huang YA, Chen X, You ZH, Huang DS, Chan KC (2016) ILNCSIM: improved lncRNA functional similarity calculation model. Oncotarget 7(18):25902–25914 34. Chen X, Huang YA, Wang XS, You ZH, Chan KC (2016) FMLNCSIM: fuzzy measure-based

lncRNAs Functional Prediction Methods and Tools lncRNA functional similarity calculation model. Oncotarget 7(29):45948–45958 35. Pyfrom SC, Luo H, Payton JE (2019) PLAIDOH: a novel method for functional prediction of long non-coding RNAs identifies cancer-specific LncRNA activities. BMC Genomics 20(1):137 36. Chen X, Yan GY (2013) Novel human lncRNA-disease association inference based

13

on lncRNA expression profiles. Bioinformatics 29(20):2617–2624 37. Hu X, Sood AK, Dang CV, Zhang L (2018) The role of long noncoding RNAs in cancer: the dark matter matters. Curr Opin Genet Dev 48:8–15 38. Cao H, Wahlestedt C, Kapranov P (2018) Strategies to annotate and characterize long noncoding RNAs: advantages and pitfalls. Trends Genet 34(9):704–721

Chapter 2 Visualization of lncRNA and mRNA Structure Models Within the Integrative Genomics Viewer Steven Busan and Kevin M. Weeks Abstract Every class of RNA forms base-paired structures that impact biological functions. Chemical probing of RNA structure, especially with the advent of strategies such as SHAPE-MaP, vastly expands the scale and quantitative accuracy over which RNA structure can be examined. These methods have enabled large-scale structural studies of mRNAs and lncRNAs, but the length and complexity of these RNAs makes interpretation of the data challenging. We have created modules available through the open-source Integrative Genomics Viewer (IGV) for straightforward visualization of RNA structures along with complementary experimental data. Here we present detailed and stepwise strategies for exploring and visualizing complex RNA structures in IGV. Individuals can use these instructions and supplied sample data to become adept at using IGV to visualize RNA structure models in conjunction with useful allied information. Key words RNA structure, lncRNA, mRNA, Integrative Genomics Viewer, SHAPE-MaP, Base pairing

1

Introduction RNA, like DNA, forms base-paired structures, as was first conclusively demonstrated for a simple synthetic helix in 1956 [1] and for tRNA in 1974 [2]. The secondary structures of RNAs of every class (including tRNA, rRNA, sRNA, miRNA, mRNA, and lncRNA) have been implicated in the biological functions of these RNAs [3–5]. Chemical or enzymatic probing of RNA in conjunction with thermodynamically-informed structure modeling has a long and successful history of defining structure models for RNAs not amenable to crystallization and for RNAs under biologically relevant or experimentally varied solution and cellular conditions [6]. Advances in RNA structure probing technologies such as SHAPE-MaP have facilitated studies of complex mRNAs [7], lncRNAs [8], and viral RNAs [9, 10]. These RNA molecules are often thousands of nucleotides in length, presenting notable challenges in data visualization and interpretability.

Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_2, © Springer Science+Business Media, LLC, part of Springer Nature 2021

15

16

Steven Busan and Kevin M. Weeks

To enable efficient examination of the structures of long RNAs, we have created visualization modules for the Integrative Genomics Viewer (IGV). IGV is cross-platform and open-source software that supports visualization of diverse experimental data, especially from studies using arrays and high-throughput sequencing readout strategies [11]. IGV was developed to flexibly display genomic, clinical, and experimental data with an emphasis on integrative and interactive analyses. The software allows for visual comparison of many types of experiments and is quite responsive to user interaction. We previously added several functionalities to IGV that enable exploration of RNA structure models and base-pairing probabilities, scaling easily from visualization of the entirety of long RNAs to focused examination of individual helices [12]. Base pairs are conveniently rendered as arcs. Here we present stepwise instructions and general recommendations for visualizing RNA structure models and associated data in IGV (see Note 1). We provide three example datasets derived from recent studies in the Supporting Information. Researchers can use these instructions and the sample data to quickly and efficiently become adept at interrogating RNA structural information in conjunction with a variety of complementary information. Due to the wide availability of next-generation sequencing, SHAPE-MaP probing strategies can be readily employed by diverse nonexpert laboratories. The visualization tools described here facilitate making RNA structure analysis a routine component of examining diverse biological systems.

2

Methods

2.1 Collect Files in Appropriate Formats

Download and extract provided “busan_rna_igv_vis_2019_SI.zip” to use sample data files with this tutorial or prepare files from your own sources. 1. Transcript sequence (required). (a) Text file with an .fa extension in FASTA format. (b) The first line of the file should begin with the “>” (greater-than) character, followed directly by a sequence name or ID with no special characters or spaces. (c) Remaining lines should contain the nucleotide sequence without spaces. 2. Chemical reactivity profiles (use either of the following two file formats). (a) .shape file: This is a tab-delimited text file with two columns. First column is nucleotide position, starting with 1. Second column is normalized SHAPE reactivity data values; no-data positions are set to 999.

Visualizing RNA Structure in IGV

17

(b) .map file: l

l

Same format as .shape, but with two additional columns. Third column is standard error, fourth column is nucleotide sequence. ShapeMapper2 software outputs .map files containing chemical probing reactivities calculated using mutational profiling. ShapeMapper2 and associated documentation is available at: https://github.com/WeeksUNC/shapemapper2.

3. Base-pairing (secondary) structure model and/or estimated base-pairing probabilities. (a) A commonly used format for defining a base-pairing model has the extension .ct. These files are produced by the Fold module of RNAstructure. (b) Dot-bracket (.db, .dbn) files are also supported, most commonly used for small hand-edited structures or used alongside multiple sequence alignments in other software packages. (c) Pairing probabilities are calculated by the partition and ProbabilityPlot modules of RNAstructure. These files have the extension .dp. (d) For file format reference, see https://software.bro adinstitute.org/software/igv/RNAsecStructure (e) Generation of RNA structure model files is not covered in detail here, since this method is focused on graphical exploration. For long RNAs, we recommend using Superfold (available at https://github.com/Weeks-UNC/Sup erfold), which automates the process of performing structure modeling over computationally manageable windows and merging the resulting structures. Superfold accepts a .map file as input and produces both a .ct file, containing a single predicted minimum free energy structure, and a .dp file, containing estimated base pairing probabilities. 4. Annotations such as gene coding regions, repeat sequences, or sites of known function (optional). (a) The .gff3 file format is convenient for most uses and is readily hand-edited. See the included examples in “busan_rna_igv_vis_2019_SI.zip” and further documentation at https://software.broadinstitute.org/software/ igv/GFF. (b) Important: Ensure that the names listed in the first column of the .gff3 file match the name given in the first line

18

Steven Busan and Kevin M. Weeks

of the FASTA file and not the FASTA filename or other text. 5. One or more linear profiles from complementary experiments or computational analyses (this list is not exhaustive, see Note 1). These data are not required, but can substantially enrich RNA structure analyses. (a) Protein-binding RIP data).

enrichment

data

(e.g.,

CLIP

or

(b) GC-content median over fixed windows. (c) SHAPE reactivity median over fixed windows. (d) Estimated per-nucleotide or median Shannon entropy over fixed windows. (e) Common file formats are .wig, .bedgraph, and .tdf (see https://software.broadinstitute.org/software/igv/ FileFormats). (f) Important: Ensure that the sequence names listed in the first column of a .wig or .bedgraph file match the name given in the first line of the FASTA file (not including the “>” character) and not the FASTA filename. 2.2 Load and Import Files into IGV

1. Download IGV (available at https://software.broadinstitute. org/software/igv/download) and launch. 2. Load nucleotide sequence. (a) Click “Genomes” in menu bar and select “Load Genome from File”. (b) Select FASTA file and click “Open”. (c) Select “E_coli/sequence.fa” if using the example dataset or select your own .fa file. 3. Load SHAPE reactivity profiles, base-pairing probabilities, and annotations. (a) Click “File” in menu bar and select “Load from File”. (b) Select “E_coli/SHAPE_reactivity.map”, “E_coli/base_pairing_probability.dp”, and “E_coli/gene_annotations. gff3” if following along with example dataset or select your own files, and click “Open”. (c) Click “Continue” in any popup dialog boxes that appear (see Note 2).

2.3 Adjust Track Display

Steps listed here are optional and specific values given are suggestions. Users should adjust settings for comfortable display for their particular screen size, platform, and dataset. 1. Set view preferences. (a) Click “View,” “Preferences,” “General.”

Visualizing RNA Structure in IGV

19

(b) Check “Display all tracks in a single panel.” (c) Uncheck “Show attributes panel.” 2. Rename tracks. (a) Right-click track, select “Rename track,” and enter a descriptive name. (b) For the example dataset, replace “SHAPE_reactivity. shape.wig” with “SHAPE reactivity” and “base_pairing_probability.dp.bp” with “Pairing probability.” 3. If working with a higher resolution display, consider adjusting track name size settings. (a) Shift-left-click and drag track names to select all tracks. Right click track or track names, select “Change font size,” and increase the value to 16 (see Note 3). (b) If track names are cut off or abbreviated: click “View,” “Set Name Panel Width” and set to a larger value. (c) The default font size can be changed by clicking “View,” “Preferences,” “General,” then clicking “Change” next to “Default font.” This will only affect tracks or files loaded in the future. 4. Reorder tracks by left-clicking and dragging track names. (a) Drag gene annotations track directly above other tracks. (b) Drag SHAPE reactivity profile above base pairing probability arcs. (c) If zoomed in far enough to see individual nucleotide identities, it can be useful to move the sequence track directly above base pairing arcs to visualize complementary pairs and the sequences of unpaired regions. 5. Adjust SHAPE profile track range. (a) Right-click reactivity profile track, and select “Set Data Range.” (b) Set “Min” and “Mid” values to 0 and “Max” value to 3. 6. Widen pairing probability arc track. (a) Right-click on the arc track, select “Change Track Height.” (b) Set to a larger value such as 100.

20

Steven Busan and Kevin M. Weeks

2.4 Examine Functional Sites in Example Biologically Important RNAs 2.4.1 E. coli mRNA Gene Translation Start Sites

The provided example of an E. coli transcript is notable in that it contains two nonribosomal protein genes rimM and trmD (encoding a ribosome maturation factor and a tRNA methyltransferase, respectively) that are located between two ribosomal proteincoding genes rpsP and rplS (encoding S16 and L19, respectively) [13] (Fig. 1a). The ribosomal proteins encoded by rpsP and rplS are translated at high levels; in contrast, the rimM and trmD gene products are translated at lower levels. In addition, the translation rates of rimM and trmS are largely uncoupled from those of the surrounding genes [14]. Examining the structures around the translation start sites of each gene provides clues to explain these differences. 1. Zoom in on the start codon regions of rplS and rimM. Use any of the following: (a) Click and drag to select range in ruler (as shown in Fig. 1a). (b) Click the “+” button in the upper right area of the toolbar several times and drag track window to scroll. (c) Double-click on an annotation graphic several times. (d) Enter a gene, annotation name, or numeric range in text box. Note the differing structural contexts of the translation start sites of rplS and rimM. In particular, the region surrounding the AUG codon in rplS is unstructured, evidenced by high SHAPE reactivities and lack of highly probable base pair arcs in structure models (Fig. 1b). This lack of structure near the start codon likely provides a high ribosome accessibility, allowing translation initiation in the absence of a Shine–Dalgarno sequence [15].

2.4.2 Murine LHR mRNA Sequence Motifs

In mice, ZFP36L2, a zinc finger protein, regulates expression of the luteinizing hormone receptor (LHR) mRNA during oocyte maturation [16]. ZFP36L2 is a member of a class of zinc finger–containing proteins that bind RNA targets containing the sequence motif “AUUUA,” termed adenine–uridine-rich elements (AREs) [17]. Surprisingly, gel-shift assays revealed that ZFP36L2 bound only one of the three AREs present within the LHR 30 untranslated region [16], raising the intriguing possibility that the RNA structural context of these sequence motifs influences protein binding. 1. Examine the nucleotide sequence of a structure model (data from ref. 18). (a) Load the provided LHR sequence, SHAPE reactivity data, structure model, and annotations from supporting files folder “LHR”, as in Subheading 2.2 (see Fig. 2a).

Visualizing RNA Structure in IGV

21

Fig. 1 Visualization of E. coli mRNA translation start sites. (a) Full view of an E. coli polycistronic transcript showing gene boundaries, SHAPE reactivity profile, and modeled base pairing probabilities. Pairing probability is indicated by arc color: green, >80% probability; blue, 30–80%; yellow, 10–30%. High SHAPE reactivities in blue shaded region are indicative of an unstructured region around the rplS translation start site. (b) Zoomed view of unstructured region surrounding the start codon of rplS, showing highly SHAPE-reactive positions modeled as unpaired (corresponding to no arcs indicative of base pairing). (c) Zoomed view of the start codon and surrounding region of rimM, showing positions with low SHAPE reactivities and corresponding welldetermined base-pairing structure model. (Data from ref. 13)

22

Steven Busan and Kevin M. Weeks

Fig. 2 Visualization of structures around AREs in LHR mRNA. (a) Full view of the LHR transcript showing SHAPE reactivity profile and base-pairing structure model. Annotations include an open reading frame (blue bar labeled ORF) and three AREs (highlighted in green and red). (b) Zoomed view of the region of two AREs, one functional and one nonfunctional based on binding assays with ZFP36L2 protein [16]. The arcs in the Structure model track, which indicate base pairs, suggest that the upstream ARE is highly structured, whereas the downstream ARE is not. (c) ARE structure rendered as a planar graph using the StructureEditor component of RNAstructure. SHAPE reactivities are indicated by color: red, reactivity >0.85; orange, 0.4–0.84; black, 80% probability; blue, 30–80%; yellow, 10–30%. The abundance of medium- and low-probability base pairing arcs (in blue and yellow) suggests that no single structure (including the minimum free energy structure shown with black arcs) predominates in this region and that, instead, an ensemble of RNA structural states is present. (Data from ref. 8)

For functionally important regions of an RNA, there is often a single, thermodynamically stable secondary structure. Examples of such well-defined secondary structures include ligand-bound riboswitches, the bacterial 16S rRNA, or the stable base-paired structures overlapping the rimM gene (Fig. 1; green arcs). In contrast, some RNAs instead adopt a family (or ensemble) of structures. Modeling RNA structural ensembles remains an important experimental and computational frontier, but the visualization of estimated pairing probabilities is a useful approach that begins to address variability within populations of folded RNA molecules. 1. Examine base-pairing probabilities (data from ref. 8). (a) Load the provided murine Xist sequence, SHAPE reactivity data, structure model, base pairing probabilities, and repeat region annotations from supporting files folder “Xist,” as in Subheading 2.2. (b) Zoom in on repeat A as in Fig. 3. Although the minimum free energy structure model, by definition, displays a single secondary structure for Xist repeat region A (Fig. 3; black arcs), the pairing probability arcs show multiple overlapping low- and medium-probability helices (Fig. 3; blue and yellow arcs). These data support a model in which this region of the Xist RNA does not have a well-defined secondary structure overall. A possible pseudoknotted structure is evident in the “Structure model” track as overlapping arcs, highlighted with an orange arrow in Fig. 3.

24

3

Steven Busan and Kevin M. Weeks

Notes 1. This brief report is focused on the visualization of RNA structure probing data and structure models and is not a comprehensive guide to IGV. For general guides and documentation to IGV, see the following: https://software.broadinstitute. org/software/igv/UserGuide and https://software.bro adinstitute.org/software/igv/FileFormats. 2. File conversion popup dialog (a) .ct, .map., and .shape file formats do not contain sequence name, strand, or nucleotide offset position. Therefore, upon import, IGV converts these files into file formats that contain this information, without overwriting the original input files. (b) The default settings in the file conversion popup dialog cover the most common RNA structure exploration scenario, that is, one transcript sequence and chemical reactivity profiles corresponding to this sequence. If examining a short gene-specific primer amplicon (see ref. 23) within a larger sequence, users may need to manually adjust the beginning and end positions of reactivity profiles or structures. 3. The IGV user interface may appear unusably small on some newer high-resolution displays. Recent Java 11 builds of IGV provide support for high-resolution displays (see https://soft ware.broadinstitute.org/software/igv/download). 4. The IGV modules discussed here visualize secondary structures as arc diagrams. Rendering traditional RNA secondary structure figures (sometimes referred to as airport, planar, or tree diagrams) requires additional software. Commonly used packages include VARNA , Ribosketch, RNAstructure StructureEditor, and XRNA . See informal discussion at https:// github.com/Weeks-UNC/shapemapper2.

References 1. Rich A, Davies DR (1956) A new two stranded helical structure: polyadenylic acid and polyuridylic acid. J Am Chem Soc 78:3548–3549. https://doi.org/10.1021/ja01595a086 2. Kim SH, Suddath FL, Quigley GJ et al (1974) Three-dimensional tertiary structure of yeast phenylalanine transfer RNA. Science 185:435–440. https://doi.org/10.1126/sci ence.185.4149.435 3. Eddy SR (2001) Non–coding RNA genes and the modern RNA world. Nat Rev Genet 2:919–929. https://doi.org/10.1038/ 35103511

4. Parker BJ, Moltke I, Roth A et al (2011) New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes. Genome Res 21:1929–1943. https://doi.org/10.1101/gr.112516.110 5. Sonenberg N, Hinnebusch AG (2009) Regulation of translation initiation in eukaryotes: mechanisms and biological targets. Cell 136:731–745. https://doi.org/10.1016/j. cell.2009.01.042 6. Mailler E, Paillart J-C, Marquet R et al (2018) The evolution of RNA structural probing methods: from gels to next-generation

Visualizing RNA Structure in IGV sequencing. Wiley Interdiscip Rev RNA 10: e1518. https://doi.org/10.1002/wrna.1518 7. Corley M, Solem A, Phillips G et al (2017) An RNA structure-mediated, posttranscriptional model of human α-1-antitrypsin expression. Proc Natl Acad Sci U S A 114: E10244–E10253 8. Smola MJ, Christy TW, Inoue K et al (2016) SHAPE reveals transcript-wide interactions, complex structural domains, and protein interactions across the Xist lncRNA in living cells. Proc Natl Acad Sci U S A 113:10322–10327 9. Dethoff EA, Boerneke MA, Gokhale NS et al (2018) Pervasive tertiary structure in the dengue virus RNA genome. Proc Natl Acad Sci U S A 115:11513–11518. https://doi.org/10. 1073/pnas.1716689115 10. Dadonaite B, Barilaite E, Fodor E et al (2017) The structure of the influenza a virus genome. Nat Microbiol 4(11):1781–1789. https://doi. org/10.1101/236620 11. Robinson JT, Thorvaldsdo´ttir H, Winckler W et al (2011) Integrative genomics viewer. Nat Biotechnol 29:24–26. https://doi.org/10. 1038/nbt.1754 12. Busan S, Weeks KM (2017) Visualization of RNA structure models within the integrative genomics viewer. RNA 23:1012–1018. https://doi.org/10.1261/rna.060194.116 13. Mustoe AM, Busan S, Rice GM et al (2018) Pervasive regulatory functions of mRNA structure revealed by high-resolution SHAPE probing. Cell 173:181–195.e18. https://doi.org/ 10.1016/j.cell.2018.02.034 14. Wikstro¨m PM, Bjo¨rk GR (1988) Noncoordinate translation-level regulation of ribosomal and nonribosomal protein genes in the Escherichia coli trmD operon. J Bacteriol 170:3025–3031. https://doi.org/10.1128/ jb.170.7.3025-3031.1988 15. Scharff LB, Childs L, Walther D, Bock R (2011) Local absence of secondary structure permits translation of mRNAs that lack ribosome-binding sites. PLoS Genet 7:

25

e1002155. https://doi.org/10.1371/journal. pgen.1002155 16. Ball CB, Rodriguez KF, Stumpo DJ et al (2014) The RNA-binding protein, ZFP36L2, influences ovulation and oocyte maturation. PLoS One 9:e97324. https://doi.org/10. 1371/journal.pone.0097324 17. Lai WS, Carballo E, Thorn JM et al (2000) Interactions of CCCH zinc finger proteins with mRNA. J Biol Chem 275:17827–17837. https://doi.org/10.1074/jbc.m001696200 18. Ball CB, Solem AC, Meganck RM et al (2017) Impact of RNA structure on ZFP36L2 interaction with luteinizing hormone receptor mRNA. RNA 23:1209–1223. https://doi. org/10.1261/rna.060467.116 19. Gendrel A-V, Heard E (2014) Noncoding RNAs and epigenetic mechanisms during X-chromosome inactivation. Annu Rev Cell Dev Biol 30:561–580. https://doi.org/10. 1146/annurev-cellbio-101512-122415 20. Chu C, Zhang QC, da Rocha ST et al (2015) Systematic discovery of Xist RNA binding proteins. Cell 161:404–416. https://doi.org/10. 1016/j.cell.2015.03.025 21. Sunwoo H, Colognori D, Froberg JE et al (2017) Repeat E anchors Xist RNA to the inactive X chromosomal compartment through CDKN1A-interacting protein (CIZ1). Proc Natl Acad Sci U S A 114:10654–10659. https://doi.org/10.1073/pnas.1711206114 22. Sarma K, Levasseur P, Aristarkhov A, Lee JT (2010) Locked nucleic acids (LNAs) reveal sequence requirements and kinetics of Xist RNA localization to the X chromosome. Proc Natl Acad Sci U S A 107:22196–22201. https://doi.org/10.1073/pnas.1009785107 23. Smola MJ, Rice GM, Busan S et al (2015) Selective 20 -hydroxyl acylation analyzed by primer extension and mutational profiling (SHAPE-MaP) for direct, versatile and accurate RNA structure analysis. Nat Protoc 10:1643–1669

Chapter 3 RNA Coding Potential Prediction Using Alignment-Free Logistic Regression Model Ying Li and Liguo Wang Abstract CPAT (Coding-Potential Assessment Tool) is a logistic regression model–based classifier that can accurately and quickly distinguish protein-coding and noncoding RNAs using pure linguistic features calculated from the RNA sequences. CPAT takes as input the nucleotides sequences or genomic coordinates of RNAs and outputs the probabilities p (0 p 1), which measure the likelihood of protein coding. Users can run CPAT online (http://lilab.research.bcm.edu/cpat/) or from the local computers after installation. CPAT provides prebuilt logistic models to recognize RNAs originated from human (Homo sapiens), mouse (Mus musculus), zebrafish (Danio rerio), and fly (Drosophila melanogaster) genomes. Instructions on how to train models for other genomes are described in CPAT website (http://rna-cpat.sourceforge.net/) and this chapter. Key words Protein coding, Prediction, Noncoding RNA, LncRNA, LincRNA, Logistic regression

1

Introduction Deep transcriptome sequencing (RNA-seq) provides unprecedented opportunities to identify RNA transcripts that have not been discovered before due to the low abundance and transient nature. The discovery of a large number of novel transcripts calls for new methods that can rapidly and accurately distinguish coding and noncoding RNAs. To this end, we developed CPAT—a logistic regression model-based classifier, which uses four sequence features including “open reading frame (ORF) size,” “ORF coverage,” “Fickett’s TESTCODE statistic,” and “hexamer usage bias” to evaluate the RNA coding potential. The logistic regression model is trained from well-annotated protein-coding RNAs and noncoding RNAs, and the performance (sensitivity, specificity, accuracy, precision) of the model is evaluated using K-fold cross-validation. CPAT achieves superior prediction accuracy to other peer algorithms [1–3]. It also runs much faster, enabling it to process thousands of transcripts within seconds.

Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_3, © Springer Science+Business Media, LLC, part of Springer Nature 2021

27

28

Ying Li and Liguo Wang

1.1 The Four Sequence Features Used by CPAT

The first feature is the ORF size. An ORF is a continuous stretch of RNA that begins with a start codon (usually AUG) and ends at a stop codon (usually UAA, UAG, or UGA). Although computationally straightforward, ORF size is arguably the best feature to predict protein-coding RNAs (i.e., messenger RNA or mRNA), because a large putative ORF is unlikely to observe by chance in noncoding RNA sequences (see Note 1). The second feature used by CPAT is ORF coverage, which is defined as the ratio between the “size of the longest ORF” and the “size of the whole transcript.” This ORF coverage feature also has good classification power, as it is highly complementary to, and independent of the ORF size. The third feature is the Fickett’s TESTCODE score (termed “Fickett score” hereafter). Fickett score was developed by James Fickett at Los Alamos National Laboratory in 1982 [4]. It is a linguistic feature that distinguishes protein-coding RNA and noncoding RNA according to the combinational effects of nucleotide positional and compositional bias of codons. It is worth noting that CPAT calculates the Fickett score from the predicted longest ORF rather than the full RNA sequence itself. The fourth feature is the hexamer usage bias (termed as “hexamer score” hereafter) [5]. The prefix “hexa-” means “six”; therefore, there are a total of 46 ¼ 4096 possible hexamers. Since each amino acid is coded by one or several nucleotide triplets (i.e., codons), the hexamer score measures the dependence of two adjacent amino acids in the peptide. CPAT uses the log-likelihood ratio to measure differential hexamer usage between coding and noncoding sequences. Therefore, positive hexamer values indicate coding sequences, whereas negative values indicate noncoding sequences. The combinatorial effect of ORF size, Fickett score, and hexamer score in mouse 12,000 RNAs can be visualized in Fig. 1.

1.2 The Logistic Regression Model

Binomial Logistic regression (often simply referred to logistic regression or LR) is a natural fit for this binary classification problem since the dependent variable (protein-coding or not) is dichotomous. We choose the LR algorithm because it provides many ways to regularize our model, without worrying whether the four selected features are intercorrelated. The performance of a machine-learning models mainly depends on the size and the quality of training data. For example, when training the LR model for the human genome, we selected 10,000 high-quality protein-coding sequences annotated by the Consensus Coding Sequence (CCDS) database (ftp://ftp.ncbi. nlm.nih.gov/pub/CCDS/) and another 10,000 noncoding transcripts from the GENCODE database (https://www. gencodegenes.org/). There 20,000 RNA sequences will be used for both model training and model validation. Briefly, after calculating the four features (i.e., ORF size, ORF coverage, Fickett score, and Hexamer score), all protein-coding noncoding RNA

29

10

1.0 8

re

0.5

−0.5

me

He xa

6

0.0

rs

co

Log2 (ORF length, nt)

12

14

Prediction of RNA Coding Potential Using Sequence Features

4

−1.0 0.2

0.4

0.6

0.8

1.0

1.2

1.4

Fickett score

Fig. 1 Three-dimensional plot shows combinatorial effects of Fickett score, hexamer score, and ORF size on 6000 coding genes (red dots) and 6000 noncoding genes (blue dots) in mouse

sequences are assigned label “1” and “0,” respectively. And then the coding sequences and non-coding sequences are combined, shuffled and equally split into ten subsets with each subset containing 1000 sequences (approximately 50% are protein-coding RNAs, and another 50% are noncoding RNAs). Nine subsets will be used to train the LR model, and the remaining subset is used to validate. Sensitivity, specificity, accuracy, precision, and the area under the receiver operating characteristic curve (AUROC) are calculated to measure the performance (i.e., the ability to recognize proteincoding and noncoding RNAs) of this model. CPAT uses a nonparametric two-graph ROC curve for selecting the optimal threshold that maximizes the sensitivity and specificity of prediction while minimizing misclassifications.

30

2

Ying Li and Liguo Wang

Materials CPAT is implemented in Python and R, and the source code is freely available via http://rna-cpat.sourceforge.net/ or https:// pypi.org/project/CPAT/. The generalized linear model function in R is used to build the logistic regression model. The CPAT web server (http://lilab.research.bcm.edu/cpat/) is implemented in PHP, MYSQL, and Apache, and supports all major browsers. Users can run CPAT online or from their local computers after installation. Our demonstrations were performed using version 1.2.4 of CPAT, on an Intel(R) Xeon(R) CPU E5-2698 v4 2.20GHz machine.

3

Methods

3.1 Run CPAT Online Using Web Server

The CPAT web server (http://lilab.research.bcm.edu/cpat) is suitable for users who need to make coding potential predictions for RNAs identified from human or model organisms, including mouse, fly, and zebrafish (Fig. 2). There are three ways to submit query RNAs to the CPAT server based on the size of the data (Fig. 2). Option 1: Upload a BED or FASTA file. The BED (Browser Extensible Data) format is used to describe genomic features and annotations. The standard BED format file consists of one line per RNA transcript, each containing 12 columns of data. Detailed specifications of BED format is described in https://genome.ucsc.edu/FAQ/FAQformat.html#format1. Since a BED file only provides the genome coordinates rather than the actual RNA sequences, users must specify the reference genome and assembly version when using the BED file as input. Known and novel transcripts can be reconstructed from RNA-seq data using tools like Scripture and Cufflinks [6, 7]. While Scripture outputs standard BED file that can be provided to CPAT directly, Cufflinks only generates GTF (gene transfer format) format file, which can be converted into BED file using BEDOPS (https://github.com/ bedops/bedops). In addition to BED format, CPAT also accepts RNA sequences in FASTA format. FASTA format is a text-based format for representing sequences of nucleotides or amino acids with a name extension “.fasta” or “.fa”. Each sequence is represented by two components: the initial description line started with a “>“symbol and the actual nucleotide sequence. In a FASTA file, each mRNA sequence must be in 50 ! 30 direction. Examples of BED and FASTA files are provided from the CPAT web server. The file size needs to be less than 10 MB (regular or

Prediction of RNA Coding Potential Using Sequence Features

31

Fig. 2 The CPAT web server interface with three data uploading options and supported genome assemblies

compressed). BED format is preferred as a FASTA sequence file tends to be larger, which may cause a “time out” error. If only a FASTA file is available, compressing it using gzip or bunzip2 or save it to a web server (option 3) is highly recommended to reduce the chance of the “time out” error. Option 2: Copy and paste the BED or FASTA format data to the text area.

32

Ying Li and Liguo Wang

Table 1 An example of CPAT prediction result from the web server Data Sequence ID Name

RNA Size

ORF Size

Fickett Score

Hexamer Score

Coding Probability

Coding Label

0

NM_198317

2564

1929

1.1902

0.58012401 0.999999999 Yes

1

NM_001014980 1043

909

1.1934

0.49512451 0.999831135 Yes

2

NM_004421

2924

2013

1.0965

0.58485224 0.999999999 Yes

3

NM_032348

2273

1329

1.2568

0.59611392 0.999999184 Yes

This option is for a small dataset that contains one or several RNAs. Option 3: Copy and paste a URL of the input file to the text area. This option is for a large dataset (>10 MB) that cannot be uploaded using options 1 and 2. First, users need to upload the BED or FASTA file to a publicly accessible web or FTP server. And then copy and paste the URL to the text area. The supported protocols include http://, https://, and ftp://. Users can click on the “Example URL to FASTA” and “Example URL to BED” buttons to learn about the input format. After uploading the query sequences through any of the above three options, users can select the species assembly that matches the query sequences (Fig. 2), then click the “Submit” button. The web page returned by the CPAT server is represented in Table 1 (using the query sequences in “Example sequence in BED” of Option 2). The names of the first six columns in the table are selfexplaining. The “Coding Probability” in the seventh column contains the probabilities ( p) that input RNA sequences are coding for proteins. One can easily convert a probability into odds. For example, if the coding probability is 0.8, the probability of noncoding is 1–0.8 ¼ 0.2, and the odds of protein-coding is 0.8/0.2 ¼ 4 (i.e., the odds of protein-coding is 4 to 1). The “Coding Label” in the eighth column shows if the query RNA sequence is protein-coding (yes if p > ¼ cutoff value, no if p < cutoff value). See how to determine the optimum cutoff value in Subheading 3.2.3. The CPAT web server has limited analysis capacity and only supports four commonly used genome assemblies, including the human, mouse, fly, and zebrafish. If users need to analyze a large dataset that exceeds the analytic capability of the web server they will need to install CPAT into their local computers. On the other hand, if users need to analyze RNA sequences from other species that are not available from the web server, they will also need to build the logistic regression model (see Note 2).

Prediction of RNA Coding Potential Using Sequence Features

33

3.2 Run CPAT on a Local Computer

We will introduce how to install CPAT to a local computer in Subheading 3.2.1 and how to run CPAT using prebuilt files in Subheading 0. In Subheadings 3.2.3 and 3.2.4, we will demonstrate how to train logistic regression models for new genomes, evaluate the performance of the models, and select optimum coding probability cutoff to maximize prediction accuracy.

3.2.1 Install CPAT to a Local Computer

Prerequisites Software and packages required by CPAT include numpy (www. numpy.org), pysam (https://github.com/pysam-developers/ pysam), and R (https://www.r-project.org/). Python 2.6 is required to run CPAT v1.2.4 or the older versions. Python 3.5 is required to run CPAT v2.0.0 or future versions. CPAT installation Pip (https://pypi.python.org/pypi/pip) is a python package management system. It is the preferred way to use pip to install CPAT since all the dependency Python packages will be installed automatically. Before using pip, users first need to make sure pip is already installed (see Note 3). Then open a terminal and type the command: $ pip2 install CPAT # for versions 1.2.4 $ pip3 install CPAT # for versions 2.0.0

As shown above, users need pip2 to install CPAT version 1.2.4, and pip3 to install CPAT version 2.0.0. Pip knows how to select the appropriate version; for example, pip3 will automatically select CPAT version 2.0.0 (see Note 4). Using pip to upgrade CPAT to a newer version is straightforward: $ pip3 install CPAT --upgrade

CPAT is also available from Anaconda (https://anaconda.org/ bioconda). Users can use conda to install and upgrade CPAT. This is particularly useful if users want to create a virtue environment for CPAT: $ conda env create -n cpat_env #create a new virtual environment $ conda activate cpat_env # activate the new virtual environment $ conda install cpat # install CPAT $ conda update cpat # update CPAT to a newer version

34

Ying Li and Liguo Wang

3.2.2 Run CPAT from the Command Line

Users can run cpat.py script using a BED file as the input. In this case, the reference genome sequences file in FASTA format (r) is required to specify the sequences of the reference genome (see Note 5): $ cpat.py -r mm9.fa -g Mouse_test_RefSeq_coding.bed -d Mouse_logitModel.RData –x Mouse_Hexamer.tab -o output1

Or, using a FASTA file as the input (r is not required): $ cpat.py -g Mouse_test_RefSeq_coding_mRNA.fa -d Mouse_logitModel.RData -x Mouse_Hexamer.tab -o output2

The detailed description of each option is given in Table 2. 3.2.3 Build Logistic Regression Model for New Genomes

When users want to use CPAT to predict the coding potential of RNAs from other genomes, they need to prepare the training data and build the logistic regression model by themselves.

Step-1: Prepare the Training Dataset

Users can select X protein-coding mRNA transcripts and Y noncoding RNA transcripts from a high-quality annotation database. Although the exact numbers of protein-coding and non-coding genes are usually unknown for a particular genome (even for the well-annotated human genome), we recommend X and Y are as large as possible and balanced (i.e., X Y). If the genome of the species is unknown or not well-annotated or does not have enough “coding” and “non-coding” genes, a compromised solution is to use data from other species that are evolutionary close to the species of your interest. If the mRNA sequence is extracted from the genome and the gene is located on the minus strand, the mRNA sequence should be reverse-complemented, in other words, the RNA sequences in FASTA file should be always in 50 ! 30 direction (see Note 6).

Step-2: Generate Hexamer Table

Hexamer score is one of the four sequence features CPAT uses to classify RNA sequences. To compute the hexamer score, one needs first to generate the hexamer frequency table, which consists of three columns: the first column contains hexamer sequences (e.g., “CAACTA”), the second and third columns contain hexamer frequencies calculated from the protein-coding RNAs and noncoding RNAs, respectively. Such hexamer table can be generated using make_hexamer_tab.py script. Please note the input coding sequences to make_hexamer_tab.py must be the CDS sequences without UTRs (i.e., RNA sequence between the start and the stop codon) (see Note 7). We use one of the supported mouse genome assemblies (mm9) to demonstrate how the files are generated. Users should replace

Prediction of RNA Coding Potential Using Sequence Features

35

Table 2 CPAT program options Option

Type

Description

-g or --gene

Mandatory

RNAs either in BED or FASTA format: If this is BED No format file, “-r/--ref” must also be specified; if this is RNA sequence file in FASTA format, ignore the “r/--ref” option. The input BED or FASTA file could be regular text file or compressed file (*.gz, *. bz2) or accessible url (http://, https://, ftp://)

-o URL -outfile

Mandatory

Name of output file

-x or --hex

Mandatory

Prebuilt hexamer frequency table (human, mouse, fly, No zebrafish). Run “make_hexamer_tab.py” to make this table out of your own training dataset

-d or -Mandatory logitModel

Prebuilt training model (human, mouse, fly, zebrafish). Run “make_logitModel.py” to build the logit model out of your own training datset

Default

No

No

-r or --ref

Mandatory if the input Reference genome sequences in FASTA format. file is in BED format Ignore this option if FASTA file was provided to “-g/--gene”. Reference genome file will be indexed automatically (produce *.fai file along with the original *.fa file within the same directory) if has not been done

-s or --start

Optional

Start codon (DNA sequence, so use “T” instead of “U”) used to define open reading frame (ORF)

-t or --stop

Optional

The stop codon (DNA sequence, so use “T” instead of “TAG, “U”) used to define open reading frame (ORF). TAA, Multiple stop codons should be separated by “,” TGA”

-h or --help

Optional

Show help message and exit

No

-v or -version

Optional

Show program’s version number an exit

No

“ATG”

“Mouse” with their species of interest when running the commands. $ make_hexamer_tab.py -c Mouse_coding_transcripts_CDS.fa -n Mouse_noncoding_transcripts_RNA.fa > Mouse_Hexamer.tab

Step-3: Build the Logistic Regression Model

After generating the hexamer table, users can use the make_logitModel.py to build the logistic regression models. This python script needs three input files: (1) the hexamer table, (2) the proteincoding mRNAs in FASTA (full mRNA sequence including both CDS and UTRs) or BED format, and (3) the noncoding RNA in FASTA or BED format. For examples:

36

Ying Li and Liguo Wang

# Use FASTA file as input: $ make_logitModel.py -x Mouse_Hexamer.tab -c Mouse_coding_transcripts_mRNA.fa -n Mouse_noncoding_transcripts_RNA.fa -o Mouse

# Use BED file as input: $ make_logitModel.py -x Mouse_Hexamer.tab -c Mouse_coding_transcripts.bed -n Mouse_noncoding_transcripts.bed -r mm9.fa -o Mouse

This program will output three files: (1) Mouse.feature.xls is a table that contains features calculated from training datasets. The six columns are “gene ID,” “mRNA size,” “ORF size,” “Fickett score,” “hexamer usage,” and “Label” (1: coding, 0: noncoding). (2) Mouse_logitModel.RData contains the logit model required by CPAT (see Note 8). (3) Mouse.make_logitModel.r is the R script to build the above logit model. 3.2.4 Evaluate the Performance of the Models Using K-Fold Cross-Validation

Overfitting and underfitting are the two biggest causes for the poor performance of machine learning algorithms. Overfitting happens when a model learns the details and noises from the training data to the extent that it negatively impacts the performance of the model on unseen data. K-fold cross-validation is effective to detect overfitting (overfitting occurs when we see “low training error” but “high testing error”). If overfitting occurs, users need to refine the existing training data, for example, remove some low-quality RNAs and add more high-quality RNAs. Using a larger training dataset not only reduce the chance of overfitting but also increase the accuracy of the model. Underfitting occurs when the models perform poorly on both training and testing data (i.e., “high training error” and “high testing error”). If underfitting occurs, users probably need to train the logistic regression model using other species that are evolutionary close and well-annotated. To perform K-fold cross-valuation, users will first shuffle the data randomly and then split the training dataset into K distinct, equal-sized subsets (i.e., folds). And then train the logistic regression model use K-1 folds and the use the remaining fold to evaluate (Fig. 3a). It is common to set K ¼ 10, but if the training dataset is small, users could also try K ¼ 5 to increase the testing/training ratio. In each validation run, a confusion matrix will be made, and performance measurements (i.e., sensitivity/recall, specificity, and precision) will be calculated. Finally, users can summarize the performance of models with the mean of the measurement scores, ROC (receiver operating characteristic) curves (Fig. 3b) and precision–recall curves (Fig. 3c). Users can download the example code and data from this link: (https://sourceforge.net/projects/rnacpat/files/Figure3_data/).

Prediction of RNA Coding Potential Using Sequence Features (B) 1.00

(A)

37

0.95

1. Shuffle the dataset randomly. 2. Split the dataset into K groups

0.75

0.80

Sensitivity

a. Take 1 group as a hold out or test data set b. Take the remaining K-1 groups as a training data set c. Fit a model on the training set and evaluate it on the test set d. Retain the evaluation score and discard the model

0.85

0.90

3. For each unique group:

AUC = 0.9863

0.70

4. Summarize the skill of the model using the model evaluation scores

0.00

0.05

0.10

0.15

0.20

0.25

0.30

1 - Specificity

(D)

Performance 0.4 0.6

0.85

Sensitivity Specificity

0.0

0.70

0.75

0.2

0.80

Precision (PPV)

0.90

0.8

0.95

1.0

1.00

(C)

0.70

0.75

0.80

0.85 Recall (TPR)

0.90

0.95

1.00

0.0

0.2

0.4 0.6 Coding Probability

0.8

1.0

Fig. 3 (a) K-fold cross-validation procedure. (b) An example of the AUC (area under the curve) plot. Blue dashed lines indicate the ten rounds of cross-validations; the solid red curve shows the averaged performance. (c) An example of the precision–recall (PR) curve. The blue dashed lines indicate the ten rounds of cross-validations; the blue solid curve shows the averaged performance. (d) An example of the two-graph ROC curve. The model was trained using mouse protein-coding and noncoding RNAs. The vertical dashed line indicates the optimum p cutoff (0.44). The horizontal dashed line indicates the sensitivity and specificity achieved if p ¼ 0.44 cutoff were used Step-4: Determine the Optimum Probability Cutoff

The two-graph ROC plot can be used for determining the optimum probability cutoff, and minimize the false positive and false negative predictions [8]. Visually, the optimum cutoff is the X-axis value of point c(x, y) where the two curves meet (Fig. 3d). The probability cutoff value will be hardcoded to the cpat.py script to determine if an unknown query sequence is coding or noncoding. An RNA sequence will be classified as “non-coding” if the coding probability p < cutoff and “coding” if the p > cutoff.

38

4

Ying Li and Liguo Wang

Notes 1. If the four nucleotides of DNA sequence are independent and identically distributed, the chance to observe a stop codon is 3/64 (3 stop codons out of a total of 64 codons), which means we will see a stop codon every 20 nucleotides. 2. Situations when users need to install the CPAT software to their local computers. (1) Users need to analyze a very large dataset. (2) Users need to analyze RNAs originated from species other than human, mouse, fly and zebra fish genomes. (3) Users need to incorporate CPAT into a pipeline/workflow. 3. Use “pip --version” or “which pip” to check if your computer already has pip. Pip may or may not be installed depending on the Python versions you have. If you are using Python 2 2.7.9 or Python 3.4, 3.5 or 3.6, pip is already installed. Otherwise, follow the instructions (https://pip.pypa.io/en/ stable/installing/) to install pip. 4. In your local computer or cluster, you might have multiple pips such as “pip”, “pip2” (pip for Python2.7) and “pip3” (pip for Python3). Whether your pip is pointing to Python2.7 or Python3 depends on your system. You can use “pip --version” to check. Explicitly use pip2 or pip3 is recommended. 5. The BED file (g) and the reference genome file (r) must be based on the same genome assembly (in this case mm9 or GRCm37). CPAT will automatically index the reference genome sequences file to enable efficient access to arbitrary regions within those reference sequences. Sometimes, unknown errors could happen, in this case, users need to manually index the reference genome sequences file using “faidx” command in samtools (http://samtools.sourceforge. net/). For example, after indexing “mm9.fa”, an “mm9.fa. fai” file will be created. 6. Use “T” instead of “U” in all RNA sequences. 7. Use “T” instead of “U” in all CDS sequences. 8. Use ‘load’ function in R to load the *RData file into R and then view its content.

References 1. Kong L, Zhang Y, Ye ZQ et al (2007) CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res 35:W345–W349 2. Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish

protein coding and non-coding regions. Bioinformatics 27:i275–i282 3. Arrial RT, Togawa RC, Brigido MM (2009) Screening non-coding RNAs in transcriptomes from neglected species using PORTRAIT: case study of the pathogenic fungus Paracoccidioides brasiliensis. BMC Bioinformatics 10:239

Prediction of RNA Coding Potential Using Sequence Features 4. Fickett JW (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10:5303–5318 5. Fickett JW, Tung C-S (1992) Assessment of protein coding measures. Nucleic Acids Res 20:6441–6450 6. Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510

39

7. Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515 8. Greiner M (1995) Two-graph receiver operating characteristic (TG-ROC): a Microsoft-EXCEL template for the selection of cutoff values in diagnostic tests. J. Immunol Methods 185:145–146

Chapter 4 Classification of Long Noncoding RNAs by k-mer Content Jessime M. Kirk, Daniel Sprague, and J. Mauro Calabrese Abstract K-mer based comparisons have emerged as powerful complements to BLAST-like alignment algorithms, particularly when the sequences being compared lack direct evolutionary relationships. In this chapter, we describe methods to compare k-mer content between groups of long noncoding RNAs (lncRNAs), to identify communities of lncRNAs with related k-mer contents, to identify the enrichment of proteinbinding motifs in lncRNAs, and to scan for domains of related k-mer contents in lncRNAs. Our step-bystep instructions are complemented by Python code deposited in Github. Though our chapter focuses on lncRNAs, the methods we describe could be applied to any set of nucleic acid sequences. Key words k-mer, Long noncoding RNA, LncRNA, Protein-binding motif, Domain, Sequence alignment, Networks, Louvain algorithm, Unsupervised clustering, Communities

1

Introduction Upward of 80% of the human genome can be transcribed into RNA. Of the total number of transcribed nucleotides, approximately one half comprise pre-messenger RNAs (pre-mRNAs) that will ultimately become spliced and encode for proteins in the cytoplasm. The other half comprise long noncoding RNAs (lncRNAs), defined as RNA species that are greater than 200 nucleotides in length and have little or no potential to encode for proteins. Compared to transcripts produced from proteincoding genes, lncRNAs are, on average, less conserved, transcribed at lower levels, spliced less efficiently, and more likely to remain in the nucleus [1–6]. Nevertheless, a growing number of lncRNAs have been studied experimentally, and are now known to play important roles in health and development. Some of the most notable of these include the lncRNA XIST , which orchestrates transcriptional silencing during X-chromosome Inactivation [7], the lncRNAs NEAT1 and

Jessime M. Kirk and Daniel Sprague Co-first authors. Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_4, © Springer Science+Business Media, LLC, part of Springer Nature 2021

41

42

Jessime M. Kirk et al.

MALAT1, which play roles in nuclear organization and have context-dependent functions in development and in cancer [8– 14], and the lncRNA NORAD, which helps to maintain genome stability by promoting DNA repair [15, 16]. LncRNAs have also been found to play important roles in developmental transitions [17–23], in the immune system [24–26], in the brain [27–33], and in the heart [34–37]. These identified roles, coupled with the large number of lncRNAs that have yet to be studied experimentally, suggest that lncRNAs with important physiological functions remain to be discovered. Still, identifying function in lncRNAs remains a major challenge. Many lncRNAs are thought to function as hubs that concentrate proteins, DNA, and possibly other biomolecules in particular regions of the cell, yet the sequence characteristics that give rise to these functions and the mechanisms through which they occur are poorly defined, even for the best studied lncRNAs [38–42]. Moreover, relative to protein-coding genes, lncRNAs are poorly conserved, evolve rapidly, and are prone to changes in gene architecture, limiting the extent to which traditional phylogenetic analyses can be employed to identify the sequence features that are important for specifying their function [43]. As an example, placental mammals express the XIST lncRNA to orchestrate gene silencing during X-Chromosome Inactivation [7], while marsupial mammals may have evolved their own lncRNA to orchestrate X-Chromosome Inactivation, termed Rsx. Remarkably, XIST and Rsx share no significant similarity by standard methods of sequence alignment [44, 45]. Thus, even though Rsx and XIST presumably function through analogous mechanisms, standard tools of sequence comparison are unable to detect the analogy. This problem extends to all lncRNAs. The sequence patterns that specify recurring functions in lncRNAs are largely unknown and difficult to detect computationally. Thus, to date, lncRNA functions must be determined empirically, on a case-by-case basis. Recently, we developed a method of sequence comparison based on the notion that different lncRNAs likely encode similar functions through different spatial arrangements of related sequence motifs, and that such similarities might not be detectable by traditional methods of linear sequence alignment [46]. In our method, which we termed SEEKR (sequence evaluation through k-mer representation), the sequences of any number of lncRNAs are evaluated by comparing the standardized abundance of nucleotide substrings termed “k-mers” in each lncRNA, where k specifies the length of the substring being counted, and is typically set to values of k ¼ 4, 5, or 6. SEEKR counts k-mers independent of their position in sequences of interest, much like the “bag of words model” used by many language processing algorithms, in which sentences are classified by word abundance without regards to grammar or syntax [47]. Using SEEKR, we demonstrated that

Classification of Long Noncoding RNAs by k-mer Content

43

k-mer content correlates with lncRNA subcellular localization, protein-binding, and repressive function, and that evolutionarily unrelated lncRNAs with analogous functions shared significant levels of nonlinear sequence similarity even when BLAST-like alignment algorithms could detect none [46]. Below, we walk users through five related applications of SEEKR that we have found to be useful. For each application, we enumerate step-by-step instructions. Where relevant, we include code to execute specific functions in python. We have deposited standalone python code to run the major applications of SEEKR in Github (https://github.com/CalabreseLab/seekr). For the simplest implementation of SEEKR, we refer users to a web portal (http://seekr.org). K-mer based classification schemes have been used in many biological contexts ([48–56] and others). Therefore, beyond lncRNAs, the methods that we describe should prove useful in the study of other nucleic acid sequences, such as 50 and 30 untranslated regions of mRNAs and DNA regulatory elements.

2

Materials

2.1 Hardware Requirements

Personal computer, preferably with a multicore processor and at least 8GB of RAM.

2.2 Software Requirements

1. Python 3.6. The easiest way to get started with Python is by downloading the Anaconda distribution: https://www.ana conda.com/download. 2. The python packages: numpy, pandas, networkx, pythonigraph, louvain. All of these can be installed by running $ pip install [name]. 3. R, which can be installed from https://www.r-project.org/. 4. The R packages amap and ctc. Amap is hosted at https://cran.rproject.org/web/packages/amap/index.html, and ctc at https://bioconductor.org/packages/release/bioc/html/ctc. html. Both can be installed by running the following: source("http://bioconductor.org/biocLite.R") biocLite("amap") biocLite("ctc")

5. Java 1.8. See this page for help installing java: https://www. java.com/en/download/help/download_options.xml 6. Java Treeview. http://jtreeview.sourceforge.net/ 7. Gephi, which can be installed from https://gephi.org/users/ download/. 8. SEEKR (optional). SEEKR is hosted at pypi: https://pypi.org/ project/seekr/, and can be installed by running $ pip

44

Jessime M. Kirk et al. install seekr.

SEEKR works on Mac and Linux. DependMacOS being used, we have observed occainstalling some of the dependencies of SEEKR Anaconda Python. As a useful workaround, for macOS 10.14.x, run $ MACOSX_ DEPLOYMENT _TARGET ¼ 10.14 pip install seekr. To print the documentation associated with each SEEKR command line tool, simply type the name of the tool in the UNIX terminal (e.g., $ seekr_download_gencode). ing on the sional bugs when using example, in

3

Methods

3.1 Comparing k-mer Contents Between a Group of lncRNAs 3.1.1 Download lncRNA Sequences

LncRNA sequences can be downloaded from https://www. gencodegenes.org/. For this analysis, we will use human v22: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/ release_22/gencode.v22.lncRNA_transcripts.fa.gz and mouse v5: ftp://ftp.ebi.ac.uk/pub/databases/gencode/ Gencode_mouse/release_M5/gencode.vM5.lncRNA_transcripts. fa.gz. Unzip these files to produce gencode.v22.lncRNA_transcripts.fa and gencode.vM5.lncRNA_transcripts.fa.gz. The following pipeline will be demonstrated using just the gencode.v22. lncRNA_transcripts.fa file. Mouse, or any other fasta file, can be substituted instead. Downloading and unzipping can be done manually. Alternatively, if SEEKR is installed locally, you can also download the files from the command line. Use “lncRNA” to specify the biotype of transcripts file, and the “--release” flag to indicate you want a particular version of the fasta file: $ seekr_download_gencode lncRNA -r 22

3.1.2 Select 01 Isoform

To avoid bias that may be introduced by counting k-mers across multiple isoforms of the same transcript, we have historically used only select transcripts ending in 01, which in prior versions of GENCODE, represented the canonical isoform of a gene product. Using this filter, each genomic locus is only represented once. Current versions of GENCODE have moved away from the ’01’ nomenclature, and our workaround is typically to use the entire set of annotations, including semi-redundant isoforms. fasta_path ¼ ’v22_lncRNA.fa’ (see Note 1) with open(fasta_path) as infasta: data ¼ [l.strip() for l in infasta] headers ¼ data[::2] seqs ¼ data[1::2] fasta01_path ¼ ’v22-01.fa’ with open(fasta01_path, ’w’) as outfasta:

Classification of Long Noncoding RNAs by k-mer Content

45

for header, seq in zip(headers, seqs): common_name ¼ header.split(’|’)[4] if common_name.endswith(’01’): (see Note 2) outfasta.write(header+’\n’) outfasta.write(seq+’\n’)

To accomplish the same using the command line tool, pass the name of the GENCODE fasta file and a path to the newly filtered fasta file:

seekr_canonical_gencode

$ seekr_canonical_gencode v22_lncRNA.fa v22-01.fa

3.1.3 Count k-Mers

Next, we define a 2D matrix where each row represents one transcript, each column represents a k-mer, and each element is a normalized and standardized count of how many times a k-mer is found in a transcript. A single row of the matrix, then, defines a “k-mer profile” for a given lncRNA. import pickle import numpy as np import pandas as pd from collections import defaultdict from itertools import product # Read fasta file fasta_path ¼ ’v22-01.fa’ with open(fasta_path) as infasta: data ¼ [l.strip() for l in infasta] headers ¼ data[::2] seqs ¼ data[1::2] # Initialize data k¼6 kmers ¼ [’’.join(i) for i in product(’AGTC’, repeat¼k)] k_map ¼ dict(zip(kmers, range(4**k))) counts

¼

np.zeros([len(seqs),

4**k],

float32) # Do counting for i, seq in enumerate(seqs): row ¼ counts[i] count_dict ¼ defaultdict(int) (see Note 3) length ¼ len(seq) increment ¼ 1000/length for c in range(length-k+1): (see Note 4) kmer ¼ seq[c:c+k] count_dict[kmer] +¼ increment for kmer, n in count_dict.items(): if kmer in k_map: (see Note 5) row[k_map[kmer]] ¼ n

dtype¼np.

46

Jessime M. Kirk et al. # Normalize counts -¼ np.mean(counts, axis¼0) counts /¼ np.std(counts, axis¼0) counts +¼ abs(counts.min()) + 1 (see Note 6) counts ¼ np.log2(counts) # Save csv file out_path ¼ ’v22-6mers.csv’ seen ¼ set()(see Note 7) names ¼ [] for h in headers: name ¼ h.split(’|’)[4] if name in seen: name +¼ ’B’ seen.add(name) names.append(name) pickle.dump(names, open(’v22_names-B.pkl’, ’wb’)) df ¼ pd.DataFrame(counts, names, kmers) df.to_csv(out_path, float_format¼’%.4f’)

Using the command line tool: $ seekr_kmer_counts v22-01.fa -o v22_6mers.csv

3.2 Hierarchical Clustering of lncRNAs by k-mer Content 3.2.1 Cluster with Amap

The visualization tool Java Treeview allows for interactive exploration of large hierarchical clusters. Treeview parses clusters defined by a set of three plaintext files, which describe the structure of row and column clusters: .gtr, .atr, and .cdt. These files can be conveniently produced by the R packages “amap” and “ctc,” which parse a .csv file such as v22_6mers.csv. The R script “treeview_cluster.r” will create the Treeview files: make_treeview Mahajan.NatGenet2018b.T2Dbmiadj.European.txt \ > | sort -k1,1n -k2,2n \ > > Mahajan.NatGenet2018b.T2Dbmiadj.European.bed

*Explaining the awk code. Block one sets the output field separator (OFS) to a tab (\t). Block two prints a “#” symbol before the first line. Block three prints the chromosome field of the original file, followed by the position field minus one (because bed files are 0 indexed), the third field, the fourth field, and so on. We then pipe the results of the awk operation into a linux sort which sorts the bed file by chromosome (field one, numerical), and start (field two, numerical). This step took 6 min to run and 8 MB memory but saves a lot more time for the next step. >head -n4 Mahajan.NatGenet2018b.T2Dbmiadj.European.bed #Chr -1 Pos EA NEA EAF Beta SE Pvalue Neff 1 79032 79033 A G 0.0014 0.2 0.2 3.1e-01 93295 1 79136 79137 A T 1 0.065 0.32 8.4e-01 92984 1 533178 533179 A G 1 0.24 0.5 6.3e-01 87066

Using the intersect tool from the BEDtools suite, match the RSID file with the GWAS bed file: >intersectBed \ > -sorted \ > -a All_20180423.vcf.gz \

Integrating GWAS-eQTL Data to Define Functional lncRNAs

97

> -b Mahajan.NatGenet2018b.T2Dbmiadj.European.bed > -wa -wb > Mahajan.NatGenet2018b.T2Dbmiadj.European.intersect.bed

*-sorted tells the intersect command that the input files are sorted and to run a more memory and time intensive algorithm. -wa and -wb tells the command to output the matching results from both of the input files. For files of this size, this step required 4.59 GB of RAM and 26 min to run on a single core, which means it is feasible for the average MacBook or Windows computer. Running it on unsorted input files would take much more RAM and much longer, which is impossible on a personal computer, and inefficient usage on a cluster. >head -n3 Mahajan.NatGenet2018b.T2Dbmiadj.European.intersect.bed 1 79033 rs2462495 A G,T . . RS=2462495;RSPOS=79033;RV;dbSNPBuildID=100;SSR=0;SAO=0;VP=0x050100000005000516000100;WGT=1;VC=SNV;SLO; ASP;HD;GNO;KGPhase1 1 79032 79033 A G 0.0014 0.2 0.2 3.1e-01 93295 1 79137 rs143777184 A C,T . . RS=143777184;RSPOS=79137;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050100000005110036000100;WGT=1;VC=SNV;SLO; ASP;G5;KGPhase1;KGPhase3;CAF=0.9587,.,0.04133;COMMON=1; TOPMED=0.95738563965341488,0.00000796381243628,0.04260639653414882 1 79136 79137 A T 1 0.065 0.32 8.4e-01 92984 1 533179 rs111501994 A G . . RS=111501994;RSPOS=533179;dbSNPBuildID=132;SSR=0;SAO=0;VP=0x050100000005110136000100;WGT=1;VC=SNV;SLO; ASP;G5;GNO;KGPhase1;KGPhase3;CAF=0.97,0.02995;COMMON=1; TOPMED=0.88215150356778797,0.11784849643221202 1 533178 533179 A G 0.24 0.5 6.3e-01 87066

The first eight columns (separated by tab) are from the dbSNP file (the eighth column is metadata which was not shown earlier) and the next ten columns are from the reformatted GWAS bed file. You can see if the alleles match up which is a good sign that the intersect worked properly (A and G in the first line, A and T in the second, and A and G in the third; the comma and second letter in the first two lines for the dbSNP file entries indicate other possible alternative alleles). Use linux wc to perform a second level of quality control by observing the number of lines in each file. >wc -l Mahajan.NatGenet2018b.T2Dbmiadj.European*.bed 21635867 Mahajan.NatGenet2018b.T2Dbmiadj.European.bed 23683993 Mahajan.NatGenet2018b.T2Dbmiadj.European.intersect. bed 45319860 total

Interestingly enough, there are more entries in the new file than in the old one. At first, you would expect there to be less after

98

Yi Chen et al.

matching it with RSIDs if there were SNPs in the GWAS file that did not have official RSIDs. This might be true, but since we matched the SNPs by location, the SNPs might have also been matched to entries which are indels overlapping the location rather than just SNPs. Use the awk tool to check the alleles and reformat the file to make it SMR compliant. >awk ’BEGIN{OFS="\t"; print "SNP\tA1\tA2\tfreq\tb\tse\tp\tn"} \ > {split($5,a,","); a[$4]=$4; for (i in a) dict[a[i]]=""; \ > if($12 in dict && $13 in dict) print $3, $12, $13, $14, $15, $16, $17, $18}’ \ > Mahajan.NatGenet2018b.T2Dbmiadj.European.intersect.bed \ > > Mahajan.NatGenet2018b.T2Dbmiadj.European.ma

*Explaining the code: again we set the output field separator to a single tab, and we print a header line that matches the one on the SMR website. We split the potential alleles from the dbSNP files in field five into a list and we add the other potential allele from field four into that list. Then we can check if the alleles from the GWAS file, fields four and five, match the ones in the list. If so, we print the RSID from the dbSNP file and the other fields from the GWAS file. It should be noted that virtually all GWAS results will have A1, A2, b, se, and p (without which you cannot run SMR), but some might not contain frequency (freq), or n (population size). Although here we have all the fields, SMR does not use population size, so you can tell awk to print a placeholder, and frequency was only implemented in later versions of SMR as a form of quality control, and can also be ignored if it is missing, in which case putting 0.5 as a placeholder also works (an additional flag needs to be added in a later step though). Your final GWAS file should look something like this: >head -n4 Mahajan.NatGenet2018b.T2Dbmiadj.European.ma SNP A1 A2 freq b se p n rs2462495 A G 0.0014 0.2 0.2 3.1e-01 93295 rs143777184 A T 1 0.065 0.32 8.4e-01 92984 rs111501994 A G 1 0.24 0.5 6.3e-01 87066

3.3 Reformatting the Genotype Reference Data

SMR also requires a reference panel to estimate linkage disequilibrium across a larger sample size. For this analysis, the example genotype reference data is available from GTEx in vcf format, which is one of the most common formats. It works well because the eQTLs are also created from GTEx data, and this reference or one from 1000 genomes can be used for you. The reformatting step for this is much easier than for the GWAS. For the example data, you need to first filter duplicate

Integrating GWAS-eQTL Data to Define Functional lncRNAs

99

SNPs by name and then put it into binary plink format. Removing duplicate SNPs by name does lose information, but is necessary for running SMR program (see Note 3). >gzcat GTEx635Pass.Hg38.rsid.final.vcf.gz | grep -v "^#" | cut -f3 | sort | uniq -d \ > > duplicates.txt plink --make-bed --vcf GTEx635Pass.Hg38.rsid.final.vcf.gz \ > --out GTEx635Pass.Hg38.rsid.final.noDups --exclude duplicates.txt

*gzcat decompresses the vcf file, grep retrieves all entries that do not begin with # (metadata), cut retrieves the third field, which contains variant IDs, and sort then uniq -d prints variants that are duplicated. The second command uses plink to make a binary file from our vcf input file and exclude the duplicate SNP list we just generated. The bed file produced by plink is not the same format as UCSC’s bed format which we used earlier. This one is a compressed binary file which cannot be viewed using a text editor. The above command creates bed, bim, fam, and nosex files all with the same prefix. The bed, bim, and fam are the necessary files for SMR. 3.4 Reformatting the eQTL Data

The next step is to reformat your eQTL data. SMR requires the data to be in their own BESD format, but they have implemented methods into their tool to reformat most eQTL software outputs, including Matrix eQTL, which is what we will use. However, we will need to update the file with extra information for SNP and gene. The first step in this process is to remove the duplicate SNPs that we removed from the Genotype Reference Data (see Note 3). >grep -v -F -f duplicates.txt combined.lncRNAKB.visceral.cis. tsv \ > > combined.lncRNAKB.visceral.cis.noDups.tsv

*The grep command runs a memory efficient algorithm to remove any value in the duplicates.txt file from the combined. lncRNAKB.visceral.cis.tsv eQTL file. Using the -F flag with grep in particular reduces runtime and memory requirements when working with a lot of input values from a file. >smr --eqtl-summary combined.lncRNAKB.visceral.cis.noDups.tsv \ > --matrix-eqtl-format \ > --make-besd \ > --out combined.lncRNAKB.visceral.cis.noDups

100

Yi Chen et al.

Fig. 1 Example epi and esi files. These files are tab delimited and do not use headers. The columns for the epi file are chromosome, probe ID, placeholder, transcription start site, gene ID (can be the same as the probe ID), and gene orientation (only for plotting the results). The columns for the esi file are chromosome, SNP ID, placeholder, position, Allele 1, allele 2, allele 1 frequency

The above command produces three files that are compatible with the SMR program: the epi, esi, and besd file. The besd file is a binary file with eQTL information. The epi file contains gene information, and the esi file contains SNP information. Below are examples of these files (Fig. 1). SMR can make a complete besd file from the MatrixEQTL output. However, it does not have enough information to make complete epi and esi files. The ones it generated above are placeholders that need to be updated. The esi file can be created from the plink files, whereas the epi file can be created from the original gene annotation gtf/gff file. Make the esi file from the plink bim file by using the plink command: >plink --freq --bfile GTEx635Pass.Hg38.rsid.final \ > --output GTEx635Pass.Hg38.rsid.final.noDups >head -n3 GTEx635Pass.Hg38.rsid.final.noDups.frq CHR SNP A1 A2 MAF NCHROBS 1 rs528916756 G C 0.002521 1190 1 rs538322974 A C 0.0007911 1264

*The --freq flag tells plink to calculate A1 frequencies. Then add the frequencies to the bim file.

Integrating GWAS-eQTL Data to Define Functional lncRNAs

101

>paste GTEx635Pass.Hg38.rsid.final.noDups.bim \ > > GTEx635Pass.Hg38.rsid.final.esi >head -n4 GTEx635Pass.Hg38.rsid.final.esi

*The linux code block within the gzcat lncRNAKB_hg38_v7.gtf.gz | grep -P ‘\tgene\t’ \ > | sed ‘s/\tgene_id “/\t/’ | sed ‘s/”.*//’ \ > | awk ‘BEGIN{FS=”\t”; OFS=”\t”}{if($7 ~ /-/) {$3=$4;$4=$5;$5=$3}; \ > print $1,$9,$5,$4,$9,$7}’ \ > | sed ‘s/chr//’ > lncRNAKB_hg38_v7.epi >head -n4 lncRNAKB_hg38_v7.epi 1 lnckb.1 0 11874 lnckb.1 + 1 lnckb.2 0 29370 lnckb.2 1 lnckb.3 0 29926 lnckb.3 + 1 lnckb.4 0 36081 lnckb.4 -

*After downloading, we uncompress the file, and then we retrieve only gene entries (removing exon, CDS, transcript subentries). The two sed commands remove extraneous key-value information from the last field of the gtf file. They will work most of the time (see Note 4). awk checks if the gene is negative, and will set the gene start coordinate to be the end coordinate, and then outputs the fields in the order of the epi file example above, with the addition of the end coordinate in the third column (which will be useful for graphing later). The last sed command removes the “chr” from the chromosome names as is convention in the genome build Hg38. This is important as SMR recognizes chromosomes in plink format, which labels chromosomes numerically. Reupdate the file with the new epi and esi files.

102

Yi Chen et al. >smr --beqtl-summary combined.lncRNAKB.visceral.cis.noDups \ > --update-esi GTEx635Pass.Hg38.rsid.final.noDups.esi >smr --beqtl-summary combined.lncRNAKB.visceral.cis.noDups \ > --update-epi lncRNAKB_hg38_v7.epi

3.5 Running the SMR Program

Run the SMR analysis. >smr --bfile GTEx635Pass.Hg38.rsid.final.noDups \ --gwas-summary Mahajan.NatGenet2018b.T2Dbmiadj.European.ma \ --beqtl-summary combined.lncRNAKB.visceral.cis.noDups \ --out GTEx_lncRNAKB_cis.T2Dbmiadj_European \ --thread-num 2 \ --cis-wind 1000

*This is the only step that you would potentially need to run on the cluster. Using these files required a max of around 17 Gb RAM, which exceeds that of most personal computers. Even with -thread-num 2 set, it only used 1 cpu for most of the run time, but spiked to 2/3 cores for a second or two near the end. I have not found any difference running SMR with 1 cpu or more. In all, this step required 25 min of runtime (see Note 5 on how to reduce memory requirements). As mentioned earlier, if your GWAS file did not have SNP frequency information available, or you do not care for that quality control step, you can add the flag --diff-freq 0.9 or --diff-freq-prop 1 to skip allele frequency checking. Observe the results. >sort -g -k19,19 GTEx_lncRNAKB_cis.T2Dbmiadj_European.smr | head -n4 probeID ProbeChr Gene Probe_bp topSNP topSNP_chr topSNP_bp A1 A2 Freq b_GWAS se_GWAS p_GWAS b_eQTL se_eQTL p_eQTL b_SMR se_SMR p_SMR p_HEIDI nsnp_HEIDI lnckb.9123 11 lnckb.9123 65422774 rs512715 11 65423737 C 0.304897

0.047

0.0085

3.200000e-08

-0.283164

0.0440858

1.335999e-10 -0.165981 0.0396089 2.783333e-05 2.927920e-09 20 lnckb.9122 11 lnckb.9122 65423153 rs512715 11 65423737 C 0.304897

0.047

0.0085

3.200000e-08

-0.297907

0.0493926

1.625433e-09 -0.157767 0.0387082 4.585291e-05 1.441744e-09 20 lnckb.54804 8 lnckb.54804 144770461 rs66716313 8 144770380 A 0.487362

0.041

0.0075

6.700000e-08

0.122988

0.0201752

1.087562e-09 0.333366 0.0819105 4.703627e-05 4.792321e-08 20

Next, conduct multiple testing correction across the number of probes tested, 488 after removing genes with no significant eQTLs (if you did not first remove probes that did not pass the 5 108 p eQTL threshold, SMR automatically does this and you can find the number of probes tested in the output log). There are multiple methods of multiple testing correction and many tools for doing

Integrating GWAS-eQTL Data to Define Functional lncRNAs

103

this (both online and for the command line). For the example data, use Benjamini–Hochberg correction in R with n ¼ 488 to correct both p_SMR and p_HEIDI. >R > table table$p_SMR_adj table$p_HEIDI_adj table_trim 0.05) > write.table(table_trim,

"GTEx_lncRNAKB_cis.T2Dbmiadj_-

European.adj.smr", sep="\t", > col.names=NA, quote=F) > q() >head -n 4 GTEx_lncRNAKB_cis.T2Dbmiadj_European.adj.smr | cut -f 1-2,21ProbeChr nsnp_HEIDI p_SMR_adj p_HEIDI_adj lnckb.29925 2 20 0.011094802 0.290036324615385 lnckb.57043 9 20 0.011094802 0.103029819142857 lnckb.29926 2 20 0.0115484956 0.137285387393939

You now have ten lncRNA genes whose function within visceral adipose tissue is associated with Type 2 Diabetes via genetic variants. The SMR program calculates two separate test statistics, the SMR statistic and the HEIDI statistic. It is important to consider both the SMR and the HEIDI p values as each test for different phenomena. Biologically, the SMR statistic tests for the association between gene expression level and the trait of interest through an instrumental variable, that is, the genetic variant. However, due to the effects of LD, it is possible that the association calculated by SMR is caused by two different variants in strong LD, one having a true effect on lncRNA expression, and one having a true effect on the trait. The HEIDI statistic specifically tests for this. If there is a true pleiotropic or causal relationship between gene expression and trait, then the ratio of the eQTL and GWAS effect sizes should be the same across all variants in LD. HEIDI tests if these ratios are significantly similar. If they are not ( p < 0.05), then we reject the null hypothesis of a single genetic variant and accept the alternative hypothesis that association is due to linkage. We therefore seek results with an SMR p value 0.05. If you wish to be stringent with HEIDI, you could instead use a nominal HEIDI p value of 0.05 as your cutoff. 3.6 Visualizing the Data

SMR provides an R script to plot results for a specific gene and is available at http://cnsgenomics.com/software/smr/#Download. However, this script does not generate graphs that are as clean as expected (and I sometimes run into errors that are difficult to

104

Yi Chen et al.

debug), so I have written additional R code for a nicer plot. In addition, since the eQTL data used in this protocol was generated by RNA sequencing rather than microarray, each gene is its own probe, so I have chosen not to print both probe and gene IDs as the SMR script does. First you need to obtain a gene list file for your organism and build. The plink website has a few available and describes how to make them https://www.cog-genomics.org/plink2/ resources#genelist. >wget https://www.cog-genomics.org/static/bin/plink/glisthg38

Then run the SMR tool again to create the data file for the graphs. >smr_Linux --bfile GTEx635Pass.Hg38.rsid.final.noDups \ > --gwas-summary Mahajan.NatGenet2018b.T2Dbmiadj.European.ma \ > --beqtl-summary combined.lncRNAKB.visceral.cis.noDups \ > --out GTEx_lncRNAKB_cis.T2Dbmiadj_European \ > --plot \ > --probe lnckb.29925 \ > --probe-wind 500 \ > --gene-list glist-hg38

*15 min runtime; 15 GB RAM; 1 cpu The command generates results in a folder called plot. Now download SMR’s plot script and load the data for probe lnckb.29925 in R. >wget http://cnsgenomics.com/software/smr/download/plot.zip >unzip plot.zip >mv plot/plot_SMR.r. >R > source(“plot_SMR.r”) > data colnames(data$GWAS) colnames(data$eQTL) colnames(data$SNP) snps

!is.na(data$eQTL$R2))$SNP) > snps effectdata data$eQTL$gene==’lnckb.29925’), > subset(data$GWAS, data$GWAS$SNP %in% snps), by="SNP") > p.eQTL effectdata$topSNP effectdata$topSNP[p.eQTL==max(p.eQTL)] > slope effectdata$effect.x[p.eQTL==max(p.eQTL)] > > p ymin=effect.y-se.y, ymax=effect.y+se.y, > xmin=effect.x-se.x, xmax=effect.x+se.x, color=R2, shape=topSNP)) + > geom_point() + > geom_errorbar() + > geom_errorbarh() + > xlab("eQTL effect sizes") + > ylab("GWAS effect sizes") + > geom_abline(slope=slope, intercept=0, color="red", linetype="dashed") + > guides(shape=guide_legend(title=NULL), color=F) + > theme_minimal() > > p

This produces the following effect plot for lnckb.29925 (Fig. 2). The effect plot is somewhat representative of the HEIDI analysis, where a linear relationship between GWAS effect size and eQTL effect size across variants is expected of either pleiotropic or causal relationships between the trait and the lncRNA expression arising from a single genetic variant. The slope of the red line is the estimated effect size of the variant on the phenotype, calculated based on the top cis-eQTL SNP.

106

Yi Chen et al.

l 0.02

l ll

l l

GWAS effect sizes

l

0.00

l

l

SNP top SNP

−0.02

l l

l

l l ll l l l

−0.04 −1

0

1

2

eQTL effect sizes

Fig. 2 Example effect plot. We see a consistent negative trend across most SNPs in this example. That is, we form the hypothesis that increasing lnckb.29925 expression has a protective effect against Type 2 Diabetes

Integrating GWAS-eQTL Data to Define Functional lncRNAs

4

107

Notes 1. SMR uses a genotype reference file in order to calculate Linkage Disequilibrium (LD) between genetic variants. As it is only used for calculating general LD, it does not need to be of the same subjects as either your GWAS or eQTL datasets. However, using a reference panel of subjects with the same ancestry as your GWAS and eQTL data produces the best estimates of LD. For the example data, we use a set of 635 subjects from GTEx, since the example eQTL data was calculated using GTEx data. This data is not available directly for public download but can be obtained via authorized access from https:// dbgap.ncbi.nlm.nih.gov/. The example file was obtained in vcf format, which is a tab-delimited format with each row corresponding to a genetic variant with information about the variant in the first few columns and then genotypes for subjects in the following columns. This is by far the most common format. However, it is possible to obtain the data in one of the many plink formats (see https://www.cog-geno mics.org/plink/1.9/formats), and you can use plink to change the format to and from vcf. As aforementioned, any large consortium of genetic variants can be used as the reference. One comprehensive dataset is from the 1000 genomes project, which as of phase 3 contains 84.4 million variants mapped for 2504 subjects and can be obtained publicly from https://www.internationalgenome. org/data/ (see the ftp link). They also have datasets for each specific ancestry (e.g., Asian, Caucasian, African) Most genotype reference files will contain a small percentage of insertion deletion (indel) variants in addition to single nucleotide polymorphism (SNP) variants. Currently it is somewhat difficult to find indels mapped across both the eQTL and GWAS datasets, since they tend to be mapped much less than SNPs. Although SMR does treat indels the same as SNPs as long as they have a name, reference allele, and alternative allele that are consistent across datasets, we did not take efforts to find specific ways of removing/quality checking for them as indels are not the focus of this protocol. 2. When the three input files format SNPs differently, it is easiest to manipulate the data in either the GWAS or eQTL files to match the genotype reference file. This is because genotype reference files tend to be the largest of the three (on the order of 100+ Gb for a genome-wide vcf file), and GWAS/eQTL files are usually available in tab- or comma-separated-value files (.tsv or .csv), which are easy to manipulate in Linux. The most common SNP formats are some variation of chromosome:basepair, chromosome_basepair_alleles_genomebuild, or RSID.

108

Yi Chen et al.

3. The most difficult step of this protocol is likely matching the SNP information across all of the different input files. This is because of how variable SNP formats can be across the three datasets, how variable SNPs and indel variants inherently are, and how much information each dataset provides. Your goal should be to retain as much information as possible while removing any potential ambiguity that could lead to false positive results. The reason why this step is particularly difficult is because (1) variant information is the only information that is present in all three datasets and (2) SMR does not deal well with multiallelic sites, which are genetic variants which have more than two potential alleles, for example an SNP which could be either an A, T, or G as opposed to just A or T. In the example, the reference genotype data and our GWAS data are somewhat ambiguous. If we look carefully, the way they are annotated is with RSIDs, but some RSIDs are repeated. In the reference genotype vcf, we can see this is because it split non biallelic (multiallelic) variants into multiple variants. For example, if we look at variant rs2303710, it is reference allele and major allele is a G. Its alternative allele can be either C or A. The way this is coded in our data is in two separate entries, one with reference allele G and alternative allele C and one with reference allele G and alternative allele A. However, both are named rs2303710. This is a problem because in the eQTL data, the alleles are not labeled. We’re not really sure if the eQTL is for G > A, G > C, or G > A + C. We can deal with ambiguity in several ways depending on what we believe about the variants. Since you are not sure of the effects of each alternative allele, you could remove them altogether, which is what this protocol does. This is a loss of information, but you can count the number of variants before and after removing duplicates using the wc tool. >wc -l combined.lncRNAKB.visceral.lncRNA.cis.*tsv 1091887 combined.lncRNAKB.visceral.lncRNA.cis.noDups.tsv 1098798 combined.lncRNAKB.visceral.lncRNA.cis.tsv

The total number of eQTLs was 1,091,887, but removing the multiallelic sites removes 6911 eQTLs, which is only 0.6% of the total eQTLs. In general, it seems to be more common to remove multiallelic sites from analyses. In some vcf files, multiallelic sites will be coded with a reference allele and then multiple alternative alleles separated by commas. Plink, like SMR, only works with biallelic variants. If there are multiple alternative alleles, by default plink will choose the more common alternative allele

Integrating GWAS-eQTL Data to Define Functional lncRNAs

109

and label other alternative alleles as missing instead of removing multiallelic variants altogether. If you do this, you probably want to check that the alleles match in GWAS and eQTL data using awk or python. If you want to get rid of these variants altogether, they can be removed using plink’s --biallelic-only strict flag. Another common possibility is that variants are labeled in the format chromosome_position_refallele_alternativeallele_build (e.g., 1_102422_G_A_b38). I suggest that you annotate it with RSIDs such as in Subheading 3.2 (the code will be slightly different as the order and field separators are slightly different, i.e., tab vs. underscore), and then the process will be the same as described above. If you want to use this format instead of RSIDs, what you must consider are (1) the genome build must be consistent across the datasets; in human most data produced recently should be in b38, but some older datasets may by in b37 or less. Older datasets need to be reformatted into UCSC bed format and can be moved to the new build using UCSC’s crossmap tool (chromosome and alleles usually stay the same, but position shifts slightly). (2) the order of the alleles must be the same and in the same case. Since SMR matches variants by name, it will not recognize 1_102422_G_A_b38, 1_102422_g_a_b38, and 1_10422_A_G_b38 as the same variant, even though it does recognize variants with the same name, but with reference and alternative alleles reversed in different files. 4. To generate the epi file from the gtf file, you can take advantage of gtf file convention. In short, gtf/gff files are tab separated data files with nine fields: chromosome, source, feature, start, end, score, strand, frame, and attribute. The attribute field is actually a field made up of a number of different key-value attributes like a dictionary. In almost all gtf files released by popular projects such as GENCODE or RefSeq, the first attribute is gene_id. Here we take advantage of this by stripping away all but the first attribute to generate the epi file. However, this might not always be the case, then using a more complicated awk command or some R packages might work instead. The main reason why running the SMR tool requires so much memory (and is the only memory intensive step of this protocol) is likely because it loads into memory one of the datasets of the GWAS or the eQTL data based on the memory usage information. If however, you are only using SMR for cis-eQTLs such as in this protocol, one method to reduce memory usage and would probably make this entire protocol possible on a personal computer (although increase runtime) is to split each input file by chromosome. This is relatively easy with either linux tools or plink. For example, you can extract chromosome 1 from each input file.

110

Yi Chen et al.

In step 2, after intersecting the vcf file and the GWAS file, use grep to extract all lines that begin with 1. >grep -P ‘^1\t’ Mahajan.NatGenet2018b.T2Dbmiadj.European.intersect.bed \ > > Mahajan.NatGenet2018b.T2Dbmiadj.European.intersect.chr1.bed

*Here the ^ symbol indicates the beginning of the line and the \t indicates a tab directly following the 1, or else we might match lines that begin with 11 or 12. In step 3, use plink’s built-in functionality to extract chromosome 1. >plink --make-bed --vcf GTEx635Pass.Hg38.rsid.final.vcf.gz \ > --out GTEx635Pass.Hg38.rsid.final.noDups.chr1 --exclude duplicates.txt \ > --chr 1

Lastly, in step 4, you can filter it by genes that are in chromosome 1. >grep -P ‘^1\t’ lncRNAKB_hg38_v7.epi | cut -f2 > chr1.probes.txt >grep -F -w -f chr1.probes.txt combined.lncRNAKB.visceral.cis.noDups.tsv \ > combined.lncRNAKB.visceral.cis.noDups.chr1.tsv

You can then run the SMR command separately for each chromosome. If you want to conduct multiple testing correction, however, I suggest you still use the number of genes tested across all chromosomes as it is more stringent for a genome-wide test.

Acknowledgments This study was funded by NHLBI Division of Intramural Research funds to HC (1ZIAHL006103, 1ZIAHL006159). References 1. Rinn JL, Chang HY (2012) Genome regulation by long noncoding RNAs. Annu Rev Biochem 81(1):145–166 2. Seifuddin F, Singh K, Suresh A et al (2019) lncRNAKB: a comprehensive knowledgebase of long non-coding RNAs. bioRxiv:669994 3. Uszczynska-Ratajczak B, Lagarde J, Frankish A et al (2018) Towards a complete map of the human long non-coding RNA transcriptome. Nat Rev Genet 19(9):535–548

4. Barbeira AN, Dickinson SP, Bonazzola R et al (2018) Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun 9(1):1825 5. Zhu Z, Zhang F, Hu H et al (2016) Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet 48:481

Chapter 8 AnnoLnc: A One-Stop Portal to Systematically Annotate Novel Human Long Noncoding RNAs De-Chang Yang, Lan Ke, Yang Ding, and Ge Gao Abstract While more than a hundred thousand long noncoding RNAs (lncRNAs) have been identified in human genome, their biological functions and regulation are largely elusive. Here we present AnnoLnc, a one-stop online annotation portal for human lncRNAs (http://annolnc1.gao-lab.org/). As the first (and the most comprehensive) Web server to provide on-the-fly annotation for novel human lncRNAs, AnnoLnc exploits more than 700 data sources to annotate inputted lncRNA systematically, spanning genomic location, secondary structure, expression patterns, coexpression-based functional annotation, transcriptional regulation, miRNA interaction, protein interaction, genetic association, and evolution. Moreover, in addition to a user-friendly Web interface, AnnoLnc can also be integrated into existing pipelines by either a set of JSONbased web service APIs or a stand-alone version for Linux server. Key words Long noncoding RNAs, Annotation, Web server

1

Introduction Long noncoding RNAs (lncRNAs) are generally defined as RNAs longer than 200 nucleotides that do not code for proteins [1]. LncRNA itself, as a novel type of functional RNA, has become increasingly intriguing to researchers. First, a vast number of human lncRNAs have been (and more will be) identified: the latest version of GENCODE (version 29) harbors 29,566 human lncRNA transcripts [2], and the well-known noncoding RNA database NONCODE (version 5.0) curates 172,216 human lncRNA transcripts [3]. Second, lncRNAs have been discovered to play a role in a vast range of biological processes. On the one hand, lncRNAs are involved in a series of cellular biological processes, including cell proliferation [4], cell migration [5], apoptosis [6], differentiation [7], and the maintenance of stemness of stem cells [8]; on the other hand, lncRNAs are capable of regulating many

De-Chang Yang and Lan Ke contributed equally to this work. Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_8, © Springer Science+Business Media, LLC, part of Springer Nature 2021

111

112

De-Chang Yang et al.

different molecular aspects of the central dogma, such as DNA synthesis [9], DNA methylation [10], histone modification [11], transcription [12], splicing [13], translation [14], posttranslational modification [15], and interaction between DNA, RNA, and protein (e.g., [16–18]). Finally, previous studies have identified several different functioning modes for lncRNAs [19], including the decoy mode [20], the guide mode [21], the scaffold mode [21], and the enhancer mode [22]. Such diverse yet complicated characteristics make it extremely difficult to study lncRNA functions and mechanisms in a scalable way. The current wet-lab approach studies lncRNA on a case-bycase basis, which is time consuming and costly. On the other hand, in silico bioinformatics analyses (including online algorithms and databases) can handle a long list of lncRNA transcripts efficiently, yet few of them can present a comprehensive annotation for both known and novel lncRNAs. In particular, current online algorithms support annotating novel lncRNAs, but they are all specialized for a single aspect of lncRNA (e.g., ncFANs [23] and Linc2GO [24] for function annotation, LncRNADisease [25] for disease, and lncPro [26] for lncRNA-protein interaction); as for the databases (e.g., lncRNAdb [27], LNCipedia [28], NPInter [29], and deepBase [30]), they cover a broad range of annotation types but are primarily restricted to known lncRNA transcripts and cannot annotate novel lncRNAs. Here, we addressed this problem by presenting AnnoLnc [31], a one-stop portal for systematically annotating novel human lncRNAs. Based on more than 700 data sources and various tool chains, AnnoLnc accepts any lncRNA sequence (whether it is known or novel) and annotates it with nine different modules: genomic location, secondary structure, expression patterns, coexpression-based functional annotation, transcriptional regulation, miRNA interaction, protein interaction, genetic association, and evolution. The AnnoLnc web portal is freely accessible from http://annolnc1.gao-lab.org, and users can also download the command-line version of AnnoLnc from ftp://ftp.cbi.pku.edu. cn/pub/Annolnc/annolnc_v1.0.tar.gz for local batch analysis. In this chapter, we have presented AnnoLnc, a one-stop portal for systematically annotating any human lncRNAs with known sequences. To the best of our knowledge, AnnoLnc is the first (and the most comprehensive) Web server capable of annotating human lncRNAs with only sequences provided; the broad coverage of its nine modules, callable by a single click, greatly facilitates the detailed examination of human lncRNAs. AnnoLnc has been fast evolving since its initial publication at 2016. An updated AnnoLnc2 [32] has been developed during the publishing of this chapter and can be accessed at http://annolnc. gao-lab.org, with more species, updated annotation modules as well as a more responsive interface. We recognize the importance of Web server to the scientific community and will continue our

Systematically Annotate Novel LncRNA by AnnoLnc

113

efforts to maintain and update timely to better serve the community in the future.

2

Materials To use the AnnoLnc Web server, users need only a JavaScriptenabled web browser (the latest version of Mozilla Firefox, Google Chrome, and Safari are recommended). The AnnoLnc Web server is also mobile friendly based on Responsive Web Design. The AnnoLnc stand-alone version requires a computer with a modern Linux operating system, with a minimal requirement of 4 CPU cores and 8 GB memory.

3

Methods

3.1 Running AnnoLnc in a Web Browser 3.1.1 Description and Scope of AnnoLnc Web Server

The AnnoLnc Web server offers an interface for comprehensively annotating human lncRNAs. It accepts human lncRNA sequences as input and generates a full spectrum of annotations covering from expression to evolution. A concise summary covering each module is displayed at the top of the annotation result page. For better use and understanding the principles of AnnoLnc, the “FAQ” page (http://annolnc1.gao-lab.org/help.jsp) lists common questions and answers about AnnoLnc. Meanwhile, methodological details of each module can be found on the “Methods” page (http://annolnc1.gao-lab.org/methods.jsp). Moreover, users can download supplementary files or contact the development team of AnnoLnc for any other questions through the “About” page (http://annolnc1.gao-lab.org/about.jsp).

3.1.2 Description of Input Data

The AnnoLnc Web server accepts input sequences in FASTA format, with no more than 100 sequences per query (Fig. 1). Click the “GO” button below, and AnnoLnc will automatically annotate the input sequences. Each sequence name should consist of less than 100 characters consisting of all lower- and upper-case letters (A to Z and a to z), digits (0–9), underscore (_), dot (.), and hyphen (-). Each sequence should be a valid nucleotide sequence (i.e., consisting of A/a, C/c, G/g, T/t, and U/u only) longer than 20 bp and shorter than 100,000 bp.

3.1.3 Description of Output Page

AnnoLnc returns a job status page for each submission, listing the status of all input sequences altogether (Fig. 2). By clicking the links under the Status column, users can visit the result page for each sequence (see Note 1). An example of the result page is given in Fig. 3. The top of the result page is an integrated summary that concisely describes the

114

De-Chang Yang et al.

Fig. 1 Input interface of AnnoLnc Web server

Fig. 2 An example job output status page

annotation results of each module. Users can click the hyperlinks within the summary or the left navigation bar to visit the detailed information for each module below. Furthermore, users can access an integrated view of pretuned custom tracks in the UCSC Genome Browser [33] (see Note 2). Finally, users can download a zipped annotation result by clicking the “Export as Zip” link at the top-left navigation bar for customized downstream analyses (see http:// annolnc1.gao-lab.org/help.jsp#result_readme for format description). For each module, users can view a detailed explanation by clicking the question mark sign next to the module name. We will briefly describe the general idea and output of each module in the

Systematically Annotate Novel LncRNA by AnnoLnc

115

Fig. 3 An example of annotation result page

next section; readers who are interested in the technical details should refer to the original AnnoLnc paper [31] and its “FAQ” web page. 3.1.4 Description of Fundamentals and Outputs of Each Module Genomic Location

This is the first module AnnoLnc runs for each input sequence, where AnnoLnc identifies its genomic location by aligning it to the human reference genome hg19 with Blat [34] and selecting all its best alignments (see Note 3). AnnoLnc then determines whether each input sequence is novel by comparing its transcript structure to a merged annotation of GENCODE v19 [2] and lncRNAdb v2.0 [27] with Cuffcompare [35]. An input sequence is considered to be similar to a known transcript if its Cuffcompare code is “¼” (complete match of intron chain) or “c” (contained), and a link to this known transcript in Ensembl or lncRNAdb will be provided; otherwise, it will be considered a novel lncRNA. Finally, AnnoLnc provides a contextual view of the alignments by embedding them within the UCSC Genome Browser. An example output of the “Genomic Location” module is shown in Fig. 4. The annotation results of an example input sequence are displayed, including its best alignment(s), transcript structure, and hyperlinked known transcripts similar to this lncRNA (if any) (Fig. 4a). In addition, AnnoLnc provides a link to the genomic context of the input sequence (200 bp upstream and downstream of the aligned position) in UCSC Genome Browser, together with a set of pretuned custom tracks in the UCSC Genome Browser (Fig. 4b): the transcript structure of the input sequence (red) and flanking known transcripts from the merged annotation above (blue), the Haplotypes and Patches to

116

De-Chang Yang et al.

Fig. 4 An example output of the “Genomic Location” module. (a) The best alignment(s), transcript structure, and known similar transcripts are shown. (b) A genomic context view of the input sequence in the UCSC Genome Browser is available

the hg19 reference genome, the RefSeq-curated gene structure (dark blue), the expression levels in 53 tissues from 8555 GTEX RNA-Seq samples, and common SNPs in dbSNP (version 151). Secondary Structure

The secondary structure of RNA provides important information about its function, and functional substructures are more likely to be stable and conserved [36]. To help identify potential functional region(s) in the RNAfold [37]-predicted RNA secondary structure, AnnoLnc provides four methods to color bases based on entropy or PhyloP scores in primates, mammals, or vertebrates (see Note 4). Figure 5 displays an example output of the “Secondary Structure” module. AnnoLnc visualizes the secondary structure as a connected graph, where nodes are the nucleotide bases, black edges the backbone, and blue edges the predicted hydrogen bonds between base pairs (Fig. 5, top). Users can choose how to color the bases on the right side of this page; clicking the “Get”

Systematically Annotate Novel LncRNA by AnnoLnc

117

Fig. 5 An example output of the “Secondary Structure” module

button brings users to an interactive graphical interface supporting panning and zooming (Fig. 5, bottom). Expression

LncRNA’s expression profile provides useful clues to its function. For example, the lncRNA CCAT2 has been reported to be expressed more in colorectal cancer tissue than in the adjacent mucosae and has been subsequently shown to promote the growth and metastasis of colorectal cancer [38]. To provide a comprehensive view of the expression profile, AnnoLnc collected 64 RNA-Seq datasets, consisting of 34 normal

118

De-Chang Yang et al.

Fig. 6 An example output of the “Expression” module

samples (two replicates for each of 16 adult tissues, and a human embryonic stem (ES) cell line H1) and 30 TCGA cancer samples covering ten tissues. For a given input lncRNA, if it is considered to be similar to a known transcript (see Subheading “Genomic Location”), AnnoLnc will output the precomputed expression profile of this known transcript on these samples; otherwise, AnnoLnc will use LocExpress [39] to compute its expression profile on the fly. Figure 6 shows an example output of the “Expression” module. The top bar chart shows the expression profile of the lncRNA in normal tissues and ES, and the bottom bar chart shows their expression profiles in cancer samples. Users can check the exact FPKM value in each tissue by hovering the mouse over the bars. Co-expression and Co-expression-Based Functional Annotation

It has been reported that genes with similar expression patterns might have similar functions [40]. AnnoLnc follows this guilt-byassociation heuristic and annotates lncRNAs with functions of coexpressed protein coding genes. Briefly, AnnoLnc first collected protein coding genes that are likely to be expressed (sum of gene FPKM in all samples 1) without strong tissue specificity (rsgcc [41] “getsgene” score 0.85). Then, for each input lncRNA sequence, AnnoLnc identifies all those protein coding genes whose expression profiles are strongly correlated with this lncRNA in normal tissues or cancer samples and assigns to this lncRNA all the statistically significantly enriched Gene Ontology (GO) terms

Systematically Annotate Novel LncRNA by AnnoLnc

119

Fig. 7 An example output of the (a) “Co-expression” and (b) “Functional Annotation” modules

of these protein coding genes using GOstats [42], with positively and negatively correlated genes analyzed separately. An example output of the “Co-expression” module is displayed in Fig. 7a, where AnnoLnc reports both positively and negatively correlated genes in either normal tissues or cancer samples, with the Ensembl ID, gene symbol, and Pearson’s correlation coefficient (r) provided for each correlated gene. Figure 7b displays an example output of the “Functional Annotation” module, where AnnoLnc outputs each enriched GO term along with its description and pvalue. Only GO terms in the namespaces “Biological Process” and “Molecular Function” are displayed. Users can filter for annotation

120

De-Chang Yang et al.

Fig. 8 An example output of the “Transcriptional Regulation” module

results with p-values in the drop-down list and filter for genes and GO terms of interest using the “Search” box. Transcriptional Regulation

Transcriptional regulation can be crucial to lncRNA expression [43] and potentially its function. In the “Transcriptional Regulation” module, AnnoLnc currently has integrated 498 ENCODE [44] ChIP-Seq (Chromatin ImmunoPrecipitation-Sequencing, a high-throughput profiling technique for DNA–protein interaction) datasets covering 159 transcription factors (TFs) in 45 cell lines. Then, for each input transcript, AnnoLnc will search for putative transcription factor binding sites within 5 kb upstream and 1 kb downstream of this transcript and classify all sites based on their relative position to the transcript as either “upstream transcriptional start site (TSS)”, “overlap with TSS,” “inside the lncRNA locus,” “overlap with transcriptional end site (TES),” or “downstream TES.” Figure 8 shows an example output of the “Transcriptional Regulation” module. For each TF binding site included, AnnoLnc outputs the relevant TFs, the cell type and treatment of the cell where the ChIP-Seq experiment was carried out, and the class of relative location (Up TSS: upstream transcriptional start site; Overlap TSS: overlap with TSS; inside: inside the lncRNA locus; Overlap TES: overlap with transcriptional end site; Down TES: downstream TES). In addition, users can check “Assign peaks to the closest gene” (as suggested by [45]) to exclude TF binding sites that are closer to other genes in the merged annotation (see Subheading “Genomic Location”) than to the input transcript and use the “Search” box to filter for TFs of interest by their name and cell type.

Systematically Annotate Novel LncRNA by AnnoLnc

121

Fig. 9 An example output of the “miRNA Interaction” module miRNA Interaction

It is important to catalog miRNAs interacting with a given lncRNA because a number of lncRNAs have been discovered to be able to competitively bind to miRNAs and suppress their regulation on their target mRNAs (e.g., the lncRNA PTENP1 [20, 46, 47]). In the module “miRNA Interaction,” AnnoLnc predicts miRNAs interacting with the given lncRNA by three steps: (1) it first predicts all possible candidate miRNAs and corresponding binding sites using TargetScan [48] on 87 miRNA families (covering 208 highly conserved miRNA) from miRcode [49]; (2) it then computes the conservation score [49] for each predicted binding site; (3) finally, it uses the lncRNA’s hg19 alignment above in “Genomic Location” to examine whether each predicted binding site overlaps with (i.e., is supported by) aligned reads from a manually curated set of 61 AGO CLIP-Seq (CrossLinking-ImmunoPrecipitation followed by Sequencing; a high-throughput technique to sequence all RNAs interacting with a given protein [50]) experiments from GEO [51, 52] (see Note 5). Figure 9 shows an example output of the “miRNA Interaction” module. AnnoLnc lists all the miRNA families predicted to interact with the given lncRNA; for each predicted binding site of a miRNA family, AnnoLnc outputs its start and end coordinates, its conservation score (in primates, mammals, and vertebrates), and finally whether it is supported by CLIP-Seq datasets.

Protein Interaction

Aside from miRNAs, it is also very important to catalog lncRNAinteracting proteins, as previous studies have identified some lncRNAs that exert their functions by interacting with proteins [53]. In the module “Protein Interaction,” AnnoLnc annotates interacting proteins in two parallel approaches: (1) by scanning a manually curated set of 112 non-AGO CLIP-Seq (spanning

122

De-Chang Yang et al.

Fig. 10 An example output of the “Protein Interaction” module

51 RNA-binding proteins) datasets (see Note 5); (2) by running a sequence-based prediction algorithm lncPro [26] (see Note 6). Figure 10 shows an example output of the “Protein Interaction” module. The left table lists all proteins predicted to interact with the given lncRNA by lncPro, including their gene symbols (linked to HGNC records), interaction scores, and p-values (see Note 6); to exclude unnecessary false positives, AnnoLnc keeps only those proteins whose interaction score is larger than 92.4 (which is the 99.9% quantile of the lncRNA’s lncPro score on shuffled protein sequences). According to lncPro [26], the larger the score, the more likely the predicted interaction will be true. The right table lists all interacting proteins supported by CLIP-Seq, including their gene symbols (linked to NCBI Gene records), cell types, treatments, and p-values produced by PIPE-CLIP [54] (see Note 6 for more details about PIPE-CLIP). Genetic Association

In addition to evolutionary analysis, another approach to studying lncRNA functions with no mechanistic priors is to locate within them SNPs from Genome-Wide Association Studies (GWAS), which is of interest given that 88% of trait-associated GWAS SNPs occur within noncoding regions [55]. In the module “Genetic Association,” AnnoLnc first scans for all GWAS Catalog SNPs (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/ VCF/00-All.vcf.gz; also referred to as “linked SNP”) within the given lncRNA transcript as well as the neighboring regions (5 kb upstream to 1 kb downstream), locates for these linked SNPs the haplotype regions they are within (r2 > 0.5; based on ftp://ftp. ncbi.nlm.nih.gov/hapmap/ld_data/2009-04_rel27/), and finally associates all traits/diseases associated with the tag SNPs of these

Systematically Annotate Novel LncRNA by AnnoLnc

123

Fig. 11 An example output of the “Genetic Association” module

haplotype regions (as compiled by the NHGRI GWAS Catalog [56]) to the lncRNA in question. Figure 11 shows an example output of the “Genetic Association” module. For each linked SNP, AnnoLnc displays (1) its corresponding tag SNP; (2) the trait the tag SNP is associated with, the p-value and (statistical) significance of this association, and the PubMed ID of the GWAS study in question; and (3) the precomputed Linkage (linkage disequilibrium, LD) value between the linked SNP and the tag SNP, along with population(s) from which the LD value is derived (downloaded from ftp://ftp.ncbi. nlm.nih.gov/hapmap/ld_data/2009-04_rel27/). The statistical significance is “Yes” when the p-value is no more than 5e8 and “No” otherwise. AnnoLnc uses three-letter abbreviations for populations, as 1000 Genomes does [57] (see https://www.inter nationalgenome.org/faq/which-populations-are-part-your-study for the explanation of these population abbreviations). Evolution

Evolutionarily conserved sequences might suggest important functions; therefore, one useful mechanism-free approach to selecting candidate functional lncRNAs is to scan for those lncRNAs with conserved sequences. In the module “Evolution,” AnnoLnc annotates evolutionary features in three ways: (1) it computes the average 46-way PhyloP scores within primates, placental mammals, or vertebrates for both the exon and promoter regions (1 kb upstream of transcription start sites) of the given lncRNA; (2) it computes the average YRI population-based derived allele frequency (DAF) [58] for both the exon and promoter regions of the given lncRNA; (3) it lists all phastCons [59] conserved elements flanking the given lncRNA (5 kb upstream of transcription start sites or 1 kb downstream of transcription stop sites, and no shorter than 20 bp) from UCSC (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database). Figure 12 shows an example output of the “Evolution” module. The upper-left bar chart displays the PhyloP scores, where a value larger than 1 indicates a strong conservation and a value smaller than 1 indicates an accelerated evolution. The upperright bar chart displays the average DAF values, where a value smaller than 0.1 indicates a very strong purifying selection. The bottom table describes the phastCons conserved elements flanking

124

De-Chang Yang et al.

Fig. 12 An example output of the “Evolution” module

the given lncRNA, including position, length, and the log-odds scores produced by phastCons; a large log-odds ratio typically indicates a strongly conserved element and vice versa. 3.1.5 Batch-Annotating Human LncRNAs via the AnnoLnc API

To programmatically process multiple sequences at once, AnnoLnc provides three JSON-based web service APIs for submitting input sequences (Upload), checking the job status (Info), and fetching the results (Fetch). Users can use the APIs via any language that can handle an HTTP request and response after getting a valid API token online (see http://annolnc1.gao-lab.org/api.jsp for details about acquiring the token). The detailed description of these APIs, along with a simple demo implemented in Python 2.7 (as shown in Fig. 13), can be found on AnnoLnc’s API web page (http:// annolnc1.gao-lab.org/api.jsp). To prevent server overload, each user can handle at most two jobs at the same time.

3.2 Running AnnoLnc in a Terminal

Below are step-by-step instructions for installing the AnnoLnc stand-alone version based on the CentOS/RedHat distro (see Note 7).

3.2.1 Installing the AnnoLnc Stand-Alone Version

1. Download the stand-alone package “annolnc_v1.0.tar.gz” and corresponding data packages from http://annolnc1.gao-lab. org/download.jsp and organize them accordingly.

Fig. 13 A Python-based demo script for programming with AnnoLnc API, which is also available at http:// annolnc1.gao-lab.org/api.jsp. Note that the user should replace the value of “email” and “token” with his/her own email address and token applied before running the demo. In addition, the user should use the job_id value retrieved by r1 in subsequent requests (r2, r3, and r4)

126

De-Chang Yang et al.

2. Install basic dependencies by running the following (note that Python 2.7 will be installed; currently, AnnoLnc does not support Python 3): yum install python-devel yum install python2-pip yum install java yum install R yum install BEDTools yum -y install libcurl libcurl-devel yum -y install libxml2 libxml2-devel

3. Install R packages by running the following (if the repository does not work, please use a different repo listed at https://cran. r-project.org/mirrors.html): R -e "install.packages(’stringr’, repos=’https://cloud.r-project.org/’)" R -e ’if (!requireNamespace("BiocManager", quietly = TRUE)) install. packages("BiocManager", repos="https://cloud.r-project.org/"); BiocManager::install("GOstats", version = "3.8", update=FALSE)’ R -e "install.packages(’snowfall’, repos=’https://cloud.r-project.org/’)" R -e "install.packages(’plyr’, repos=’https://cloud.r-project.org/’)"

4. Install Python packages by running the following: pip install --upgrade pip pip install biopython pip install bx-python==0.7.3 mako paste sqlalchemy routes webob mercurial bz2file webhelpers kombu h5py pysam pycrypto

3.2.2 Running the AnnoLnc Stand-Alone Version

AnnoLnc can be run from any working directory. Below is an example usage where the current working directory is assumed to be the uncompressed annolnc directory: sh annolnc.sh -i (input_sequence_file) -o (output_directory) -s (reference_genome) -m (module_name)

where -i specifies the input lncRNA fasta file to annotate (e.g., input_demo/test.fa), -o the output directory to put annotations (e.g., ./output), -s the species and genome build to use (e.g., hg19), and -m the annotation modules to use (e.g., “/expr” for calling expression profile). Additionally, run “sh annolnc.sh -h” to view the complete help page.

Systematically Annotate Novel LncRNA by AnnoLnc

127

Fig. 14 An example of the organization of output directory specified by “-o”. Here, the output directory is named “result_test1” and contains the result directories for two input sequences: “H19” (“H19_trans1”) and “HOTAIR” (“HOTAIR_trans1”) 3.2.3 Results of the AnnoLnc Stand-Alone Version

4

The results are present at the output directory (specified by “-o” above) and are organized in a query sequence-first, module-second hierarchy as shown in Fig. 14: the result directory for a given input sequence X is named “X_trans1” and contains the results of all modules executed.

Notes 1. While there is no formal support for downloading multiple lncRNAs’ annotations in batch, the user can loop over each lncRNA’s “seqName” and fetch annotations via API (see Subheading 3.1.5 for more details about how to use API). 2. Users are encouraged to integrate AnnoLnc customized tracks with built-in ones for a more comprehensive view. 3. When a particular inputted sequence is mapped to multiple loci, AnnoLnc will run analyses for all these loci, with result for each locus displayed independently.

128

De-Chang Yang et al.

Meanwhile, if the sequence cannot be mapped to reference genome build (hg19) reliably, AnnoLnc will continue running all “sequence level” analyses such as secondary structure and protein interaction. 4. It is possible that only the entropy-colored plot, or none of these plots, is available for display. Possible reasons include: (1) no phyloP scores are available for the particular genomic region; or (2) inputted lncRNA is too long. To avoid unnecessary workload, the secondary structure module will refuse to work when the length of input sequence is longer than 20,000 nt (which usually takes ~400 s). 5. The following is excerpted from AnnoLnc’s BMC Bioinformatics paper [31]. “For a given CLIP-Seq dataset, AnnoLnc first trimmed the adapter with FASTX Clipper (http://hannonlab.cshl.edu/ fastx_toolkit/), and only reads longer than 15 nt were kept and mapped to the human genome hg19 by the algorithm BWA-backtrack (v0.7.10-r789) [60] with the options “-n 1 -i 0” (i.e., allowing one alignment error). Then, only uniquely mapped reads were kept. To improve precision, AnnoLnc used stringent criteria for crosslinking site calling with PIPE-CLIP v1.0.0 [54]; the FDR cutoffs for both enriched clusters and reliable mutations were set as 0.05 (crosslinking sites in HITSCLIP data identified by deletion, insertion and substitution were combined).” 6. The statistical significance of each lncPro hit is estimated based on the empirical null distribution generated by random shuffling [31]. 7. Please also check the online guide (http://annolnc1.gao-lab. org/download.jsp) for up-to-date information. In late 2019, we provided scripts (install_centos7.sh and install_ubuntu.sh) for automatic installation (see http:// annolnc1.gao-lab.org/download.jsp). Please feel free to contact us ([email protected]) if you meet any questions when using these scripts.

Acknowledgments Funding: This work was supported by funds from the National Key Research and Development Program (2016YFC0901603), the China 863 Program (2015AA020108), the State Key Laboratory of Protein and Plant Gene Research and the Beijing Advanced Innovation Center for Genomics (ICG) at Peking University, as well as the Clinical Medicine Plus X - Young Scholars Project of Peking University and the Fundamental Research Funds for the

Systematically Annotate Novel LncRNA by AnnoLnc

129

Central Universities. The research of G.G. was supported in part by the National Program for Support of Top-notch Young Professionals. Part of the analysis was performed on the Computing Platform of the Center for Life Sciences of Peking University, and we thank Dr. Fangjin Chen for his help. References 1. Quinn JJ, Chang HY (2016) Unique features of long non-coding RNA biogenesis and function. Nat Rev Genet 17:47–62. https://doi. org/10.1038/nrg.2015.10 2. Harrow J, Frankish A, Gonzalez JM et al (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22:1760–1774. https://doi. org/10.1101/gr.135350.111 3. Fang S, Zhang L, Guo J et al (2018) NONCODEV5: a comprehensive annotation database for long non-coding RNAs. Nucleic Acids Res 46:D308–D314. https://doi.org/10. 1093/nar/gkx1107 4. Yang F, Xue X, Zheng L et al (2014) Long non-coding RNA GHET1 promotes gastric carcinoma cell proliferation by increasing c-Myc mRNA stability. FEBS J 281:802–813. https://doi.org/10.1111/febs.12625 5. Tano K, Mizuno R, Okada T et al (2010) MALAT-1 enhances cell motility of lung adenocarcinoma cells by influencing the expression of motility-related genes. FEBS Lett 584:4575–4580. https://doi.org/10.1016/j. febslet.2010.10.008 6. Takahashi Y, Sawada G, Kurashige J et al (2014) Amplification of PVT-1 is involved in poor prognosis via apoptosis inhibition in colorectal cancers. Br J Cancer 110:164–171. https://doi.org/10.1038/bjc.2013.698 7. Wang P, Xue Y, Han Y et al (2014) The STAT3binding long noncoding RNA lnc-DC controls human dendritic cell differentiation. Science 344:310–313. https://doi.org/10.1126/sci ence.1251456 8. Guttman M, Donaghey J, Carey BW et al (2011) lincRNAs act in the circuitry controlling pluripotency and differentiation. Nature 477:295–300. https://doi.org/10. 1038/nature10398 9. Zhang X, Rice K, Wang Y et al (2010) Maternally Expressed Gene 3 (MEG3) Noncoding ribonucleic acid: isoform structure, expression, and functions. Endocrinology 151:939–947. https://doi.org/10.1210/en.2009-0657 10. Zhou J, Yang L, Zhong T et al (2015) H19 lncRNA alters DNA methylation genome wide

by regulating S-adenosylhomocysteine hydrolase. Nat Commun 6:10221. https://doi.org/ 10.1038/ncomms10221 11. Ohhata T, Matsumoto M, Leeb M et al (2015) Histone H3 lysine 36 trimethylation is established over the Xist promoter by antisense Tsix transcription and contributes to repressing Xist expression. Mol Cell Biol 35:3909–3920. https://doi.org/10.1128/MCB.00561-15 12. Petruk S, Sedkov Y, Riley KM et al (2006) Transcription of bxd noncoding RNAs promoted by Trithorax represses Ubx in cis by transcriptional interference. Cell 127:1209–1221. https://doi.org/10.1016/j. cell.2006.10.039 13. Beltran M, Puig I, Pena C et al (2008) A natural antisense transcript regulates Zeb2/Sip1 gene expression during Snail1-induced epithelial-mesenchymal transition. Genes Dev 22:756–769. https://doi.org/10.1101/gad. 455708 14. Schein A, Zucchelli S, Kauppinen S et al (2016) Identification of antisense long noncoding RNAs that function as SINEUPs in human cells. Sci Rep 6:33605. https://doi.org/10. 1038/srep33605 15. Ferdin J, Nishida N, Wu X et al (2013) HINCUTs in cancer: hypoxia-induced noncoding ultraconserved transcripts. Cell Death Differ 20:1675–1687. https://doi.org/10.1038/ cdd.2013.119 16. Cabianca DS, Casa V, Bodega B et al (2012) A long ncRNA links copy number variation to a Polycomb/Trithorax epigenetic switch in FSHD muscular dystrophy. Cell 149:819–831. https://doi.org/10.1016/j. cell.2012.03.035 17. Yang L, Lin C, Jin C et al (2013) lncRNAdependent mechanisms of androgen-receptorregulated gene activation programs. Nature 500:598–602. https://doi.org/10.1038/ nature12451 18. Shamovsky I, Ivannikov M, Kandel ES et al (2006) RNA-mediated response to heat shock in mammalian cells. Nature 440:556–560. https://doi.org/10.1038/nature04518

130

De-Chang Yang et al.

19. Rinn JL, Chang HY (2012) Genome regulation by long noncoding RNAs. Annu Rev Biochem 81:145–166. https://doi.org/10.1146/ annurev-biochem-051410-092902 20. Johnsson P, Ackley A, Vidarsdottir L et al (2013) A pseudogene long-noncoding-RNA network regulates PTEN transcription and translation in human cells. Nat Struct Mol Biol 20:440–446. https://doi.org/10.1038/ nsmb.2516 21. Tsai M-C, Manor O, Wan Y et al (2010) Long noncoding RNA as modular scaffold of histone modification complexes. Science 329:689–693. https://doi.org/10.1126/sci ence.1192002 22. Trimarchi T, Bilal E, Ntziachristos P et al (2014) Genome-wide mapping and characterization of notch-regulated long noncoding RNAs in acute leukemia. Cell 158:593–606. https://doi.org/10.1016/j.cell.2014.05.049 23. Liao Q, Xiao H, Bu D et al (2011) ncFANs: a web server for functional annotation of long non-coding RNAs. Nucleic Acids Res 39: W118–W124. https://doi.org/10.1093/ nar/gkr432 24. Liu K, Yan Z, Li Y, Sun Z (2013) Linc2GO: a human LincRNA function annotation resource based on ceRNA hypothesis. Bioinformatics 29:2221–2222. https://doi.org/10.1093/bio informatics/btt361 25. Chen G, Wang Z, Wang D et al (2012) LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Res 41:D983–D986. https://doi.org/10.1093/ nar/gks1099 26. Lu Q, Ren S, Lu M et al (2013) Computational prediction of associations between long non-coding RNAs and proteins. BMC Genomics 14:651. https://doi.org/10.1186/14712164-14-651 27. Dinger ME, Quek XC, Signal B et al (2014) lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs. Nucleic Acids Res 43:D168–D173. https:// doi.org/10.1093/nar/gku988 28. Volders PJ, Anckaert J, Verheggen K et al (2019) Lncipedia 5: towards a reference set of human long non-coding rnas. Nucleic Acids Res 47:D135–D139. https://doi.org/10. 1093/nar/gky1031 29. Hao Y, Wu W, Li H et al (2016) NPInter v3.0: an upgraded database of noncoding RNA-associated interactions. Database 2016: baw057. https://doi.org/10.1093/database/ baw057 30. Zheng L-L, Li J-H, Wu J et al (2016) deepBase v2.0: identification, expression, evolution and

function of small RNAs, LncRNAs and circular RNAs from deep-sequencing data. Nucleic Acids Res 44:D196–D202. https://doi.org/ 10.1093/nar/gkv1273 31. Hou M, Tang X, Tian F et al (2016) AnnoLnc: a web server for systematically annotating novel human lncRNAs. BMC Genomics 17:931. https://doi.org/10.1186/s12864-016-3287-9 32. Ke L, Yang D-C, Wang Y, Ding Y, and Gao G (2020) AnnoLnc2: the one-stop portal to systematically annotate novel lncRNAs for human and mouse. Nucleic Acids Res 48:W230W238. https://doi.org/10.1093/nar/ gkaa368 33. Kent WJ, Sugnet CW, Furey TS et al (2002) The human genome browser at UCSC. Genome Res 12:996–1006. https://doi.org/ 10.1101/gr.229102 34. Kent WJ (2002) BLAT---the BLAST-like alignment tool. Genome Res 12:656–664. https:// doi.org/10.1101/gr.229202 35. Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515. https://doi. org/10.1038/nbt.1621 36. Smith MA, Gesell T, Stadler PF, Mattick JS (2013) Widespread purifying selection on RNA structure in mammals. Nucleic Acids Res 41:8220–8236. https://doi.org/10. 1093/nar/gkt596 37. Lorenz R, Bernhart SH, Ho¨ner zu Siederdissen C et al (2011) ViennaRNA package 2.0. algorithms. Mol Biol 6:26. https://doi.org/10. 1186/1748-7188-6-26 38. Ling H, Spizzo R, Atlasi Y et al (2013) CCAT2, a novel noncoding RNA mapping to 8q24, underlies metastatic progression and chromosomal instability in colon cancer. Genome Res 23:1446–1461. https://doi.org/10.1101/gr. 152942.112 39. Hou M, Tian F, Jiang S et al (2016) LocExpress: a web server for efficiently estimating expression of novel transcripts. BMC Genomics 17:1023. https://doi.org/10.1186/ s12864-016-3329-3 40. Guttman M, Amit I, Garber M et al (2009) Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458:223–227. https://doi. org/10.1038/nature07672 41. Ma C, Wang X (2012) Application of the Gini correlation coefficient to infer regulatory relationships in transcriptome analysis. Plant Physiol 160:192–203. https://doi.org/10. 1104/pp.112.201962

Systematically Annotate Novel LncRNA by AnnoLnc 42. Falcon S, Gentleman R (2007) Using GOstats to test gene lists for GO term association. Bioinformatics 23:257–258. https://doi.org/10. 1093/bioinformatics/btl567 43. Wu Z, Liu X, Liu L et al (2014) Regulation of lncRNA expression. Cell Mol Biol Lett 19:561. https://doi.org/10.2478/s11658-014-0212-6 44. ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74. https://doi.org/10.1038/nature11247 45. Sikora-Wohlfeld W, Ackermann M, Christodoulou EG et al (2013) Assessing computational methods for transcription factor target gene identification based on ChIP-seq data. PLoS Comput Biol 9:e1003342. https://doi. org/10.1371/journal.pcbi.1003342 46. Poliseno L, Salmena L, Zhang J et al (2010) A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465:1033–1038. https://doi.org/10. 1038/nature09144 47. Song MS, Carracedo A, Salmena L et al (2011) Nuclear PTEN regulates the APC-CDH1 tumor-suppressive complex in a phosphataseindependent manner. Cell 144:187–199. https://doi.org/10.1016/j.cell.2010.12.020 48. Grimson A, Farh KKH, Johnston WK et al (2007) MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell 27:91–105. https://doi.org/10. 1016/j.molcel.2007.06.017 49. Jeggari A, Marks DS, Larsson E (2012) miRcode: a map of putative microRNA target sites in the long non-coding transcriptome. Bioinformatics 28:2062–2063. https://doi.org/10. 1093/bioinformatics/bts344 50. Hafner M, Lianoglou S, Tuschl T, Betel D (2012) Genome-wide identification of miRNA targets by PAR-CLIP. Methods 58:94–105. https://doi.org/10.1016/j. ymeth.2012.08.006 51. Edgar R, Domrachev M, Lash AE (2002) Gene expression omnibus: NCBI gene expression

131

and hybridization array data repository. Nucleic Acids Res 30:207–210 52. Barrett T, Wilhite SE, Ledoux P et al (2013) NCBI GEO: archive for functional genomics data sets - update. Nucleic Acids Res 41: D991–D995. https://doi.org/10.1093/nar/ gks1193 53. Wang KC, Chang HY (2011) Molecular mechanisms of long noncoding RNAs. Mol Cell 43:904–914. https://doi.org/10.1016/ j.molcel.2011.08.018 54. Chen B, Yun J, Kim MS et al (2014) PIPECLIP: a comprehensive online tool for CLIPseq data analysis. Genome Biol 15:R18. https://doi.org/10.1186/gb-2014-15-1-r18 55. Hindorff LA, Sethupathy P, Junkins HA et al (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106:9362–9367. https://doi.org/10. 1073/pnas.0903103106 56. Welter D, MacArthur J, Morales J et al (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42:D1001–D1006. https://doi. org/10.1093/nar/gkt1229 57. 1000 Genomes Project Consortium, Abecasis GR, Auton A et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65. https://doi.org/10. 1038/nature11632 58. 1000 Genomes Project Consortium, Abecasis GR, Altshuler D et al (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073. https:// doi.org/10.1038/nature09534 59. Siepel A, Bejerano G, Pedersen JS et al (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15:1034–1050. https://doi. org/10.1101/gr.3715005 60. Li H, Durbin R (2010) Fast and accurate longread alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595. https:// doi.org/10.1093/bioinformatics/btp698

Chapter 9 Annotation of Full-Length Long Noncoding RNAs with Capture Long-Read Sequencing (CLS) Sı´lvia Carbonell Sala, Barbara Uszczyn´ska-Ratajczak, Julien Lagarde, Rory Johnson, and Roderic Guigo´ Abstract Metazoan genomes produce thousands of long-noncoding RNAs (lncRNAs), of which just a small fraction have been well characterized. Understanding their biological functions requires accurate annotations, or maps of the precise location and structure of genes and transcripts in the genome. Current lncRNA annotations are limited by compromises between quality and size, with many gene models being fragmentary or uncatalogued. To overcome this, the GENCODE consortium has developed RNA capture longread sequencing (CLS), an approach combining targeted RNA capture with third-generation long-read sequencing. CLS provides accurate annotations at high-throughput rates. It eliminates the need for noisy transcriptome assembly from short reads, and requires minimal manual curation. The full-length transcript models produced are of quality comparable to present-day manually curated annotations. Here we describe a detailed CLS protocol, from probe design through long-read sequencing to creation of final annotations. Key words lncRNAs, Next-generation sequencing, NGS, Targeted RNA sequencing, CaptureSeq, Long-read RNA sequencing, PacBio, Nanopore, Genome annotation, GENCODE

1

Introduction The vast majority of transcribed sequences in mammalian and vertebrate genomes do not encode proteins and are called noncoding RNAs. Long-noncoding RNAs (lncRNAs) are noncoding RNA transcripts longer than 200 nucleotides [1]. In recent years a growing number of lncRNAs have been functionally associated with various biological and pathological processes, including cancer [2–4]. However, the vast majority (~98%) of them is yet to be functionally characterized. The assignment of lncRNA functions relies on high-quality annotation of their gene structure and boundaries, which is currently lacking.

Sı´lvia Carbonell Sala and Barbara Uszczyn´ska-Ratajczak contributed equally to this work. Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_9, © Springer Science+Business Media, LLC, part of Springer Nature 2021

133

134

Sı´lvia Carbonell Sala et al.

LncRNAs are particularly challenging to annotate compared to other gene types, due to their characteristics of low expression, low evolutionary conservation, high tissue specificity, and unknown sequence–function relationship (contrary to open reading frames in protein-coding genes) [1, 5, 6]. Consequently, present-day lncRNA annotations are far from complete. This incompleteness takes two forms: first, many lncRNA loci are entirely missing from annotations; second, lncRNA annotations are fragmentary and often represent only parts of actual gene structures. In other words, a high proportion of lncRNA annotations lack correct 50 or 30 ends, and their true positions may lie many thousands of base pairs away from the annotated location. The ramifications of this for functional studies are profound [7]. Just one example is the requirement for CRISPR functional screens to design targeting constructs within a window of ~150 bp of the target gene’s correct 50 end [8]. Thus, lncRNA gene annotations are a critical resource that presently suffer from serious weaknesses. RNA-sequencing (RNA-seq), based on next-generation sequencing (NGS) and coupled to computational methods, has driven the rapid growth of lncRNA catalogues [9]. However, in the context of annotation, the conventional form of RNA-seq based on short reads of ~150 bp suffers from severe limitations. Although such methods produce hundreds of millions of reads, lowly expressed transcripts like lncRNAs are weakly sampled [10]. Moreover, such reads are much shorter than the average lncRNA transcript, and thus must be assembled computationally to build transcript models [11, 12]. Such “transcriptome assembly” is a challenging informatics problem and resulting transcript assemblies suffer from a variety of issues, leading to the incompleteness described above [7, 12]. The problem of transcript incompleteness is reduced in manually curated lncRNA catalogs (e.g., GENCODE or RefSeq) [13, 14]. In manual annotation, the lncRNA genes and transcript models are built by human annotators based on nonreconstructed transcriptomic and genomic evidence and according to defined protocols. This precise inspection of lncRNA models produces more confident annotations, which nevertheless suffer from many artifacts present in automatic annotations, for example the omission of terminal exons, but at lower rates. Nevertheless, the main weakness of manual annotations is their low throughput compared to automated approaches, and requirement of longterm funding. The advent of third-generation single molecule long-read sequencing (“TGS”) can eliminate the need for transcriptome assembly. By using reads from hundreds to thousands of bases in length, such methods reveal the exon connectivity of individual RNA transcripts [15, 16]. While the problem of assembly is solved, a new challenge is introduced in terms of sensitivity: TGS methods

Annotation of Full-Length lncRNAs using CLS Method

135

typically have low depth ( gene_trans. txt $ rsem-prepare-reference --transcript-to-gene-map gene_trans. txt mRNA.lncRNA.fa mRNA.lncRNA

3.2 Distinguishing Ribosome-Associated and Ribosome-Free lncRNAs by Ribosome Density

1. Trimming adaptor sequences from Ribo-seq reads. $ cd ~/Desktop/ribo-lncRNA/raw/ $

cutadapt

-m

15

-a

TGGAATTCTCGG

HeLa_ribo_trimmed.fastq HeLa_ribo.fastq

-a

AAAAAAAAAAAA

-o

184

Chao Zeng and Michiaki Hamada

2. Aligning to contaminant sequences. $ bowtie2 --very-sensitive-local -x ~/Desktop/ribo-lncRNA/ref/ contaminant -U HeLa_ribo_trimmed.fastq --un HeLa_ribo_trimmed_filtered.fastq > /dev/null

3. Aligning to mRNAs and lncRNAs. $ bowtie2 --very-sensitive-local -k 100 -x ~/Desktop/ribolncRNA/ref/mRNA.lncRNA HeLa_rna.fastq > HeLa_rna.sam $

bowtie2

--very-sensitive-local

-k

100

--norc

--rdg

99999999,99999999 --rfg 99999999,99999999 -x ~/Desktop/ribolncRNA/ref/mRNA.lncRNA HeLa_ribo_trimmed_filtered.fastq > HeLa_ribo.sam

4. Quantifying transcript expression by RNA-seq (see Note 3). $ python ~/Desktop/ribo-lncRNA/scripts/Modify_SAM_for_RSEM.py HeLa_rna.sam HeLa_rna.modified.sam $ rsem-calculate-expression --alignments HeLa_rna.modified. sam ~/Desktop/ribo-lncRNA/ref/mRNA.lncRNA HeLa_rna

5. Calculating ribosome density (see Note 3). The ribosome density is calculated as RPKM(x)/RPKM(y), where RPKM(x) represents the expression level of the Ribo-seq in the region of interest (i.e., CDS, 30 UTR or lncRNA). RPKM(y) denotes the expression level of the corresponding transcript measured by the RNA-seq data (see Fig. 1).

Fig. 1 Identification of ribosome-associated lncRNAs with ribosome density. We applied 30 UTR as the negative control because this region is considered to have no ribosome attachment. The 90th percentile of the ribosome densities for 30 UTR was used as a threshold (T) to distinguish between ribosome-associated lncRNAs (T) and ribosome-free lncRNAs ( ~/Desktop/ribo-lncRNA/ref/ human_ribo_noribo_lncrna.xlsx $ cd ~/Desktop/ribo-lncRNA/scripts $ python Generate_ribolncRNA_noribolncRNA.py

3.4 Defining Sequence Features that Potentially Affect Ribosome Associations

In addition to some basic features (splicing, GC content, etc.), we defined the other features by predefining three different ORFs of lncRNA and by some positional relationships between the sequence elements (i.e., m6A and G4) and the ORF (see Fig. 2). The primary ORF (pORF) is the longest ORF in lncRNA and the first ORF (fORF) is the first to appear on the 50 end. The upstream ORF (uORF) appears upstream of the pORF and starts with a nearcognate initiation site of CTG/GTG/TTG (See Table 1 for the complete list of sequence features and description).

186

Chao Zeng and Michiaki Hamada

Fig. 2 Defining sequence features of lncRNA. (a) Primary ORF (pORF), first ORF (fORF), and upstream ORF (uORF) in lncRNA (horizontal line). (b) Distances between m6A/G4 and transcription initiation/termination site, and starts or ends of p/f/uORF were considered as features

1. Generating lncRNA sequences. $ python Generate_lncRNA_sequences.py

2. Predicting m6A sites with SRAMP. $ cd ../tools/sramp_simple $ perl runsramp.pl ../../lncrna.fa ../../lncrna_m6a_sramp.txt mature $ cd ~/Desktop/ribo-lncRNA/scripts

3. Predicting G-quadruplex sites with QGRS. $ python Run_ParasoR_QGRS.py

4. Mapping repeat elements [24] to lncRNAs. $ wget http://www.repeatmasker.org/genomes/hg19/RepeatMaskerrm405-db20140131/hg19.fa.out.gz $ gunzip hg19.fa.out.gz && mv hg19.fa.out repeatmask.out $ python Generate_lncrna_bed.py

5. Sampling for sequence context of transcription initiation sites. $ python Sampling_CDS.py

Detection and Characterization of Ribosome-Associated lncRNAs

187

Table 1 Sequence features and descriptions No.

Feature

Description

1

fLen

Log10(length + 1) of the mature lncRNA

2

gc

G + C content of the mature lncRNA

Basic

RNA splicing 3

nE

Number of exons

4

fELen

Log10(length + 1) of the first exon

5

minELen

Log10(length + 1) of the shortest exon

6

maxELen

Log10(length + 1) of the longest exon

7

avgELen

Log10(averaged_length + 1) of exons

8

fEgc

G + C content of the first exon

9

minEgc

G + C content of the shortest exon

10

maxEgc

G + C content of the longest exon

11

avgEgc

Averaged G + C content of exons

12

fILen

Log10(length + 1) of the first intron

13

minILen

Log10(length + 1) of the shortest intron

14

maxILen

Log10(length + 1) of the longest intron

15

avgILen

Log10(averaged_length + 1) of introns

16

fIgc

G + C content of the first intron

17

minIgc

G + C content of the shortest intron

18

maxIgc

G + C content of the longest intron

19

avgIgc

Averaged G + C content of introns

Putative ORF (pORF: primary ORF; fORF: first ORF; uORF: upstream ORF) 20-22 p/f/uOrfLen

Log10(length + 1) of ORF

23-25 p/f/uOrfCov

Percentage of ORF length compared to that of lncRNA

26-28 p/f/uOrf5utrLen

Log10(length + 1) of the upstream region of ORF (50 UTR)

29-31 p/f/uOrf5utrCov

Percentage of the 50 UTR length compared to that of lncRNA

32-34 p/f/uOrf3utrLen

Log10(length + 1) of the downstream region of ORF (30 UTR)

35-37 p/f/uOrf3utrCov

Percentage of the 30 UTR length compared to that of lncRNA

38-40 p/f/uOrfStartContext

Context sore of ORF start

41-43 p/f/uOrfSeqTrimer

Trimer score of ORF

44-46 p/f/uOrfSeqHexamer

Hexamer score of ORF (continued)

188

Chao Zeng and Michiaki Hamada

Table 1 (continued) No.

Feature

Description

RNA secondary structure 47-49 p/f/uOrfSp

Averaged RNA stem probability of ORF

50-52 p/f/uOrf5utrSp

Averaged RNA stem probability of 50 UTR

53-55 p/f/uOrf5utrSpFC

Ratio of RNA stem probability of 50 UTR to that of ORF

56-58 p/f/uOrf3utrSp

Averaged RNA stem probability of 30 UTR

59-61 p/f/uOrf3utrSpFC

Ratio of RNA stem probability of 30 UTR to that of ORF

62

g4NearTIS_log

Log10(minimum distance) from G4 to transcription initiation

63

g4NearTTS_log

Log10(minimum distance) from G4 to transcription termination

64-66 g4Near(p/f/u) ORFstart_log

Log10(minimum distance) from G4 to ORF start

67-69 g4Near(p/f/u) ORFend_log

Log10(minimum distance) from G4 to ORF end

70

g4NearTIS_%

Minimum distance from G4 to TIS divided by length of lncRNA

71

g4NearTTS_%

Minimum distance from G4 to TTS divided by length of lncRNA

72-74 g4Near(p/f/u)ORFstart_% Minimum distance from G4 to ORF start divided by length of lncRNA 75-77 g4Near(p/f/u)ORFend_%

Minimum distance from G4 to ORF end divided by length of lncRNA

RNA modification 78

m6aNearTIS_log

Log10(minimum distance) from m6A to transcription initiation

79

m6aNearTTS_log

Log10(minimum distance) from m6A to transcription termination

80-82 m6aNear(p/f/u) ORFstart_log

Log10(minimum distance) from m6A to ORF start

83-85 m6aNear(p/f/u) ORFend_log

Log10(minimum distance) from m6A to ORF end

86

m6aNearTIS_%

Minimum distance from m6A to TIS divided by length of lncRNA

87

m6aNearTTS_%

Minimum distance from m6A to TTS divided by length of lncRNA

88-90 m6aNear(p/f/u) ORFstart_%

Minimum distance from m6A to ORF start divided by length of lncRNA (continued)

Detection and Characterization of Ribosome-Associated lncRNAs

189

Table 1 (continued) No.

Feature

Description

91-93 m6aNear(p/f/u)ORFend_ Minimum distance from m6A to ORF end divided by length of % lncRNA Repeat element 94

DNA

Containing DNA transposon or not

95

LINE

Containing LINE element or not

96

LTR

Containing LTR element or not

97

SINE

Containing SINE element or not

98

Retroposon

Containing Retroposon element or not

99

Satellite

Containing Satellite element or not

6. Removing sequences of high similarity with BLAST. $ makeblastdb -in ~/Desktop/ribo-lncRNA/ref/human_ribolncrna_noribolncrna.fa

-out

~/Desktop/ribo-lncRNA/ref/

human_ribolncrna_noribolncrna -dbtype nucl -parse_seqids $

blastn

-outfmt

6

-db

~/Desktop/ribo-lncRNA/ref/

human_ribolncrna_noribolncrna -query ~/Desktop/ribo-lncRNA/ ref/human_ribolncrna_noribolncrna.fa > ~/Desktop/ribo-lncRNA/ ref/human_ribolncrna_noribolncrna.blast $ python ~/Desktop/ribo-lncRNA/scripts/Process_blast_results. py > ~/Desktop/ribo-lncRNA/lncrna_align.txt $ python Generate_features.py $ python Remove_similarity.py

7. Merging sequence features. $ python Generate_dataset.py

3.5 Feature Selection by L1-Regularized Logistic Regression

1. Removing highly redundant (|r| > 0.8) features. $ python Remove_redundant_features.py

2. Applying L1-regularized logistic regression to obtain important features (see Fig. 3). $ python Feature_selection.py

190

Chao Zeng and Michiaki Hamada

Fig. 3 Feature selection with L1-logistic regression. We used 80% and 20% of the data to train the model and calculate accuracy (blue, dashed line), respectively. The larger the value of C, the more complex the model (the more nonzero feature parameters), and the corresponding absolute value of the feature parameter indicates the importance of this sequence feature. The final model was chosen at the black dashed line to obtain 12 features related to the ribosomal association (top left, ranked according to importance)

4

Notes 1. URL of the gene annotation file was changed from ftp://ftp. sanger.ac.uk/pub/gencode/Gencode_human/release_25/ GRCh37_mapping/gencode.v25lift37.annotation.gtf.gz to ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_ human/release_25/GRCh37_mapping/gencode.v25lift37. annotation.gtf.gz. The latest version can also be used. 2. By comparing several reads that are repeated in Ribo-seq with the human genome, we found that the sources of contamination are snoRNA, snRNA, and miRNA (see Table 2). Thus, we added these RNAs to the contamination list. 3. Since the soft-clipping is allowed in alignment, making it incompatible with RSEM, before running RSEM, we removed the soft-clipping alignments with the custom script (Modify_SAM_for_RSEM.py). For the convenience of subsequent processing, we performed the same processing on Ribo-seq alignments.

18

U6 snRNA U1 snRNA

24

25

mir4286

HEK-Iwasaki2016-dmso-rep2

HeLa-Guo2010-mock12hr

HEK-Subtelny2014-cyt

HEK-Sidrauski2015-control-b

mir4286

HEK-Iwasaki2016-dmso-rep1

HEK-Eichhorn2014-mock

U2 snRNA

U6 snRNA

23

Fibroblasts-Xu2016-wt-Lleucine

U2 snRNA

U2 snRNA

22

U2 snRNA

mir4286

21

Fibroblasts-Xu2016-wt-Dleucine

Fibroblasts-Shitrit2015-control

Breast-Ruibo2014-control-rep2

Breast-Ruibo2014-control-rep1

Brain-Gonzalez2014-tumor-A

Brain-Gonzalez2014-normal-C mir4286

Dataset

Length of Ribo-seq read (nt)

Table 2 Contaminant Ribo-seq reads derived from miRNAs, snRNAs, and snoRNAs are enriched in lncRNAs

snoRNA

snoRNA

28

U11 snRNA

U11 snRNA

29

U6/U11

U6/U11

30

(continued)

33

Detection and Characterization of Ribosome-Associated lncRNAs 191

mir4286

KOPT-K1-Wolfe2014-dmsorep2

Macrophages-Su2015-mockrep1

mir4286

KOPT-K1-Wolfe2014-dmsorep1

hES-Werner2015-control-rep2

hES-Werner2015-control-rep1

HeLa-Zur2016-Mphase-exp2

HeLa-Zur2016-Mphase-exp1

U2 snRNA

U2 snRNA

U2 snRNA

HeLa-Zur2016-G1phase-exp2

U2 snRNA

21

U2 snRNA

18

22

Length of Ribo-seq read (nt)

HeLa-Zur2016-G1phase-exp1

HeLa-Park2016-Sphase-rep1

HeLa-Park2016-Mphase-rep1

HeLa-Guo2010-mock32hr

Dataset

Table 2 (continued)

U1 snRNA

U1 snRNA

U1 snRNA

23

24

25

28

U1 snRNA

U1 snRNA

29

U2 snRNA

30

U2 snRNA

U2 snRNA

33

192 Chao Zeng and Michiaki Hamada

U2 snRNA U2 snRNA U2 snRNA U2 snRNA U2 snRNA U2 snRNA U2 snRNA

Macrophages-Su2015-mockrep2

Eye-Tanenbaum2015-G1-rep1

Eye-Tanenbaum2015-G1-rep2

Eye-Tanenbaum2015-G2-rep1

Eye-Tanenbaum2015-G2-rep2

Eye-Tanenbaum2015-M-rep1

Eye-Tanenbaum2015-M-rep2

U1 snRNA

snoRNA U1 snRNA

U1 snRNA

U1 snRNA

U1 snRNA

snoRNA U1 snRNA

U2 snRNA

U2 snRNA

U2 snRNA

U2 snRNA

Detection and Characterization of Ribosome-Associated lncRNAs 193

194

Chao Zeng and Michiaki Hamada

Acknowledgments This work was supported by the Ministry of Education, Culture, Sports, Science and Technology (KAKENHI) [grant numbers JP17K20032, JP16H05879, JP16H01318, and JP16H02484 to MH]. References 1. Bu D, Yu K, Sun S et al (2012) NONCODE v3.0: integrative annotation of long noncoding RNAs. Nucleic Acids Res 40:D210–D215 2. Iyer MK, Niknafs YS, Malik R et al (2015) The landscape of long noncoding RNAs in the human transcriptome. Nat Genet 47:199–208 3. Hon C-C, Ramilowski JA, Harshbarger J et al (2017) An atlas of human long non-coding RNAs with accurate 50 ends. Nature 543:199–204 4. O’Leary NA, Wright MW, Brister JR et al (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–D745 5. You B-H, Yoon S-H, Nam J-W (2017) Highconfidence coding and noncoding transcriptome maps. Genome Res 27:1050–1062 6. Frankish A, Diekhans M, Ferreira A-M et al (2019) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 47:D766–D773 7. Ingolia NT, Ghaemmaghami S, Newman JRS, Weissman JS (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218–223 8. Ingolia NT, Lareau LF, Weissman JS (2011) Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147:789–802 9. Zhou P, Zhang Y, Ma Q et al (2013) Interrogating translational efficiency and lineagespecific transcriptomes using ribosome affinity purification. Proc Natl Acad Sci U S A 110:15395–15400 10. Aspden JL, Eyre-Walker YC, Phillips RJ et al (2014) Extensive translation of small open reading frames revealed by poly-ribo-seq. elife 3:e03528 11. Zeng C, Fukunaga T, Hamada M (2018) Identification and analysis of ribosome-associated lncRNAs using ribosome profiling data. BMC Genomics 19:414

12. Zeng C, Hamada M (2018) Identifying sequence features that drive ribosomal association for lncRNA. BMC Genomics 19:906 13. Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17:10–12 14. Langmead B, Salzberg SL (2012) Fast gappedread alignment with Bowtie 2. Nat Methods 9:357–359 15. Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323 16. Camacho C, Coulouris G, Avagyan V et al (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421 17. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842 18. Kawaguchi R, Kiryu H (2016) Parallel computation of genome-scale RNA secondary structure to detect structural constraints on human genome. BMC Bioinformatics 17:203 19. Zhou Y, Zeng P, Li Y-H et al (2016) SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features. Nucleic Acids Res 44:e91 20. Kikin O, D’Antonio L, Bagga PS (2006) QGRS Mapper: a web-based server for predicting G-quadruplexes in nucleotide sequences. Nucleic Acids Res 34:W676–W682 21. Park J-E, Yi H, Kim Y et al (2016) Regulation of poly(A) tail and translation during the somatic cell cycle. Mol Cell 62:462–471 22. Ingolia NT, Brar GA, Stern-Ginossar N et al (2014) Ribosome profiling reveals pervasive translation outside of annotated proteincoding genes. Cell Rep 8:1365–1379 23. Guttman M, Russell P, Ingolia NT et al (2013) Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154:240–251 24. Smit AFA, Hubley R, Green P (1996) RepeatMasker. http://www.repeatmasker.org. Accessed 31 Mar 2019

Chapter 12 Analysis of Annotated and Unannotated Long Noncoding RNAs from Exosome Subtypes Using Next-Generation RNA Sequencing Wittaya Suwakulsiri, Maoshan Chen, David W. Greening, Rong Xu, and Richard J. Simpson Abstract Long noncoding RNAs (lncRNAs) contain >200 nucleotides and act as regulatory molecules in transcription and translation processes in both normal and pathological conditions. LncRNAs have been reported to localize in nuclei, cytoplasm, and, more recently, extracellular vesicles such as exosomes. Exosomal lncRNAs have gained much attention as exosomes secreted from one cell type can transfer their cargo (e.g., protein, RNA species, and lipids) to recipient cells and mediate phenotypic changes in the recipient cell. In recent years, many exosomal lncRNAs have been discovered and annotated and are attracting much attention as potential markers for disease diagnosis and prognosis. It is expected that many exosomal lncRNAs are yet to be identified. However, characterization of unannotated exosomal RNAs with non–protein-coding sequences from massive RNA sequencing data is technically challenging. Here, we describe a method for the discovery of annotated and unannotated exosomal lncRNA. This method includes a large-scale isolation and purification strategy for exosome subtypes, using the human colorectal cancer cell line (LIM1863) as a model. The method inputs RNA sequencing clean reads and performs transcript assembly to identify annotated and unannotated exosomal lncRNAs. Cutoffs (length, number of exon, classification code, and human protein-coding probability) are used to identify potentially novel exosomal lncRNAs. Raw read count calculation and differential expression analysis are also introduced for downstream analysis and candidate selection. Exosomal lncRNA candidates are validated using RT-qPCR. This method provides a template for exosomal lncRNA discovery and analysis from next-generation RNA sequencing. Key words Exosomes, Long noncoding RNAs, Transcriptomics, Next-generation RNA sequencing, Bioinformatics

1

Introduction It is now well recognized that 200 nucleotides in length and incapable of encoding proteins [6]. The functions of lncRNAs depends on their cell compartmentation, for instance, chromatin modifications, transcriptional control, and posttranscriptional processing occurs in the nuclear compartment while lncRNAs can inhibit protein synthesis in the cytoplasmic compartment [8]. Moreover, some lncRNAs containing small open reading frames (sORFs) such as LINC00948 [9], LINC00116 [10] and Six1 [11], have been shown to encode functional micropeptides (typically, ~50 amino acids in length). Interestingly, lncRNAs have been detected in extracellular vesicles (EVs) isolated from biofluids such as plasma [12], serum [13], breast milk [14], synovial fluid [15], and urine [16]. EVs are a heterogeneous population of lipid bilayer-membrane vesicles derived from diverse cell types [17–19]. Based upon their mechanism of biogenesis, EVs comprise two major classes—exosomes (of endosomal origin) and shed microvesicles (also referred to as microparticles and ectomeres), which originate by blebbing of the plasma membrane [19]. EVs are crucial mediators of intercellular communication and act by transferring their bioactive cargo (e.g., proteins, RNA species, lipids, and metabolites) to recipient cells [18]. Interestingly, the majority of exosomal RNAs map to intronic regions resulting in the abundance of long intergenic noncoding RNAs (lincRNAs) when compared to parental cells [20, 21]. Exosomal lncRNAs have been implicated in the onset of cancer drug resistance (e.g., lncARSR [22], linc-ROR [23], lincVLDLR [24], and lncUCA1 [25]) and angiogenesis (e.g., lncPOU3F3 [26] and lncCCAT2 [27]). Here, we report a method used to discover annotated and unannotated exosomal lncRNAs from next-generation RNA sequencing data, focusing on human exosomal lncRNAs (Fig. 1). The method starts from large-scale generation of exosomes secreted from the human colorectal cancer cell line, LIM1863 [28] continuously cultured in Bioreactor classic flasks [21, 29]. To isolate exosomes, we next performed differential centrifugation on the cell culture medium (CM) to remove cells, cell debris, and larger EVs known as shed microvesicles [30] followed by sequential glycoprotein A33 antigen (A33) [31] and EpCAM-immunoaffinity capture exosome subtype purification

Analysis of Exosomal IncRNAs Using RNA Sequencing

197

Cell culture Differential centrifugation 0.1 µm filtration A33- and EpCAM-Immunocapture exosome purification

Isolation, purification and characterization of exosome subtypes

A33+ and EpCAM+ exosomes

Immunoblot analysis

RNA extraction A33+ and EpCAM+ exosome-derived RNA samples

cDNA library construction cDNA library construction

RNA sequencing

Raw reads

Clean reads Hisat2 GRCh38 annotation

Merged annotation

Stringtie grch38 annotation

GRCh38

Read alignment

Gffcompare

lncRNA data analysis

Coding potential calculation and Filtering Coding probability < 0.364

Length > 200 Exon Index 50%) are ideal for moving forward for embryo injections. 3.2 Embryonic Microinjection of the CRISPR/Cas9 Reagents

1. Day One: Give one intraperitoneal injection of 20 I.U. PMSG to juvenile female rats (4–5 weeks of age) at 1100 h (11 AM).

3.2.1 Superovulation

3. Place one female with one singly housed male rat (10 weeks of age or older).

2. Day Three: Give one intraperitoneal injection of 50 I.U. HCG to PMSG treated rats at 1100 h (11 AM).

4. Day Four: Remove female rats from males and check for the presence of copulation plugs (see Note 12). 3.2.2 Rat Zygote Collection

1. Place four 75 μL drops of global medium in a 35 mm petri dish and cover the drops with mineral oil. Place in incubator and allow to equilibrate at least 30 min. 2. Place one 300 μL drop of global medium in a 35 mm petri dish and cover with mineral oil. Place in incubator and allow to equilibrate at least 30 min. 3. Place 1.0 mL M2 medium in a 35 mm petri dish on the bench.

314

Xi Cheng et al.

4. Place four 200 μL drops of hyaluronidase on the underside of a 100 mm dish lid. Place 300 μL of M2 in the center of the dish. Cover with the dish bottom to prevent evaporation and place on bench. 5. Euthanize egg donors with CO2 asphyxiation or other approved humane method. 6. Wash abdomen of egg donor with 70% ethanol to minimize loose hair. 7. Make a transverse cut in the skin over the stomach. 8. Pinch the skin on both sides of the cut and peel the skin toward the head and tail. 9. Make a transverse cut in the peritoneal wall near the groin. 10. Make two lateral cuts in the peritoneal wall, from the groin to the rib cage. 11. Lift the resulting flap and displace the intestines up and out of way. 12. Grasp the uterus with dressing forceps. 13. Insert the tip of the iris scissors between the uterus and the mesometrium. 14. Run the scissors along the uterus to free it from the mesometrium. 15. Grasp the uterotubal junction with Dumont forceps. 16. Use the scissors to cut between the ovary and the oviduct. 17. Then cut through the uterus immediately below the forceps. 18. Place the dissected oviduct in a drop of hyaluronidase. 19. Remove the second oviduct and place it in a drop of hyaluronidase. 20. Place dish lid on stereomicroscope with episcopic illumination at 10x magnification. 21. Tear open oviduct loop that contains the eggs with Dumont forceps. 22. Continue until all of the oviducts are torn open in hyaluronidase drops. 23. Increase magnification to 40 and use transfer pipet mounted on mouth pipette to place zygotes in the central drop of M2. 24. After all eggs are in the central M2 drop, move them into the 300 μL drop of global medium in the incubator, minimizing the volume of M2 that is carried over.

In Vivo CRISPR/Cas9-Based Targeted Disruption and Knockin of a Long. . .

315

3.2.3 Rat Zygote Microinjection

1. Treat a glass slide with Sigmacote after cleaning the slide thoroughly with 70% ethanol.

Assemble Microinjection Chamber

2. Apply a dab of high vacuum grease to the long side of the first Plexiglas block. 3. Press the block onto the glass slide so that the block’s long axis is perpendicular to the long axis of the slide. The silicone grease causes the pacer to adhere to glass slide. 4. Place the second block parallel to the first block 20 mm away from the first block on the slide. 5. Apply a dab of vacuum grease to the upper sides of the blocks. 6. Place 50 μL of M2 medium on the slide between the blocks. 7. While holding a 10 mm 20 mm coverslip in your hand place 50 μL of M2 medium onto the coverslip. 8. Rotate the coverslip 180 so that the M2 drop hangs down from the coverslip. 9. Lower the coverslip onto the slide so that the M2 on the coverslip and the M2 on the slide fuse; the ends of the coverslip should be resting on the blocks. 10. Gently press the coverslip onto the vacuum grease on the blocks to adhere the coverslip to the blocks. 11. Aspirate 1 mL mineral oil in a pipet tip and insert tip into upper corner of coverslip and spacer. Slowly eject mineral oil under coverslip and above glass slide until the M2 column is surrounded by mineral oil to prevent evaporation. 12. If desired, increase the diameter of the M2 medium column in the microinjection chamber by pipetting 50 μL of M2 directly into the M2 column.

Microinjection Procedure

1. Place glass slide on microscope stage. 2. Use the pipette puller to prepare microinjection needles with a tip size less than 1 μM in diameter. Long slender sharp needles are preferred for rat zygote microinjection. 3. Place the butt end of the needle into a microtube containing 50 μL of injection solution. The internal filament will wick the solution to the tip of the needle only. 4. Place the needle into its holder and orient it in the microinjection chamber at 40 magnification so that the tip is in the center of the field of view. 5. Place the holding pipette into its holder and orient it in the microinjection chamber so that it is in the center of the field of view. 6. The needle and the holding pipette should be directly across from each other, forming a straight line.

316

Xi Cheng et al.

7. Place as many rat zygotes as can be microinjected in 30–60 min into the microinjection chamber. 8. Focus on the zygotes and then lower the holding pipet and the injection needle until they are in the same plane of focus as the zygotes. 9. Change to the 10 objective and recenter needle and holding pipette, if necessary. 10. Change to the 40 objective and recenter needle and holding pipette, if necessary. 11. Place the needle against a loose zygote and pressurize the needle at 20 psi or 1400 hPa. If the pressurized solution does not cause the zygote to move the needle is closed. To open the needle tap it against the holding pipet until a little glass breaks off and the needle is open. 12. Aspirate a zygote onto the holding pipet. 13. Use the microscope focus knob to bring the pronuclear membrane of the male pronucleus into focus. 14. Use the micromanipulator to raise or lower the needle until it is in the same plane of focus as the pronuclear membrane. 15. Guide the needle through the zona pellucida of the zygote and pierce the cell membrane, continue into and through the pronucleus. The membranes are very flexible and it is not unusual to be able to push the needle completely through the zygote without breaking the membranes. 16. After the needle is in the pronucleus tap the foot pedal to inject ~3 pL of solution into the pronucleus. A visible swelling will occur as the pronucleus inflates with the solution. In general, Cas9 ribonucleoprotein is more efficient than Cas9 expressed from microinjected plasmids or Cas9 translated from microinjected mRNA. The Transgenic Core at Michigan routinely incubates phosphothiorate modified sgRNA (30 ng/μL, Synthego.com) with enhanced specificity Cas9 protein (50 ng/μL, MilliporeSigma) for 10 min in microinjection buffer just before microinjection. When oligonucleotide, ssDNA or dsDNA donors are used, they are added to a final concentration of 10 ng/μL. 17. In our study [1] as an example, for the disruption model, a mixture of rRffl.g4 (2.5 ng/μL) and Cas9 mRNA (5 ng/μL) was injected into one-cell stage embryos. For the knockin model, a mixture of rRffl.g4 (2.5 ng/μL), Cas9 mRNA (5 ng/μL), and the donor oligonucleotide (10 ng/μL) was injected into one-cell stage embryos. 18. Slowly withdraw the needle, move the zygote to the bottom of the field of view.

In Vivo CRISPR/Cas9-Based Targeted Disruption and Knockin of a Long. . .

317

19. After all of the zygotes are microinjected, remove them with the embryo transfer pipette and wash them through the four drops of global medium (see Subheading 3.2.2, step 1), minimizing medium carry over between drops. 20. Place another group of zygotes in the microinjection chamber and inject them. 21. After all zygotes have been microinjected, transfer them to pseudopregnant female rats (see Note 13). 3.2.4 Rat Embryo Transfer to Pseudopregnant Recipients

1. Use Sprague Dawley rats as recipients of microinjected zygotes. Use Charles River Laboratory Strain Code 400 or other rat strain with excellent reproductive characteristics.

Pseudopregnant Rat Preparation

2. Day One: Give one subcutaneous injection of 40 μg of LHRHa to adult females (200–250 g). 3. Day Five: Place one LHRHa treated female with one singly housed vasectomized male rat (10 weeks of age or older). 4. Day Six: Remove female rats and check for the presence of copulation plugs. (see Note 14).

Embryo Transfer Laparotomy

1. Pick up 15 microinjected zygotes in a transfer pipet attached to a mouth pipette. Carefully lay it aside. 2. Anesthetize rats with intraperitoneal injection of ketamine/ xylazine then administer carprofen analgesia. 3. Apply ophthalmic ointment to eyes. 4. Shave the fur on the left side of the rat in the space between the last rib and leg. 5. Clean the shaved skin by alternating betadine scrub and 70% ethanol. 6. Make a small incision in the skin and body wall over the ovary. 7. Grip the ovarian fat pad with dressing forceps and exteriorize the reproductive tract. 8. Apply the bulldog vessel clamp to the ovarian fat pad to hold the ovary and oviduct in place. 9. Apply several drops of diluted epinephrine to vasoconstrict blood vessels running through the bursa membrane that covers the ovary and oviduct. 10. Gently tear the bursa open with the Dumont forceps. 11. Locate the infundibulum (the opening of the oviduct) and insert the tip of the embryo transfer pipet in the infundibulum then blow the zygotes into the left oviduct. 12. Return the reproductive tract to the peritoneal cavity and close the skin with wound clips.

318

Xi Cheng et al.

13. Repeat the procedure on the right side of the animal to place 15 zygotes into the right oviduct (see Note 15). 14. Administer 50 mg of ampicillin by intraperitoneal injection. 15. Place the animal in a cage on a 37 C slide warmer until alert and returned to sternal recumbency. House two rats per cage until 1 week prior to birth, then separate pregnant rats into one female per cage. 3.3 Genotyping of Transgenic Rats

1. At ages between 2 and 3 weeks apply ear tags to the transgenic rats on the right or left ear. 2. Hold the base of the tail between the forefinger and thumb of one hand while the other hand is used to remove up to 0.5 cm of the tip of the tail. Cut through the tail tip with scissors or a scalpel or a single edge razor. 3. Extract DNA from tail biopsy using Wizard® SV 96 Genomic DNA Purification System and Proteinase K Solution as per the manufacturer’s instructions. 4. Using the DNA sequence of the lncRNA, design multiple sets of primers (using Primer 3: http://bioinfo.ut.ee/primer3-0. 4.0/) to amplify the CRISPR/Cas9 targeted region of the lncRNA. The full DNA sequence of a specific lncRNA can be found in Ensembl (useast.ensembl.org/index.html) (see Note 16). 5. The designed primers were used to PCR amplify CRISPR/ Cas9 edited region from the extracted genomic DNA of the transgenic rats using Radiant™ Red 2 Taq Mastermix as per the manufacturer’s instructions. 6. Run PCR product on agarose gel in the electrophoresis system. For a better visualization of the PCR product with different size, prepare agarose gels with appropriate concentrations according to the size of PCR product. 7. Visualize the agarose gel using the gel imaging system for the initial genotyping analysis. The genotype in the founders of targeted disruption may be either heterozygous (double PCR bands) or homozygous (single PCR band), but at least one band should be observed with different product size, indicating insertions or deletions (INDELs), compared with the wildtype. The genotype in the founders of knockin may be either heterozygous (double PCR bands) or homozygous (single PCR band), but at least one band should be higher by the expected knockin size compared to the wild type. 8. PCR amplify the genomic DNA of the homozygous founders using Platinum™ Taq DNA Polymerase. Send the PCR product for DNA sequencing (Eurofins Genomics) to confirm the

In Vivo CRISPR/Cas9-Based Targeted Disruption and Knockin of a Long. . .

319

genotype. DNA sequencing data is analyzed using Sequencher 4.10.1. 9. If needed, the heterozygous founders can be selectively bred to generate the homozygous founders. The genotyping process should be performed as detailed above in the subsequent generations of the founders.

4

Notes 1. Episcopic illumination is used for zygote collection and loading zygotes into the embryo transfer pipet during surgical transfer of zygotes to the reproductive tract. Diascopic illumination is used during surgery to visualize the opening of the oviduct. A single stereomicroscope with both types of illumination can be used in place of two stereomicroscopes; however, switching between illumination sources and magnifications may be tedious. 2. Several types of transfection methods can be used to deliver CRISPR/Cas9 reagents to many different cell types. Here, we describe our nucleofection protocol using the Lonza 4D-X Nucleofector with the small 16 strip cuvette. If using a cell type other than rat C6, Lonza’s website has several optimized protocols for common cell lines. There you will find the program to use and which nucleofector solution. If Lonza does not have a protocol for your cell type, we recommend optimizing the nucleofection following the manufacturer’s protocol. 3. We recommend using chemically modified sgRNAs with the 2’O-methyl 30 phosphorothioate modifications in first and last 3 nucleotides. 4. The amount of RNP can be decreased or increased depending on size of nucleofection, type of nucleofection, and number of reagents needed for modifications. If using a donor or more than one RNP, amounts will have to be adjusted to fit volume of specified transfection protocol. 5. Nucleofector solution must be supplemented prior to use. Nucleofection will have poor results such as low transfection efficiency if supplement is not added. 6. We find that a two-step PBS wash helps prevent degradation of the RNA and maximize RNP cutting efficiency. 7. Nucleofection program varies depending on cell type. Check the manufacturer’s protocol if using cells other than rat C6 cells. 8. We use NCBI’s primer blast to design primers. The website can be found at https://www.ncbi.nlm.nih.gov/tools/primer-

320

Xi Cheng et al.

blast/. We use the default settings except for the database parameter and the organism parameter. We reference the genome instead of the refseq for the database, “Genomes for selected organisms,” and we input our intended organism based on the species of interest. 9. Very concentrated DNA as well as very dilute DNA can impair PCR amplification. For very small and large DNA pellets, we recommend resuspending in 25–50 μL and 150–400 μL extraction buffer, respectively. The DNA should be easily aspirated by a pipet after extraction. If the DNA is very viscous, try diluting the DNA at 1:5–1:10 ratio. 10. Some primer pairs and PCRs may not work robustly. Some regions and/or primers may work better with Mytaq while some work better with SuperFi. If the PCR does not work with either taq, we recommend checking the region’s GC content. If the GC content is high or the region has large stretches of nucleotides or repeats, supplementing 5–10% DMSO into Mytaq’s recipe or 5XGC enhancer with SuperFi can help produce an amplicon. If the primers still do not work, we recommend repeating the amplification with a new primer set. 11. Sequencing of only one strand is necessary for identifying indels via TIDE, however; both strands can be used for TIDE analysis in case one strand has poor sequencing results. 12. Depending on the genetic background of the rats 5 to 8 egg donors are prepared for each microinjection day. Response to superovulation varies by genetic background, for example Sprague Dawley rats produce more eggs than Fischer344 [5]. 13. It has been observed that microinjected nucleic acid preparations can be toxic to zygote development. Microinjected zygotes may lyse after overnight culture, or may be blocked from developing to the two-cell, morula, or blastocyst stages. This level of toxicity will cause in vivo developmental failure and no living rat pups will be born. If possible, nucleic acid preparations should be microinjected into mouse zygotes and the mouse zygote observed for development to the blastocyst stage for signs of toxicity. Mouse zygotes are suggested for this quality control procedure because they efficiently develop to blastocysts in vitro while it is challenging to culture rat zygotes to the blastocyst stage [6]. 14. Typically, 6 pseudopregnant rats are prepared for each microinjection session. If necessary, the lordosis response can be elicited in female rats with an engraving tool to increase the number of rats available for embryo transfer. Place a smooth syringe needle sheath over the tip of the tool, set the tip to the

In Vivo CRISPR/Cas9-Based Targeted Disruption and Knockin of a Long. . .

321

lowest setting, cover it with KY jelly and briefly insert the tool in the vagina until lordosis is observed. 15. It advisable to place microinjected zygotes in both oviducts because there is no evidence of transuterine migration in rodents [7]. 16. The purpose of designing multiple sets of primers is to comprehensively capture the CRISPR/Cas9 editing information in the lncRNA. For example, the first set of the primers is designed to amplify the 150 bp of the genomic region containing the CRISPR/Cas9 targeted region. However, the CRISPR/Cas9 editing may result in a deletion more than 150 bp, so the first set of primers cannot amplify the region as the binding sites of one or both primers are missing. Therefore, multiple sets of primers are needed to amplify the targeted region with different product sizes (150 bp, 500 bp, 1000 bp, etc.).

Acknowledgments Dr. Xi Cheng acknowledges funding support of the Dean’s Postdoctoral to Faculty Fellowship from University of Toledo College of Medicine and Life Sciences and the P30 Core Center Pilot Grant from NIDA Center of Excellence in Omics, Systems Genetics, and the Addictome. Dr. Bina Joe acknowledges grant support from the National Heart, Lung, and Blood Institute. References 1. Cheng X, Waghulde H, Mell B, Morgan EE, Pruett-Miller SM, Joe B (2017) Positional cloning of quantitative trait nucleotides for blood pressure and cardiac QT-interval by targeted CRISPR/Cas9 editing of a novel long non-coding RNA. PLoS Genet 13:e1006961 2. Stickrod G (1979) Ketamine/xylazine anesthesia in the pregnant rat. J Am Vet Med Assoc 175:952–953 3. Chen F, Pruett-Miller SM, Davis GD (2015) Gene editing using ssODNs with engineered endonucleases. Methods Mol Biol 1239:251–265 4. Brinkman EK, Chen T, Amendola M, van Steensel B (2014) Easy quantitative assessment of

genome editing by sequence trace decomposition. Nucleic Acids Res 42:e168 5. Filipiak WE, Saunders TL (2006) Advances in transgenic rat production. Transgenic Res 15:673–686 6. Popova E, Bader M, Krivokharchenko A (2011) Effect of culture conditions on viability of mouse and rat embryos developed in vitro. Genes (Basel) 2:332–344 7. Rulicke T, Haenggli A, Rappold K, Moehrlen U, Stallmach T (2006) No transuterine migration of fertilised ova after unilateral embryo transfer in mice. Reprod Fertil Dev 18:885–891

Chapter 20 Genome-Scale Perturbation of Long Noncoding RNA Expression Using CRISPR Interference S. John Liu, Max A. Horlbeck, Jonathan S. Weissman, and Daniel A. Lim Abstract CRISPR-mediated interference (CRISPRi), a robust and specific system for programmably repressing transcription, provides a versatile tool for systematically characterizing the function of long noncoding RNAs (lncRNAs). When used with highly parallel, lentiviral pooled screening approaches, CRISPRi enables the targeted knockdown of tens of thousands of lncRNA-expressing loci in a single screen. Here we describe the use of CRISPRi to target lncRNA loci in a pooled screen, using cell growth and proliferation as an example of a phenotypic readout. Considerations for custom lncRNA-targeting libraries, alternative phenotypic readouts, and orthogonal validation approaches are also discussed. Key words CRISPR, CRISPRi, lncRNA, Screen

1

Introduction Long noncoding RNAs (lncRNAs) are a broadly defined class of genes that are transcribed into RNA molecules longer than 200 nucleotides that do not encode proteins. The human genome produces tens of thousands of distinct lncRNA transcripts, and it is now clear that certain lncRNAs have important biological functions. Due to the large number of annotated lncRNAs whose biological significance is not yet known, systematic genome-scale approaches for testing lncRNA function are especially important for efficient characterization of these genes. Large-scale interrogation of lncRNAs has been performed previously using RNA interference [1, 2], CRISPR/Cas9 deletion of lncRNA loci, [3], CRISPR/Cas9 disruption of lncRNA splice acceptor and donor sites [4], and CRISPRi repression of lncRNA transcription [5]. Given that lncRNA genes can function via diverse (i.e., cis and/or trans) mechanisms [6], it is important to consider the molecular mechanisms by which the screening method interrogates lncRNA function [7, 8].

Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_20, © Springer Science+Business Media, LLC, part of Springer Nature 2021

323

324

S. John Liu et al.

In CRISPRi, a catalytically inactive Cas9 protein is fused with a transcriptional repressor domain such as the KRAB repressor domain (dCas9-KRAB), and such dCas9 fusion proteins can be targeted to essentially any region of the genome by the expression of single guide RNAs (sgRNAs). sgRNA-mediated recruitment of dCas9-KRAB silences transcription through steric hindrance of RNA polymerase elongation and deposition of the heterochromatin mark H3K9me3 [9, 10]. CRISPRi exhibits maximal activity when targeted to between 50 and + 300 bp relative to the transcription start site (TSS) of the target lncRNA, which minimizes disruption of neighboring cis regulatory elements and other genes [11]. This narrow targeting window makes precise TSS identification and optimized single guide RNA (sgRNA) design an important consideration in designing screening libraries [12– 14]. CRISPRi can be engineered to be inducible/reversible [15], and when dCas9 is fused to transcriptional activation domains, the system can also be used to achieve targeted overexpression of lncRNAs [16, 17]. Because the function of lncRNAs is generally not expected to be disrupted by small insertions and deletions mediated by targeting catalytically active CRISPR/Cas9 to their loci [18, 19], by blocking lncRNA transcription at the level of the genome, CRISPRi is particularly well-suited for screening lncRNA gene function. In addition, CRISPRi is not susceptible to artifacts related to genomic copy number variation, whereas Cas9 nuclease activity at amplified loci can lead to nonspecific cell death [20–22]. Here, we present a protocol for the application of CRISPRi to test the function of thousands of lncRNAs in a pooled screen format, using cellular growth and proliferation as a phenotypic readout. We describe CRISPRi cell line generation and usage for the glioblastoma cell line U87, but the protocol is applicable to other cell types [5]. The use of custom sgRNA libraries is discussed (see note below), but for simplicity we present the screen using the publicly available Human CRISPRi Non-Coding Libraries (CRiNCL). Variations of the protocols below can also be found online at weissmanlab.ucsf.edu (see Note 1).

2

Materials

2.1 Generation of CRISPRi Cell Line

1. U87-MG cell line (ATCC #HTB-14). 2. 293T cells (ATCC #CRL-11268) 3. DMEM (Gibco #11965-092). 4. FBS (Gibco # 26140079). 5. PBS (Gibco # 20012027). 6. 0.25% Trypsin-EDTA (Gibco #25200056)

Genome-Scale CRISPR Interference of lncRNAs

325

7. dCas9-KRAB expression vector—UCOE-SFFV-dCas9-BFPKRAB (Addgene: 85969), or SFFV-dCas9-BFP-KRAB (Addgene: 46911), pHR-EF1a-dCas9-HA-BFP-KRAB-NLS (Addgene: 102244), 8. Lentivirus packaging plasmids—pMD2.G (Addgene: 12259), pCMV-dR8.91 (Trono Lab). 9. TransIT-LT1 Transfection Reagent (Mirus #2300). 10. ViralBoost Reagent (Alstem #VB100). 11. 0.45 μm membrane filter (Sigma # HVHP02500) 12. 10 mL syringes (BD # 309604) 13. Access to an FACS instrument. 2.2 Pooled sgRNA Library Preparation and Virus Production

1. CRiNCL sublibraries for U87 cells (Common: 86538, Cancer Common: 86539, U87 & HEK293T: 86547, U87 unique: 86542). 2. MegaX DH10B cells (Thermo-Fisher C640003). 3. SOC Outgrowth Media (NEB B9020S). 4. LB Agar Plates, Carbenicillin-100 (Teknova #L1010). 5. Carbenicillin (Sigma #C1613). 6. NucleoBond Xtra Maxi Plus (Macherey Nagel # 740416.10). 7. Access to Bio-Rad electroporator.

Gene

Pulser

II

or

comparable

8. Access to an Illumina sequencer. 2.3 Expansion and Maintenance of Growth Screen

1. Large format tissue culture treated plates (Corning # CLS430599). 2. Polybrene Infection Agent (Sigma # TR-1003-G). 3. Puromycin (Sigma # P9620). 4. DMSO (Sigma #D8418). 5. Cryogenic vials 2 mL (Corning #430659).

2.4 Processing of Genomic DNA for Illumina Sequencing of sgRNA Barcodes

1. NucleoSpin Blood XL (Macherey Nagel #740950.10). 2. SbfI-HF (NEB #R3642L). 3. Sub-Cell 192 Cell gel electrophoresis unit and power supply (or equivalent; Bio-Rad #1704508). 4. UV-Transparent Gel Tray 25 20 cm (Bio-Rad #1704523). 5. Large format gel comb (Bio-Rad #1704531). 6. Agarose (Bio-Rad #1613102). 7. TAE buffer 50 (ThermoFisher #B49). 8. 1 kb Plus DNA Ladder (ThermoFisher #10787018) 9. Gel Loading Dye, Purple 6 (NEB #B7024S).

326

S. John Liu et al.

10. NucleoSpin Gel and PCR Clean-up (Macherey Nagel #740609.250). 11. NaOAc, 3 M (ThermoFisher #AM9740). 12. Phusion High-Fidelity DNA Polymerase (NEB # M0530L). 13. SPRIselect magnetic beads (Beckman Coulter #B23318). 14. DynaMag-2 magnetic rack (Thermo Fisher #12321D). 15. Qubit dsDNA HS Assay Kit (Thermo Fisher #Q32854). 16. Access to a Qubit Fluorometer. 17. Access to a Bioanalyzer or TapeStation (Agilent). 2.5 Analysis of Screen Sequencing Data

3

1. Access to a Linux or Mac workstation with Python 2.7 and the following Python libraries installed: NumPy, SciPy, Pandas, Matplotlib, BioPython. iPython, and Jupyter Notebook recommended for interactive plotting functions.

Methods

3.1 Generation of CRISPRi Cell Line

The stable and efficient expression of dCas9-KRAB is critical for the success of the screen and subsequent follow-up experiments. We have found that for CRISPRi, a polyclonal population of cells expressing dCas9-KRAB is suitable as long as >95% of the cells are expressing the chimeric protein. 1. Expand U87 cells in DMEM with 10% FBS. 2. Obtain and prepare >10 μg of the dCas9-KRAB expression vector containing the ubiquitous chromatin opening element (UCOE): UCOE-SFFV-dCas9-BFP-KRAB using standard plasmid preparation methods. Different dCas9-KRAB constructs can also be used, such as SFFV-dCas9-BFP-KRAB or pHR-EF1a-dCas9-HA-BFP-KRAB-NLS. Also prepare lentivirus packaging plasmids pCMV-dR8.91 and pMD2.G. 3. Plate 6 10^6293T cells in a 10 cm plate with 10 mL of DMEM, 10% FBS the day before transfection. 4. In a 1.5 mL tube, mix 1500 μL serum free DMEM with 45 μL Mirus TransIT LT1 and incubate for 5 min at room temperature. 5. In separate 1.5 mL tube, mix 8 μg of pCMV-dR8.91, 1 μg of pMD2.G, and 9 μg of UCOE-SFFV-dCas9-BFP-KRAB. 6. Mix the DMEM-Mirus mixture into the plasmid mixture and vortex. 7. Incubate for 30 min at room temperature. 8. Add mixture onto 10 cm plate of 293T cells in a dropwise fashion.

Genome-Scale CRISPR Interference of lncRNAs

327

9. Add 24 μL of ViralBoost Reagent into 10 cm plate of 293T cells. 10. Allow virus production for 72 h. 11. 48 h following transfection of lentivirus plasmids, seed 2 106 U87 cells onto a 10 cm plate. 12. At 72 h following 293T cell transfection, filter the viruscontaining supernatant through a 0.45 μm filter. 13. Add 6 mL of filtered virus onto the U87 cells and allow 48 h for virus infection. Note: amount of virus containing media used for infection may need to be altered. We generally aim for ~20–40% infection rate. 14. Expand infected U87 cells 2–3 additional days, at which point the cells are FACS sorted as follows: identify the BFP positive population and sort for the top ~30% of cells in this BFP positive population. Reanalyze the sorted population to confirm >95% BFP positive. If cells are under 95% pure, expand sorted population for another 2–4 days and resort. 15. Expand U87 dCas9-KRAB cell line and freeze down aliquots. 16. Verify effective expression of dCas9-KRAB (see Notes 2 and 3). Note that BFP expression levels can diminish over time although the cell line may retain robust CRISPRi activity. 3.2 Pooled sgRNA Library Preparation and Virus Production

The CRISPRi Non-Coding Library (CRiNCL) is subdivided into 13 libraries according to cell type-specific expression of lncRNAs (Addgene 86538, 86539, 86540, 86541, 86542, 86543, 86544, 86545, 86546, 86547, 86548, 86549, 86550). For targeting lncRNAs expressed in U87 cells, the Common, Cancer Common, U87 & HEK293T, and U87 unique sublibraries will be used. Custom sgRNA libraries may also be used (see Note 4). 1. Obtain CRiNCL sublibraries corresponding to U87 from Addgene (Common: 86538, Cancer Common: 86539, U87 & HEK293T: 86547, U87 unique: 86542). Reconstitute the complete library by preparing a mixture of these sublibraries proportionally to the number of sgRNAs contained in each sublibrary. There are 58,452 total sgRNAs in this reconstituted library. 2. To amplify the library for use, first add 100 ng of complete library mixture to 50 μL of MegaX DH10B cells. 3. Electroporate bacteria/library mixture using the Bio-Rad Gene Pulser II or comparable electroporator using the recommended settings: 2.0 kV, 200 ohms, 25 μF, in a 0.1 cm chilled cuvette. Add 1 mL of SOC media, shake at 37C for 1.5—2 h. 4. Using 5 μL of the bacteria suspension, prepare serial dilutions and plate on LB + carbenicillin plates.

328

S. John Liu et al.

5. Add the remaining suspension to 500 mL of LB + carbenicillin (100 μg/mL working concentration) and shake overnight at 37 C. 6. Harvest bacteria using (multiple) Maxiprep columns, if transformation efficiency was above 1000 colonies per sgRNA. 7. Recommended: confirm the sgRNA composition of the prepared library using Illumina sequencing by amplifying 100 ng of the amplified library using PCR primers 50 : aatgatacggcgaccaccgagatctacacgatcggaagagcacacgtctgaactccagtcacCTTGTAgcacaaaaggaaact caccct, 30 : CAAGCAGAAGACGGCATACGAGATCGACTCGGTGC CACTTTTTC . Sequence the amplicons using standard Illumina sequencing protocols with the sequencing primers 50 : GTGTGTTTTGAGACTATAAGTATCCCTTGGAGAAC CACCTTGTTG, 30 : CCACTTTTTCAAGTTGATAACGGAC TAGCCTTATTTAAACTTGCTATGCTGT. 8. To begin sgRNA library virus production, seed 3 15 cm plates containing 8.5 106 293T cells in 31 mL medium (DMEM, 10% FBS) for each plate. 9. Transfect the 293T cells the following day: in a 15 mL tube mix 3900 μL DMEM with 144 μL Mirus TransIT LT1and incubate for 5 min at room temperature. 10. In a separate tube, mix the following plasmids: 24 μg of pCMV-dR8.91, 3 μg of pMD2.G, and 24 μg of the amplified, pooled sgRNA library. 11. Mix the plasmid tube with the media + Mirus tube and incubate for 20 min at room temperature. 12. Pipette 1355 μL of transfection mixture onto each 15 cm plate of 293T cells in a dropwise fashion. 13. Allow virus production to continue for 72 h before harvesting and filtering. 14. Filter a total of 90 mL of supernatant through 0.45 μm filter and infect cells for screening immediately. 3.3 Expansion and Maintenance of Growth Screen

We generally perform pooled growth screens in biological duplicates in parallel or with replicates performed in a staggered fashion. The number of cells that should be maintained per replicate during the screen should be at least ~1000 coverage per sgRNA in the screening library. For the U87 screen using CRiNCL, this equates to two replicates of 68 million cells, assuming that 85% of the population contains sgRNA after a brief period of selection (empirically determined). Time between sgRNA library lentivirus infection and the first time point for cell harvest (T0) should be minimized in order to capture sgRNAs with strong negative selection phenotypes, as these may otherwise quickly drop out of the screen.

Genome-Scale CRISPR Interference of lncRNAs

329

1. The day before pooled lentivirus infection, seed 2 replicates of 12 15 cm plates, each with 4.75 106 U87 cells stably expressing dCas9-KRAB and 25 mL of DMEM, 10% FBS media. Expect >90 million cells per replicate the following day at time of infection. Also seed a single well of a 12 well plate with 100,000 U87 dCas9-KRAB cells as an uninfected control. 2. The following day (Day 0), add 3.5 mL of freshly prepared pooled lentivirus (see above) onto each 15 cm plate of U87 cells. Add polybrene to a final concentration of 4 μg/mL into each plate as well. Leave at 37 C overnight. 3. Day 1: change the media of the infected U87 cells to 25 mL of DMEM, 10% FBS. 4. Day 2: Monitor cells for confluence. If the cells are nearing confluence, passage the cells such that each replicate is seeded sparsely enough to grow in monolayer, but will still attain at least 68 million cells by day 5, the first time point of the screen. In our lab we perform the following for each replicate: 6 mL PBS wash to each plate, 4 mL 0.25% Trypsin-EDTA for 3 min at 37 C, quench with 8 mL DMEM, 10% FBS. Triturate on plate and pool the dissociated cell suspensions into a 250 mL polystyrene bottle. Mix the bottle well. Count the cells using a manual hemocytometer. 5. Seed 68 million U87 cells evenly across 12 15 cm plates and add puromycin to a final concentration of 0.75 μg/mL. Also seed 100,000 infected cells onto each of two wells of a 12-well plate for monitoring of infection rate. One well should have puromycin at the same concentration of the main screen and the other should be grown without puromycin. 6. Day 3: Perform flow cytometry of the uninfected, infected without puromycin, and infected with puromycin cell populations and split them 1 to 3. Continue to monitor the percentage of BFP positive population in these small-scale samples, as they reflect the BFP% in the large screen. Note: BFP expression from the sgRNA expression vector should be much brighter than the BFP signal from dCas9-BFP-KRAB, and therefore sgRNA infected cells should be readily distinguishable on flow cytometry. 7. Change media to puromycin-containing media to a final concentration of 0.75 μg/mL. 8. Day 4: Split and expand the U87 cells undergoing the screen by performing the following: 6 mL PBS wash to each plate, 4 mL 0.25% Trypsin-EDTA for 3 min at 37 C, quench with 8 mL DMEM, 10% FBS. Triturate on plate and pool the dissociated cell suspensions into a 250 mL polystyrene bottle. Mix the bottle well. Count the cells using a manual hemocytometer.

330

S. John Liu et al.

9. For each replicate screen, collect 136 million cells of cells in suspension, which is equal to double the minimum number of cells needed to be maintained during the screen, and plate evenly across 12 15 cm plates in regular DMEM, 10% FBS, without puromycin. 10. Perform flow cytometry on a small aliquot of the reconstituted pool of cells undergoing screening. If 80% of cells are sgRNA and BFP positive, do not add puromycin to the plates. Instead, allow cells to grow in normal media for one more day. 11. Day 5 (T0). After one day of recovery growth without drug selection, passage the cells as follows: 6 mL PBS wash to each plate, 4 mL 0.25% Trypsin-EDTA for 3 min at 37 C, quench with 8 mL DMEM, 10% FBS. Triturate on plate and pool the dissociated cell suspensions into a 250 mL polystyrene bottle. Count using a manual hemocytometer and record cell numbers for determination of doubling rate. Fill these values into the “cell_doubling_measurements.xlsx” spreadsheet included in the ScreenProcessing pipeline (see Analysis protocol, below). 12. Prepare aliquots of 68 million cells (equivalent to (size of sgRNA library 1000)/(Proportion of BFP positive cells)) and centrifuge them at 1000 g for 5 min. Resuspend each aliquot in 2 mL of freezing media (90% FBS, 10% DMSO) and transfer them to a cryovial for liquid nitrogen storage. Each aliquot can be subsequently harvested for genomic DNA or thawed for a repeat screen. 13. For each replicate, seed 68 million live cells across 12 15 cm plates and allow growth for 48 h. 14. A small aliquot of the remaining cells can be analyzed for population purity using flow cytometry. 15. Day 6–17: Continue to passage cells every 2 days, maintaining a minimum cell count of 68 million cells per replicate (approximately 1:4 passage ratios for U87). Freeze down aliquots of 68 million cells after cells have undergone 5 and 10 cell doublings following T0 (Day 11 and Day 17, respectively, for U87 cells in this protocol). These will be used as intermediate and final time points for screen data analysis. 3.4 Processing of Genomic DNA for Illumina Sequencing of sgRNA Barcodes

LncRNA knockdown phenotypes in this growth-based screen are reflected in relative enrichment or depletion of sgRNA barcodes in a population of cells at the end of the screen compared to the beginning. Such changes are monitored by targeted sequencing of the integrated sgRNA barcodes following genomic DNA isolation. This protocol uses a restriction digest and large gel extraction

Genome-Scale CRISPR Interference of lncRNAs

331

to enrich input DNA for the sgRNA cassette, as we previously used for our large-scale screens [5]. However, for small input DNA quantities (e.g., 200 μg) the digest and gel may be entirely omitted, and all DNA may be directly used in the PCR amplification step (e.g., 200 100 μL reactions with 1 μg input DNA each for 200 μg total). Alternatively, recently developed high-capacity polymerases such as NEBNext Ultra II Q5 may allow for tenfold increases in input DNA concentration (10 μg per 100 μL reaction) and thus make omitting the digest and gel steps practical for all screens. 1. Isolate genomic DNA from aliquots of frozen cells at T0 and also at intermediate and end time points using the Macherey Nagel NucleoSpin Blood XL as follows. Each column of the XL kit can process ~100 million cells, which exceeds the number of cells grown per replicate of U87 cells at each time point. 2. Thaw cryovials of cells in a 37 C water bath and transfer suspension to 10 mL of PBS in a 15 mL conical tube, then centrifuge at 1300 g for 5 min. 3. Follow the Macherey Nagel manufacturer’s protocol using the following amounts of reagents: Proteinase K 500 μL, BQ1 10 mL, 100% ethanol 10 mL. 4. Elute genomic DNA twice with 800 μL EB, preheated to 70 C and incubated on the column for 5 min at room temperature before centrifuge. Repeat with another 800 μL EB to maximize yield. Expect ~1 mg of genomic DNA per sample in 1.2–1.5 mL of EB. 5. Next we enzymatically fragment and size select the genomic DNA to reduce the amount of input for sgRNA library PCR. Enzymatic digestion may be avoided if input cells are relatively low (15 million or fewer; e.g., < 200 μg genomic DNA). To the entire volume of purified genomic DNA, add NEB 10 Cutsmart buffer to a final concentration of 1 and add 400 U of SbfI-HF per mg of genomic DNA. Incubate at 37 C overnight. 6. Prepare a large TAE 0.8% agarose gel (400 mL worth of TAE) and use a gel loading comb that can accommodate 1.2–1.5 mL of DNA digest per well. 7. Load each DNA digest sample with final concentration of 1 loading dye. Also run a 1 kb plus ladder. Gel electrophoresis should be performed at ~120 V for 60–90 min. 8. Gel excise the sample between 700 bp and 350 bp, using the ladder as a guide. The DNA fragments that contain sgRNAs may not be evident on the gel, since the vast majority of the fragmented DNA visible on the gel corresponds to genomic DNA, so cutting based on an accurate ladder is critical.

332

S. John Liu et al.

9. Perform gel purification using the Macherey Nagel NucleoSpin Gel and PCR Clean-up kit as follows: weigh the excised gel and add 2 volume of NTI. Dissolve the agarose gel in a 56 C water bath for 10 min, then add 1/100th volume of 3 M NaOAc, pH 5.3. Load the entire volume of dissolved gel into a single DNA binding column (maximum 100 million cells worth of DNA fragments per column) attached to a vacuum apparatus. Wash column with 700 μL Buffer NT3 and dry the membrane according to the manufacturer’s recommendations. Elute in 20 μL of EB, preheated to 70 C and incubated on column for 5 min before centrifugation. Repeat elution with another 20 μL of preheated EB. 10. PCR amplify the sgRNA cassettes using custom Illumina primers with unique indices for each sample. A combination of Illumina “Set A” and “Set B” indices should be used to maximize sequencing diversity. For instance, two replicate screens each with T0 and T12 samples should be indexed as follows: (a) Rep 1 T0: 50 Index 12 + Common 30 . (b) Rep 1 T12: Common 50 + 30 Index 6. (c) Rep 2 T0: 50 Index 14 + Common 30 . (d) Rep 2 T12: Common 50 + 30 Index 10. Primer sequences are as follows: 0

5 Truseq index 12: aatgatacggcgaccaccgagatctacacgatcggaagagca cacgtctgaactccagtcacCTTGTAgcacaaaaggaaactcaccct. 50 Truseq index 14: aatgatacggcgaccaccgagatctacacgatcggaag agcacacgtctgaactccagtcacAGTTCCgcacaaaaggaaactcaccct. 50 Truseq index 3: aatgatacggcgaccaccgagatctacacgatcggaaga gcacacgtctgaactccagtcacTTAGGCgcacaaaaggaaactcaccct. 30 Truseq index 6: aatgatacggcgaccaccgagatctacacgatcggaaga gcacacgtctgaactccagtcacGCCAATcgactcggtgccactttttc. 30 Truseq index 10: aatgatacggcgaccaccgagatctacacgatcggaag agcacacgtctgaactccagtcacTAGCTTcgactcggtgccactttttc. 30 Truseq index 1: aatgatacggcgaccaccgagatctacacgatcggaaga gcacacgtctgaactccagtcacATCACGcgactcggtgccactttttc. Common 30 Primer: CAAGCAGAAGACGGCATACGA GATCGACTCGGTGCCACTTTTTC. Common 50 Primer: CAAGCAGAAGACGGCATACGA GATGCACAAAAGGAAACTCACCCT 11. Using the primer configuration listed above, amplify the entirety of the purified DNA in multiple 100 μL reactions of 500 ng template each using Phusion DNA polymerase. Expect to set up many PCR reactions: (a) Extracted genomic DNA, 500 ng.

Genome-Scale CRISPR Interference of lncRNAs

333

(b) 5 Phusion HF buffer, 20 μL (c) DMSO, 3 μL. (d) Index Primer 100 μM, 0.4 μL. (e) Common Primer 100 μM, 0.4 μL. (f) dNTP 10 mM, 2 μL, (g) Phusion DNA Polymerase, 1 μL. 12. PCR protocol: (a) 98 C, 30 s (b) 23 cycles: l

98 C, 30 s

l

56 C, 15 s

l

72 C, 15 s

(c) 72 C, 10 min (d) 4 C hold 13. Pool all PCR products into a 15 mL tube, mix well, and proceed with double sided SPRI DNA purification using 300 μL of mixed PCR product in a 1.7 mL centrifuge tube. The enriched sgRNA cassette PCR product is 274 bp. The amount of PCR product to be purified can be varied as long as the volumetric ratio of SPRI beads to initial DNA solution remains constant throughout the procedure. 14. Add 195 μL SPRI beads (0.65), mix by pipetting up and down, and incubate for 10 min at room temperature. 15. Place 1.7 mL tube containing DNA and SPRI beads onto a magnetic rack (i.e., DynaMag-2 Magnet, Life Technologies) for 5 min at room temperature. 16. Transfer supernatant to a new 1.7 mL tube. DO NOT DISCARD SUPERNATANT. 17. Add 300 μL SPRI beads (1), mix by pipetting up and down, and incubate for 10 min at room temperature. 18. Place 1.7 mL tube containing DNA and SPRI beads onto a magnetic rack for 5 min at room temperature. 19. Remove supernatant. DO NOT DISTURB BEADS. 20. With the tube on the magnetic rack, add 1 mL 80% EtOH (freshly prepared), incubate for 2 min at room temperature, then remove all EtOH. Repeat once for a total of 2 EtOH washes. 21. With the tube on the magnetic rack, air dry the beads for 5–15 min. 22. Remove tube from magnetic rack and resuspend beads in 20 μL EB and mix well.

334

S. John Liu et al.

23. Incubate off the magnetic rack for 1 min at room temperature. 24. Place tube back on magnetic rack and transfer 19.5 μL of EB containing purified DNA product into a new tube. 25. Quantify DNA concentration on a Qubit Fluorometer using the high sensitivity dsDNA assay kit. Dilute sample to 0.4 ng/μ L for sequencing. 26. Run diluted sample on a Bioanalyzer or TapeStation using a high sensitivity DNA assay kit. The purified DNA product should be 274 bp. 27. Submit to a core sequencing facility for sequencing on an Illumina HiSeq 2500 or 4000 using the single end 50 protocol, index length of 1 6 bp. A PhiX spike in may be required to increase diversity of the sequencing reads. Alternatively, pool with unrelated sequencing libraries to maximize diversity. Custom sequencing primers required for sequencing are as follows: 50 Sequencing Primer: GTGTGTTTTGAGACTATAAGTAT CCCTTGGAGAACCACCTTGTTG. 30 Sequencing Primer: CCACTTTTTCAAGTTGATAACG GACTAGCCTTATTTAAACTTGCTATGCTGT. 3.5 Analysis of Screen Sequencing Data

After demultiplexing screen sequencing reads, sgRNA counts will be quantified separately for each condition. The screen processing pipeline in addition to an example vignette are freely available on Github: https://github.com/mhorlbeck/ScreenProcessing. Briefly, raw sequencing reads are aligned and sgRNA library counts are generated. sgRNA phenotype scores are then generated based on comparisons between initial and intermediate/end time points of the screen. Gene/lncRNA-level scores are generated and are tested for statistical significance. 1. Ensure that Python 2.7 and necessary dependencies for Screen Processing are installed (iPython, numpy, scipy, pandas, matplotlib, biopython). 2. Download entire directory containing Screen Processing analysis suite from https://github.com/mhorlbeck/Screen Processing. See the tutorial and consider running the scripts on the demo files first. 3. Initiate the sgRNA quantification script “fastqgz_to_counts. py” by running in the command line “run fastqgz_to_counts. py” along with the requested input parameters (“run fastqgz_to_counts.py -h” prints a help message describing the optional and required inputs). 4. To calculate sgRNA-level, transcript-level, and gene-level phenotypes, run “process_experiments.py” by editing the corresponding configuration file with the correct sgRNA library, sample ID’s, and doubling time in days.

Genome-Scale CRISPR Interference of lncRNAs

335

5. Finally, run “screen_analysis.py” to generate publicationquality plots of screen results. 6. Screen hits for validation or follow-up can be selected in several ways, including p-value, absolute value of phenotype, or “discriminant score” cut-offs. Select a cut-off that includes few negative control genes, genes comprised of randomly sampled non-targeting sgRNAs and labeled “pseudo” in the gene table output from process_experiments script. Hits should be further examined to ensure multiple independent sgRNAs contribute to the gene-level phenotype and the TSS of the lncRNA gene does not closely neighbor other gene TSSs (see Note 5). Orthogonal validation using non-CRISPR interference methods is also encouraged (see Note 6). Additional phenotypes may also be investigated in future screens (see Note 7).

4

Notes 1. Additional protocols for generation and validation of CRISPRi/a cell lines and screening methods can be found at weissmanlab.ucsf.edu. 2. Upon generation of the CRISPRi cell line, it is strongly recommended to confirm efficient CRISPRi activity by targeting positive control genes using established sgRNA sequences [5, 9, 11] before proceeding to infection of the large-scale lentivirus sgRNA library. Common methods for doing so include infecting cells with individual sgRNAs targeting essential genes and observing depletion of infected cells or with sgRNAs targeting nonessential genes and measuring RNA knockdown by RT-qPCR. Example sgRNAs and qPCR primers are available at weissmanlab.ucsf.edu. 3. LncRNA knockdown using individual sgRNAs that show pronounced phenotypes in the screen may also be cloned into a lentivirus expression vector, and cells stably expressing dCas9KRAB infected with these vectors (fluorescently labeled) may be monitored using flow cytometry in an internally controlled growth assay. 4. Custom sgRNA libraries targeting lncRNA transcription start sites may be generated by selecting sgRNAs targeting a set of lncRNA libraries of interest from existing libraries such as the CRiNCL library and then synthesizing them (see step 4 below). If a set of lncRNAs are not targeted by existing libraries, sgRNAs may be designed as follows: (a) Identify lncRNA transcripts of interest using established transcriptome references such as Ensembl, GENCODE, or other custom annotations. Expression levels or other biological features of lncRNAs may be used to prioritize genes for inclusion.

336

S. John Liu et al.

(b) Extract the transcription start sites of these transcripts and where applicable and compare them to the FANTOM cap analysis of gene expression (CAGE)-based TSS annotations as previously described [5, 11]. (c) 10 candidate sgRNAs per lncRNA TSS are then generated within 25 bp and + 500 bp relative to the TSS according to the hCRISPRi-v2.1 algorithm described in [12] and are prioritized based on predicted off-target scores, restriction digest sites, lack of redundancy. (d) The final set of sgRNAs targeting lncRNAs and also negative control sgRNAs (weighted by the per-base nucleotide frequencies of the targeting sgRNAs in the library) are then designed with flanking cloning and PCR sites described in [11], synthesized by Agilent Technologies or equivalent service, and cloned into the appropriate vector as described above. 5. We have observed CRISPRi activity at up to 1 kb away from the targeted locus. A reasonable initial filter for screen hits would be to exclude any gene for which the sgRNAs targeting the gene TSS are within 1 kb of an essential gene TSS as defined by a database (e.g., DepMap). Alternately, excluding any lncRNA for which the TSS is within 5 kb of any other gene TSS would be a reasonable strict filter to yield fewer genes for follow-up. In either case, hits should be validated as below. 6. Orthogonal validation approaches such as antisense oligonucleotides (ASO) are invaluable for further characterization of lncRNAs of interest, since they utilize RNase-H based degradation of targeted lncRNAs without altering the genomic DNA or epigenome, and they also represent a compelling mode of targeting lncRNAs for therapeutic applications [5, 8, 23]. We use locked nucleic acid (LNA) Gapmers (Qiagen) for validation and further studies of lncRNA transcript function. Other emerging approaches of direct RNA perturbation such as the Cas13 RNA-guided, RNA-targeting nuclease [24, 25] also represent promising methods that are orthogonal to DNA-targeting CRISPR/(d)Cas9 approaches. Critically, validation methods should be chosen such that they are completely orthogonal with respect to how lncRNA function is perturbed (e.g., DNA sequence modification, transcription modulation, or RNA degradation) and the potential artefacts associated with the method [7]. 7. As with all pooled screening approaches, cell growth represents just one possible phenotypic readout. As has been done for protein-coding genes, lncRNAs involved in drug sensitivity/ resistance [11, 26, 27], synthetic lethality [28], and cellular differentiation [5, 15], among other phenotypes, can be dissected using similar principles as those outlined here.

Genome-Scale CRISPR Interference of lncRNAs

337

References 1. Guttman M, Donaghey J, Carey BW et al (2011) lincRNAs act in the circuitry controlling pluripotency and differentiation. Nature 477:295–300. https://doi.org/10. 1038/nature10398 2. Lin N, Chang K-Y, Li Z et al (2014) An evolutionarily conserved long noncoding RNA TUNA controls pluripotency and neural lineage commitment. Mol Cell 53:1005–1019. https://doi.org/10.1016/j.molcel.2014.01. 021 3. Zhu S, Li W, Liu J et al (2016) Genome-scale deletion screening of human long non-coding RNAs using a paired-guide RNA CRISPRCas9 library. Nat Biotechnol 34:1279–1286. https://doi.org/10.1038/nbt.3715 4. Liu Y, Cao Z, Wang Y et al (2018) Genomewide screening for functional long noncoding RNAs in human cells by Cas9 targeting of splice sites. Nat Biotechnol 1656:175–1210. https://doi.org/10.1038/nbt.4283 5. Liu SJ, Horlbeck MA, Cho SW et al (2017) CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells. Science 355:eaah7111. https://doi.org/ 10.1126/science.aah7111 6. Kopp F, Mendell JT (2018) Functional classification and experimental dissection of long noncoding RNAs. Cell 172:393–407. https://doi.org/10.1016/j.cell.2018.01.011 7. Bassett AR, Akhtar A, Barlow DP et al (2014) Considerations when investigating lncRNA function in vivo. eLife 3:e03058. https://doi. org/10.7554/eLife.03058 8. Liu SJ, Lim DA (2018) Modulating the expression of long non-coding RNAs for functional studies. EMBO Rep 19(12):e46955–11. https://doi.org/10.15252/embr.201846955 9. Gilbert LA, Larson MH, Morsut L et al (2013) CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes. Cell 154:442–451. https://doi.org/10.1016/j. cell.2013.06.044 10. Qi LS, Larson MH, Gilbert LA et al (2013) Repurposing CRISPR as an RNA-guided platform for sequence-specific control of gene expression. Cell 152:1173–1183. https://doi. org/10.1016/j.cell.2013.02.022 11. Gilbert LA, Horlbeck MA, Adamson B et al (2014) Genome-scale CRISPR-mediated control of gene repression and activation. Cell 159:647–661. https://doi.org/10.1016/j. cell.2014.09.029 12. Horlbeck MA, Gilbert LA, Villalta JE et al (2016) Compact and highly active next-

generation libraries for CRISPR-mediated gene repression and activation. eLife 5:914. https://doi.org/10.7554/eLife.19760 13. Horlbeck MA, Witkowsky LB, Guglielmi B et al (2016) Nucleosomes impede Cas9 access to DNA in vivo and in vitro. eLife 5:2767. https://doi.org/10.7554/eLife.12677 14. Radzisheuskaya A, Shlyueva D, Mu¨ller I et al (2016) Optimizing sgRNA position markedly improves the efficiency of CRISPR/dCas9mediated transcriptional repression. Nucleic Acids Res 44:e141–e141. https://doi.org/10. 1093/nar/gkw583 15. Mandegar MA, Huebsch N, Frolov EB et al (2016) CRISPR interference efficiently induces specific and reversible gene silencing in human iPSCs. Cell Stem Cell 18:541–553. https:// doi.org/10.1016/j.stem.2016.01.022 16. Boettcher M, Tian R, Blau JA et al (2018) Dual gene activation and knockout screen reveals directional dependencies in genetic networks. Nat Biotechnol 36:170–178. https://doi.org/ 10.1038/nbt.4062 17. Cho SW, Xu J, Sun R et al (2018) Promoter of lncRNA gene PVT1 is a tumor-suppressor DNA boundary element. Cell 173:1398–1412.e22. https://doi.org/10. 1016/j.cell.2018.03.068 18. Ho T-T, Zhou N, Huang J (2015) Targeting non-coding RNAs with the CRISPR/Cas9 system in human cell lines. Nucleic Acids Res 43: e17–e17. https://doi.org/10.1093/nar/ gku1198 19. Goyal A, Myacheva K, Gross M et al (2017) Challenges of CRISPR/Cas9 applications for long non-coding RNA genes. Nucleic Acids Res 45:e12. https://doi.org/10.1093/nar/ gkw883 20. Wang T, Birsoy K, Hughes NW et al (2015) Identification and characterization of essential genes in the human genome. Science 350:1096–1101. https://doi.org/10.1126/ science.aac7041 21. Aguirre AJ, Meyers RM, Weir BA et al (2016) Genomic copy number dictates a geneindependent cell response to CRISPR/Cas9 targeting. Cancer Discov 6:914–929. https:// doi.org/10.1158/2159-8290.CD-16-0154 22. Munoz DM, Cassiani PJ, Li L et al (2016) CRISPR screens provide a comprehensive assessment of cancer vulnerabilities but generate false-positive hits for highly amplified genomic regions. Cancer Discov 6:900–913. https://doi.org/10.1158/2159-8290.CD16-0178

338

S. John Liu et al.

23. Meng L, Ward AJ, Chun S et al (2015) Towards a therapy for Angelman syndrome by targeting a long non-coding RNA. Nature 518:409–412. https://doi.org/10.1038/ nature13975 24. Yan WX, Chong S, Zhang H et al (2018) Cas13d is a compact RNA-targeting type VI CRISPR effector positively modulated by a WYL-domain-containing accessory protein. Mol Cell 70:327–339.e5. https://doi.org/ 10.1016/j.molcel.2018.02.028 25. Konermann S, Lotfy P, Brideau NJ et al (2018) Transcriptome engineering with RNA-targeting type VI-D CRISPR effectors. Cell 173:665–676.e14. https://doi.org/10. 1016/j.cell.2018.02.033

26. Konermann S, Brigham MD, Trevino AE et al (2015) Genome-scale transcriptional activation by an engineered CRISPR-Cas9 complex. Nature 517:583–588. https://doi.org/10. 1038/nature14136 27. Jost M, Chen Y, Gilbert LA, Horlbeck MA et al (2017) Combined CRISPRi/a-based chemical genetic screens reveal that Rigosertib is a microtubule-destabilizing agent. Mol Cell 68:210–223.e6. https://doi.org/10.1016/j. molcel.2017.09.012 28. Horlbeck MA, Xu A, Wang M et al (2018) Mapping the genetic landscape of human cells. Cell 174:953–967.e22. https://doi. org/10.1016/j.cell.2018.06.010

Chapter 21 In Vivo Functional Analysis of Nonconserved Human lncRNAs Using a Humanized Mouse Model Yonghe Ma, Cheng-Fei Jiang, Ping Li, and Haiming Cao Abstract LncRNAs (long noncoding RNAs) are transcripts that are at least 200 nucleotides long and lack any predicted coding potential. Whereas significant progress has been made in deciphering the function of mouse lncRNAs, critical gaps remain in understanding how human lncRNAs exercise their function in a physiological context. As most human lncRNAs are currently considered nonconserved and often do not have homologs in mouse, the technical bottleneck is the lack of a suitable model to study the physiological function. Chimeric mice with repopulated human hepatocytes have emerged as promising tools to study human-specific, liver enriched lncRNAs. Among all liver-specific humanized mouse models, TK-NOG is relatively easy to prepare and holds a higher repopulation rate for a prolonged period of time. In this chapter, we will illustrate how to establish humanized TK-NOG mice for in vivo analysis of human lncRNAs in detail. Key words lncRNA, Nonconserved, Humanized mice, Liver, TK-NOG, Repopulation

1

Introduction Although only ~2% of the genome sequence is sufficient to encode all ORFs in human, the vast majority of the genome can be transcribed, giving rise to thousands of long noncoding transcripts [1]. The past decade has seen unprecedented progress in understanding the function of lncRNAs in experimental animals, especially in the mouse model, ranging from cell proliferation, metabolism, differentiation, to apoptosis [2, 3]. This prompts us to explore if similar mechanisms could be applied to human lncRNAs, as our current understanding of the physiological function of human lncRNAs is extremely limited. Surprisingly, over 80% of human lncRNAs are nonconserved, with most human lncRNAs not being detected for a homolog in mice [4, 5]. As a feasible in vitro model, cultured primary human hepatocytes, however, turns out to be difficult to recapture human lncRNA functions, despite considerable efforts to test this possibility [6–8]. Therefore,

Haiming Cao (ed.), Functional Analysis of Long Non-Coding RNAs: Methods and Protocols, Methods in Molecular Biology, vol. 2254, https://doi.org/10.1007/978-1-0716-1158-6_21, © Springer Science+Business Media, LLC, part of Springer Nature 2021

339

340

Yonghe Ma et al.

a validated in vivo animal model that can be utilized to study the huge amount of nonconserved human lncRNAs is highly needed. Mice with humanized chimeric liver have been widely and successfully used in the study of drug metabolism, hepatotoxicity, and pathogenesis [9–12]. The humanization is based on the ability of immune-deficient mice with severe liver damage to be engrafted with human primary hepatocytes [13]. The reconstituted humanized liver serves as a well-controlled and genetic-consistent condition to study human liver-specific functions. Several humanized models, such as uPA-SCID [14], TK-NOG [15], and FRG-NOD [16], have been reported, differing from human cell repopulation rate, engraftment lasting time, survival rate and preparation complexity. In TK-NOG mice, an albumin promoter drives the liver-specific expression of an HSVtk transgene in severely immunodeficient NOG mice. Ganciclovir (GCV) is a drug that is not toxic to either human or mouse tissues per se, but becomes toxic when phosphorylated. When administrated to TK-NOG mice, GCV is phosphorylated by liver expressed HSVtk, thus leads to the tissue-specific liver parenchymal cells damage. When engrafted with primary human hepatocytes, the damaged liver could be reconstituted to form a new chimeric one. Of importance, the humanized TK-NOG livers maintain human liver functions for a long period of time and need relatively less effort for generation and maintenance. Moreover, we demonstrate that the human hepatocytes engrafted in the humanized mice can faithfully reflect the hepatocytes in human liver in vivo and could serve as a useful tool to study the regulation and function of liver-enriched, nonconserved human lncRNAs.

2

Materials

2.1 TK-NOG Mice Housing and Pretreatment

1. TK-NOG mice (NOD.Cg-Prkdcscid Il2rgtm1Sug Tg (Alb-TK) 7-2/ShiJic), Catalog: 12907, Taconic Biosciences. 2. Ganciclovir (GCV), NDC: 63323-315-10, APP Pharmaceuticals. Inject 10 mL sterile water into the vial (50 mg/mL). Shake vial until a clear solution is achieved. Use 0.9% Sodium Chloride to dilute it into 5 mg/mL before use. 3. 0.9% Sodium Chloride Injection, USP, NDC: 0409-4888-02, Hospira 4. Alcohol Prep Pad, NDC: 10819-3914-1, Professional Disposables International.

2.2 Intrasplenic Primary Human Hepatocyte Transplantation

1. Precision water bath, Catalog: 2829, Thermo Fisher Scientific. 2. Glass bead germinator, Catalog: GER 5287-120V, Braintree Scientific.

Functional Analysis of Human lncRNAs Using Humanized Mice

341

3. High intensity illuminator, Catalog: MI-150, Dolan-Jenner. 4. Circulation water warm pad, Catalog: HTP-1500, Adroit Medical System. 5. Betadine solution swab stick, NDC: 67618-153-01, Purdue Pharma. 6. Cotton-tipped applicator, Catalog: C150053-006, Cardinal Health. 7. Absorbent Underpads, Catalog: 1155Q87, Thomas Scientific. 8. Ophthalmic ointment, NDC: 17033-211-38, Dechra. 9. Mini Arco clipper kit, Catalog: CL8787-KIT, Kent Scientific. 10. Surgical scissors. 11. Forceps. 12. Needle driver. 13. 5–0 Vicryl Plus (Antibacterial) Undyed 2700 RB-1 Taper, Catalog: VCP213H, Esuture 14. Insulin syringe, Catalog: 328438, BD. 15. Zetamine Injection, MWI Code: 501072, VetOne. 16. AnaSed Injection, NDC: 59339-110-20, Akorn Animal Health. 17. Anesthetic is prepared by diluting 2 mL Ketamine (Zetamine Injection, 100 mg/mL) and 1 mL Xylazine (AnaSed Injection, 20 mg/mL) with 7 mL 0.9% sodium chloride. 18. Buprenorphine SR (0.5 mg/mL), ZooPharm. 19. Primary human hepatocyte, Catalog Number: HUCPI, Lonza. 20. HyClone Hank’s 1 Balanced Salt Solutions, Catalog number: 16777-153, GE Healthcare. 21. Cryopreserved Hepatocyte Recovery Medium (CHRM), Catalog number: CM700, Thermo Fisher Scientific. 2.3 After-Surgery Care

3

1. DietGel Boost, Clear H2O. 2. STAT Critical Care Support, PRN Pharmacal. This concentrated high-calorie liquid supplement should be diluted using sterile water (1:5) before use.

Methods

3.1 TK-NOG Mice Housing and Pretreatment

TK-NOG mice were housed in ventilated and autoclaved cages at an ambient temperature of 21–23 ̊C with an automated 12:12 h light/dark cycle and access to water and commercial rodent food ad libitum. One week before surgery, TK-NOG mice aged 8–10 weeks were received i.p. injection of GCV (5 mg/mL) at a dose of 20 mg/ kg (4 μL/g).

342

Yonghe Ma et al.

1. Warm prepared GCV to room temperature since the injection of cold substance will cause discomfort and drop in body temperature. 2. Disinfect top of the GCV vial with Alcohol Prep Pad. 3. Slowly draw up GCV (4 μL/g) solution into the 1 mL syringe through the 30G ½ needle. Make sure there is no bubble in the syringe. 4. Remove the mouse from the cage and restrain it with the headdown position gently. 5. Use a new Alcohol Prep Pad to scrub the skin around the injection area. The injection area is typically in the lower left or right quadrant of the abdomen. 6. Insert the needle to the abdominal cavity, with the bevel facing up to the abdomen at a 30–40 ̊ angel to horizontal. Keep the mouse head down to avoid damage to the abdominal organs. Press the plunger until the solution has been fully administered. 7. Pull the needle straight out and place the syringe into a sharp container. Replace a new needle and syringe for each animal. 8. Make sure there is no bleeding at the injection site. Then place the mice back into the cage. 3.2 Intra-Splenic Primary Human Hepatocyte Transplantation 3.2.1 Primary Human Hepatocyte Preparation

1. Determine the amount of hepatocyte vial needed for the surgery. For one vial of primary human hepatocyte (~5 106 cells), we inject five TK-NOG mice (~106 cells/mouse). One tube of Cryopreserved Hepatocyte Recovery Medium (CHRM) is used per vial of hepatocytes. Warm the CHRM to 37 C in a water bath. 2. Quickly remove the cryopreserved hepatocytes vials from the liquid nitrogen and put them on a float. Thaw cryopreserved hepatocytes in a 37 C water bath for