Practical Bioinformatics 9780815344568, 2012017992

731 143 13MB

English Pages 397 Year 2013

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Practical Bioinformatics
 9780815344568, 2012017992

Table of contents :
Untitled
Preface
Acknowledgments
Contents
CHAPTER 1: Introduction to Bioinformatics and Sequence Analysis
1.1 INTRODUCTION
1.2 THE GROWTH OF GenBank
1.3 DATA, DATA, EVERYWHERE
Further examples of human genome sequencing
1.4 THE SIZE OF A GENOME
1.5 ANNOTATION
1.6 WITNESSING EVOLUTION THROUGH BIOINFORMATICS
Recent evolutionary changes to plants and animals
1.7 LARGE SOURCES OF HUMAN SEQUENCE VARIATION
1.8 RECENT EVOLUTIONARY CHANGES TO HUMAN POPULATIONS
1.9 DNA SEQUENCE IN DATABASES
Genomic DNA assembly
cDNA in databases—where does it come from?
1.10 SEQUENCE ANALYSIS AND DATA DISPLAY
1.11 SUMMARY
FURTHER READING
Internet resources
CHAPTER 2: Introduction to Internet Resources
2.1 INTRODUCTION
2.2 THE NCBI WEBSITE AND ENTREZ
2.3 PubMed
2.4 GENE NAME EVOLUTION
2.5 OMIM
2.6 RETRIEVING NUCLEOTIDE SEQUENCES
2.7 SEARCHING PATENTS
2.8 PUBLIC GRANTS DATABASE: NIH REPORTER
2.9 GENE ONTOLOGY
2.10 THE GENE DATABASE
2.11 UniGene
2.12 THE UniGene LIBRARY BROWSER
2.13 SUMMARY
EXERCISES
Williams syndrome and oxytocin: research with Internet tools
FURTHER READING
CHAPTER 3: Introduction to the BLAST Suite and BLASTN
3.1 INTRODUCTION
Why search a database?
3.2 WHAT IS BLAST?
How does BLAST work?
3.3 YOUR FIRST BLAST SEARCH
Find the query sequence in GenBank
Convert the file to another format
Performing BLASTN searches
3.4 BLAST RESULTS
Graphic
Results table
The alignments
Other BLASTN hits from this query
Simultaneous review of the graphic, table, and alignments
3.5 BLASTN ACROSS SPECIES
BLASTN of the reference sequence for human beta hemoglobin against nonhuman transcripts
Paralogs, orthologs, and homologs
3.6 BLAST OUTPUT FORMAT
3.7 SUMMARY
EXERCISES
Exercise 1: Biofilm analysis
Exercise 2: RuBisCO
FURTHER READING
Internet resources
CHAPTER 4: Protein BLAST: BLASTP
4.1 INTRODUCTION
4.2 CODONS AND THE GENETIC CODE
Memorizing the genetic code
4.3 AMINO ACIDS
Amino acid properties
4.4 BLASTP AND THE SCORING MATRIX
Building a matrix
4.5 AN EXAMPLE BLASTP SEARCH
Retrieving protein records
Running BLASTP
The results
The alignments
Distant homologies
4.6 PAIRWISE BLAST
4.7 RUNNING BLASTP AT THE ExPASy WEBSITE
Searching for pro-opiomelanocortin using a protein sequence fragment
Searching for repeated domains in alpha-1 collagen
4.8 SUMMARY
EXERCISES
Exercise 1: Typing contest
Exercise 2: How mammoths adapted to cold
Exercise 3: Longevity genes?
FURTHER READING
CHAPTER 5: Cross-Molecular Searches: BLASTX and TBLASTN
5.1 INTRODUCTION
5.2 MESSENGER RNA STRUCTURE
5.3 cDNA
Synthesis
cDNA in databases
ESTs
Normalized cDNA libraries
An EST record
5.4 BLASTX
Reading frames in nucleic acids
A simple BLASTX search
A more complex BLASTX
Using the annotation of sequence records
BLASTX alignments with the reverse strand
5.5 TBLASTN
A TBLASTN search
Metagenomics and TBLASTN
5.6 SUMMARY
EXERCISES
Exercise 1: Analyzing an unknown sequence
Exercise 2: Snake venom proteins
Exercise 3: Metagenomics
FURTHER READING
CHAPTER 6: Advanced Topics in BLAST
6.1 INTRODUCTION
6.2 RECIPROCAL BLAST: CONFIRMING IDENTITIES
Demonstration of a reciprocal BLASTP
6.3 ADJUSTING BLAST PARAMETERS
Gap cost
Compositional adjustments
6.4 EXON DETECTION
Exon detection with BLASTN
Look at the coordinates
Exon detection with TBLASTN
Orthologous exon searching with TBLASTN
6.5 REPETITIVE DNA
Simple sequences
Satellite DNA
Mini-satellites
LINEs and SINEs
Tandemly arrayed genes
6.6 INTERPRETING DISTANT RELATIONSHIPS
Name of the protein
Percentage identity
Alignment length and length similarity between query and hit
E value
Gaps
Conserved amino acids
6.7 SUMMARY
EXERCISES
Exercise 1: Simple sequences
Exercise 2: Reciprocal BLAST
Exercise 3: Exon identifi cation with TBLASTN
Exercise 4: Identification of orthologous exons with TBLASTN
FURTHER READING
CHAPTER 7: Bioinformatics Tools for the Laboratory
7.1 INTRODUCTION
7.2 RESTRICTION MAPPING AND GENETICENGINEERING
Restriction enzymes
Restriction enzyme mapping: the polylinker site
NEBcutter
Generating reverse strand sequences: ReverseComplement
DNA translation: the ExPASy Translate tool
7.3 FINDING OPEN READING FRAMES
The NCBI ORF Finder
7.4 PCR AND PRIMER DESIGN TOOLS
Primer3
Primer-BLAST
7.5 MEASURING DNA AND PROTEINCOMPOSITION
DNA Stats
Composition/Molecular Weight Calculation Form
7.6 ASKING VERY SPECIFIC QUESTIONS:THE SEQUENCE RETRIEVAL SYSTEM (SRS)
7.7 DotPlot
DotPlot of alternative transcripts
DotPlots of orthologous genes
7.8 SUMMARY
EXERCISES
Spider silk: a workflow of analysis
FURTHER READING
CHAPTER 8: Protein Analysis
8.1 INTRODUCTION
8.2 FINDING FUNCTIONAL PATTERNS
A repeating pattern within a zinc fi nger
8.3 ANNOTATING AN UNKNOWN SEQUENCE
A zinc protease pattern
The ADAM_MEPRO profi le
8.4 LOOKING AT THREE-DIMENSIONAL PROTEIN STRUCTURES
Jmol: a protein structure viewer
Exploring and understanding a structure
Jmol scripting
8.5 ProPhylER
The Interface view
The CrystalPainter view
8.6 THE IMPACT OF SEQUENCE ON STRUCTURE
8.7 BUILDING BLOCKS: A MULTIPLE DOMAINPROTEIN
8.8 POST-TRANSLATIONAL MODIFICATION
Secretion signals
Prediction of protein glycosylation sites
8.9 TRANSMEMBRANE DOMAIN DETECTION
8.10 SUMMARY
EXERCISES
Aquaporin-5
FURTHER READING
Internet resources
CHAPTER 9: Explorations of Short Nucleotide Sequences
9.1 INTRODUCTION
9.2 TRANSCRIPTION FACTOR BINDING SITES
Transfac
Identifying other binding sites for the estrogen receptor
Predicting transcription factor binding sites
An experiment with MATCH
An experiment with PATCH
9.3 TRANSLATION INITIATION:THE KOZAK SEQUENCE
9.4 VIEWING WHOLE GENES
9.5 EXON SPLICING
Renin: a striking example of a small exon
Another striking splice: human ISG15 ubiquitin-like modifier
Alternative splicing
Human plectin: alternative splicing at the 5P end
Consensus splice junctions, translated
9.6 POLYADENYLATION SIGNALS
9.7 SUMMARY
EXERCISES
Inhibitor of Kappa light polypeptide gene enhancer in B-cells (IKBKAP)
FURTHER READING
CHAPTER 10: MicroRNAs and Pathway Analysis
10.1 INTRODUCTION
10.2 miRNA FUNCTION
10.3 miRNA NOMENCLATURE
10.4 miRNA FAMILIES AND CONSERVATION
10.5 STRUCTURE AND PROCESSING OF miRNAs
10.6 miRBase: THE REPOSITORY FOR miRNAs
10.7 NUMBERS AND LOCATIONS
10.8 LINKING miRNA ANALYSIS TO ABIOCHEMICAL PATHWAY: GASTRIC CANCER
10.9 KEGG: BIOLOGICAL NETWORKS ATYOUR FINGERTIPS
miRNAs in the cell cycle pathway
10.10 TarBase: EXPERIMENTALLY VERIFIED miRNA INHIBITION
Verified miRNA-driven translation repression
10.11 TargetScan: miRNA TARGET SITEPREDICTION
TargetScan predictions for cell cycle transcripts
10.12 EXPANDING miRNA REGULATION OF THE CELL CYCLE USING TarBase AND TargetScan
10.13 MAKING SENSE OF miRNAs AND THEIR MANY PREDICTED TARGETS
10.14 miRNAs ASSOCIATED WITH DISEASES
10.15 SUMMARY
EXERCISES
GDF8
FURTHER READING
CHAPTER 11: Multiple Sequence Alignments
11.1 INTRODUCTION
11.2 MULTIPLE SEQUENCE ALIGNMENTSTHROUGH NCBI BLAST
11.3 ClustalW FROM THE ExPASy WEBSITE
11.4 ClustalW AT THE EMBL-EBI SERVER
MARK1 kinase
MAPK15 kinase
DNA versus protein identities
11.5 MODIFYING ClustalW PARAMETERS
Gap-opening penalty
The clustering method
11.6 COMPARING ClustalW, MUSCLE, AND COBALT
11.7 ISOFORM ALIGNMENT PROBLEM: INTERNAL SPLICING
11.8 ALIGNING PARALOG DOMAINS
11.9 MANUALLY EDITING A MULTIPLE SEQUENCE ALIGNMENT
Jalview
Editing with a word processor
11.10 SUMMARY
EXERCISES
FOXP2
FURTHER READING
CHAPTER 12: Browsing the Genome
12.1 INTRODUCTION
12.2 CHROMOSOMES
Human chromosome statistics
Chromosome details and comparisons
12.3 SYNTENY
Synteny of the sex chromosomes
12.4 THE UCSC GENOME BROWSER
OPN5: a sample gene to browse
Simple view changes in the UCSC Genome Browser
Confi guring the UCSC Genome Browser window
Searching genomes and adding tracks through BLAT
Viewing the Multiz alignments
Zooming out: seeing the big picture
Very large genes: dystrophin and titin
Gene density
Interspecies comparison of genomes
The beta globin locus
12.5 SUMMARY
EXERCISES
Olfactory genes
FURTHER READING
APPENDIX 1: Formatting Your Report
A1.1 INTRODUCTION
A1.2 FONT CHOICE AND PASTING ISSUES
A1.3 FIND AND REPLACE
Changing file format
A1.4 HYPERTEXT
Creating hypertext
Selecting a column of text
A1.5 SUMMARY
APPENDIX 2: Running NCBI BLAST in “batch” Mode
ABBREVIATIONS
GLOSSARY
WEB RESOURCES
INDEX
Color Versions of Selected Figures

Citation preview

Practical Bioinformatics Michael Agostino

Practical Bioinformatics

Dedication This book is dedicated to my mother, Ruth Agostino, who tolerated my smelly biology and chemistry experiments in the basement of our house, and the endless number of muddy clothes and shoes from my frequent explorations of the woods near my home. I owe my love of exploration and discovery to you.

Practical Bioinformatics Michael Agostino

Vice President: Denise Schanck Senior Editor: Gina Almond Assistant Editor: David Borrowdale Development Editor: Mary Purton Production Editor: Ioana Moldovan Typesetter and Senior Production Editor: Georgina Lucas Copy Editor: Jo Clayton Proofreader: Sally Huish Illustrations: Oxford Designers & Illustrators Cover Design: Andrew Magee Indexer: Medical Indexing Ltd

© 2013 by Garland Science, Taylor & Francis Group, LLC

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems—without permission of the copyright holder.

ISBN 978-0-8153-4456-8

Front cover image: Chapter 8 of this book focuses on protein analysis. One example in this chapter is the superimposition of the black swan and Atlantic cod fish lysozyme structures (see Section 8.6). This allows the viewer to see the impact, or lack thereof, of the amino acid differences on the structures of these distantly related proteins. The book cover shows the amino acid sequence of this swan lysozyme (UniProt accession number P00717), repeated many times to fill the page, and is combined with the structure of the lysozyme protein (PDB identifier 1gbs). About the author: Michael Agostino received his PhD in Molecular Biology from Roswell Park Memorial Institute, a division of SUNY at Buffalo, New York. His thesis characterized the unusual structure and evolution of sea urchin histone genes. Postdoctoral work included the development of a molecular assay for DNA strand scission agents used in chemotherapy. In 1984, he moved to the University of North Carolina at Chapel Hill where he co-developed a vector trap for gene enhancers. Other work included the creation of a synthetic gene and an E. coli blue-white reporter gene assay for HIV protease activity. In 1991 he formally switched careers to bioinformatics by joining GlaxoSmithKline. There, he provided sequence analysis, consulting, user-support, and training for the Glaxo scientists. In 1996 he moved to Genetics Institute, where he was appointed manager of a bioinformatics department. This group was responsible for the sequence analysis and database of a high-throughput effort to identify, express, and patent the human genes that encode secreted proteins. Presently, he provides bioinformatics analysis and end-user support for multiple sites of the Pfizer Research organization. He is also an adjunct professor in the Biology Department at Merrimack College, North Andover, Massachusetts (USA).

Library of Congress Cataloging-in-Publication Data Agostino, Michael J. Practical bioinformatics / Michael Agostino. p. cm. ISBN 978-0-8153-4456-8 (alk. paper) 1. Nucleotide sequence--Data processing. 2. Bioinformatics. I. Title. QP625.N89A39 2013 572.8’6330285--dc23 2012017992

Published by Garland Science, Taylor & Francis Group, LLC, an informa business, 711 Third Avenue, 8th floor, New York, NY 10017, USA, and 3 Park Square, Milton Park, Abingdon, OX14 4RN, UK. Printed in the United States of America 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Visit our Website at http://www.garlandscience.com

Preface Although bioinformatics is a relatively new scientific discipline, it has become quite broad in definition. It is often described as including diverse topics such as the analysis of microarrays and the accompanying statistics, protein structure prediction, and pathway and protein interaction analysis. Of course, computer programming, database development, and even hardware design are included in the field. Practical Bioinformatics is focused on the fundamental skills of bioinformatics: the analysis of DNA, RNA, and protein sequences. The chapters take the reader through a commonly asked question, “What can I learn about this sequence?” The only requirement is access to the Internet and a web browser; no other software is required. This book is designed as an introduction to bioinformatics sequence analysis for biology and biochemistry majors. There are many published books that teach about detailed algorithms, sophisticated programs, and advanced interpretation of data. Although these are excellent sources of information, many biologists and biochemists are not prepared for, nor do they need, the depth and detail of these texts. Instead, they need the practical knowledge and skills to analyze sequences. They are asking questions such as “Which tools do I use?” “What settings should I use?” “What database should I search?” “What do these results mean?” “How do I export this information?” Practical Bioinformatics addresses these questions, and many more, in 12 easily read chapters. Concepts will be introduced within each chapter and then demonstrated through the analysis of problems using selected gene/protein examples. Adequate background, details, illustrations, and references will be provided to insure that readers understand the fundamentals and can do further reading if desired. Along the way, interesting genes, phenotypes, mutations, and biology will be introduced but not discussed extensively or analyzed. These topics are purposefully left open so they may easily be turned into literature searches, analysis problems, or senior projects for the ambitious student. Just thinking about these problems and how to analyze them will instill the habit of identifying topics needing exploration. The best way to learn this material is by “doing.” Readers of this book will learn the concepts by performing many analysis problems. To get the most out of this book, readers should perform most, if not all, of the analysis steps and recreate the figures for themselves. By the time readers finish the book, they will have significant experience in sequence analysis problems, approaches, and solutions. They should then be ready to perform many analysis steps on their own, and tackle more advanced books on the subject. A common error when approaching a sequence analysis problem is to use powerful analysis software with little understanding of how it works or how to interpret the output. Web forms and software can completely hide the details. This text will emphasize the proper use of established analysis software and the need to evaluate new tools. There are literally hundreds of bioinformatics tools available and no book could possibly contain or instruct on all the tools that are available. However, the repeated experience of performing guided analysis problems will teach the reader to be critical of bioinformatics software and to use proper positive and negative controls when testing unfamiliar tools. When this book is finished, readers will have both the practical knowledge and experience to address their own problems, and take advantage of the mountains of genetic data being generated today. I would like to thank the staff and associates of Garland Science for their tremendous support during the process of writing this book. Thanks to Gina Almond who believed in the project from the very beginning and was never

vi

Preface short on enthusiasm, David Borrowdale for guiding the book through the many steps, and Mary Purton for her infinite patience during the editing process. My thanks go to Ioana Moldovan, Georgina Lucas, Jo Clayton, and Sally Huish for their tremendous attention to detail and style during the final editing. Special thanks go to Oxford Designers & Illustrators for numerous illustrations. Thanks to Josephine Modica-Napolitano who gave me my first job in teaching, and the students at Merrimack College; together they put me on the path of writing this book. My special thanks to Donald J. Mulcare, my undergraduate advisor, for advice, encouragement, and my first real taste of what it is like to be a scientist. My years in industry would not have been the same without knowing the members of the “Dream Team:” Yuchen Bai, Sreekumar Kodangattil, Ellen Murphy, Padma Reddy, and Wenyan Zhong. They are the best sequence analysts I know. Additional thanks go to Maryann Whitley and Steve Howes for providing a calm and steady leadership at Pfizer. Many thanks to my daughter Becky, who inspires me to be better every day. Finally, this book would not have been possible without the years of support, encouragement, and love from my wife, Nan. Dreams can come true.

Instructor Resources Website Accessible from www.garlandscience.com, the Instructor Resource Site requires registration and access is available only to qualified instructors. To access the Instructor Resource Site, please contact your local sales representative or email [email protected]. The images in Practical Bioinformatics are available on the Instructor Resource Site in two convenient formats: PowerPoint® and JPEG, which have been optimized for display. The resources may be browsed by individual chapter or a search engine. Figures are searchable by figure number, figure name, or by keywords used in the figure legend from the book. Answers to end of chapter questions/exercises are available on the Instructor Resource Site. Resources available for other Garland Science titles can be accessed via the Garland Science Website. PowerPoint is a registered trademark of Microsoft Corporation in the United States and/or other countries.

vii

Acknowledgments The author and publisher of Practical Bioinformatics gratefully acknowledge the contributions of the following reviewers in the development of this book: Enrique Blanco

University of Barcelona, Spain

Ron Croy

Durham University, UK

John Ferguson

Bard College, USA

Laurie Heyer

Davidson College, USA

Torgeir Hvidsten

Umeå University, Sweden

Ian Kerr

University of Nottingham, UK

Daisuke Kihara

Purdue University, USA

Peter Kos

Biological Research Centre of the Hungarian Academy of Sciences, Hungary

Jean-Christophe Nebel

Kingston University London, UK

Samuel Rebelsky

Grinnell College, USA

Rebecca Roberts

Ursinus College, USA

Hugh Shanahan

Royal Holloway, University of London, UK

Shin-Han Shiu

Michigan State University, USA

Shaneen Singh

Brooklyn College, USA

Alan Ward

Newcastle University, UK

viii

Contents Chapter 1 Introduction to Bioinformatics and Sequence Analysis 1 1.1 1.2 1.3

Introduction The Growth of GenBank Data, Data, Everywhere Further examples of human genome sequencing Personal genome sequencing Paleogenetics Focused medical genomic studies

1.4 1.5 1.6

The Size of a Genome Annotation Witnessing Evolution Through Bioinformatics Recent evolutionary changes to plants and animals 1.7 Large Sources of Human Sequence Variation 1.8 Recent Evolutionary Changes to Human Populations 1.9 DNA Sequence in Databases Genomic DNA assembly cDNA in databases—where does it come from? 1.10 Sequence Analysis and Data Display 1.11 Summary Further Reading Internet resources

Chapter 2 Introduction to Internet Resources 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13

Introduction The NCBI Website and ENTREZ PubMed Gene Name Evolution OMIM Retrieving Nucleotide Sequences Searching Patents Public Grants Database: NIH RePORTER Gene Ontology The Gene Database UniGene The UniGene Library Browser Summary

1 2 2

Exercises Williams syndrome and oxytocin: research with Internet tools Further Reading

4 4 4 5 5 6

Chapter 3 Introduction to the BLAST Suite and BLASTN

7

3.3

3.1 3.2

7 7

3.4

8 9 10 12 14 20 20 21

23 23 23 25 27 29 30 31 33 34 36 38 43 44

Introduction Why search a database? What is BLAST? How does BLAST work? Your First BLAST Search Find the query sequence in GenBank Convert the file to another format Performing BLASTN searches BLAST Results Graphic Interpretation of the graphic

Results table Interpretation of the table

The alignments Other BLASTN hits from this query Simultaneous review of the graphic, table, and alignments 3.5 BLASTN Across Species BLASTN of the reference sequence for human beta hemoglobin against nonhuman transcripts Paralogs, orthologs, and homologs 3.6 BLAST Output Format 3.7 Summary Exercises Exercise 1: Biofilm analysis Exercise 2: RuBisCO Further Reading Internet resources

44 44 45

47 47 47 48 48 49 49 51 52 54 54 55 55 57 57 60 63 64

64 66 68 68 68 68 70 71 71

Chapter 4 Protein BLAST: BLASTP

73

4.1 4.2

73 73 76 76 77

4.3

Introduction Codons and the Genetic Code Memorizing the genetic code Amino Acids Amino acid properties

Contents 4.4

BLASTP and the Scoring Matrix Building a matrix 4.5 An Example BLASTP Search Retrieving protein records Running BLASTP The results The alignments Distant homologies 4.6 Pairwise BLAST 4.7 Running BLASTP at the ExPASy Website Searching for pro-opiomelanocortin using a protein sequence fragment Searching for repeated domains in alpha-1 collagen 4.8 Summary Exercises Exercise 1: Typing contest Exercise 2: How mammoths adapted to cold Exercise 3: Longevity genes? Further Reading

Chapter 5 Cross-Molecular Searches: BLASTX and TBLASTN 5.1 5.2 5.3

Introduction Messenger RNA Structure cDNA Synthesis cDNA in databases ESTs Normalized cDNA libraries An EST record 5.4 BLASTX Reading frames in nucleic acids A simple BLASTX search A more complex BLASTX Using the annotation of sequence records BLASTX alignments with the reverse strand 5.5 TBLASTN A TBLASTN search Metagenomics and TBLASTN 5.6 Summary Exercises Exercise 1: Analyzing an unknown sequence Exercise 2: Snake venom proteins Exercise 3: Metagenomics Further Reading

78 78 80 81 81 82 84 84 85 86 87 91 94 94 94 95 96 97

99 99 100 101 101 102 103 104 106 107 107 108 109 115 117 117 118 120 122 122 122 123 124 125

Chapter 6 Advanced Topics in BLAST 6.1 6.2

Introduction Reciprocal BLAST: Confirming Identities Demonstration of a reciprocal BLASTP 6.3 Adjusting BLAST Parameters Gap cost Compositional adjustments 6.4 Exon Detection Exon detection with BLASTN Look at the coordinates Exon detection with TBLASTN Orthologous exon searching with TBLASTN 6.5 Repetitive DNA Simple sequences Satellite DNA Mini-satellites LINEs and SINEs Tandemly arrayed genes 6.6 Interpreting Distant Relationships Name of the protein Percentage identity Alignment length and length similarity between query and hit E value Gaps Conserved amino acids 6.7 Summary Exercises Exercise 1: Simple sequences Exercise 2: Reciprocal BLAST Exercise 3: Exon identification with TBLASTN Exercise 4: Identification of orthologous exons with TBLASTN Further Reading

Chapter 7 Bioinformatics Tools for the Laboratory 7.1 7.2

Introduction Restriction Mapping and Genetic Engineering Restriction enzymes Restriction enzyme mapping: the polylinker site NEBcutter Generating reverse strand sequences: Reverse Complement DNA translation: the ExPASy Translate tool

ix

127 127 127 128 131 131 133 134 135 138 138 141 144 145 145 145 145 146 147 147 148 148 149 149 150 152 152 152 153 153 154 155

157 157 158 158 160 160 162 162

x

Contents

7.3

Finding Open Reading Frames The NCBI ORF Finder 7.4 PCR and Primer Design Tools Primer3 Primer-BLAST 7.5 Measuring DNA and Protein Composition DNA Stats Composition/Molecular Weight Calculation Form 7.6 Asking Very Specific Questions: The Sequence Retrieval System (SRS) 7.7 DotPlot DotPlot of alternative transcripts DotPlots of orthologous genes 7.8 Summary Exercises Spider silk: a workflow of analysis Further Reading

Chapter 8 Protein Analysis 8.1 8.2

Introduction Finding Functional Patterns A repeating pattern within a zinc finger 8.3 Annotating an Unknown Sequence A zinc protease pattern The ADAM_MEPRO profile 8.4 Looking at Three-dimensional Protein Structures Jmol: a protein structure viewer Exploring and understanding a structure Jmol scripting 8.5 ProPhylER The Interface view The CrystalPainter view 8.6 The Impact of Sequence on Structure 8.7 Building Blocks: A Multiple Domain Protein 8.8 Post-translational Modification Secretion signals Prediction of protein glycosylation sites 8.9 Transmembrane Domain Detection 8.10 Summary Exercises Aquaporin-5 Further Reading Internet resources

163 163 165 166 169 170 170 171

172 174 175 176 179 179 179 181

183 183 183 184 187 188 188 190 192 193 194 195 196 198 201 204 204 206 208 208 211 211 211 213 214

Chapter 9 Explorations of Short Nucleotide Sequences 9.1 9.2

Introduction Transcription Factor Binding Sites Transfac Identifying other binding sites for the estrogen receptor Predicting transcription factor binding sites An experiment with MATCH An experiment with PATCH 9.3 Translation Initiation: The Kozak Sequence 9.4 Viewing Whole Genes 9.5 Exon Splicing Renin: a striking example of a small exon Another striking splice: human ISG15 ubiquitin-like modifier Alternative splicing Human plectin: alternative splicing at the 5P end Consensus splice junctions, translated 9.6 Polyadenylation Signals 9.7 Summary Exercises Inhibitor of Kappa light polypeptide gene enhancer in B-cells (IKBKAP) Further Reading

Chapter 10 MicroRNAs and Pathway Analysis 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8

Introduction miRNA Function miRNA Nomenclature miRNA Families and Conservation Structure and Processing of miRNAs miRBase: The Repository for miRNAs Numbers and Locations Linking miRNA Analysis to a Biochemical Pathway: Gastric Cancer 10.9 KEGG: Biological Networks at Your Fingertips miRNAs in the cell cycle pathway 10.10 TarBase: Experimentally Verified miRNA Inhibition Verified miRNA-driven translation repression

215 215 216 216 219 220 221 224 226 228 231 234 235 236 237 238 239 240 242 242 243

245 245 245 247 247 248 250 251

251 253 255 256 256

Contents 10.11 TargetScan: miRNA Target Site Prediction TargetScan predictions for cell cycle transcripts 10.12 Expanding miRNA Regulation of  the Cell Cycle Using TarBase and TargetScan 10.13 Making Sense of miRNAs and Their Many Predicted Targets 10.14 miRNAs Associated With Diseases 10.15 Summary Exercises GDF8 Further Reading

Chapter 11 Multiple Sequence Alignments 11.1 11.2

Introduction Multiple Sequence Alignments Through NCBI BLAST 11.3 ClustalW from the ExPASy Website 11.4 ClustalW at the EMBL-EBI Server MARK1 kinase MAPK15 kinase DNA versus protein identities 11.5 Modifying ClustalW Parameters Gap-opening penalty The clustering method 11.6 Comparing ClustalW, MUSCLE, and COBALT 11.7 Isoform Alignment Problem: Internal Splicing 11.8 Aligning Paralog Domains 11.9 Manually Editing a Multiple Sequence Alignment Jalview Editing with a word processor 11.10 Summary Exercises FOXP2 Further Reading

12.3

258 260

263 265 266 267 267 267 269

271 271 271 274 276 277 280 282 282 282 283

Introduction Chromosomes Human chromosome statistics Chromosome details and comparisons

Appendix 1 Formatting Your Report A1.1 A1.2 A1.3 A1.4

286 A1.5

Introduction Font Choice and Pasting Issues Find and Replace Changing file format Hypertext Creating hypertext Selecting a column of text Summary

303 304 304 305 308 310 312 314 316 318 320 323 324 325 325 325 327

329 329 329 331 332 333 334 334 334

288 292 294 294 296 296 296 296 297

Chapter 12 Browsing the Genome 299 12.1 12.2

Synteny Synteny of the sex chromosomes 12.4 The UCSC Genome Browser OPN5: a sample gene to browse Simple view changes in the UCSC Genome Browser Configuring the UCSC Genome Browser window Searching genomes and adding tracks through BLAT Viewing the Multiz alignments Zooming out: seeing the big picture Very large genes: dystrophin and titin Gene density Interspecies comparison of genomes The beta globin locus 12.5 Summary Exercises Olfactory genes Further Reading

xi

299 299 300 302

Appendix 2 Running NCBI BLAST in “batch” Mode

337

Abbreviations

340

Glossary

341

Web Resources

344

Index

347

             

This page is intentionally left blank.  

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA GAAGGGGAAACAGATGCAGAAAGCATCT AGAAAGCATCT ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCT ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAA GAAGGGGAA

CHAPTER 1

Introduction to Bioinformatics and Sequence Analysis

Key concepts • The scope of bioinformatics • The origins and growth of DNA databasess • Evidence of evolution from bioinformatics • Example sequence analysis and displays using human Factor IX

1.1 INTRODUCTION We are witnessing a revolution in biomedical research. Although it has been clear for decades that exploring the genetics of biological systems was crucial to understanding them, it was far too expensive and complex to consider obtaining genetic sequences for that exploration. But now, acquiring genetic sequences is affordable and simple, and data are being generated at unprecedented rates. The heart of understanding all this sequence lies in bioinformatics sequence analysis, and this book serves as an introduction to this powerful study of DNA, RNA, and protein sequence. Bioinformatics concerns the generation, visualization, analysis, storage, and retrieval of large quantities of biological information. The generation of biomedical data, including DNA sequence, in its raw form does not involve bioinformatics skills. But in order for that sequence to be usable, it must be analyzed, annotated, and reformatted to be suitable for databases. These are all bioinformatics activities. Many of these activities can be automated, but their development and support come from someone with skills or experience in bioinformatics. Once the data have been made available, how do you analyze the data? Is there text like DNA and protein sequence files? If yes, it should be presented in a way to allow interpretation or easy input into programs for analysis. Or is there so much information that data are represented graphically? This form of data reduction is quite powerful and without it we would be staring at pages and pages of sequence without, literally, seeing the big picture. Some analysis is manual, ranging from looking at the individual nucleotides or amino acids, to submitting sequence to a program that transforms the sequence into another form. This could include the location of features such as functional domains, modification sites, and coding regions. Often, analysis includes the searching of databases for the purposes of comparison or discovery, and this will be the primary activity for a number of chapters. Much of the content of this book is concerned with analysis.

Chapter 1: Introduction to Bioinformatics and Sequence Analysis

2

i

Floppy disk databases

In the early years of GenBank, if you wanted access to the database you ordered a handful of floppy disks that were delivered in the mail.

120,000,000,000

100,000,000,000

Storage is usually not a responsibility of those who will analyze the sequences. However, the creation of properly structured databases or storage forms so data can be queried and retrieved is essential for the analysts to do their work. Sequence files and other forms of data can be decades old or just created yesterday. But unless you can retrieve them easily, the value decreases quickly. “Easily” is not just describing the speed of the computers and connections delivering the information to you, although this can be extremely important. It also includes the steps to access and query the stored data. The ideal approach is often a Web form with easily understood options, online help, and results pages rich with hypertext. Bioinformatics was one of the first areas of science to embrace the Web as a vehicle for disseminating information and we’ll be using many Web pages in this book. Finally, bioinformatics activities often involve large quantities of data. Even  if you are focusing on a single gene, you still may have mountains of data that are connected to this single sequence. With a good database or software tool, you may only be aware of the quantities yet not overwhelmed with details that don’t interest you. Still, it can’t be emphasized enough that one of the biggest challenges facing the field of bioinformatics is the absolute deluge of information and how to generate, visualize, analyze, store, and retrieve these data.

1.2 THE GROWTH OF GenBank

nucleotides

80,000,000,000

60,000,000,000

How much data are we talking about? One way to answer this is to describe the amount of DNA sequence data in public databases. GenBank is a huge repository run by the US National Center for Biotechnology Information (NCBI). The inset in Figure 1.1 shows the steady growth in the early years of GenBank but the rate of growth has been rapid since then. As of early 2011, there are over 126 billion nucleotides in this standard division of GenBank from over 380,000 organisms. If this were not impressive enough, there are an additional 91 billion nucleotides in the whole genome shotgun (a type of sequencing) division, the section of GenBank dedicated to unfinished large sequencing efforts. If a DNA sequence is considered “public information,” it is deposited in GenBank, the DNA Data Bank of Japan (DDBJ), or the database of the European Bioinformatics Institute (EBI). The contents of these three databases are synchronized. In terms of disk space, the database is over 500 Gigabytes in size.

1.3 DATA, DATA, EVERYWHERE 40,000,000,000

20,000,000,000

Where are all the data coming from? The quick answer is everywhere! In recent years there has been a dramatic drop in prices and rapid advances in both sequencing technology and computing power. What was once too timeconsuming and expensive is now very possible and affordable; biological sequence generation is now commonplace. A major driver for the advances being realized today is the Human Genome Project. Even though the completion of the sequencing of the human genome was announced in 2001, the analysis of the data is ongoing and will take many years. These advances had to be coupled with dramatic improvements in computers and the drop in cost for processing power, memory, and storage. Of course, the Human Genome Project and all the spin-offs are only possible because of simultaneous advances in bioinformatics. This intersection of sequencing technologies, computational power, and advances in bioinformatics has made DNA sequencing quite routine and paved the way for many bold and ambitious projects. Projects now come from scientists

0 1982 1992 2002 2012 year

Figure 1.1 GenBank growth. Plotted is the size of GenBank in nucleotides versus the years from 1982 to the first three months of 2011. The inset shows data for years 1982–1994, not visible on the larger plot. From the GenBank Release Notes of Release 184, ncbi.nlm.nih.gov.

Data, Data, Everywhere in numerous fields of biology, medicine, agriculture, ecology, history, energy, and forensics, just to name a few. Here are some prominent examples. ●

















The 1000 Genomes Project (www.1000genomes.org). An effort to sequence the genomes of 1000 people to identify genetic variants that affect 1% of the human population. In addition to providing insights to genetic disorders and health risks, the history of human migrations is being revealed. In recent years, people have proposed that the number of human genomes to sequence for this project grow to be 10,000 or higher. The 1001 Arabidopsis thaliana Genomes Project (www.1001genomes.org). Arabidopsis is a widely used plant model due to its habitat diversity, genetics, and ease of manipulation. This genome project aims to study the genomes of 1001 strains that differ in phenotype including adaptation to growth in a wide variety of conditions. Project scientists and those in the Arabidopsis community are able to grow huge numbers of genetically identical plants and can vary the environment at will to challenge and observe the underlying genetic elements which define these strains. The Genome 10K Project (genome10k.soe.ucsc.edu). An effort to sequence the genomes of 10,000 vertebrate species, one from every genus. Along with all the other genomes sequenced, this project will make a tremendous impact on understanding the relationship between organisms. We can only guess what will be discovered from these animals, having so much in common with us but with such diverse physiologies and phenotypes, and occupying such a wide range of habitats. The i5k Initiative (www.arthropodgenomes.org/wiki/i5K). An effort to sequence the genomes of 5000 insects and arthropods. Many insects are either pests, carriers of disease, or beneficial to agriculture and man. More knowledge of their biochemical pathways will surely result in new avenues of control, utilization, and fascination. Metagenomics. This is a broad term covering the sequencing of DNA samples from the environment as well as from biomedical sources. For example, sequencing has led to the identification of the hundreds of bacterial species inhabiting our skin, mouth, and digestive system. The populations that live on and within us vary with our health state and are clearly linked to our physiology (as we are to theirs). The NCBI lists almost 350 metagenomics projects (www.ncbi.nlm.nih.gov/genomes/lenvs.cgi) that are either at the beginning stages or completed. These projects each generate anywhere from thousands to millions of sequences. Cancer Genome Atlas. This is a massive project (cancergenome.nih.gov) where thousands of specimens from all the major cancer types and their matched normal controls will have their RNAs and many of their genes sequenced. EST generation. ESTs (expressed sequence tags) are small samples of transcribed genes and a quick avenue for discovering the genes expressed in tissues or organisms. Clones are generated and sequenced by the thousands. There are at least 72 million EST sequences in GenBank. The Barcode of Life (www.barcodeoflife.org). Distinguishing closely related species is often difficult, even for taxonomy experts. For example, there are approximately 11,000 species of ants. How can you easily tell them apart? The Barcode of Life project aims to identify a DNA “signature” for each species in the world using a 648 base pair sequence of the cytochrome c oxidase 1 gene. The five-year goal is to have sequences from 500,000 species. Nice examples of consumer use of this information include the identification of illegal fishing of endangered species and illegal logging activities. The NCBI lists over 1700 eukaryotic genome sequencing projects (www.ncbi. nlm.nih.gov/genomes/leuks.cgi), over 11,000 microbial genome projects (www.ncbi.nlm.nih.gov/genomes/lproks.cgi), and over 3100 viral genomes.

i

Tumbling costs

According to Eric Lander, director of the Genome Biology Program at the Broad Institute, it now costs about $20 to sequence the Escherichia coli genome, sequencing each of the 4.7 million nucleotides twenty times to ensure accuracy.

3

Chapter 1: Introduction to Bioinformatics and Sequence Analysis

4

i

Keep flossing

Human microbiome studies have sampled bacteria from skin from all over your body, the gut flora (of course), even your navel (nicknamed “bellybutton biodiversity”). According to a Nature Reviews Microbiology Editorial, dental plaque is very dense with bacteria. The number of bacteria in a single gram is equivalent to the number of people who have ever lived.

There are also private sequencing efforts where the data are not always released to the public yet the parties acquiring the data still have to cope with the huge amount of sequence generated by these projects. ●



Firms such as pharmaceutical and biotechnology companies are contracting other companies to generate sequence from patients, animals, important crops, plants, cell lines, tumors, and pathogens. They are also doing deep sequencing of complementary DNA (cDNA) libraries to identify rarely expressed genes. These efforts are being used to develop products such as new drugs, crops, and diagnostic kits. In response to an infectious disease, the genomes of suspected pathogens are being sequenced. For example, in 2011 there was a major pathogenic Escherichia coli outbreak in Europe that eventually killed several dozen people. In 2010 there was a cholera outbreak in Haiti following the devastating earthquake there. In both cases, the genomes of the causative bacteria were sequenced to better understand the pathogens and learn how to treat the diseases. Literally tens of thousands of human immune deficiency virus (HIV) genomes have been sequenced. As the price drops, medical sequencing will probably become more commonplace for diagnosis of individuals in the general population.

There are many “smaller” projects that are contributing to the public data growth. There is a division of the NCBI Website (www.ncbi.nlm.nih.gov/popset) that only contains population studies (PopSet): collections of sequences from many members of the same species. For example, there are PopSets for spiders (102), rabbits (179), squirrels (83), skunks (94), robins (114), and ants (94). Within these records you can find the sequence of a single gene from hundreds or thousands of individuals. Smaller still in size, but not importance, are the efforts to understand a single gene or gene family. This analysis, often originating in an individual laboratory or academic department, is often very detailed and associated with publications. These analysis studies are at the heart of understanding how genes function. Many automated annotation efforts absolutely depend on these manual and long-term projects to serve as reference sequences.

Further examples of human genome sequencing

i

A long journey

In 2011, there were surprising reports of a mountain lion being seen in Connecticut, not the current habitat of these large cats. Shortly after these reports, a 140-pound male mountain lion was struck and killed by a car on a Connecticut highway. For the preceding several years, scientists using the DNA found in scat and hair samples had been tracing its movement from South Dakota, Minnesota, and Wisconsin, making the journey to Connecticut of at least 1500 miles.

Personal genome sequencing Families with a common last name have often cooperated to establish links by common ancestors. Now some are using sequencing from the Y chromosome (inherited from father to son), mitochondria (passed from mothers to their children), or both with the specific purpose of establishing or verifying these links. Some have already uncovered unknown connections between families that would not have been possible to identify without the DNA sequence. Companies have formed that specialize in these kinds of sequence analysis services. They can provide partial family histories for adoptees, provide information concerning paternity, and even identify the presence of the so-called “warrior gene” (MAOA), a gene variant associated with aggressive responses to threats. There are companies that offer the sequencing of your entire genome and the accompanying analysis as a service. As the cost comes down (estimated to drop as low as $1000 per genome) and the predictive value of genes goes up, you can expect more people to have their genomes sequenced.

Paleogenetics This is a relatively new field, made possible by vast improvements in the isolation and amplification of DNA from ancient biological specimens. Scientists are now able to ask genetic questions of ancient times in history. For example, Schuenemann and colleagues sequenced DNA from the remains of people who were fourteenth century victims of the Black Death that swept through Europe.

The Size of a Genome Their work shows that Yersinia pestis, the agent most probably causing the Black Death, was present but is a different strain to the one found today. A spectacular display of paleogenetics is the sequencing of the Neanderthal genome. The DNA was obtained from bones thousands of years old and carefully sequenced by Richard E. Green and colleagues in the laboratory of Svante Pääbo. The analysis of these data has just begun but has already yielded interesting findings about our ancient relatives. Early work examined the language gene, FOXP2, investigating their ability to speak. It was also discovered that Neanderthals had a MCR1 gene variant that leads to red hair. Comparisons between our genome and that of the Neanderthal reveal that approximately 2.5% of Neanderthal DNA sequence is in our genome, indicating that our ancestors interbred. Very recently, a 41,000-year-old bone from a new human ancestor (hominin) was discovered in Siberia and their genome indicates that they contributed a small amount of sequence to present-day Melanesians (people of the islands northeast of Australia).

Focused medical genomic studies Genetic testing is a well-established hospital procedure for carrier or prenatal testing, diagnostics, and newborn screening for common genetic disorders. The formation of companies that specialize in gene sequencing for establishing genetic risks has some parties concerned that testing without reason or access to qualified counseling can lead to fear or poorly informed life decisions. Testing positive does not mean that you definitely have or will develop a disorder and a negative test does not guarantee that you will not develop the disorder. Others are concerned that disclosure of a positive test to an employer or insurance company may lead to negative consequences. There are a number of studies in which patients had their DNA sequenced and analyzed for the purpose of identifying the molecular basis of genetic disorders. In the laboratory of David Galas, genome sequences were obtained from the parents and their two children who inherited separate and different recessive genetic disorders. One child was born with Miller syndrome, which causes facial and limb development abnormalities, and the other child was born with primary ciliary dyskinesia. The latter is characterized by the malfunction of microscopic cilia in the respiratory tract. Through careful analysis of the sequences, the disorders were narrowed down to four possible genes. Another finding from this study was an accurate measurement of the mutation rate per generation. Each child in this family was born with 70 mutations (sequences different from either parent), which was lower than that estimated for human generations using other methods. In another study, a scientist, James Lupski, used his own DNA to identify the molecular basis of the disease Charcot-Marie-Tooth neuropathy that affected him and other members of his family. By sequencing his own genomic DNA, candidates for the cause of his disease were identified. More directed sequencing of the DNA of family members confirmed the mutations responsible for his family’s disorder. In a final example, newspaper reporters received a Pulitzer Prize for an article describing a team effort at The Medical College and Children’s Hospital of Wisconsin where the genome sequence of a sick child was determined to assist in the diagnosis of his unusual disease. The Medical College has started a program where physicians can nominate medical cases where knowing the patient’s genomic sequence may help, and at least six patients are in the queue to have their DNA sequenced.

1.4 THE SIZE OF A GENOME How much data is generated when a genome is sequenced? The genome size and gene number generally increase with the complexity of the organism, but there are some surprises. E. coli, the object of research for decades and resident in our

5

Chapter 1: Introduction to Bioinformatics and Sequence Analysis

6

Table 1.1 The size of genomes Species

Genome size (106 nucleotides)

Escherichia coli

4.7

Number of genes 4300

Saccharomyces cerevisiae

12

6700

Drosophila melanogaster

169

13,900

Danio rerio

1500

26,000

Homo sapiens

3200

21,000

Zea mays

3200

63,000

488

57,000

Oryza sativa

Source: The Ensembl Genome Browser (www.ensembl.org) April 2012.

digestive system, has 4300 genes in 4.7 million nucleotide pairs. Saccharomyces cerevisiae, a single-celled yeast used in cooking and fermentation, has an incrementally larger genome and number of genes (Table 1.1). Multicellular organisms show an increase in these numbers. The common fruit fly, Drosophila melanogaster, has almost 14,000 genes in 169 million base pairs. The genome of the vertebrate zebrafish, Danio rerio, is almost tenfold larger, yet only contains 26,000 protein-coding genes. The human genome is approximately 3.2 billion nucleotides long and contains approximately 21,000 protein-coding genes and at least 12,000 noncoding genes. Each mammal genome sequenced in the projects listed above will generate approximately the same amount of partially processed data as seen in human genome analysis. Plants have complex genomes, reflecting a history of genetic duplications that far surpass the number seen in vertebrates. As a result they often have large genomes and gene numbers; maize (Zea mays) has 63,000 genes in a genome of size comparable to that of mammals while rice (Oryza sativa) has over 57,000 genes in a much smaller genome.

1.5 ANNOTATION

i

Economic impact of the Human Genome Project

The Human Genome Project cost the US government $3.8 billion yet the return on that investment has been incredible. According to a report by Battelle Technology Partnership Practice, the breakthroughs in technology and information spawned the birth and growth of both companies and academic laboratories, followed by the creation of products and services. In 2010 alone, this generated $67 billion in US output, supporting 310,000 jobs and $20 billion in personal income. Since the Human Genome Project started, over $49 billion in taxes have been paid to the US government from these genomic-related activities.

Of course, if all we had were file after file of just DNA sequence, we would learn little about the object of our sequencing efforts. The true value is realized when the DNA or protein sequence is described to tell us about genetic or protein elements, structures, similarities, functions, and predictions associated with these sequences. Collectively, these details are referred to as gene annotation. Like bioinformatics, annotation is a broad term and has different meanings to different people. Here, it is used to describe details such as where a gene starts and ends; similarities to other genes and proteins based on database searches; places that are known to vary; translation start and stop sites; places where the protein is predicted to be or is modified; association with a phenotype or disorder; and ties to other analysis or publications. Annotation efforts are a big part of any genome or gene project and, depending on the size of the project, can be either manual or automated. When a genome sequence is finished, the annotation of the hundreds or thousands of genes has to be automated. Bioinformatics experts join together a “pipeline” of software tools that systematically analyzes each region of the genome, identifies genes, and then determines the details of those genes. The fields of bioinformatics and gene analysis would be at a near standstill without these pipelines, and this form of analysis is both powerful and very accurate. An automated process cannot perform every conceivable analysis, however. The developers of the pipeline choose the questions to be asked and this analysis

Large Sources of Human Sequence Variation provides valuable, but basic, information about these newly discovered genetic elements. Automated efforts will miss details that the software is not trained to recognize. Importantly, automated annotation is not always updated. Annotations entered in a database when a gene is newly discovered may never be updated. If other members of this gene family later appear in the database, there may be no link between the older sequence and these new, more fully described sequences. The description on the older file may be frozen in time.

1.6 WITNESSING EVOLUTION THROUGH BIOINFORMATICS In the history of life, there have been countless times when a gene’s sequence has randomly mutated with a concomitant change in the encoded protein structure and function. Some of these new functions imparted advantages to the organism and were retained for future generations. Deleterious mutations were quickly eliminated from the population. Other changes were neutral and, because they caused no harm, may or may not have been retained. Genes have been duplicated again and again, with each copy continuing to evolve, leading to large gene families and new functions. The path from unicellular to multicellular organisms, and the development of tissue, organs, and limbs, also increased genetic complexity, visible today in higher organisms. Throughout this book there will be numerous demonstrations where the fields of genomics and bioinformatics will show these steps in evolution.

Recent evolutionary changes to plants and animals About 10,000 years ago, humans began to change from a hunter-gatherer lifestyle to practicing agriculture. Seeds were collected and kept from consumption for planting in the ground in the vicinity of their dwellings. By selecting seeds of plants with superior characteristics, ancient varieties of plants grew taller, produced bigger seeds, produced more nuts or fruit, and resisted inclement weather or disease. We can barely recognize the ancestral plants because this selection process has transformed their appearance so dramatically. However, their DNA sequence reveals the evolution. The same applies to domesticated animals. Recent sequencing of the dog genome reveals their origins from wolves and places the time of domestication much earlier than plant domestication. Over time, we have transformed dogs into breeds with strikingly different phenotypes. They are all clearly dogs, but the size range, accomplished by careful breeding by humans, is astonishing. An adult Chihuahua weighs no more than 6 pounds and can be as small as six inches high, while a Saint Bernard can reach 180 pounds. Just based on weight, this variation is equivalent to a small human newborn and an adult man. Other animals have also been bred for specific traits: cows (increased milk production), horses (speed or strength), sheep (wool quantity and quality), poultry (more breast meat), and fish (speed of maturation). Through the sequencing and study of their genes, genetic screening and manipulation may prove to be a more direct route to desired phenotypes. These studies are taking place now.

1.7 LARGE SOURCES OF HUMAN SEQUENCE VARIATION One of the contributing reasons for the sharp decline in the cost of sequencing the human genome is that the first sequence to be obtained stands as a template to guide the assembly and analysis of subsequent genomes. These newer, so-called re-sequencing efforts do not require the many weeks of assembly and problem solving seen during the first genome sequencing. However, there are still considerable differences seen between individual people.

i

The Japanese Warrior Crab

The Japanese Warrior Crab (Heikea japonica) has on its back an uncanny resemblance to an artistic portrait of a Samurai warrior. Over hundreds of years, any captured crabs not looking like a warrior were kept for the market (and no longer reproduced) while those resembling the warriors were returned to the sea.

7

8

Chapter 1: Introduction to Bioinformatics and Sequence Analysis First, there are single nucleotide polymorphisms (SNPs). The entire human genome is approximately 3.2 billion nucleotide pairs long, and there are approximately 3 million nucleotides that differ when you compare the genomes of two people. These common differences are found in about 1% of the population. Many of these differences have no apparent impact on the function of the genome, while others disrupt gene regulation, or change coding regions resulting in altered amino acid sequences. People studying the genomes of tumors find many SNPs arise within the tumor. Some of these may be responsible for the cancer state, while others accumulate independently of the biochemical changes necessary to become a cancer cell. There are also tremendous differences between genomes due to copy number variations (CNVs). Comparing your DNA sequence to that of the human “standard” genome, there are thousands of DNA segments which range from 1000 to several million nucleotides in length, and they are either present, present in multiple copies, or absent from your genome. Kimberly Pelak and colleagues did a fascinating study published in 2010 where they sequenced the genomes of 20 people, 10 of which had hemophilia A. Although they were faced with the many differences between individuals, they were able to identify the mutations causing hemophilia in 6 of the 10 patients. Surprisingly, they found that “on average, each genome carries 165 homozygous protein-truncating or stop loss variants in genes representing a diverse set of pathways.” Of the 21,000 protein-coding genes, almost 0.8% of our genes are unable to be translated to full-length proteins, essentially “knocking out” many of these genes.

1.8 RECENT EVOLUTIONARY CHANGES TO HUMAN POPULATIONS Since the emergence from Africa, humans have migrated to all continents except Antarctica (Box 1.1).

Box 1.1 The author’s DNA My mother’s ancestors walked out of Africa perhaps 50,000 years ago. I don’t know their exact path, or how they got across rivers or mountains, or survived winters. For countless generations they fanned out from the Middle East across Europe, displacing, and eventually driving to extinction, the Neanderthals who had inhabited these lands for several hundred thousand years. But not before the Neanderthals contributed a small amount to their gene pool, shown recently by careful analysis of both modern human and Neanderthal genomic DNA sequences. My ancestors eventually crossed Eastern Europe and settled in what is now called Lithuania. The evidence for this narrative is contained in my DNA. A partnership between National Geographic and IBM (The Genographic Project) aims to establish the migratory paths of modern humans through the collection and sequencing of DNA from many tens of thousands of volunteers. I am one of those volunteers and had a small section of my mitochondrial DNA sequenced. This snippet of DNA tracks with my mother’s side of the family as the mitochondrial DNA is only contributed through females. My DNA sequence is

seen in Figure 1. The handful of nucleotides that vary from a reference sequence, shown in a lighter shade, indicate that I am in the “T haplogroup,” which places my ancestors on the migratory path described above. With time and more analysis, perhaps more details will be filled in.

Figure 1 The author’s DNA. The sequence shown is from the author’s mitochondrial DNA. The sequence was provided by The Genographic Project, www.nationalgeographic.com/genographic.

DNA Sequence in Databases Along the way, in response to the environment, they have changed their diet and lifestyle. Here are a few examples along with the genetic changes associated with these adaptations. Many may have occurred during the last 40,000 years, or since the more recent start of agriculture. Included are the official gene names or symbols when known. ●









Skin color. Humans near the equator have retained a darker skin color to block damaging ultraviolet light. However, people closer to the Earth’s poles need and have paler skin allowing them to make enough light-induced vitamin D. Sequence variation in a number of genes, such as SLC24A5, appears to be responsible for this skin pigmentation. Lactose tolerance. It is estimated that as recently as 8000 years ago, goats and cattle were domesticated and their milk was consumed by humans, especially at times of poor crop yields. This practice was probably a contributing factor to preventing starvation in the historically frequent famines. Normally, the ability to digest lactose rapidly decreases after early childhood, resulting in considerable intestinal discomfort after consuming milk. But a mutation arose which resulted in the persistence of lactase expression into adulthood, allowing milk consumption without side effects. Lactose tolerance quickly spread through the European population and a sequence variation near the promoter of the lactase gene, LCT, appears to be responsible. Interestingly, a different set of promoter sequence variations arose independently in pastoral African populations, reaching 90% of the Tutsi population. Digestion of starch. Like the digestion of lactose, there have been selective pressures for the increased ability to survive on high-starch diets. Amylase is an enzyme found in saliva and provides the first steps in the digestion of starch. It has been found that populations that consume a lot of starch have high copy numbers of the amylase (AMY1A) gene while populations with lowstarch diets have fewer amylase genes. More amylase gene copies results in higher amylase expression, especially in saliva, conferring an advantage for digesting food. Malaria resistance and sickle cell anemia. Malaria is a tropical disease caused by a Plasmodium blood infection. It attacks hundreds of millions of people each year and is fatal to tens of thousands. Sickle cell anemia is a disease where mutations in the hemoglobin B gene (HBB) lead to misshapen red blood cells. In addition to carrying less oxygen, the crescent-shaped cells cause problems of poor circulation such as pain and organ damage. However, you are more likely to survive malaria if you carry one copy of the sickle cell disease gene and, over multiple generations, mutations of the sickle cell trait have spread rapidly through the populations most at risk for malaria. Life at high altitude. There are several human populations that live and thrive at extremely high altitudes and have high red blood cell counts in response to the low oxygen levels. Yi and colleagues identified a transcription factor gene, EPAS1, as having a SNP present in most high-altitude Tibetans but mostly absent from Han Chinese living at low elevation. The Tibetan population split from the Han population less than 3000 years ago. Interestingly, EPAS1 expression rises in response to low oxygen levels.

1.9 DNA SEQUENCE IN DATABASES In earlier sections, the major drivers of database growth were described. With this growth and wealth of data comes the ability to address long-term questions such as finding molecular evidence of evolution, and examples of this were also described. Genomic and cDNA sequences are chiefly responsible for the flood of information into GenBank and the basics of DNA sequence assembly and cDNA

i

The black blood of Uro Indians

There is a legend that the Uro Indians of Peru had “black blood” which helped them survive at the cold and high altitudes. Although the legend may not be true it is interesting that the story comes from a time of limited biochemistry knowledge yet blood color, and therefore oxygen-binding hemoglobin, is connected to this legend.

9

10

Chapter 1: Introduction to Bioinformatics and Sequence Analysis synthesis will be described here and explained in more detail in Chapter 5. The same principles of genome sequence assembly apply to both established and next-generation sequencing methods. Importantly, bioinformatics plays a key role in assembling the millions of sequences into contiguous pieces of genome. As we search databases and come across pieces of DNA sequence, it is important to appreciate the origins of those fragments, both from a scientific point of view and source of pride in human ingenuity.

Genomic DNA assembly Most genomes are millions of nucleotides long, far surpassing the length of sequence generated by current sequencing technologies. So genomic sequencing efforts have all involved breaking chromosomal DNA into pieces and then working with the smaller fragments. Once the fragments are in a suitable format and size, the DNA sequence is determined and bioinformatics software assembles the fragments into long contiguous stretches with the goal of assembling the genome sequence from end to end. Now, it is important to remember that when chromosomal DNA is isolated, you are not working with just one copy. DNA is obtained from something abundant, for example cells grown in culture, a whole organ, or even a whole organism or flask of organisms. Since you are not working with a single cell, you are isolating the DNA from many millions of cells and therefore have millions of copies of each gene. It is also important to realize that you are sequencing random pieces of DNA. The approach you took to randomly fragment the chromosomal DNA generated many different beginnings and endings. That is, you are running thousands or millions of sequencing reactions at once and they correspond to regions all over the genome. Furthermore, your sequencing reactions are not generating the entire sequence of a gene; sequencing only generates short stretches. Finally, you are starting at random places in gene A, gene B, gene C, and so on. After you are done with the sequencing reactions, you have DNA sequence from the beginning and end of the genes, and everywhere in between. Because you had multiple copies of each gene in the original sequencing reactions, you have overlapping copies of sequence. But all of this is required to get any level of accuracy of sequence. To understand the strength in the randomness described above, let’s start with an analogy. Imagine taking a piece of paper on which two sentences are written and with scissors, cutting the page into those individual sentences. Consider these two adjacent sentences, Here comes a fox. The fox jumps over the lazy dog.

and the pieces that you generated with scissors, deliberately put out of order: The fox jumps over the lazy dog. Here comes a fox.

If you had no knowledge of the original order of sentences and were asked to assemble these as they were before, you would be at a loss since there is no overlap between sentences. There is some hint of sentence order if you consider them individually because you understand English, but you can’t be absolutely sure about the order when you try to reassemble the sentences. But if you had multiple copies of the sentences, and many pieces of sentences cut at random places, and pieces that spanned the two sentences, you could assemble, with confidence, the order of the sentences and place a consensus assembly (built from the agreement between words) underneath the fragments:

DNA Sequence in Databases Here comes comes a fox. The f fox. The fox jumps over the over the lazy dog. the lazy dog. Here comes a fox. The fox jumps over the lazy dog.

This analogy is close to the approach and solution to sequencing and assembling genomic DNA. The overlapping nature allows us to confidently determine the relationship among fragments. The unique words (here, comes, jumps, over, lazy, dog) are analogous to genes, scattered about. Genomic DNA has repetitive elements, much like the repeating words (fox, the), also distributed unevenly. These can be a problem unless you have adjacent unique sequences (words). Now look at a small stretch of DNA sequence taken from a rat gene: ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCTGGAGACAA

Let’s “sequence” multiple copies, starting from random locations, in overlapping pieces, and build a consensus, shown below it: ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA GAAGGGGAAACAGATGCAGAAAGCATCT AGAAAGCATCTGGAGACAA ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCTGGAGACAA

Like the two-sentence analogy, the overlapping fragments allowed you to order the pieces. Note that there are multiple places of “AAA” which could confuse the assembly process but the adjacent sequence allowed the correct assembly. Many early genome assemblies aimed for approximately sixfold coverage (overlapping regions six sequences deep) but with the next generation of sequencing machines, 20–100-fold coverage is now commonplace. Even so, errors in assembly can and do occur, often because of scattered repetitive regions ranging from hundreds to thousands of nucleotides long. These repeats are nearly identical in sequence and can be indistinguishable from each other. For reasons that are not always clear, some regions of DNA cannot be isolated very easily, do not clone at high efficiency, and/or cannot be sequenced very accurately. This results in regions that are underrepresented in the multifold coverage or not represented at all. These “holes” can be mapped and addressed through alternative means, but nevertheless represent a hurdle in generating genomic sequence. You will often see these holes as a long string of Ns (NNNNNNNNNNNNNNNNNNNNNNNN) as placeholders where the scientists know the length of a fragment, based on the distance between flanking markers, but not its content. Remember, DNA is double-stranded. When you sequence DNA randomly, you could be sequencing the complementary strand, which gives you twice the sequence to consider when trying to pull all your fragments together to build the consensus. In practice, sometimes the sequence of a complementary strand is more easily obtained, so having the other strand’s sequence is a benefit and simply contributes to the redundancy you need for accuracy. Finally, you cannot forget the huge bioinformatics contribution for the assembly of genomic DNA sequence. That is, if you have to assemble literally millions of randomly generated DNA sequences, each fragment ranging from 25 to 900 nucleotides long, you must use a computer program to accomplish this difficult

11

Chapter 1: Introduction to Bioinformatics and Sequence Analysis

12

goal. Going back to the two-sentence analogy above, imagine trying to assemble the sentences of a 23-volume encyclopedia, with millions of words in each volume, and 20–100 copies of those 23 volumes. Assembling the original version of the human genome sequence required the assembly of over 27 million fragments (approximately fivefold coverage) and literally weeks of computing time on one of the largest known computers at that time. Complete understanding of the complex computer programs that accomplished this feat is beyond the scope of this book, but it is valuable to appreciate that bioinformatics was key to this historic event. With the above description as a background, be prepared that the genomic DNA sequence in our databases may: ●

be very long pieces (contiguous stretches or contigs) but often, many small fragments;



contain regions of unknown sequence;



contain mistakes in sequence;



be assembled incorrectly;





contain either strand of the double helix in the database unless it is described in more detail (like a gene); be represented multiple times in the database.

cDNA in databases—where does it come from? A huge contribution to the sequence in DNA databases is cDNA. As more thoroughly explained in Chapter 5, messenger RNA (mRNA) is quite unstable and using enzymes to convert this polymer into stable DNA is the preferred approach for cloning and sequencing of these transcripts. cDNA analysis is critical to the understanding of gene expression and function so, as a result, this form of DNA is very prominent in the analysis in this book. The basic steps in cDNA synthesis are described below. There are approximately 21,000 human protein-encoding genes. Around 8000 are ubiquitously expressed (that is, transcribed and translated) in all tissues and have functions in common with all cells: DNA replication, energy metabolism, regulation of transcription, translation, and so on. The expression of the other 13,000 genes is thought to be somewhat specific between the cell types, tissues, developmental states, or any circumstances that make a cell type or condition unique. For example, genes important for the function of blood should only be expressed in blood cells, and genes important for liver function should only be found in liver. A direct approach to identifying liver-specific genes would be to isolate all the proteins in a sample of liver and identify them by sequencing. This is technically challenging and is beyond the expertise of many laboratories.

i

"Tissue-specific" expression

The transcripts of most genes can be found in more than one tissue. Rather than think of genes as tissue-specific, it may be more accurate to think that gene expression is more “selective” for one tissue or cell type. Nevertheless, it is very common to refer to genes as being tissue-specific.

Another approach is to isolate and study the mRNAs in liver. These encode the proteins found in liver and you could generate a list of genes that appear to be liver-specific. However, studying mRNA is technically challenging, as mRNA is very labile so only the most meticulous handling will prevent it from breaking down quickly. A technically easier approach to studying mRNA is to make a complementary DNA copy of the mRNA and then sequence this copy. Complementary DNA, or cDNA, is very stable, easily handled, and sequenced with little difficulty. You still have the challenge of isolating and properly handling mRNA from cells, but once you have cDNA, your success is almost guaranteed. A brief explanation of mRNA and cDNA synthesis will help you understand what you are looking at in DNA databases. This will also be covered in different detail in Chapter 5. Can measuring mRNA allow good estimates of protein levels? Be aware that not all transcripts are translated. We’ll learn in Chapter 10 that there are possibly thousands of genes that have RNA as their end product and are never translated. Furthermore, translation rates of transcripts vary considerably so there may not be a direct correlation between protein levels and the abundance of an mRNA. mRNA is a long polymer with a “cap” at the beginning (5P end, pronounced “five prime”) and tail of AAAs (poly(A) tail) at the 3P end (Figure 1.2A). cDNA

DNA Sequence in Databases

13

synthesis is begun by mixing the mRNA with the components necessary for the synthesis of cDNA. These components include the individual nucleotides (represented here as A, T, G, and C) and an enzyme called reverse transcriptase. As the name of this enzyme might suggest, it makes cDNA out of RNA. The enzyme requires a starting point: it needs a short stretch of DNA, called a primer, already sitting on the mRNA as a place to begin cDNA synthesis. Reverse transcriptase, like other DNA-synthesizing enzymes, can only begin at the 3P end of the primer. If you add a primer of poly(T) DNA to the reaction, it will base-pair with the poly(A) tail in the opposite orientation (Figure 1.2B). Reverse transcription will then begin at the 3P end of the poly(T) and extend the DNA synthesis toward the 5P end of the mRNA (Figure 1.2C). The mRNA is then removed from the reaction, leaving only single-stranded cDNA behind (Figure 1.2D). The second strand of DNA now needs to be synthesized, but what will act as the 3P primer for this reaction? Multiple solutions have been devised but an early way to prime the reaction was to allow the cDNA to fold back on itself and self-prime (Figure 1.2E). The second strand synthesis was then completed (Figure 1.2F). Subsequent steps then clone the cDNA into vectors suitable for cloning and sequencing. cDNA synthesis does have limitations. Synthesis of the first strand is not always efficient and the reverse transcriptase may fall off the mRNA before reaching the 5P end of the message. Since much of the early cDNA synthesis started with poly(T) priming, cDNAs in databases are often biased toward the 3P end. Reverse transcriptase is also error-prone so there may be mistakes in the sequence. One type of cDNA mentioned above in Section 1.3 is called an Expressed Sequence Tag or EST. ESTs are often less than 500 nucleotides long even though they were derived from mRNAs thousands of nucleotides in length. What they lack in length is balanced out by quantity: ESTs are synthesized and sequenced in very high numbers (thousands). The result is often a very thorough sampling of the mRNAs expressed in that cell line, tissue, or organ. If nothing is done to normalize the mRNA population, the cDNA synthesized will be proportional to the abundance of the various mRNAs found in those cells. That is, abundant mRNAs will give rise to most of the ESTs, and rare mRNAs will give rise to rare ESTs or none at all. When you randomly pick ESTs to sequence, by chance you will sequence the

Figure 1.2 Synthesis of a cDNA from an mRNA. (A) mRNA. (B) A DNA primer is attached to the poly(A) tail. (C) Reverse transcriptase extends the cDNA to the 5P end of the mRNA. (D) cDNA after removal of the mRNA. (E) Formation of a “hairpin” at the 3P end of the cDNA acts as a primer for synthesis of the complementary strand (F).

(A) mRNA 5P-GAAUUCACGUGGGAAUUCGCAGCAAAAUGAUGCAUAGCUCGCUGAUAGCUUUGAUAAAAAAAAAAAAAA-3P

(B) mRNA 5P-GAAUUCACGUGGGAAUUCGCAGCAAAAUGAUGCAUAGCUCGCUGAUAGCUUUGAUAAAAAAAAAAAAAA-3P 3P-TTTTTTTTTT-5P

(C) mRNA 5P-GAAUUCACGUGGGAAUUCGCAGCAAAAUGAUGCAUAGCUCGCUGAUAGCUUUGAUAAAAAAAAAAAAAA-3P cDNA 3P-CTTAAGTGCACCCTTAAGCGTCGTTTTACTACGTATCGAGCGACTATCGAAACTATTTTTTTTTTTTTT-5P

(D) cDNA 3P-CTTAAGTGCACCCTTAAGCGTCGTTTTACTACGTATCGAGCGACTATCGAAACTATTTTTTTTTTTTTT-5P

(E) GT C A

GAATTC-3P CTTAAGCGTCGTTTTACTACGTATCGAGCGACTATCGAAACTATTTTTTTTTTTTTT-5P CC

(F) GT C A

GAATTCGCAGCAAAATGAUGCATAGCTCGCTGATAGCTTTGATAAAAAAAAAAAAAA-3P CTTAAGCGTCGTTTTACTACGTATCGAGCGACTATCGAAACTATTTTTTTTTTTTTT-5P CC

14

Chapter 1: Introduction to Bioinformatics and Sequence Analysis abundant cDNAs repeatedly. So scientists will sequence ESTs by the thousands to find those rare EST sequences. ESTs will be discussed further in Chapter 5.

1.10 SEQUENCE ANALYSIS AND DATA DISPLAY The following example illustrates a very simple sequence analysis problem. As the analysis progresses, the display of data changes, demonstrating some of the variety of styles that you will see in this book. Figure 1.3 shows the mRNA transcript from a human gene called Factor IX (pronounced “factor nine”). The Factor IX gene encodes a protein critical to the cascade of proteins that respond and work together to properly clot blood. Transcript sequences are conventionally shown in databases and in many publications as cDNA sequences, using “T” instead of “U.” In this form, the sequence is mostly uninformative, not providing any details except for general impressions about the nucleotide content and length (the sequence in Figure 1.3 is 2802 nucleotides long). But what if you knew two simple rules: protein-coding regions begin with “ATG” and end with “TAA,” “TGA,” or “TAG.” Figure 1.4 shows bold and underlined the

Figure 1.3 The sequence of the mRNA for human Factor IX. In GenBank and many other databases, sequence files are given unique identification numbers called “accession numbers.” This sequence is from GenBank accession number NM_000133.

ACCACTTTCACAATCTGCTAGCAAAGGTTATGCAGCGCGTGAACATGATCATGGCAGAATCACCAGGCCT CATCACCATCTGCCTTTTAGGATATCTACTCAGTGCTGAATGTACAGTTTTTCTTGATCATGAAAACGCC AACAAAATTCTGAATCGGCCAAAGAGGTATAATTCAGGTAAATTGGAAGAGTTTGTTCAAGGGAACCTTG AGAGAGAATGTATGGAAGAAAAGTGTAGTTTTGAAGAAGCACGAGAAGTTTTTGAAAACACTGAAAGAAC AACTGAATTTTGGAAGCAGTATGTTGATGGAGATCAGTGTGAGTCCAATCCATGTTTAAATGGCGGCAGT TGCAAGGATGACATTAATTCCTATGAATGTTGGTGTCCCTTTGGATTTGAAGGAAAGAACTGTGAATTAG ATGTAACATGTAACATTAAGAATGGCAGATGCGAGCAGTTTTGTAAAAATAGTGCTGATAACAAGGTGGT TTGCTCCTGTACTGAGGGATATCGACTTGCAGAAAACCAGAAGTCCTGTGAACCAGCAGTGCCATTTCCA TGTGGAAGAGTTTCTGTTTCACAAACTTCTAAGCTCACCCGTGCTGAGACTGTTTTTCCTGATGTGGACT ATGTAAATTCTACTGAAGCTGAAACCATTTTGGATAACATCACTCAAAGCACCCAATCATTTAATGACTT CACTCGGGTTGTTGGTGGAGAAGATGCCAAACCAGGTCAATTCCCTTGGCAGGTTGTTTTGAATGGTAAA GTTGATGCATTCTGTGGAGGCTCTATCGTTAATGAAAAATGGATTGTAACTGCTGCCCACTGTGTTGAAA CTGGTGTTAAAATTACAGTTGTCGCAGGTGAACATAATATTGAGGAGACAGAACATACAGAGCAAAAGCG AAATGTGATTCGAATTATTCCTCACCACAACTACAATGCAGCTATTAATAAGTACAACCATGACATTGCC CTTCTGGAACTGGACGAACCCTTAGTGCTAAACAGCTACGTTACACCTATTTGCATTGCTGACAAGGAAT ACACGAACATCTTCCTCAAATTTGGATCTGGCTATGTAAGTGGCTGGGGAAGAGTCTTCCACAAAGGGAG ATCAGCTTTAGTTCTTCAGTACCTTAGAGTTCCACTTGTTGACCGAGCCACATGTCTTCGATCTACAAAG TTCACCATCTATAACAACATGTTCTGTGCTGGCTTCCATGAAGGAGGTAGAGATTCATGTCAAGGAGATA GTGGGGGACCCCATGTTACTGAAGTGGAAGGGACCAGTTTCTTAACTGGAATTATTAGCTGGGGTGAAGA GTGTGCAATGAAAGGCAAATATGGAATATATACCAAGGTATCCCGGTATGTCAACTGGATTAAGGAAAAA ACAAAGCTCACTTAATGAAAGATGGATTTCCAAGGTTAATTCATTGGAATTGAAAATTAACAGGGCCTCT CACTAACTAATCACTTTCCCATCTTTTGTTAGATTTGAATATATACATTCTATGATCATTGCTTTTTCTC TTTACAGGGGAGAATTTCATATTTTACCTGAGCAAATTGATTAGAAAATGGAACCACTAGAGGAATATAA TGTGTTAGGAAATTACAGTCATTTCTAAGGGCCCAGCCCTTGACAAAATTGTGAAGTTAAATTCTCCACT CTGTCCATCAGATACTATGGTTCTCCACTATGGCAACTAACTCACTCAATTTTCCCTCCTTAGCAGCATT CCATCTTCCCGATCTTCTTTGCTTCTCCAACCAAAACATCAATGTTTATTAGTTCTGTATACAGTACAGG ATCTTTGGTCTACTCTATCACAAGGCCAGTACCACACTCATGAAGAAAGAACACAGGAGTAGCTGAGAGG CTAAAACTCATCAAAAACACTACTCCTTTTCCTCTACCCTATTCCTCAATCTTTTACCTTTTCCAAATCC CAATCCCCAAATCAGTTTTTCTCTTTCTTACTCCCTCTCTCCCTTTTACCCTCCATGGTCGTTAAAGGAG AGATGGGGAGCATCATTCTGTTATACTTCTGTACACAGTTATACATGTCTATCAAACCCAGACTTGCTTC CGTAGTGGAGACTTGCTTTTCAGAACATAGGGATGAAGTAAGGTGCCTGAAAAGTTTGGGGGAAAAGTTT CTTTCAGAGAGTTAAGTTATTTTATATATATAATATATATATAAAATATATAATATACAATATAAATATA TAGTGTGTGTGTATGCGTGTGTGTAGACACACACGCATACACACATATAATGGAAGCAATAAGCCATTCT AAGAGCTTGTATGGTTATGGAGGTCTGACTAGGCATGATTTCACGAAGGCAAGATTGGCATATCATTGTA ACTAAAAAAGCTGACATTGACCCAGACATATTGTACTCTTTCTAAAAATAATAATAATAATGCTAACAGA AAGAAGAGAACCGTTCGTTTGCAATCTACAGCTAGTAGAGACTTTGAGGAAGAATTCAACAGTGTGTCTT CAGCAGTGTTCAGAGCCAAGCAAGAAGTTGAAGTTGCCTAGACCAGAGGACATAAGTATCATGTCTCCTT TAACTAGCATACCCCGAAGTGGAGAAGGGTGCAGCAGGCTCAAAGGCATAAGTCATTCCAATCAGCCAAC TAAGTTGTCCTTTTCTGGTTTCGTGTTCACCATGGAACATTTTGATTATAGTTAATCCTTCTATCTTGAA TCTTCTAGAGAGTTGCTGACCAACTGACGTATGTTTCCCTTTGTGAATTAATAAACTGGTGTTCTGGTTC AT

Sequence Analysis and Data Display

15

“ATG” and “TAA” triplets that bound the protein-coding region. You can now see that there are regions upstream and downstream of the coding region that do not code for protein. These are the 5P and 3P untranslated regions (UTRs), respectively. But these aren’t perfect rules. If you closely look at the sequence, there are many other instances of “ATG” and “TAA.” However, there are some additional constraints to consider. ATG can appear multiple times in a gene sequence, but often (but not always—this is biology!) the first ATG is used to start the coding region, as in this sequence. There are no other ATGs upstream of the one indicated in Figure 1.4. The protein-coding region is read three nucleotides at a time, starting at the ATG. These are called codons. The coding sequence can now be formatted to show all the codons (Figure 1.5). Although straightforward, this grouping step is completely dependent on the sequence being accurate. If the sequence was incorrect and a single nucleotide was missing or inserted, the grouping of three would be completely wrong from that point onward. A mistake involving two nucleotides would also be incorrect. However, if the insertion or deletion were a multiple of three, an event such as this would only have obvious consequences at the point of change, as all the other codons would be correct. ACCACTTTCACAATCTGCTAGCAAAGGTTATGCAGCGCGTGAACATGATCATGGCAGAATCACCAGGCCT CATCACCATCTGCCTTTTAGGATATCTACTCAGTGCTGAATGTACAGTTTTTCTTGATCATGAAAACGCC AACAAAATTCTGAATCGGCCAAAGAGGTATAATTCAGGTAAATTGGAAGAGTTTGTTCAAGGGAACCTTG AGAGAGAATGTATGGAAGAAAAGTGTAGTTTTGAAGAAGCACGAGAAGTTTTTGAAAACACTGAAAGAAC AACTGAATTTTGGAAGCAGTATGTTGATGGAGATCAGTGTGAGTCCAATCCATGTTTAAATGGCGGCAGT TGCAAGGATGACATTAATTCCTATGAATGTTGGTGTCCCTTTGGATTTGAAGGAAAGAACTGTGAATTAG ATGTAACATGTAACATTAAGAATGGCAGATGCGAGCAGTTTTGTAAAAATAGTGCTGATAACAAGGTGGT TTGCTCCTGTACTGAGGGATATCGACTTGCAGAAAACCAGAAGTCCTGTGAACCAGCAGTGCCATTTCCA TGTGGAAGAGTTTCTGTTTCACAAACTTCTAAGCTCACCCGTGCTGAGACTGTTTTTCCTGATGTGGACT ATGTAAATTCTACTGAAGCTGAAACCATTTTGGATAACATCACTCAAAGCACCCAATCATTTAATGACTT CACTCGGGTTGTTGGTGGAGAAGATGCCAAACCAGGTCAATTCCCTTGGCAGGTTGTTTTGAATGGTAAA GTTGATGCATTCTGTGGAGGCTCTATCGTTAATGAAAAATGGATTGTAACTGCTGCCCACTGTGTTGAAA CTGGTGTTAAAATTACAGTTGTCGCAGGTGAACATAATATTGAGGAGACAGAACATACAGAGCAAAAGCG AAATGTGATTCGAATTATTCCTCACCACAACTACAATGCAGCTATTAATAAGTACAACCATGACATTGCC CTTCTGGAACTGGACGAACCCTTAGTGCTAAACAGCTACGTTACACCTATTTGCATTGCTGACAAGGAAT ACACGAACATCTTCCTCAAATTTGGATCTGGCTATGTAAGTGGCTGGGGAAGAGTCTTCCACAAAGGGAG ATCAGCTTTAGTTCTTCAGTACCTTAGAGTTCCACTTGTTGACCGAGCCACATGTCTTCGATCTACAAAG TTCACCATCTATAACAACATGTTCTGTGCTGGCTTCCATGAAGGAGGTAGAGATTCATGTCAAGGAGATA GTGGGGGACCCCATGTTACTGAAGTGGAAGGGACCAGTTTCTTAACTGGAATTATTAGCTGGGGTGAAGA GTGTGCAATGAAAGGCAAATATGGAATATATACCAAGGTATCCCGGTATGTCAACTGGATTAAGGAAAAA ACAAAGCTCACTTAATGAAAGATGGATTTCCAAGGTTAATTCATTGGAATTGAAAATTAACAGGGCCTCT CACTAACTAATCACTTTCCCATCTTTTGTTAGATTTGAATATATACATTCTATGATCATTGCTTTTTCTC TTTACAGGGGAGAATTTCATATTTTACCTGAGCAAATTGATTAGAAAATGGAACCACTAGAGGAATATAA TGTGTTAGGAAATTACAGTCATTTCTAAGGGCCCAGCCCTTGACAAAATTGTGAAGTTAAATTCTCCACT CTGTCCATCAGATACTATGGTTCTCCACTATGGCAACTAACTCACTCAATTTTCCCTCCTTAGCAGCATT CCATCTTCCCGATCTTCTTTGCTTCTCCAACCAAAACATCAATGTTTATTAGTTCTGTATACAGTACAGG ATCTTTGGTCTACTCTATCACAAGGCCAGTACCACACTCATGAAGAAAGAACACAGGAGTAGCTGAGAGG CTAAAACTCATCAAAAACACTACTCCTTTTCCTCTACCCTATTCCTCAATCTTTTACCTTTTCCAAATCC CAATCCCCAAATCAGTTTTTCTCTTTCTTACTCCCTCTCTCCCTTTTACCCTCCATGGTCGTTAAAGGAG AGATGGGGAGCATCATTCTGTTATACTTCTGTACACAGTTATACATGTCTATCAAACCCAGACTTGCTTC CGTAGTGGAGACTTGCTTTTCAGAACATAGGGATGAAGTAAGGTGCCTGAAAAGTTTGGGGGAAAAGTTT CTTTCAGAGAGTTAAGTTATTTTATATATATAATATATATATAAAATATATAATATACAATATAAATATA TAGTGTGTGTGTATGCGTGTGTGTAGACACACACGCATACACACATATAATGGAAGCAATAAGCCATTCT AAGAGCTTGTATGGTTATGGAGGTCTGACTAGGCATGATTTCACGAAGGCAAGATTGGCATATCATTGTA ACTAAAAAAGCTGACATTGACCCAGACATATTGTACTCTTTCTAAAAATAATAATAATAATGCTAACAGA AAGAAGAGAACCGTTCGTTTGCAATCTACAGCTAGTAGAGACTTTGAGGAAGAATTCAACAGTGTGTCTT CAGCAGTGTTCAGAGCCAAGCAAGAAGTTGAAGTTGCCTAGACCAGAGGACATAAGTATCATGTCTCCTT TAACTAGCATACCCCGAAGTGGAGAAGGGTGCAGCAGGCTCAAAGGCATAAGTCATTCCAATCAGCCAAC TAAGTTGTCCTTTTCTGGTTTCGTGTTCACCATGGAACATTTTGATTATAGTTAATCCTTCTATCTTGAA TCTTCTAGAGAGTTGCTGACCAACTGACGTATGTTTCCCTTTGTGAATTAATAAACTGGTGTTCTGGTTC AT

Figure 1.4 Applying two rules for describing the human Factor IX mRNA sequence. Those two rules are (a) coding regions begin with “ATG” and (b) coding regions end with one of three terminator sequences, “TAA,” “TGA,” or “TAG.” Two of the many possible matches to these rules are in bold and underlined.

16

Chapter 1: Introduction to Bioinformatics and Sequence Analysis

ACCACTTTCACAATCTGCTAGCAAAGGTT ATG CAG CGC GTG AAC ATG ATC ATG GCA GAA TCA CCA GGC CTC ATC ACC ATC TGC CTT TTA GGA TAT CTA CTC AGT GCT GAA TGT ACA GTT TTT CTT GAT CAT GAA AAC GCC AAC AAA ATT CTG AAT CGG CCA AAG AGG TAT AAT TCA GGT AAA TTG GAA GAG TTT GTT CAA GGG AAC CTT GAG AGA GAA TGT ATG GAA GAA AAG TGT AGT TTT GAA GAA GCA CGA GAA GTT TTT GAA AAC ACT GAA AGA ACA ACT GAA TTT TGG AAG CAG TAT GTT GAT GGA GAT CAG TGT GAG TCC AAT CCA TGT TTA AAT GGC GGC AGT TGC AAG GAT GAC ATT AAT TCC TAT GAA TGT TGG TGT CCC TTT GGA TTT GAA GGA AAG AAC TGT GAA TTA GAT GTA ACA TGT AAC ATT AAG AAT GGC AGA TGC GAG CAG TTT TGT AAA AAT AGT GCT GAT AAC AAG GTG GTT TGC TCC TGT ACT GAG GGA TAT CGA CTT GCA GAA AAC CAG AAG TCC TGT GAA CCA GCA GTG CCA TTT CCA TGT GGA AGA GTT TCT GTT TCA CAA ACT TCT AAG CTC ACC CGT GCT GAG ACT GTT TTT CCT GAT GTG GAC TAT GTA AAT TCT ACT GAA GCT GAA ACC ATT TTG GAT AAC ATC ACT CAA AGC ACC CAA TCA TTT AAT GAC TTC ACT CGG GTT GTT GGT GGA GAA GAT GCC AAA CCA GGT CAA TTC CCT TGG CAG GTT GTT TTG AAT GGT AAA GTT GAT GCA TTC TGT GGA GGC TCT ATC GTT AAT GAA AAA TGG ATT GTA ACT GCT GCC CAC TGT GTT GAA ACT GGT GTT AAA ATT ACA GTT GTC GCA GGT GAA CAT AAT ATT GAG GAG ACA GAA CAT ACA GAG CAA AAG CGA AAT GTG ATT CGA ATT ATT CCT CAC CAC AAC TAC AAT GCA GCT ATT AAT AAG TAC AAC CAT GAC ATT GCC CTT CTG GAA CTG GAC GAA CCC TTA GTG CTA AAC AGC TAC GTT ACA CCT ATT TGC ATT GCT GAC AAG GAA TAC ACG AAC ATC TTC CTC AAA TTT GGA TCT GGC TAT GTA AGT GGC TGG GGA AGA GTC TTC CAC AAA GGG AGA TCA GCT TTA GTT CTT CAG TAC CTT AGA GTT CCA CTT GTT GAC CGA GCC ACA TGT CTT CGA TCT ACA AAG TTC ACC ATC TAT AAC AAC ATG TTC TGT GCT GGC TTC CAT GAA GGA GGT AGA GAT TCA TGT CAA GGA GAT AGT GGG GGA CCC CAT GTT ACT GAA GTG GAA GGG ACC AGT TTC TTA ACT GGA ATT ATT AGC TGG GGT GAA GAG TGT GCA ATG AAA GGC AAA TAT GGA ATA TAT ACC AAG GTA TCC CGG TAT GTC AAC TGG ATT AAG GAA AAA ACA AAG CTC ACT TAA TGAAAGATGGATTTCCAAGGTTAATTCATTGGAATTGAAAATTAACAGGGCCTCTCACTAACTAATCACTTTCCCATCTTTTGTTAGATTTGAATATATACA TTCTATGATCATTGCTTTTTCTCTTTACAGGGGAGAATTTCATATTTTACCTGAGCAAATTGATTAGAAAATGGAACCACTAGAGGAATATAATGTGTTAGG AAATTACAGTCATTTCTAAGGGCCCAGCCCTTGACAAAATTGTGAAGTTAAATTCTCCACTCTGTCCATCAGATACTATGGTTCTCCACTATGGCAACTAAC TCACTCAATTTTCCCTCCTTAGCAGCATTCCATCTTCCCGATCTTCTTTGCTTCTCCAACCAAAACATCAATGTTTATTAGTTCTGTATACAGTACAGGATC TTTGGTCTACTCTATCACAAGGCCAGTACCACACTCATGAAGAAAGAACACAGGAGTAGCTGAGAGGCTAAAACTCATCAAAAACACTACTCCTTTTCCTCT ACCCTATTCCTCAATCTTTTACCTTTTCCAAATCCCAATCCCCAAATCAGTTTTTCTCTTTCTTACTCCCTCTCTCCCTTTTACCCTCCATGGTCGTTAAAG GAGAGATGGGGAGCATCATTCTGTTATACTTCTGTACACAGTTATACATGTCTATCAAACCCAGACTTGCTTCCGTAGTGGAGACTTGCTTTTCAGAACATA GGGATGAAGTAAGGTGCCTGAAAAGTTTGGGGGAAAAGTTTCTTTCAGAGAGTTAAGTTATTTTATATATATAATATATATATAAAATATATAATATACAAT ATAAATATATAGTGTGTGTGTATGCGTGTGTGTAGACACACACGCATACACACATATAATGGAAGCAATAAGCCATTCTAAGAGCTTGTATGGTTATGGAGG TCTGACTAGGCATGATTTCACGAAGGCAAGATTGGCATATCATTGTAACTAAAAAAGCTGACATTGACCCAGACATATTGTACTCTTTCTAAAAATAATAAT AATAATGCTAACAGAAAGAAGAGAACCGTTCGTTTGCAATCTACAGCTAGTAGAGACTTTGAGGAAGAATTCAACAGTGTGTCTTCAGCAGTGTTCAGAGCC AAGCAAGAAGTTGAAGTTGCCTAGACCAGAGGACATAAGTATCATGTCTCCTTTAACTAGCATACCCCGAAGTGGAGAAGGGTGCAGCAGGCTCAAAGGCAT AAGTCATTCCAATCAGCCAACTAAGTTGTCCTTTTCTGGTTTCGTGTTCACCATGGAACATTTTGATTATAGTTAATCCTTCTATCTTGAATCTTCTAGAGA GTTGCTGACCAACTGACGTATGTTTCCCTTTGTGAATTAATAAACTGGTGTTCTGGTTCAT

Figure 1.5 Coding regions are read as triplets. The human Factor IX mRNA with the start and termination codons bold and underlined. The coding region has been divided into the three-base codons. The 5P and 3P untranslated regions (5P and 3P UTRs, respectively) appear before and after the coding region.

Scanning by eye, you can see that there are no other terminator codons—TAA, TAG, or TGA—within this coding region. But there are other ATG triplets, including two just downstream of the first one. All of these codons are translated into a polypeptide chain according to the genetic code. If you wanted to, you could count the 462 codons to deduce the length of the protein as 461 amino acids (the terminator codon does not encode an amino acid). There are many software programs to do this for you, but you should also get into the habit of examining a sequence by eye as well. After all, software only finds things it is designed to look for but you may notice something that is not yet described. There are programs that take DNA sequence as input and translate this into an amino acid sequence using the genetic code (to be covered later, in Chapter 4). Figure 1.6 shows this translation below each codon using the one-letter code for the amino acids. This figure is a little complex since it includes both the nucleotides, some grouped as three-letter codons, and one-letter amino acids. It has its value, though; for example, you can see that there can be multiple codons for each amino acid. Methionine (M) is always ATG, and tryptophan (W) is always TGG, but valine (V) can be GTG, GTT, GTA, or GTC. Figure 1.7, which shows just the amino acid sequence, is a much simpler figure to study. If you knew the biochemical properties of the amino acids, you might recognize regions that are hydrophilic or hydrophobic. Based on the sequence you see, regions of amino acids that tend to fold into helical structures or sheets might be noticed. Or you might recognize certain groups of amino acids that often have attached sugar groups. These structural features may tell a story about

Sequence Analysis and Data Display ACCACTTTCACAATCTGCTAGCAAAGGTT ATG CAG CGC GTG AAC ATG ATC ATG GCA GAA TCA CCA GGC CTC ATC ACC ATC TGC CTT TTA GGA TAT CTA CTC AGT M Q R V N M I M A E S P G L I T I C L L G Y L L S GCT GAA TGT ACA GTT TTT CTT GAT CAT GAA AAC GCC AAC AAA ATT CTG AAT CGG CCA AAG AGG TAT AAT TCA GGT A E C T V F L D H E N A N K I L N R P K R Y N S G AAA TTG GAA GAG TTT GTT CAA GGG AAC CTT GAG AGA GAA TGT ATG GAA GAA AAG TGT AGT TTT GAA GAA GCA CGA K L E E F V Q G N L E R E C M E E K C S F E E A R GAA GTT TTT GAA AAC ACT GAA AGA ACA ACT GAA TTT TGG AAG CAG TAT GTT GAT GGA GAT CAG TGT GAG TCC AAT E V F E N T E R T T E F W K Q Y V D G D Q C E S N CCA TGT TTA AAT GGC GGC AGT TGC AAG GAT GAC ATT AAT TCC TAT GAA TGT TGG TGT CCC TTT GGA TTT GAA GGA P C L N G G S C K D D I N S Y E C W C P F G F E G AAG AAC TGT GAA TTA GAT GTA ACA TGT AAC ATT AAG AAT GGC AGA TGC GAG CAG TTT TGT AAA AAT AGT GCT GAT K N C E L D V T C N I K N G R C E Q F C K N S A D AAC AAG GTG GTT TGC TCC TGT ACT GAG GGA TAT CGA CTT GCA GAA AAC CAG AAG TCC TGT GAA CCA GCA GTG CCA N K V V C S C T E G Y R L A E N Q K S C E P A V P TTT CCA TGT GGA AGA GTT TCT GTT TCA CAA ACT TCT AAG CTC ACC CGT GCT GAG ACT GTT TTT CCT GAT GTG GAC F P C G R V S V S Q T S K L T R A E T V F P D V D TAT GTA AAT TCT ACT GAA GCT GAA ACC ATT TTG GAT AAC ATC ACT CAA AGC ACC CAA TCA TTT AAT GAC TTC ACT Y V N S T E A E T I L D N I T Q S T Q S F N D F T CGG GTT GTT GGT GGA GAA GAT GCC AAA CCA GGT CAA TTC CCT TGG CAG GTT GTT TTG AAT GGT AAA GTT GAT GCA R V V G G E D A K P G Q F P W Q V V L N G K V D A TTC TGT GGA GGC TCT ATC GTT AAT GAA AAA TGG ATT GTA ACT GCT GCC CAC TGT GTT GAA ACT GGT GTT AAA ATT F C G G S I V N E K W I V T A A H C V E T G V K I ACA GTT GTC GCA GGT GAA CAT AAT ATT GAG GAG ACA GAA CAT ACA GAG CAA AAG CGA AAT GTG ATT CGA ATT ATT T V V A G E H N I E E T E H T E Q K R N V I R I I CCT CAC CAC AAC TAC AAT GCA GCT ATT AAT AAG TAC AAC CAT GAC ATT GCC CTT CTG GAA CTG GAC GAA CCC TTA P H H N Y N A A I N K Y N H D I A L L E L D E P L GTG CTA AAC AGC TAC GTT ACA CCT ATT TGC ATT GCT GAC AAG GAA TAC ACG AAC ATC TTC CTC AAA TTT GGA TCT V L N S Y V T P I C I A D K E Y T N I F L K F G S GGC TAT GTA AGT GGC TGG GGA AGA GTC TTC CAC AAA GGG AGA TCA GCT TTA GTT CTT CAG TAC CTT AGA GTT CCA G Y V S G W G R V F H K G R S A L V L Q Y L R V P CTT GTT GAC CGA GCC ACA TGT CTT CGA TCT ACA AAG TTC ACC ATC TAT AAC AAC ATG TTC TGT GCT GGC TTC CAT L V D R A T C L R S T K F T I Y N N M F C A G F H GAA GGA GGT AGA GAT TCA TGT CAA GGA GAT AGT GGG GGA CCC CAT GTT ACT GAA GTG GAA GGG ACC AGT TTC TTA E G G R D S C Q G D S G G P H V T E V E G T S F L ACT GGA ATT ATT AGC TGG GGT GAA GAG TGT GCA ATG AAA GGC AAA TAT GGA ATA TAT ACC AAG GTA TCC CGG TAT T G I I S W G E E C A M K G K Y G I Y T K V S R Y GTC AAC TGG ATT AAG GAA AAA ACA AAG CTC ACT TAA V N W I K E K T K L T Stop TGAAAGATGGATTTCCAAGGTTAATTCATTGGAATTGAAAATTAACAGGGCCTCTCACTAACTAATCACTTTCCCATCTTTTGTTAGATTTGAATATATACA TTCTATGATCATTGCTTTTTCTCTTTACAGGGGAGAATTTCATATTTTACCTGAGCAAATTGATTAGAAAATGGAACCACTAGAGGAATATAATGTGTTAGG AAATTACAGTCATTTCTAAGGGCCCAGCCCTTGACAAAATTGTGAAGTTAAATTCTCCACTCTGTCCATCAGATACTATGGTTCTCCACTATGGCAACTAAC TCACTCAATTTTCCCTCCTTAGCAGCATTCCATCTTCCCGATCTTCTTTGCTTCTCCAACCAAAACATCAATGTTTATTAGTTCTGTATACAGTACAGGATC TTTGGTCTACTCTATCACAAGGCCAGTACCACACTCATGAAGAAAGAACACAGGAGTAGCTGAGAGGCTAAAACTCATCAAAAACACTACTCCTTTTCCTCT ACCCTATTCCTCAATCTTTTACCTTTTCCAAATCCCAATCCCCAAATCAGTTTTTCTCTTTCTTACTCCCTCTCTCCCTTTTACCCTCCATGGTCGTTAAAG GAGAGATGGGGAGCATCATTCTGTTATACTTCTGTACACAGTTATACATGTCTATCAAACCCAGACTTGCTTCCGTAGTGGAGACTTGCTTTTCAGAACATA GGGATGAAGTAAGGTGCCTGAAAAGTTTGGGGGAAAAGTTTCTTTCAGAGAGTTAAGTTATTTTATATATATAATATATATATAAAATATATAATATACAAT ATAAATATATAGTGTGTGTGTATGCGTGTGTGTAGACACACACGCATACACACATATAATGGAAGCAATAAGCCATTCTAAGAGCTTGTATGGTTATGGAGG TCTGACTAGGCATGATTTCACGAAGGCAAGATTGGCATATCATTGTAACTAAAAAAGCTGACATTGACCCAGACATATTGTACTCTTTCTAAAAATAATAAT AATAATGCTAACAGAAAGAAGAGAACCGTTCGTTTGCAATCTACAGCTAGTAGAGACTTTGAGGAAGAATTCAACAGTGTGTCTTCAGCAGTGTTCAGAGCC AAGCAAGAAGTTGAAGTTGCCTAGACCAGAGGACATAAGTATCATGTCTCCTTTAACTAGCATACCCCGAAGTGGAGAAGGGTGCAGCAGGCTCAAAGGCAT AAGTCATTCCAATCAGCCAACTAAGTTGTCCTTTTCTGGTTTCGTGTTCACCATGGAACATTTTGATTATAGTTAATCCTTCTATCTTGAATCTTCTAGAG AGTTGCTGACCAACTGACGTATGTTTCCCTTTGTGAATTAATAAACTGGTGTTCTGGTTCAT

Figure 1.6 Coding-region triplets are translated into amino acids. Each three-base codon can be translated into an amino acid using the genetic code. The one-letter representations for the amino acids appear in this figure (for example, “M”  stands for methionine, “Q” stands for glutamine, and so on).

17

18

Chapter 1: Introduction to Bioinformatics and Sequence Analysis

Figure 1.7 The protein sequence of human Factor IX. Here is the translation of the coding region appearing in Figure 1.6. This protein is 461 amino acids long.

MQRVNMIMAESPGLITICLLGYLLSAECTVFLDHENANKILNRPKRYNSGKLEEFVQGNL ERECMEEKCSFEEAREVFENTERTTEFWKQYVDGDQCESNPCLNGGSCKDDINSYECWCP FGFEGKNCELDVTCNIKNGRCEQFCKNSADNKVVCSCTEGYRLAENQKSCEPAVPFPCGR VSVSQTSKLTRAETVFPDVDYVNSTEAETILDNITQSTQSFNDFTRVVGGEDAKPGQFPW QVVLNGKVDAFCGGSIVNEKWIVTAAHCVETGVKITVVAGEHNIEETEHTEQKRNVIRII PHHNYNAAINKYNHDIALLELDEPLVLNSYVTPICIADKEYTNIFLKFGSGYVSGWGRVF HKGRSALVLQYLRVPLVDRATCLRSTKFTIYNNMFCAGFHEGGRDSCQGDSGGPHVTEVE GTSFLTGIISWGEECAMKGKYGIYTKVSRYVNWIKEKTKLT

the function of this protein. Luckily, we have bioinformatics programs that will tell us about these biochemical properties, structures, and modifications. Protein analysis, along with structures and their visualization, will appear in Chapter 8. This simple example illustrates how analysis of a raw sequence (Figure 1.3) can be broken down into steps and additional information can be extracted through the application of distinct rules. Knowing the intermediate steps makes you more aware of the dependencies. The next two examples show how sequences can be compared. The Factor IX transcript, analyzed above, is a product of the Factor IX gene that is over 38,000 nucleotides long. A single nucleotide mutation, changing a G to a T at coordinate 25,531, results in hemophilia B, a severe bleeding disorder (Figure 1.8). Single nucleotide changes elsewhere in the gene are quite common but tolerated, either because they do not change the protein sequence or do not disrupt splicing or any form of regulation, or do not make changes to the protein that biochemically compromise the function of Factor IX. A comparison between the human and chimpanzee Factor IX proteins (Figure 1.9) illustrates that conservative variation in at least one amino acid position can be tolerated in primates.

Figure 1.8 Hemophilia B mutation. A simple pairwise alignment between a small portion of the normal sequence and a sequence from a hemophilia B patient is shown. A vertical bar links positions where the two sequences are identical. A single nucleotide change in the 38,059 base pair human Factor IX gene can cause hemophilia B. The sequence shown in this figure spans the location of the G-to-T mutation found at gene coordinate 25,531 in GenBank record K02402.

The Factor IX protein sequence can be divided into domains with different biochemical properties and functions. These domains interact with each other and with other proteins to function properly. They attain a three-dimensional structure via folding although they are depicted in a linear form in Figure 1.10. In this figure, the domains of interest are gray boxes while other amino acids, not described here, are in white boxes. Domain one, at the N-terminus of the protein, functions to direct the protein to the endoplasmic reticulum of liver cells, from where it is secreted into the blood. When the protein is secreted, this first domain is cleaved off by a protein called signal peptidase. Twelve glutamic acid residues in domain two (also called the “Gla” domain) are modified by the enzyme gammacarboxylase to become gamma-carboxyglutamic acid residues. Domain three is an “epidermal growth factor (EGF)-like” domain that binds calcium. Skipping ahead, domain five is a peptidase domain that cleaves another protein in the clotting cascade, the X protein. This protease function only becomes active once the Factor IX protein is cleaved into two peptides, and this cut occurs in domain four. That is why domain four is called the “activation peptide.” This cleavage results in two polypeptides, the Factor IX light chain consisting of domains two and three, and the Factor IX heavy chain consisting of domain five. The heavy and light chains remain covalently linked to each other by a disulfide bond between two cysteines. To transform from precursor to functional protein, Factor IX interacts with at least four other molecules, as described above. Finally, here are two more views of the human Factor IX gene showing the context within genomic DNA. In Figure 1.11A, the entire 38,000-nucleotide gene is

Normal Mutation

GATGCCAAACCAGGTCAATTCCCTTGGCAGGTACTTTATACTGATGGTGTGTCAAAACTG |||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||| GATGCCAAACCAGGTCAATTCCCTTGGCAGTTACTTTATACTGATGGTGTGTCAAAACTG

Sequence Analysis and Data Display Query

1

Sbjct

1

Query

61

Sbjct

61

Query

121

Sbjct

121

Query

181

Sbjct

181

Query

241

Sbjct

241

Query

301

Sbjct

301

Query

361

Sbjct

361

Query

421

Sbjct

421

MQRVNMIMAESPGLITICLLGYLLSAECTVFLDHENANKILNRPKRYNSGKLEEFVQGNL MQRVNMIMAESPGLITICLLGYLLSAECTVFLDHENANKILNRPKRYNSGKLEEFVQGNL MQRVNMIMAESPGLITICLLGYLLSAECTVFLDHENANKILNRPKRYNSGKLEEFVQGNL

60

ERECMEEKCSFEEAREVFENTERTTEFWKQYVDGDQCESNPCLNGGSCKDDINSYECWCP ERECMEEKCSFEEAREVFENTERTTEFWKQYVDGDQCESNPCLNGGSCKDDINSYECWCP ERECMEEKCSFEEAREVFENTERTTEFWKQYVDGDQCESNPCLNGGSCKDDINSYECWCP

120

FGFEGKNCELDVTCNIKNGRCEQFCKNSADNKVVCSCTEGYRLAENQKSCEPAVPFPCGR FGFEGKNCELDVTCNIKNGRCEQFCKNSADNKVVCSCTEGYRLAENQKSCEPAVPFPCGR FGFEGKNCELDVTCNIKNGRCEQFCKNSADNKVVCSCTEGYRLAENQKSCEPAVPFPCGR

180

VSVSQTSKLTRAETVFPDVDYVNSTEAETILDNITQSTQSFNDFTRVVGGEDAKPGQFPW VSVSQTSKLTRAETVFPDVDYVNSTEAETILDNITQSTQSFNDFTRVVGGEDAKPGQFPW VSVSQTSKLTRAETVFPDVDYVNSTEAETILDNITQSTQSFNDFTRVVGGEDAKPGQFPW

240

QVVLNGKVDAFCGGSIVNEKWIVTAAHCVETGVKITVVAGEHNIEETEHTEQKRNVIRII QVVLNGKVDAFCGGSIVNEKWIVTAAHCV+TGVKITVVAGEHNIEETEHTEQKRNVIRII QVVLNGKVDAFCGGSIVNEKWIVTAAHCVDTGVKITVVAGEHNIEETEHTEQKRNVIRII

300

PHHNYNAAINKYNHDIALLELDEPLVLNSYVTPICIADKEYTNIFLKFGSGYVSGWGRVF PHHNYNAAINKYNHDIALLELDEPLVLNSYVTPICIADKEYTNIFLKFGSGYVSGWGRVF PHHNYNAAINKYNHDIALLELDEPLVLNSYVTPICIADKEYTNIFLKFGSGYVSGWGRVF

360

HKGRSALVLQYLRVPLVDRATCLRSTKFTIYNNMFCAGFHEGGRDSCQGDSGGPHVTEVE HKGRSALVLQYLRVPLVDRATCLRSTKFTIYNNMFCAGFHEGGRDSCQGDSGGPHVTEVE HKGRSALVLQYLRVPLVDRATCLRSTKFTIYNNMFCAGFHEGGRDSCQGDSGGPHVTEVE

420

GTSFLTGIISWGEECAMKGKYGIYTKVSRYVNWIKEKTKLT GTSFLTGIISWGEECAMKGKYGIYTKVSRYVNWIKEKTKLT GTSFLTGIISWGEECAMKGKYGIYTKVSRYVNWIKEKTKLT

60

120

180

240

300

360

19

Figure 1.9 Alignment of human (Query) and chimpanzee (Subject (Sbjct)) Factor IX proteins. Instead of a vertical bar to signify identity, as seen in Figure 1.8, this alignment places identical amino acids between the two sequences. There is only one amino acid difference over the 461 amino acid length (see if you can spot it). This change is biochemically conservative and so the space between the E (glutamic acid) and the D (aspartic acid) is indicated by “+” rather than a blank space indicating “no identity.” The chimpanzee protein sequence is from the NCBI RefSeq sequence file NP_001129063.1 and the human sequence is from file NP_000124.1.

420

461 461

shown as the black arrow (labeled F9) near the center of the figure. Factor IX has several genetic neighbors including steroid-5-alpha-reductase, alpha polypeptide 1 pseudogene 1 (SRD5A1P1). A pseudogene is a sequence derived from a known functional gene but which is somehow defective structurally. There are over 15,000 pseudogenes in the human genome. These may represent opportunities for new functional genes to arise, or be defective genes that are destined to accumulate mutations until they are no longer recognizable as once being similar to any known gene. Another genomic neighbor is MCF2 which encodes a protein capable of transforming normal tissue culture cells into a cancerous state. Note that the MCF2 gene is over 126,000 nucleotides long and has the opposite orientation of the Factor IX gene, as indicated by the direction of the arrows. Genes can be found on both DNA strands and their sizes vary tremendously. Figure 1.11B shows the location of the Factor IX gene on the X chromosome. Chromosomes are one long piece of genomic DNA with features that allow their identification and orientation. The X chromosome is 155 million nucleotides long, with a constriction near the center called the centromere. Throughout the chromosome are regions that, upon staining, show dark and light bands. Factor IX is one of 1500 genes and pseudogenes on this chromosome.

Signal peptide

EGF-like Gla domain domain

Activation peptide

Peptidase domain

4

5

N

C 1

2

3

Figure 1.10 Factor IX protein domains. The Factor IX protein contains five major domains (gray boxes), each with specific functions in the molecule. They are depicted as boxes on a line representing the length of the protein, from N-terminus (N) to the C-terminus (C).

20

Chapter 1: Introduction to Bioinformatics and Sequence Analysis

(A) Chromosome X - NC_000023.10 138802934

138444099 SNURFL

SRD5A1P1

F9

BCYRN1P1 MCF2

(B) chrX (q26.3–q27.1)

22.2

Figure 1.11 The position of the Factor IX gene on chromosome X. (A) Detail of the part of chromosome X that contains the Factor IX gene (the locus). The Factor IX gene is shown as a black arrow pointing to the right, 5P to 3P. It is labeled using the official gene symbol, “F9.” Note that the genes shown differ in size and orientation. This screenshot is taken from the NCBI Gene database. (B) The human X chromosome with the banding pattern similar to what is seen with a light microscope. On the far right, the boxed area indicates the approximate region shown in (A). This screenshot is taken from the University of California at Santa Cruz Genome Browser.

21.1

Xq23 24 Xq25

Xq28

1.11 SUMMARY This chapter serves as an introduction to sequence analysis, which is, perhaps, the first specialty within bioinformatics and has become the cornerstone of interpreting the deluge of data that we are experiencing today. As this chapter described, sequence data come from animals, plants, and microbes and from research and medicine. In the coming chapters you will acquire the practical skills of using sequence analysis tools, giving you access to this wealth of information.

FURTHER READING Anonymous (2011) Microbiology by numbers. Nat. Rev. Microbiol. 9, 628. A collection of interesting statistics about microbes. Benson DA, Karsch-Mizrachi I, Lipman DJ et al. (2011) GenBank. Nucleic Acids Res. 39 (Database issue), D32–37. Cordain L, Eaton SB, Sebastian A et al. (2005) Origins and evolution of the Western diet: health implications for the 21st century. Am. J. Clin. Nutr. 81, 341–354. Feuk L, Carson AR & Scherer SW (2006) Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97. Gan X, Stegle O, Behr J et al. (2011) Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423. Green RE, Krause J, Briggs AW et al. (2010) A draft sequence of the Neandertal genome. Science 328, 710–722. Itan Y, Powell A, Beaumont MA et al. (2009) The origins of lactase persistence in Europe. PLoS Comput. Biol. 5, e1000491. Johnson M, Gallagher K, Porter G et al. A baffling illness. Milwaukee Journal Sentinel, December 19, 2010. Sifting through the DNA haystack. Milwaukee Journal Sentinel, December 22, 2010. Embracing a risk. Milwaukee Journal Sentinel, December 26, 2010. An award-winning series of newspaper articles about the use of genomic DNA sequencing to help diagnose a patient. Li J, Yang T, Wang L et al. (2009) Whole genome distribution and ethnic differentiation of copy number variation in Caucasian and Asian populations. PLoS One 4, e7958 (DOI: 10.1371/journal.pone.0007958). Lupski JR, Reid JG, Gonzaga-Jauregui C et al. (2010) Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. New Engl. J. Med. 362, 1181–1191. McDermott R, Tingley D, Cowden J et al. (2009) Monoamine oxidase A gene (MAOA) predicts behavioral aggression following provocation. Proc. Natl Acad. Sci. USA 106, 2118–2123. Pelak K, Shianna KV, Ge D et al. (2010) The characterization of twenty sequenced human genomes. PLoS Genet. 6, e1001111 (DOI:10.1371/journal.pgen.1001111). Genomic sequencing accurately identified those with hemophilia and found a number of genes that were “knocked out.”

Further Reading Perry GH, Dominy NJ, Claw KG et al. (2007) Diet and the evolution of human amylase gene copy number variation. Nat. Genet. 39, 1256–1260. Pontius JU, Wagner L & Schuler GD (2003) UniGene: a unified view of the transcriptome. In The NCBI Handbook. Bethesda, MD: National Center for Biotechnology Information. Reich D, Green RE, Kircher M et al. (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468, 1053–1060. Roach JC, Glusman G, Smit AF et al. (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639. The genomic sequencing of two healthy parents and their two children affected by genetic disorders. Schuenemann VJ, Bos K, DeWitte S et al. (2011) Targeted enrichment of ancient pathogens yielding the pPCP1 plasmid of Yersinia pestis from victims of the Black Death. Proc. Natl Acad. Sci. USA 108, E746–752. Sturm RA (2009) Molecular genetics of human pigmentation diversity. Hum. Mol. Genet. 18, R9–17. Includes a beautiful picture of 19 forearms showing the range in skin pigmentation. Tishkoff SA, Reed FA, Ranciaro A et al. (2007) Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39, 31–40. Tripp S & Grueber M (2011). Economic impact of the human genome project. Battelle Memorial Institute. Yi X, Liang Y, Huerta-Sanchez E et al. (2010) Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329, 75–78. Zhu J, He F, Hu S & Yu J (2008) On the nature of human housekeeping genes. Trends Genet. 24, 481–484. Housekeeping genes are expressed in all cell types, taking care of basic functions such as transcription, translation, and cell division.

Internet resources The NCBI has a collection of electronic textbooks available through their Website which is a good place to search for explanations of the biology and technology mentioned in this chapter: www.ncbi.nlm.nih.gov/books. GenBank release notes: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

21

             

This page is intentionally left blank.  

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA GAAGGGGAAACAGATGCAGAAAGCATCT AGAAAGCATCT ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCT ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAA GAAGGGGAA

CHAPTER 2

Introduction to Internet Resources Key concepts

• Searching the medical and scientific literature: PubMed, iHOP, OMIM • Searching the patent literature • Gene classifications: Ontology • Sequence collections such as Gene and UniGene

2.1 INTRODUCTION How do you first learn about an interesting topic in biology? Is it in a lecture? A book? From a newspaper article or a conversation? Or do you observe it yourself in the laboratory or during a walk in the woods? However you first learn about a topic, you may wish to learn more and the early steps in this journey should include a search of the medical and scientific literature. Here you find a virtual mountain of past observations, experiments, discussions, comparisons, speculations, and explanations about our world. The benefits of consulting past observations are too numerous to list but include saving you time by directing your next steps to understanding. Do you make more observations, or have they already taken place? Do you care to repeat them, or do you build on past accomplishments? You have entered the field of science because you are naturally curious, and want explanations, so why not learn from thousands who have walked before you? This chapter introduces the tools and resources for obtaining information from published works and Internet databases. Here, and throughout the book, we will be exploring some of the best Websites in the world for genetic information. Our launch point is the Website for the National Center for Biotechnology Information (NCBI). This Website will be used extensively throughout this book and this introduction should give you the tools to work beyond the topics covered. The foundations of many observations are the genes that act alone or in concert to give rise to the phenotype, disease, behavior, or organism. In this chapter you will not be analyzing any sequences, but you will learn ways to find and better understand them. You will see a mix of locations and approaches with the ultimate goal of showing you a path to your objective.

2.2 THE NCBI WEBSITE AND ENTREZ The NCBI Website is one of the major hubs of bioinformatics resources and innovation in the world, and provides a wealth of information. The NCBI home page

24

Chapter 2: Introduction to Internet Resources (Figure 2.1) welcomes you with quick access to commonly used tools or databases (“Popular Resources”), and links to collections of resources (left sidebar). The “Site Map” and “All Resources,” also on the left sidebar, give you access to everything in one place and it is interesting to scroll through all that is offered here. It is important to remember that this Website is not just a collection of hyperlinks. The scientific community is trying to cope with truly massive amounts of biological information and the scientists and computer experts at the NCBI have created and maintain many powerful solutions to the problems of data storage, retrieval, and display. Many other wonderful Websites will be introduced in this book, but the NCBI is certainly a major player in the world of bioinformatics. A primary interface between you and this wealth of information is the Entrez (pronounced “on-tray”; it is French for “enter”) retrieval system, the NCBI’s own search engine. On most pages on the Website, there is a very simple text field along with a drop-down menu which launches your searches of any or all of the databases available through Entrez. This text field appears at the top of Figure 2.1. In this example, kibra, the name of a gene, has been entered. Hitting the return key or clicking the “Search” button will launch a very fast search of 38 NCBI databases. When the page refreshes, the majority of the page is a large section listing numerous databases, each with a brief description. The number of hits to each database is displayed to the left (Figure 2.2). Some databases have zero hits but this is understandable. Unless there was an organism named “kibra” you would not expect to find any hits in the Taxonomy database, for example. But on display are a number of choices for information, sequences, and other forms of data. Clicking on the question mark next to each brief database description provides a longer explanation. To explore the hits in an individual database, just click on the number. Note that on the Entrez results page seen in Figure 2.2, the dropdown menu of individual databases is no longer available. But if you hit the back button or navigate to any of the individual results (for example, Nucleotide) you will find the familiar drop-down menu. Many of the individual databases will be described in detail later in this book. In this section of the chapter, we will focus on “PubMed: biomedical literature citations and abstracts.”

Figure 2.1 Home page for the NCBI (www. ncbi.nlm.nih.gov). At the top of the page is a drop-down menu where you can choose a specific database to search. In this example, all the NCBI databases will be searched with the term “kibra.”

PubMed

2.3 PubMed PubMed is a very large literature database, covering all major journals of biology and medicine. When you use Entrez to search PubMed, a number of searches are performed. First, your queries are used against an index of Medical Subject Headings (MeSH), which is a controlled vocabulary that describes the contents of a published paper. Queries are also used against indices from several fields including author. For example, searching with “white” will find papers on white blood cells or the Drosophila mutation called white, but also authors named White, too. You can specify which specific fields are searched, such as author and title/abstract, to eliminate many unwanted hits. Clicking on the PubMed hits in Entrez brings you to a page providing more details: title, authors, and journal (Figure 2.3). The search term “kibra” found 52 hits in this example, sorted by date. In the upper-right corner of this figure, notice that related searches are suggested, including “kibra memory,” “kibra lats,” “hippo kibra,” and “kibra expression.” Clicking on these links will automatically return hits found with these terms, subsets of the original 52 since these hits must have both search terms present. Filters for the listed results, such as date, are found on the left side of the page. The default display is 20 references to a page, which can save page-loading time by not providing long lists of hits should you have them. This display is also customizable. Clicking on the “Display Settings” hyperlink (upper left) opens a box (Figure 2.4). You can vary the information displayed, the citations per page, and the sorting order. Similar display settings are available on all of the Entrez results pages (for example, Nucleotide searches). Back in the PubMed results (Figure 2.3), clicking on the hypertext title of the citation takes you to a page displaying the title, authors, and full abstract, if available (Figure 2.5). The information on this page can be exported using the “Send to:”

25

Figure 2.2 NCBI Entrez results page. The search for the term “kibra” performed in Figure 2.1 returned hits in many databases, including PubMed, Gene, OMIM, and UniGene. These specific databases will be explored in this chapter while others such as the Nucleotide, EST, Protein, and Structure databases will be covered in later chapters.

i

Search all of them or only what you need?

There are benefits for searching all the databases even if you are just looking for a nucleotide sequence. Scanning the results page, you may find it informative to see that there are only a few PubMed references, or that there are many. The same applies to all the other databases. So glance at the other hit numbers and let those bits of information provide you a little more background on your topic of interest.

26

Chapter 2: Introduction to Internet Resources

Figure 2.3 PubMed result for the search term “kibra.” By clicking on the PubMed hits in the Entrez results page, seen in Figure 2.2, you are brought to a page listing those hits individually. Here, two hits are visible, showing title, authors, and journal citation. Clicking on the title will bring you to the full citation along with an abstract, if available.

Figure 2.4 Display settings for PubMed. Each Entrez results subsection has a “Display Settings” menu, tailored for the data to be displayed. Shown here are the PubMed Display Settings where you can view the list of hits in multiple ways (Format), vary the number of citations listed per page (Items per page), and sort the results.

tool on the right, which opens a dialog box (Choose Destination). In the figure, this box partially obscures the link to the full article on the far right. This link is often the logo of a journal, in this case Science magazine. The authors’ names are hypertext (Figure 2.5) so if you want to search for all articles published by that author, just click on their name and the page refreshes, listing their articles. In addition, the author’s name appears in the search window (Figure 2.6A). Along with the author’s name (Papassotiropoulos A), PubMed inserts a field label ([Author]) to direct the search to author names, ignoring other fields. You may do this manually, but PubMed also provides a form to construct these and other types of searches, so there is no need to memorize the field labels. This form is available through the “Advanced search” link visible in this and earlier figures near the top of the page. In Advanced Search (Figure 2.6B) a drop-down menu (under “Builder”) offers many choices, only some of which are visible in this screenshot. You pick the field from the menu (Title/Abstract is this example), enter text in the window next to

Gene Name Evolution

27

Figure 2.5 A single PubMed result. Clicking on any of the titles of the PubMed hits will take you to the full citation and an abstract, if available. The author names are hypertext; clicking on them will find other articles by that author. A “Send to:” menu, shown already open in this figure, allows you to export the citation in a variety of ways. There will also be a link to the full journal article (in this case, a link to Science magazine) if it is available online. Not all articles are free to view. (A)

(B)

your choice, and the form automatically constructs your query in the text box at the top of the page. This form also offers a choice of “AND,” “OR,” and “NOT” to construct the logic for your approach (the default is “AND”). These choices are available next to the Builder drop-down menus. For example, let’s say that there were two authors with the same name, “Papassotiropoulos A,” and one has published articles about genes while the other has published papers about heart surgery. You may wish to construct “(Papassotiropoulos A[Author]) NOT surgery[Title/Abstract]” to generate a list of articles that are more specific.

2.4 GENE NAME EVOLUTION Gene names can often be considered a moving target. If you search PubMed for “kibra” you find 41 hits. But using the official gene symbol, WWC1, given to this gene some time after it was first described, you only get 28 hits. If you do an “OR” search, kibra OR WWC1, you get 41 hits, suggesting that kibra is the term to use. There are thousands of laboratories around the world and many laboratories make the same discoveries and publish their names for the same gene. Even the same laboratory could have multiple names for the same gene as their research

Figure 2.6 More specific searches in PubMed. (A) Clicking on the hypertext author name of a citation will create a more specific PubMed search. There is now a field description called “Author” in square brackets and this will direct PubMed to only display articles by this author. (B) Clicking on the “Advanced search” hypertext in the PubMed title bar takes you to a query builder where many specific fields of the database can be selected from a drop-down menu, and the Boolean terms “AND,” “OR,” and “NOT” can be selected from a separate menu (right). The “Limits” hyperlink, shown in this and earlier figures on PubMed, allows the placement of other constraints upon the search such as dates or language.

28

Chapter 2: Introduction to Internet Resources progresses. These names could be quite generic (for example, kidney and brain protein) or a clone name (for example, FLJ10865). Perhaps after more analysis, it becomes more specific to the function or structure (for example, WW domaincontaining protein 1). Finally, a committee or organization gives it an official name (for example, WW and C2 domain containing 1) and an official symbol (for example, WWC1). All of these names apply to kibra and are commonly referred to as gene “synonyms.” In another example, a gene first described in 1988 was called MSF. By 1994 it was referred to as CACP. By 2000 it was referred to as either lubricin or PRG4. If you search PubMed for “lubricin” you get 104 hits; “PRG4” finds 83 publications. But if you do an “OR” search for these two terms, you get 165 hits. From the above, you might conclude that performing thorough literature searches is an art form, best performed by professionals or professional approaches. One approach available to you is to use multiple gene synonyms as described above. Another interesting and powerful way to search the published information on a gene can be found in the “information Hyperlinked Over Proteins” Website, or iHOP (www.ihop-net.org). Importantly, iHOP uses the gene synonyms simultaneously to identify as many publications as possible and then displays your queries in the sentences of the publications. This allows you to browse the context of your gene appearance in a publication. Was it being listed along with a number of other genes, or was something specific being described about your query? Entering the word “kibra” into the iHOP query window returns the results seen in Figure 2.7A. Notice that both human and mouse kibra genes are listed. In the “Results” column, there are four icons. Clicking on the left icon returns the information displayed in Figure 2.7B. Shown here is the human kibra record, official symbol WWC1. Note the list of synonyms at the top of the page; perhaps now you can guess that the name kibra might be derived from another name, “Kidney

Figure 2.7 iHOP (information Hyperlinked Over Proteins) www.ihop-net.org. (A) Search results obtained by entering “kibra.” (B) Clicking on the first rectangular icon in the “Results” column displays the details. At the top is a list of protein synonyms and below that are links to various databases. Below these hyperlinks are the individual sentences taken from PubMed citations where the search term (kibra) is bold and uppercase. Hypertext links to medical subject headings (MeSH terms, for example, transactivation) and other genes (for example, HAX-1) decorate these sentences.

(A)

(B)

OMIM and brain protein.” There are links to common databases (for example, UniProt), some of which will be covered later in this book. Below these links are actual sentences taken from publications. MeSH terms and other gene names in these sentences are hyperlinked, allowing rapid navigation to topics and genes related to your query. To the right of the sentences are square icons for links to the entire abstracts of the publications, again marked with hypertext terms and names.

2.5 OMIM Another database that can be searched by the NCBI Entrez query interface is OMIM (Online Mendelian Inheritance in Man). OMIM contains records on human genes and the many human disorders that are thought to be genetic in nature. It is manually authored and updated by staff at the Johns Hopkins University School of Medicine. From the NCBI home page, select OMIM from the Entrez drop-down menu and enter the query term “memory.” Some of the results are displayed in Figure 2.8. Note that the “Limits” box (upper left) is checked and only the Title and Clinical Synopsis fields in the database are searched, two of the choices from this filter, which narrows the search from over 400 hits to 30. Limits do continue with subsequent searches until this box is unchecked, so be careful not to apply Limits by mistake. Of the hits shown in Figure 2.8, one gene is listed (kibra, the first hit) but the other hits are disorders. The symbols next to the OMIM record number (for example, 610533) designate a gene record with an associated phenotype (+), diseases where the gene defect is known (#), and mapped disorders where the gene is still unknown (%). Not shown is the fourth symbol (*) for genes not associated with a phenotype. The thousands of records in OMIM contain basic information such as the chromosomal location of the genes and disorders (if known) and how the gene was isolated and identified (Figure 2.9). In disease records, information about clinical descriptions and findings, animal models, cellular pathways, and many other topics may be available, along with references on these topics. Records on common and major disorders such as diabetes can be dozens of pages long and easily have over 100 references. OMIM records are remarkable collections of biomedical data and should be explored whenever genetic information is needed.

i

29

OMIM record 100820

There is a disorder called “Achoo syndrome,” characterized by sneezing in response to bright light, and it appears in the OMIM database. Here is a passage directly from the OMIM record: “Duncan (1995) pointed out public awareness of the ACHOO syndrome is much more widespread than one might guess, to the point that it has entered into the popular wisdom conveyed to preschoolers. In a best-selling children’s book by Berenstain and Berenstain (1981), Papa and Mama Bear are taking Sister Bear and Brother Bear to their pediatrician, Dr. Grizzly, for a check-up. The cubs are expressing their apprehension about the possibility of injections when Papa Bear suddenly cuts loose with an explosive sneeze. ‘Bless you!’ said Mama. ‘It’s just this bright sunlight,’ sniffed Papa. ‘I never get sick.’”

Figure 2.8 OMIM results using the search term “memory.” By clicking on the “Limits” tab (visible near the top of the figure, on the left) the search can be restricted to specific fields in the records. In this case, the search was limited to “Title” and “Clinical Synopsis.” This search resulted in 30 hits, some of which are shown in this figure.

30

Chapter 2: Introduction to Internet Resources

Figure 2.9 OMIM detail of the first search result from Figure 2.8. Although the search was limited to “Title” and “Clinical Synopsis,” the term “memory” can find this record because the “Other entities represented in this entry” field is also searched, and this subtitle appears in Figure 2.8. This is an OMIM record for a gene (kibra). For disease records, there are often many fields with extensive description of clinical findings and associated genetics.

2.6 RETRIEVING NUCLEOTIDE SEQUENCES

Figure 2.10 Searching for nucleotide sequences at the NCBI Website. The Entrez drop-down menu has both nucleotide and protein sequences as choices. You can make the search quite specific using the “Advanced search” form or “Limits.” You can also quickly construct queries that may work well enough for you to find the sequences you want. In this example, “kibra AND sapiens” demonstrates Boolean logic combining two terms to search all the fields. Using “sapiens” alone can lead to many thousands of hits that are from bacterial files because the search goes beyond the organism field.

The Entrez drop-down menu used earlier in this chapter (see Figure 2.1) lists all the databases separately, including the extensive NCBI Nucleotide collection. Changing the menu from “All Databases” to “Nucleotide” will limit your search to only that database, speeding up your search (by just a bit, the NCBI search engine is very fast), saving the NCBI computing network some work, and returning a page of nucleotide hits. In the query submitted in Figure 2.10, two terms were used: “kibra AND sapiens” (as in Homo sapiens). The Boolean term “AND” specifies that both terms must be present in order to be listed as a hit. The query “kibra

Searching Patents

31

sapiens” generates the same results because listed terms are “AND’ed” together (the default). This approach should return fewer and more specific hits, but not always (as explained below). If you just use “kibra” as a query, 234 sequences are returned, while the “kibra AND sapiens” query returns 79 hits. In the first list, many sequences from other organisms are returned. In the more specific search, only a few hits are from nonhuman species, and they are listed on the right side of Figure 2.10 (“Top Organisms”). The annotation of many records often contains references to other organisms and this usually explains why hits from other organisms appear in search results. Clicking on the hypertext line in the hit list (see Figure 2.10) takes you to that record. For example, Figure 2.11 is a record that was listed as “WO 2005116204A/350160: Double strand polynucleotides generating RNA interference” and you can see that this text was taken from the “DEFINITION” line. The default condition of the Entrez search uses your query against all fields in the nucleotide records. “Kibra” is found under the “COMMENT” field and “sapiens” is found in the “SOURCE” and “ORGANISM” fields. By using the “Advanced search” as we did before, you can limit your searches to specific fields. This record was found using the terms “kibra patent.” Under “COMMENT,” “PN” stands for the patent publication number. This number (WO 2005116204A/350160) can be used to find the patent application associated with this sequence. Nucleotide searching will be extensively covered in chapters 3, 5, and 6.

2.7 SEARCHING PATENTS Because they are legal documents, patent applications can contain very thorough descriptions of specific subjects. In the case of genes and proteins, patent applications can often act as excellent review articles containing many references

LOCUS DEFINITION

FW943634 19 bp DNA linear PAT 19-APR-2011 WO 2005116204-A/350160: Double strand polynucleotides generating RNA interference. ACCESSION FW943634 VERSION FW943634.1 GI:329972699 KEYWORDS WO 2005116204-A/350160. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 19) AUTHORS Naito,Y., Fujino,M., Oguchi,S. and Natori,Y. TITLE Double strand polynucleotides generating RNA interference JOURNAL Patent: WO 2005116204-A 350160 08-DEC-2005; RNAi Co Ltd COMMENT OS Homo sapiens PN WO 2005116204-A/350160 PD 08-DEC-2005 PF 11-MAY-2005 WO 2005IB001647 PR 11-MAY-2004 JP 04P 232811 PI yuki naito,masato fujino,shinobu oguchi,yukikazu natori CC siRNA target sequence for KIBRA (NM_015238.1,3046-3064). FH Key Location/Qualifiers. FEATURES Location/Qualifiers source 1..19 /organism=”Homo sapiens” /mol_type=”unassigned DNA” /db_xref=”taxon:9606” ORIGIN 1 gacctgcagg cgacaagaa

Figure 2.11 A nucleotide record from NCBI’s GenBank. The field names of a GenBank file appear on the far left of each line. Although there is a lot of information in this short record, it is easy to focus on the fields that interest you. For example, the first line says it is a 19 base pair (bp) linear DNA, and the PAT tag indicates that it is a sequence that appeared in a patent application. The DEFINITION lines are often descriptive and, in this case, indicate that this short sequence is used in RNA interference.

32

Chapter 2: Introduction to Internet Resources and interesting sequences. A drawback is that the legal language within a patent is not always an easy read. Nevertheless, you should not overlook patents as a source of information and sequences. The World Intellectual Property Organization (WIPO) has a Website, www.wipo.int, which is easy to use and can be searched using words or patent numbers. The number associated with the kibra patent (see above) is the publication number. This number requires some consideration. “WO” (“World”) indicates it is an international patent. Patents from countries begin with two-letter designations, for example Japan (JP) and the United States (US). The next four digits in the kibra patent represent the year of publication (2005). As you encounter patents on Websites and in databases, inconsistencies on how these patent numbers are displayed will be seen. You may see gaps between WO and the number, or slashes (“/”), so should a query not work, try again with a different format.

Figure 2.12 The World Intellectual Property Organization (WIPO) Website (www.wipo.int). You can search for patents using patent publication numbers like the one found in Figure 2.11. (A) The WIPO search engine (www.wipo.int/patentscope) can be found by clicking on the “Patents” link on the left sidebar. Click on the PATENTSCOPE link on that new page. (B) The results page with multiple languages available. The translation is automated and so it can be imperfect. For example, instead of saying “interference” (seen in the NCBI nucleotide record in Figure 2.11), the auto-translation says “interfere” in panel (B).

(A)

(B)

From the WIPO home page, click on the “Patents” link on the left sidebar of the Website (Figure 2.12A), then click on “PATENTSCOPE search” (www.wipo.int/ patentscope) to construct and launch your query. The number found in the NCBI record (WO2005116204, no space) should be entered, but just the number 2005116204 would find the patent record also. The default section of the patent to be searched is the Front Page but other sections can be chosen by clicking on the tabs. Figure 2.12B shows part of the results page; notice that multiple languages are available on the top of the Web page (visible in Figure 2.12B). Clicking on another language does an automated translation of the text, sometimes with imperfect results. Once in your language of choice, the tabs across the page provide access to the description and claims. Still, indexes are not always complete; using “kibra” as a WIPO Website query does not find this patent application. Persistence and multiple approaches are often needed to find what you seek. There are commercial databases of sequences from patents and these are more up to date than those available in the public domain. When patent applications are filed, they can contain anywhere from a handful to thousands of sequences. Extraction of the sequences directly from the patent is labor-intensive and prone to error, and not something to be done routinely. But armed with a patent number, the nucleotide sequences at the NCBI Website could then be searched and those sequences associated with that patent number could be easily retrieved. However, there is a delay between the patent documents appearing in databases and the time when sequences appear in GenBank.

Public Grants Database: NIH RePORTER

33

(A)

(B)

(C)

Another major Website to use for patents is from the US Patent and Trademark Office, www.uspto.gov. There is a lot of information on this Website and you have to find the “Search” hypertext on the home page. From there, at least two kinds of patent searches can be performed: searching patent applications and searching issued patents. There can be a considerable amount of time between application and issuance, and this will be demonstrated in Figure 2.13. Figure 2.13A shows the query interface for issued patents. Note that you can search for patents that go back to the year 1790! As seen earlier at other Websites, drop-down menus allow you to search specific fields. In this example, the term “kibra” was used to search all fields of issued patents and the result is six patents shown in Figure 2.13B. These results are hypertext so you are one click away from viewing the patent. In Figure 2.13C, the search was repeated for patent applications (patents not issued yet) and 48 hits were returned, eight of which are shown in this figure. The difference in numbers makes sense in that kibra was only named in 2003 according to an iHOP search and not enough time has passed for these patent applications to be issued. Nevertheless, if you only searched issued patents, you would have missed 48 other documents. You may have to install a Web browser plugin to view patent figures or images. Trying a different browser will sometimes work too. You can also search for patents at the European Patent Office (EPO), www.epo.org.

2.8 PUBLIC GRANTS DATABASE: NIH REPORTER Searching patents gives you access to information about genes that can be so valuable that legal steps were taken to protect discoveries. Another source of gene information, some of which is not yet published, is public grant applications. The US National Institutes of Health (NIH) has made information on issued grants available through their NIH RePORTER Website, projectreporter.nih.gov.

Figure 2.13 Searching for information at the US Patent and Trademark Office (USPTO), www.uspto.gov. Both issued patents and patent applications can be searched. (A) Using “kibra” as a search term in the patents database, only six issued patents are found (B). (C) Using the same term to search the patent applications database finds 48 patent applications, eight of which are shown here.

34

Chapter 2: Introduction to Internet Resources

Figure 2.14 Query interface of the NIH RePORTER database (www.projectreporter.nih.gov).

You cannot see the entire grant but the abstract and associated publications will inform you about work that is underway which can potentially help you understand your gene of interest or identify potential collaborators. Figure 2.14 shows the query interface, which provides many avenues for searching this database.

2.9 GENE ONTOLOGY In addition to the multiple names possible for a single gene, there is still potential for more chaos and confusion. In the human genome alone, there are over 21,000 genes that encode proteins. What do they do? Where are they located? What chemicals or proteins do they associate with? The answers cannot be revealed in their names, although some names are more descriptive than others. The approach to making sense of these thousands of proteins is called Gene Ontology, and a major Website for this effort is found at www. geneontology.org. The introduction on their home page describes it well: “The Gene Ontology project is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases.” As you can imagine, this is a massive project requiring tremendous organization, data analysis, and interpretation. We can only touch upon this effort and will do so with kibra to illustrate a simple query. From the Gene Ontology home page, entering kibra into the simple text field and clicking on the “GO” button brings you to ten results, one of which is Homo sapiens (Figure 2.15A). This result shows that for human kibra, there are 14 known “associations,” or ontologies, where there is some evidence linking kibra to a process, cellular component, or function. Clicking on the hypertext “14 associations” shows you this list (Figure 2.15B). For each ontology (for example, biological process) there can be multiple terms. For example, for this kibra result there are six different biological processes: cell migration, hippo signaling cascade, positive regulation of MAPK cascade, regulation of hippo signaling cascade, regulation of transcription, DNA-dependent, and transcription, DNAdependent. Clicking on these terms takes you to a tree of larger categories that contain these terms. For example, cell migration is a form of cell motility, which is a form of locomotion, which is a form of localization, which is a biological process. Clicking on the “gene products” takes you to a listing of individual genes that have the same classification. For example, the proteins crumbs and merlin are also involved in the regulation of the hippo signaling cascade.

Gene Ontology

35

(A)

(B)

So why is kibra classified in this manner? Notice that the Reference column in Figure 2.15B contains a PubMed Identifier (PMID) for a scientific reference, and there is the name of the principal source of this classification (for example, UniProtKB). UniProtKB is a fantastic database that will be covered in Chapter 4. In this example, the Gene Ontology Website drew upon the knowledge and analysis captured at this other Website to classify kibra. The Evidence column in Figure 2.15B contains three-letter codes representing the type of evidence used to classify kibra. For example, the assignment of “cell migration” was Inferred from a Direct Assay (IDA), while the regulation of the hippo signaling cascade was Inferred from a Mutant Phenotype (IMP). These and the other codes for evidence are listed in Figure 2.16. Clicking on the evidence code on the Website will take you to their extensive help section with thorough explanations of each type of evidence along with examples. The tabular results as shown in Figure 2.15 can be filtered using these codes. For example, you can display only the results based on evidence from direct assays. The evidence types listed in Figure 2.16 illustrate the varied forms of data that support or refute scientific conclusions. Throughout this book you will be gathering evidence through sequence analysis. You literally “play detective” and build a case classifying, naming, or otherwise describing sequences. This annotation process is sometimes straightforward but when it is not, you will need to pursue multiple avenues to gather evidence. This includes the use of literature and databases

Figure 2.15 The Gene Ontology project (www.geneontology.org). Gene Ontology uses a controlled list of terms to describe proteins, producing a very organized classification of proteins. (A) Entering “kibra” in the simple query form returns a list of proteins including the human form of kibra. (B) Clicking on the “14 associations” link shown in (A) brings up a list of terms associated with kibra function (for example, cell migration). Also displayed are Ontology (for example, cell migration is a “biological process” while protein binding is a “molecular function”), Evidence (how this was determined), and Reference (a publication that demonstrates this).

36

Chapter 2: Introduction to Internet Resources

Figure 2.16 Evidence codes for the Gene Ontology Website. By clicking on the hypertext of the evidence codes found in Figure 2.15, detailed descriptions of the terms listed here can be found.

Experimental Evidence Codes EXP Inferred from Experiment IDA Inferred from Direct Assay IEP Inferred from Expression Pattern IGI Inferred from Genetic Interaction IMP Inferred from Mutant Phenotype IPI Inferred from Physical Interaction Computational Analysis Evidence Codes IBA Inferred from Biological aspect of Ancestor IBD Inferred from Biological aspect of Descendant IGC Inferred from Genomic Context IKR Inferred from Key Residues IRD Inferred from Rapid Divergence ISA Inferred from Sequence ISM Inferred from Sequence Model ISO Inferred from Sequence Orthology ISS Inferred from Sequence or Structural Similarity RCA Inferred from Reviewed Computational Analysis Other listings TAS Traceable Author Statement NAS Non-traceable Author Statement IC Inferred by Curator ND No biological Data available IEA Inferred from Electronic Annotation

where a lot of work has already been done for you. Internet Websites have many databases on specialized topics and gene families which are not covered by this book. For example, the ENZYME database (see Box 2.1) is an extensive resource on this class of proteins. Before struggling to gather the information you need for a project, search the Internet to determine if it has already been done for you.

2.10 THE GENE DATABASE For the remainder of this chapter, we will return to the NCBI Website and visit two wonderful collections: the Gene and UniGene databases. Both collections are often used to retrieve DNA and protein sequences and we’ll be using these databases later in the book for this purpose. Here, we will learn about their content. From the Entrez interface (Figure 2.2), you can select a database called “Gene.” Each record within Gene compiles information from numerous sources on a single gene. There are 105 separate Gene records with at least some mention of kibra

Box 2.1 Enzyme database The Gene Ontology project provides a view of all gene products. An important subset, already possessing extensive descriptions and classification, are the enzymes. For a database that organizes the naming and classification of enzymes, go to enzyme.expasy.org. Here you can search the ENZYME database for enzymes based on names, Enzyme Commission (EC) numbers, and other terms. Enter “kinase” and you are shown many pages of different kinases. Clicking on their links shows the chemical reactions they catalyze, and links to their sequences and annotation.

The Gene Database

37

and many of these genes are versions of kibra from various organisms. There are multiple human records found with the more specific query “kibra AND sapiens.” But only one has the official gene symbol seen earlier (WWC1) and kibra listed as a synonym. The others have the word kibra somewhere in the annotation and are found by this search. Part of the human kibra Gene record is shown in Figure 2.17. While this is not the only database where gene/protein-centric information has been compiled, it can serve as a launching point for many analysis projects. Most of the text in each record is hypertext linking you to other NCBI resources. Looking at Figure 2.17, at the top is a Summary of some basic facts about the gene,

Figure 2.17 The NCBI Gene record for human kibra. Basic information for a gene is displayed along with links to related information from NCBI data sources on the right sidebar. This record was selected from the “Gene” hit list shown in Figure 2.2. The “Summary” section includes the official name for this gene, WWC1 WW and C2 domain containing 1, along with a list of synonyms, including kibra. The “Genomic context” section shows kibra along with adjacent genes portrayed as arrows on the chromosome. The “Genomic regions, transcripts, and products” section is a map showing exons of kibra transcripts, with transcript and protein accession numbers on the left and right sides of the map, respectively. These accession numbers are unique identifiers for these molecules.

38

Chapter 2: Introduction to Internet Resources including synonyms that we saw earlier in the iHOP record. The official symbol is here (WWC1) along with the RefSeq (Reference Sequence) status; REVIEWED indicates that the sequence information in this record has been reviewed by a person, increasing the confidence about the gene structure and associated information. Many records in the Gene database were automatically generated and have not (yet) been reviewed. However, automated methods are quite powerful, accurate, and necessary when you are dealing with millions of genes. The “Genomic context” section shows the position of the kibra gene (a dark arrow labeled WWC1) in relation to its neighbors on human chromosome 5. Below that are three different kibra gene transcripts represented as exons (boxes) and introns (thin lines). The bottom of the kibra Gene Web page (not shown in the figure) contains information from PubMed, concise functional information contributed by the scientific community (GeneRIF (Gene Reference into Function)), and protein–protein interaction information.

2.11 UniGene The NCBI has another genes-centered database, UniGene. From the NCBI home page, it can be located using the Entrez interface drop-down menu at the top of most NCBI Web pages (Figure 2.1) or from the Site Map (A-Z) hyperlink (Figure 2.18A). From here, click on the “U” on the top of the page and the view will jump to the bottom of the alphabet (Figure 2.18B). There are also links to UniGene on the iHOP database (Figure 2.7B), in OMIM (Figure 2.8), and in Gene records (Figure 2.17). The lesson here is that there are many ways to get to UniGene and many other databases, both at the NCBI Website and at other Websites hosted around the world. Figure 2.18 The NCBI Site Map. (A) The NCBI home page has a hyperlink “Site Map (A-Z)” for an alphabetized list of resources available at the NCBI Website. (B) There are two UniGene links listed: the UniGene link is the primary entry point. To view all the organisms and clusters, find the statistics page on the UniGene home page. The UniGene Library Browser allows you to find individual complementary DNA (cDNA) libraries from specific species.

(A)

Each UniGene collection or cluster is a compilation of transcripts from a given genomic locus. These transcripts come from cDNAs (complementary DNAs) and ESTs (Expressed Sequence Tags, a form of cDNA) although the difference between these two sources is sometimes blurred. Historically, cDNA was the result of efforts to isolate full-length or nearly full-length representations of mRNA transcripts. ESTs are intended to represent a short piece or “tag” of the full-length message. ESTs often originate from the 5P and 3P ends of the mRNAs and are synthesized inward on the message. With short mRNAs and improved EST synthesis, ESTs often span the length of the entire transcript so it is not unusual to find ESTs that are “full length.” cDNA synthesis can be inefficient so it is also common for cDNA to be truncated (although they may be labeled “full length”). ESTs are often generated from very-high-throughput efforts with little human intervention for (B)

UniGene interpreting the sequence. At the other extreme, some cDNAs were very carefully synthesized, checked for accuracy, and fully annotated as part of a publication on that sequence. Regardless of these different possibilities, the transcripts of a chromosomal locus are in one place in the UniGene database. From the home page of the UniGene database, click on the “UniGene Statistics” link and you navigate to a page (Figure 2.19A) displaying a long list of species, reflecting the widespread generation of ESTs. The first entry is from the cow, with over 45,000 loci. Some adjacent loci will eventually be joined because overlapping sequences will be found, linking distant exons to a single mRNA. Use of the latest technology to create and sequence cDNA is uncovering rarely transcribed loci, many of which will be shown not to encode proteins. The human entry shows over 129,000 loci, which far exceeds the 21,000 known protein-encoding genes. Guesses as to the identity of the other thousands of clusters range from regulatory RNAs to transcribed regions that have no function. The latter may represent nature’s way of creating genes and function; first get transcribed and then find a role. This is an exciting and emerging area of molecular biology and bioinformatics so non-coding transcripts should not be overlooked. (A)

Figure 2.19 The UniGene database gathers all the transcription products of genomic loci into one place. (A) The statistics page lists the many species represented, each with thousands of UniGene “clusters.” (B) A sampling of four cow transcribed loci. Notice that there are 45,178 UniGene clusters distributed over 2259 pages containing 20 records each.

(B)

39

40

Chapter 2: Introduction to Internet Resources Clicking on a species name brings you to the statistics of the collection. For example, the cow has approximately 1.4 million sequences assembled into almost 46,000 clusters. There are over 16,000 transcripts that are “singletons”; only a single cDNA sequence maps to each location. Clicking on the number of UniGene entries seen in Figure 2.19A takes you to a list of the individual clusters (Figure 2.19B) and you can see the number of sequences per cluster along with an automated identification. For example, the first cluster shows strong similarity to a known human gene and the second cluster is tentatively identified as the cow equivalent to a yeast (Saccharomyces cerevisiae) gene. A search of the UniGene database using the term “kibra” returns results such as those seen in Figure 2.20. There are a total of 12 clusters found with kibra, from 12 different species, ranging from 228 sequences (human) to 1 (cow). If you wanted to clone the cow kibra gene, this single EST out of the 1.4 million cow sequences would be your ticket to success. These results also demonstrate the range of similarities that are calculated by the automated annotation of these sequences. The sequence identity is fairly certain if the term “similar” is used. The terms “strongly” and “moderately” reflect the degree of similarity to known genes. The top of the UniGene cluster record for human kibra is shown in Figure 2.21A. Sequences from the cluster show high identity to known proteins, ranging from 100% (the human kibra) to 67% (Danio rerio). The next section in the UniGene record covers gene expression (Figure 2.21B); the origins of the cDNA or ESTs are used to determine where these sequences can be found. For example, if kibra ESTs were found in brain cDNA libraries, brain is listed as a possible location for the normal expression of kibra. It is worth noting that this is not a perfect assumption. Certain tissues (for example, brain and testis) are known to express many genes in an apparent nonspecific

Figure 2.20 Searching UniGene. In this example, “kibra” is used as a query and 12 clusters were found. Some are well characterized, for example the human, mouse, and Drosophila genes. The others are automatically annotated as loci that have similarity to kibra but more analysis is needed to verify if these are kibra genes.

UniGene (A)

(B)

manner. Nevertheless, tissues of expression are additional data points that can be gathered from these collections of cDNA in the UniGene database. The next section of the UniGene database is the listing of sequences in the cluster (Figure 2.22A). The mRNAs are listed first. The left-hand column contains database accession numbers, unique identifiers in the database, which are hyperlinked to these sequences. The accession numbers that start with “NM_” are reference sequences, thought to be standards that are full length and without errors. Also included are brief descriptions and, to the right, a code indicating similarity to known proteins (P) and possession of a polyadenylation signal at the 3P end of the mRNA (A). The list of ESTs includes hyperlinked accession numbers and brief descriptions. Those labeled “IMAGE” are from clones produced by the IMAGE consortium (Integrated Molecular Analysis of Genomes and their Expression). Members of this organization generated millions of clones, perhaps the world’s largest collection, from many different tissues and made them available to everyone. A large number of clones were sequenced both at the 5P and 3P ends, and some were sequenced internally. As a result, their descriptions (second column) may indicate identical clone names. The annotation of the sequence file provides more information such as which end of the clone was sequenced (for example, 3P read). Another column lists the tissues that generated these clones. Figure 2.22B shows a sampling of the 214 ESTs in the human kibra cluster. The clone names often help to reveal more about the tissue of origin. For example, BRHIP is from the brain hippocampus, BRALZ is from the brain of an Alzheimer’s disease patient, and BRAMY is from the brain amygdala. Clicking on the accession numbers reveals additional information. For example, clicking on accession number DA228506.1 will take you to the sequence summary (Figure 2.23). From here you can go to two additional resources to learn about this clone or this library. Clicking on the GenBank entry hyperlink DA228506.1 will take you to a record containing the sequence. The dbEST_18318 hyperlink will take you to information about the cDNA library, BRAWH3. The top of the BRAWH3 library’s Web page contains fundamental information about the library (Figure 2.24A). It is from a normal human brain but no other

41

Figure 2.21 The UniGene record for human kibra. (A) A portion of the record, showing the best similarities to proteins from other organisms. This evidence supports the identity of this locus. (B) Another section of the record concerned with gene expression. The tissues of origin for the transcripts in this UniGene cluster are summarized as “cDNA Sources” and suggest widespread expression of the kibra transcripts. Clicking on “EST Profile” will give both a graphical and numerical presentation of these data.

42

Chapter 2: Introduction to Internet Resources

Figure 2.22 UniGene sequences. (A) The sequences that make up a UniGene cluster are divided to separate the longer, often annotated mRNAs from the shorter and largely uncharacterized ESTs. When there are many ESTs present, only some (here 10 of 214) are shown in the main window but all are available for browsing (click on “Show all sequences”) or download (see the button in this panel). The key to the other annotation symbols can be found on the UniGene page. (B) A number of the ESTs found in the human kibra UniGene cluster.

(A)

(B)

information is provided as to its origin. Is it derived from the whole brain or just a portion? There are no specifics regarding the exact tissue. Under “Gene Content Analysis” we learn that the library generated over 43,000 ESTs that were collapsed into 7583 UniGene entries, listed here in descending number of ESTs. As can be seen, 1317 ESTs are from the amyloid beta A4 gene. Then there is a big jump to 388 ESTs from a gene called ermin. The EST number is also expressed as TPM (transcripts per million), which is a convenient way of normalizing the EST numbers. Considering that there was a big jump down to the ermin ESTs, you might conclude that amyloid beta A4 is hugely abundant, but the TPM number puts it at 3% of this library. This is still a very large number but the TPM value keeps it in perspective. Figure 2.23 UniGene sequence summary. This includes a link to the full GenBank record (the accession number), the sequence length, clone name, library name, and tissue of origin. ESTs can come from multiple regions on a single clone, for example the 5P and 3P ends.

Compare and contrast the BRAWH3 library record to that of BRHIP3 (Figure 2.24B). The BRHIP3 library has almost 33,000 ESTs that collapse to 6383 clusters, smaller than the BRAWH3 library but similar in proportions. The BRHIP3 library is from the hippocampus and the most abundant EST is again from the amyloid beta A4 gene. In fact, two of the top five EST clusters are identical between the two libraries. By comparing libraries from different brain regions, gene expression patterns could be recognized which may shed light on the roles and functions of genes.

The UniGene Library Browser (A)

(B)

2.12 THE UniGene LIBRARY BROWSER The UniGene Library Browser allows access to all the cDNA libraries represented in the database. You can navigate to this page from the NCBI Site Map (Figure 2.18), or from the UniGene home page to the Library Browser seen in a number of the UniGene pages (for example, Figure 2.19A). Once on the Library Browser page, you are met with a very simple query form (Figure 2.25A). Here you pick the species and the minimum library size to view. For example, do you only want to see huge libraries that may show the most rare of transcripts? Or do you want to see as many libraries as possible, especially from rarely seen tissues that yield only a few clones? In this example, the form is set to find human libraries containing 1000 ESTs or more. Using these settings, you are shown many pages of EST and cDNA libraries. There are 977 libraries from brain tissue alone, with “only” 109 libraries above 1000 ESTs. The top of the brain library list is shown in Figure 2.25B. Some of the information in UniGene is difficult to interpret. For example, why can  you find albumin ESTs, a gene normally expressed in the liver, in brain libraries? You can explain blood-specific genes found in the brain because of blood contamination in the brain preparation, but not liver genes. Rather than being a problem, perhaps it points to something interesting: a low level of expression of every gene in many tissues. Alternatively, perhaps many proteins have been re-purposed to serve a function in the brain.

43

Figure 2.24 UniGene library exploration. (A) The library BRAWH3 can be accessed by clicking on its hyperlink in the UniGene sequence summary shown in Figure 2.23. The tissue is described as “brain” and, under the “Gene Content Analysis” section, the library consists of over 43,000 sequences which collapse to over 7500 loci. The EST clusters from this library are listed, with the most abundant being amyloid beta (over 1300 transcripts in this single library). (B) The details of another library, BRHIP3, derived from the hippocampus region of the brain. The most abundant transcript from this library is also amyloid beta.

44

Chapter 2: Introduction to Internet Resources

Figure 2.25 The UniGene Library Browser. (A) The simple query form where the request is to show all human libraries with a minimum number of sequences equal to 1000. (B) This query brings up a long list, alphabetically organized by tissue and then library, the top of which is shown here. Notice that BRAWH3 and BRHIP3 libraries are on this list.

(A)

(B)

The UniGene database is an incredible resource for determining where you might find transcripts. Even if only a piece of your gene of interest is found in a library, and you have been previously unsuccessful in cloning this gene, you now have a significant advantage in your next attempt.

2.13 SUMMARY In this chapter, we learned about the databases maintained at the NCBI and elsewhere. At these Websites, we did text-based searches using accession numbers, gene names, authors, and patent numbers. We searched the medical literature and databases of DNA sequences, and we did global searches where a single text query found hits in the nucleotide, protein, literature, and other databases.

EXERCISES Williams syndrome and oxytocin: research with Internet tools Oxytocin-neurophysin 1 is a protein expressed in the hypothalamus of the brain. It is cleaved into two different proteins: the 9 amino acid oxytocin and the 94 amino acid neurophysin. Oxytocin causes the smooth muscle of the uterus to contract during labor and has been used for years to induce labor in both humans and animals. Oxytocin has also been shown to have effects on behavior. In fact, the popular press often refers to it as the “trust hormone” and experiments support this claim. Williams syndrome is a disorder characterized by both physical features and developmental delays. The OMIM database entry includes a description of individuals with Williams syndrome as “empathetic, loquacious, and sociable.” The behavior also includes a tendency to trust others, and some have drawn comparisons between Williams syndrome and the effects of oxytocin.

Further Reading In this series of exercises, the various tools covered in this chapter are to be employed to explore both oxytocin and Williams syndrome. 1. The bond between the domesticated dog and their owner can be quite strong. It is also believed that owning and interacting with a dog can lower blood pressure, increase longevity, and give the owner a feeling of well-being. Could it be that some of these physiological effects are mediated by the dog inducing the production of oxytocin in the owner? Search PubMed for evidence. 2. Oxytocin has been shown to induce trust, even when money is concerned. By administering an oxytocin nasal spray to experimental subjects, oxytocin has been shown to make people more willing to share money with strangers. Search PubMed for articles pertaining to this topic and others related to trust. 3. Find the human oxytocin Gene record at the NCBI. What is the official gene symbol for this gene? 4. Using the UniGene Website, what human tissues are known to express oxytocin? 5. Oxytocinase is an enzyme that degrades oxytocin. Using www.ihop-net.org, find the current official symbol for this human protein. Sorting the reference sentences by date, what are some of the earliest names used for this protein? 6. Using the uspto.gov Website, find a patent where oxytocin is used to induce labor in farm animals. In what year was this patent issued? 7. Using the OMIM Website, can you find any diseases associated with oxytocin? 8. Using the NIH RePORTER Website, are there any grant applications on Williams syndrome? 9. Using the OMIM Website, can you find any records associated with Williams syndrome? Make sure your strategy does not result in a number greater than 50 hits. What is the proper name for this syndrome? 10. There is an article in the New England Journal of Medicine on Williams syndrome written by a Dr. Barbara Pober. Construct a specific search of PubMed to find this article. 11. A characteristic of those afflicted by Williams syndrome is excessive trust and acceptance of strangers. Using any tool you were introduced to in this chapter, can you find any connection between Williams syndrome and oxytocin?

FURTHER READING Ashburner M, Ball CA, Blake JA et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29. Baxevanis AD (2008) Searching NCBI databases using Entrez. Curr. Protoc. Bioinformatics Chapter 1, Unit 1.3. Grusche FA, Richardson HE & Harvey KF (2010) Upstream regulation of the hippo size control pathway. Curr. Biol. 20, R574–R582. Review article on network of gene regulation for kibra. Hoffmann R & Valencia A (2004) A gene network for navigating the literature. Nat. Genet. 36, 664. A very short article on iHOP. Lee HJ, Macbeth AH, Pagani JH & Young WS 3rd (2009) Oxytocin: the great facilitator of life. Prog. Neurobiol. 88, 127–151. A wonderful and thorough review article on the various roles oxytocin plays. Lennon GG, Auffray C, Polymeropoulos M & Soares MB (1996) The I.M.A.G.E. Consortium: an integrated molecular analysis of genomes and their expression. Genomics 33, 151–152. Lu Z (2011) PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford) (DOI: 10.1093/database/baq036). Like any other mountain of data, the medical literature is being mined by sophisticated tools. This article reviews the different approaches used to extract more information out of the medical literature.

45

46

Chapter 2: Introduction to Internet Resources Maglott D, Ostell J, Pruitt KD & Tatusova T (2011) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39 (Database issue), D52–D57. Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), 2011. World Wide Web URL: omim.org Papassotiropoulos A, Stephan DA, Huentelman MJ et al. (2006) Common Kibra alleles are associated with human memory performance. Science 314, 475–478. Ties between kibra and memory are being actively pursued. There were four papers on kibra prior to 2006, and 36 papers in the five years that followed, including papers that suggest that the role of kibra in memory is minor or “complex.” Wheeler DL, Church DM, Federhen S et al. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31, 28–33. General and brief reference on UniGene and other resources at the NCBI.

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA GAAGGGGAAACAGATGCAGAAAGCATCT AGAAAGCATCT ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCT ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAA GAAGGGGAA

CHAPTER 3

Introduction to the BLAST Suite and BLASTN Key concepts • Why and how to search a sequence database • An introduction to nucleotide BLAST (BLASTN) • Interpreting BLASTN results • Cross-species searches: paralogs, orthologs, and homologs

3.1 INTRODUCTION In Chapter 2 we learned how to search databases with text queries. All of these were exact matches—that is, we were expecting to find the exact accession number or exactly spelled words. In this chapter, a much harder database-searching problem is introduced. How do you find matches when your query is not a short accession number or a text term, but instead a DNA sequence that is 500 nucleotides long? In addition to finding all the exact matches, can you find those sequences with mismatches, clearly related to the query but not 100% identical? For all the hits that are not exact matches, can calculations generate statistics that help evaluate which hits are significant, and which should be ignored? On top of these challenges, can this search of a database, that contains millions of sequences, show the results in a reasonable time? These and other questions will be answered here. A computer program called BLAST is one of the most commonly used tools in bioinformatics and will be introduced in this chapter. The next three chapters will explore further uses of BLAST.

Why search a database? Let’s assume that you have an unknown sequence and you use it as a query to search a bioinformatics database: ●





Is the query identical to something already in the database? Is the query a known gene? If so, you can learn a lot about this gene by looking at the annotation in the sequence records. Is your query just a small piece of a much larger gene? If so, you may have just found a way to obtain the rest of the gene. Or based on these results, your laboratory technique may need to be improved upon. Is the query similar to something already in the database? Has it already been found in another organism, or is it similar to something in the same organism? Did you just find members of a gene family? These sequence similarities may tell you something about the function of your sequence.

48

Chapter 3: Introduction to the BLAST Suite and BLASTN



Is the query unique? Never been seen before? Perhaps you have discovered a new gene!

You are asking a handful of questions every time you do a sequence similarity search.

3.2 WHAT IS BLAST? BLAST, or the Basic Local Alignment Search Tool, was specifically designed to search nucleotide and protein databases. It takes your query (DNA or protein sequence) and searches either DNA or protein databases for levels of identity that range from perfect matches to very low similarity. Using statistics, it reports back to you what it finds, in order of decreasing significance, and in the form of graphics, tables, and alignments. There are multiple forms of BLAST, but in this chapter we concentrate on nucleotide BLAST (BLASTN, pronounced “blast en”). The query is a DNA sequence and the database you search is populated with DNA sequences, too. Table 3.1 outlines the query and the subjects of the search.

How does BLAST work? As mentioned previously, many millions of DNA sequences have been collected in databases and are available for searching at Websites such as those at the NCBI. To conduct sequence-based queries with BLAST, databases must be of a special format to optimize for this type of query. The annotation associated with these individual files must be removed leaving just the sequences. Links to the annotation are still maintained so the identities of these sequences are not lost. Each sequence in the database is then broken into words or short sequences for comparison to the query. When a search is submitted, BLASTN first takes your query DNA sequence and breaks it into words that are quite short (11 nucleotides). It then compares these words to those in the database. As BLAST has to compare many millions of words in this manner, and subsequent steps can be time-consuming, BLAST looks for two adjacent word pairs and, if their similarities and distance between words are acceptable, only those advance to the next set of calculations. Starting with this local similarity, BLAST then tries to extend the similarity in either direction (Figure 3.1). Using the sequence immediately upstream and downstream of the word in the original query, BLAST starts keeping track of the consequences of lengthening the alignment between the query and the sequence in the database. Still matching? The significance score increases. Mismatches encountered? Penalty points accumulate until the cost outweighs the benefit and BLAST stops extending. This approach is finely tuned; if the penalty threshold for extension is too low, BLAST would stop trying to extend similarity very quickly and distant but significant sequence similarities would be missed. If the penalty threshold were too high, BLAST would be given too much freedom to keep extending past real areas of similarity and start collecting many truly insignificant hits. Finally, all the alignments between the query and the database subjects, or “high-scoring subject pairs” (HSPs), are ranked based on length and significance. The best hits are kept and shown to you in the forms of a graphic, a table, and alignments between the query and the hits.

Table 3.1 BLASTN definition

Type

Query

Database

BLASTN

Nucleotide

Nucleotide

Your First BLAST Search Alignment starts with initial word of 11 ACACTGAGTGA ||||||||||| ACACTGAGTGA

Extension to the left has no mismatches, no penalty points Extension to the right has mismatches and penalty points GCACCTTTGCCACACTGAGTGAGCTGCTCTATG |||||||||||||||||||||| |||| || | GCACCTTTGCCACACTGAGTGACCTGCACTGTA

Figure 3.1 Simple extension example for BLASTN. Starting with  an initial match of “words,” BLAST extends the alignment between query and hit, keeping track of penalty points against, and increasing significance for, extending the alignment.

Extension to the left has no penalty points and can continue to grow Extension to the right accumulates too many mismatch penalty points; extension in this direction stops CAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCTCTATGGTCCTTTGGGG ||||||||||||||||||||||||||||||||| |||| || | |||| CAACCTCAAGGGCACCTTTGCCACACTGAGTGACCTGCACTGTAAAGTTTTGCAT

If left side cannot grow any more, the final alignment looks like this: CAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCTCTATG ||||||||||||||||||||||||||||||||| |||| || | CAACCTCAAGGGCACCTTTGCCACACTGAGTGACCTGCACTGTA

3.3 YOUR FIRST BLAST SEARCH BLAST may be the most widely used sequence analysis program in the world. It is available as a tool from many Websites but is also downloadable as an application that works on your local personal computer or powerful server. It is free, but commercial parties have also created enhanced BLAST applications and charge a fee for these products. Please note that here only the Web form of BLAST will be discussed. Below is a step-by-step description of your first BLAST search. In future use, these searches can take less than a minute, including your analysis.

Find the query sequence in GenBank For your first BLASTN search, we’ll use a sequence where information is very limited. This may be very typical of what you will face: you have an unknown sequence and you want to determine its identity, if possible. 1. Go to the NCBI Website, ncbi.nlm.nih.gov. 2. Near the top of the home page is the Entrez drop-down menu of database choices. Select “Nucleotide.” 3. In the text field next to the menu, enter the GenBank accession number, DD148865. Accession numbers are unique identifiers for sequence records in the GenBank database. By searching with this accession number, you will find just one DNA sequence file. 4. Either press the return key on your keyboard or hit the Search button on the Web page. The Web page refreshes and you will see the nucleotide file. There are two main sections to this file (Figure 3.2): on top, information about this sequence and, below it, the DNA sequence. The information of a sequence record is generally referred to as the annotation. Although the DNA sequence is the key information for all GenBank records, the annotation section can be a rich source of knowledge about the sequence and should not be overlooked. Databases such as GenBank have a specific structure to their annotation: fields

49

50

Chapter 3: Introduction to the BLAST Suite and BLASTN

Figure 3.2 A GenBank file.

LOCUS DEFINITION

DD148865 631 bp DNA linear PAT 04-NOV-2005 A group of genes which is differentially expressed in peripheral blood cells, and diagnostic methods and assay methods using the same. ACCESSION DD148865 VERSION DD148865.1 GI:92839210 KEYWORDS JP 2005102694-A/175. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 631) AUTHORS Nojima,H. TITLE A group of genes which is differentially expressed in peripheral blood cells, and diagnostic methods and assay methods using the same JOURNAL Patent: JP 2005102694-A 175 21-APR-2005; Japan Science and Technology Agency,Hiroshi NOJIMA,GeneDesign Inc COMMENT OS Homo sapiens PN JP 2005102694-A/175 PD 21-APR-2005 PF 09-SEP-2004 JP 2004263092 PI hiroshi nojima CC FH Key Location/Qualifiers FT misc_feature (17)..(17) FT /note=’n is a, c, g, or t’. FEATURES Location/Qualifiers source 1..631 /organism=”Homo sapiens” /mol_type=”unassigned DNA” /db_xref=”taxon:9606” ORIGIN 1 gcaactgtgt tcactancaa cctcaaacag acaccatggt gcatctgact cctgaggaga 61 agtctgccgt tactgccctg tggggcaagg tgaacgtgga tgaagttggt ggtgabgccc 121 tgggcaggct gctggtggtc tacccttgga cccagaggtt ctttgagtcc tttggggatc 181 tgtccactcc tgatgctgtt atgggcaacc ctaaggtgaa ggctcatggc aagaaagtgc 241 tcggtgcctt tagtgatggc ctggctcacc tggacaacct caagggcacc tttgccacac 301 tgagtgagct gcactgtgac aagctgcacg tggatcctga gaacttcagg ctcctgggca 361 acgtgctggt ctgtgtgctg gcccatcact ttggcaaaga attcacccca ccagtgcagg 421 ctgcctatca gaaagtggtg gctggtgtgg ctaatgccct ggcccacaag tatsactaag 481 ctcgctttct tgctgtccaa tttctattaa aggttccttt gttccctaag tccaactact 541 aaactggggg atattatgaa gggccttgag catctggatt ctgcctaata aaaaagcatt 601 tattttcatt gcaaaaaaaa aaaaaaaaaa a

of information that are the same in each record. Examples include definition, accession number, and species of origin. We will spend a lot of time examining and utilizing the annotation of files so let’s look at this record’s annotation closely. Down the left side of the annotation are section labels, all in uppercase. The LOCUS line contains some basic information: the name, usually synonymous with the accession number (DD148865); the length (631 bp); the type of molecule which was the source of this sequence (DNA); the topology of the source material (linear or circular); the division code (in this case, PAT, which stands for the Patent division of GenBank); and the date when the file was created or underwent revision (04-NOV-2005). For a full description of the fields in a GenBank file, see the “Release Notes” associated with the database. One way to find these is by using the drop-down menu at the top of the NCBI home page: select “NCBI Web Site” and enter “GenBank release notes” as a query.

Your First BLAST Search The DEFINITION field is usually a brief description of the sequence, its origin, and any additional information that may prove valuable to the reader. The definition line for this record is unusually long, clearly taken from the title of the patent: “A group of genes which is differentially expressed in peripheral blood cells, and diagnostic methods and assay methods using the same.” The ACCESSION and VERSION lines are related. GenBank uses two types of unique identifiers. Accession numbers are unique to a sequence record, but should that record be revised, then the accession number is given a new version number. When this sequence was submitted to GenBank, it was given an accession number of DD148865, and the version was DD148865.1. If this record is updated, for example if the annotation is revised by the author, then it will be given a new version number, DD148865.2. Both versions will be kept. Rather than force you to know which version is the latest, GenBank lets you retrieve the latest version by just entering the accession number (DD148865), without the version number extension (.1), as you did when you were asked to retrieve this sequence. GI numbers are also assigned as unique identifiers for each specific record, for practical reasons of constructing the GenBank database. Different versions of sequence records could have completely different GI numbers (for example, they won’t be sequential or appended with numbers or letters). We will only be working with accession numbers in this book. KEYWORDS is a field where authors can record any other information they feel might be useful. However, as this is an optional field, you cannot rely on it to comprehensively search a database for related records. In this case, the authors (inventors) put the patent number (JP 2005102694-A/175) in this field so a search of Nucleotide for “JP 2005102694” finds 305 sequences that are associated with this patent. Note that not all terms are indexed for searches. For example, using more of the original patent number (adding “-A/175”) does not find these sequences. The SOURCE and ORGANISM fields are related. The SOURCE is the genus and species according to the author of this record. This is a free text field, and so you might find variations and typographic mistakes; for example, Homo sapien instead of the correct Homo sapiens. The ORGANISM field is constrained and contains the full taxonomic classification for the source organism. The REFERENCE field contains details about the sequence origin and can include publications. Also present under FEATURES is information found elsewhere in the annotation (for example, organism) but it may include additional facts or partial analysis results. We’ll see this later when we examine sequence records more rich with information. The COMMENT section may include information such as accession numbers of related sequences or references to other databases. In this patent record, it repeats basic details about the patent. The last section of this file is the DNA sequence. It is one continuous sequence, broken up into groups of 10 nucleotides (with blank spaces in between), with 60 nucleotides per line. Each line is prefixed with the number of the first base it contains.

Convert the file to another format The GenBank file described above contains annotation and features that are not used by BLAST and it is necessary to remove these components to run the search. This can be easily accomplished by converting this file into another file format called FASTA (pronounced “FAST-AY,” the second syllable rhymes with “say”). 1. Near the top of the sequence Web page there is a list of formats under “Display Settings” which includes GenBank, FASTA, Graphics, and more. Choose FASTA. 2. When the page refreshes you see the file in FASTA format (Figure 3.3).

51

52

Chapter 3: Introduction to the BLAST Suite and BLASTN

>gi|92839210|dbj|DD148865.1| A group of genes which is differentially expressed in peripheral blood cells, and diagnostic methods and assay methods using the same GCAACTGTGTTCACTANCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGT TACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGABGCCCTGGGCAGGCTGCTGGTGGTC TACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACC CTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCT CAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGG CTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGG CTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATSACTAAGCTCGCTTTCT TGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAA GGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAAGCATTTATTTTCATTGCAAAAAAAAAAAAAAAAAA A

Figure 3.3 GenBank file DD148865 in FASTA format.

Almost all the annotation is now missing. What description remains is on the top line which, in all FASTA files, begins with a “>” symbol. Next are fields, separated by vertical bars. The first field indicates the unique identifier for this file record (GI number 92839210). The next field shows the organization that first received this sequence (another sequence database called the DNA Data Bank of Japan or DBJ). This is followed by the accession number, and the definition. The DNA sequence begins on a new line. This signals “everything past this point is just DNA sequence”: no numbers, spaces, or anything else among the sequence characters. This is the essence of the FASTA format: minimum annotation followed by sequence. In Figure 3.3 it appears that there are two lines of annotation because of the long definition. In fact, the annotation text has been “wrapped” to the next line due to page width. This is a good opportunity to look at the sequence with your eyes. There are no numbers and spaces and with the simplicity of the format you might recognize details not visible in GenBank format. Do you see any pattern in the sequence? Any regions rich in As? Do you see any doublets or triplets that seem to be repeated throughout the sequence? Are there any bases that are not A, T, G, or C (Box 3.1)? This moment of “low-tech” examination will often reveal details that may help you interpret findings derived by other means. Throughout this book, take time to look at the sequences you encounter and your eyes will become trained at recognizing interesting features without the use of software. An obvious feature of this cDNA is the poly(A) tail.

Performing BLASTN searches At the NCBI home page, the link to the BLAST forms is usually near the top of the list of Popular Resources. This hyperlink will take you to a page listing different types of BLAST. In this case we are using a DNA sequence query and will be searching a DNA database, so you will be using the “nucleotide blast” or BLASTN form. Navigate to this choice and we’ll begin to populate the fields on the form. ●



The Query Sequence field. The best way to provide sequence is the plain text found in FASTA format. Some Websites will take your DNA sequence and strip away any numbers, spaces, or otherwise non-DNA characters. This convenience is not found everywhere so it is best to get into the habit of using FASTA format. The NCBI also allows you to enter accession numbers, but in this case, paste the FASTA format of DD148865 into the Query Sequence field, including the annotation line. The BLAST program will grab the definition from this FASTA file and label your results. This will help you stay organized when conducting more than one BLAST search at a time. Depending on the Internet browser you are using, you may see the sequence underlined because the browser will interpret the 631 As, Ts, Gs, and Cs as a spelling error and underline the sequence, but you can ignore that warning. The Database field. Next we have to choose the database to search. There are many to choose from the drop-down menu; select the Reference RNA sequences, also known as RefSeq mRNA (refseq_rna) (Figure 3.4). This is a specialized, nonredundant database of sequences containing the definitive

Your First BLAST Search

53

reference sequence for RNAs. Note that RNA sequences are represented as DNA: you will see Ts instead of Us. ●



The Organism field. This field allows you to limit your search to a specific organism’s DNA. We could search all species, which is the default, but since the query is from Homo sapiens, and every human gene has (in theory) been sequenced, why not search a much smaller database and get a direct answer? Your results will also come back faster because of the smaller database size and the NCBI will use fewer computer resources to deliver your results. Enter “Homo sapiens” in this field (Figure 3.4). You’ll see that as you begin to type this in the text box, choices will come up and you can finish your entry by clicking on “Homo sapiens (taxid:9606).” Program Selection. The default setting uses a version of BLASTN called megaBLAST. This version of BLAST is optimized to find nearly identical hits and is only found at the NCBI Website. We will often be looking for distantly related hits so BLASTN will be used in this book (Figure 3.5). Click on the radio button next to this choice.

Box 3.1 Uncertainty codes As you work with nucleic acid sequence records and read scientific articles, you may come across representations of bases which are not A, T, G, or C. In fact there are several of these in Figure 3.3. These nonstandard letters are actually the uncertainty codes proposed by the International Union of Biochemistry and Molecular Biology (IUBMB) to represent certain groupings of nucleotides. On occasion, software or a scientist is unable to decide if a newly sequenced base is, for example, an A or a G. The three IUBMB symbols most frequently encountered are R for the puRines (A or G), Y for the pYrimidines (C or T), and N for aNy nucleotide. Here are the uncertainty codes: IUBMB symbol

Definition

R

A or G

Y

C or T

K

G or T

M

A or C

W

A or T

S

C or G

B

C or G or T

D

A or G or T

V

A or C or G

H

A or C or T

N

G or A or T or C

Figure 3.4 BLASTN database choices. The drop-down menu lists the databases and the species can be entered in the “Organism” field.

54

Chapter 3: Introduction to the BLAST Suite and BLASTN

Figure 3.5 Select BLASTN as the program to use.

The search is now ready. Click on the “BLAST” button and wait for the results. Times will vary but this incredibly fast-working program will break your 631-nucleotide sequence into small “words,” search a database measured in the many thousands of sequences, identify the best hits, try to extend and join the sequences of similarity, then generate the statistics so you can better evaluate the hits. By the time you finish reading this brief description, all of this may be performed and you are now looking at the results. Depending on your browser settings, the next time you visit this BLASTN form, the defaults may now be changed to what you used in this search. Be sure to make it a habit to review all settings before initiating the search.

3.4 BLAST RESULTS The results from a BLAST search are divided into three sections: the graphic pane, a results table, and the alignments between the query and the hits. Although some conclusions can be obtained based on interpretations of sections individually, it is best to consider all three sections and draw upon their complementary content during your analysis. These sections are described sequentially below, but when reviewing your results, you should move back and forth between sections as needed.

Graphic The colorful graphic of the BLAST search results shows the query length across the top (see Figure 3.6 and color plates). The line goes from 0 to over 600, corresponding to the 5P end and 3P end, respectively. The sequences found by BLAST, the “hits,” appear below as horizontal bars in rows. If a hit lined up to every nucleotide of the query, then the bar goes across completely, as is almost the case in the first row. If other hits were similar to only the extreme 3P end of the query, then there would be short bars to the right, as there are in various rows of this graphic. To save space, several hits are often placed in the same row, as space and location allows. For example, the second row has a long red bar, and much shorter black and green bars (see Figure 3.6 and color plates). Bars in the same row do not represent different regions from the same sequence; they are different hits. Different regions of the same sequence are joined by a thin line, but there are none in this example. The color-coding within the graphic is generated by the statistics of each hit. As indicated by the key at the top of the graphic, hits with the highest score are red, the next highest are purple, then green, and so on. In this BLAST search, hits from all ranges of scores are found. What we see in Figure 3.6 (see also color plates) is a single high-scoring hit which covers almost the entire length of the query (the red bar in the first row). Moving down the graph, there are five other high-scoring hits (red) that line up with (approximately) the first 475 nucleotides of the query. Two moderately scoring hits (purple) line up with about 200 nucleotides of the query, and the rest are short, low-scoring hits (green, blue, and black bars). These short bars appear in several stacks because members within the stack have something in common with the same place in the query. By floating your computer mouse over each

BLAST Results

55

Figure 3.6 The graphic pane of the NCBI BLASTN results. The query coordinates and length correspond to the numbered scale across the top. Sequences found by BLAST, “hits,” are represented as horizontal bars below this scale. These will vary in length, position, and color-coded scoring, shown here in gray shades. See color plates for a color version of this figure.

bar, the sequence definition line appears in the small window above the graphic. By clicking on the individual bars you can navigate to the alignments between the query and the hits. It is important to realize that these short bars show the length of the similarity between the query and the hit and do not necessarily represent the entire length of the hit. For example, the hit might be one million bases long, but only contain 30 nucleotides in common with the query. This identity will be represented by a bar that is 30 nucleotides long.

Interpretation of the graphic Even without knowing the identities of the hits shown in the graphic, you can still conclude several things. First, there appears to be only one Homo sapiens reference mRNA sequence that aligns with almost the entire query. That is, BLAST was able to align almost every section of the query, in a continuous fashion, across a similar length of sequence within the hit. Considering the other high-scoring hits, it appears that the 3P end of the query does not align with anything else in the database that is high scoring (ignoring the low-scoring hits for now). The endpoint of the five short red bars is around nucleotide 475, and two other purple bars align between nucleotides 300 and 475. Does this reflect some kind of feature about the query? More information is needed to find the answer and this is discussed later.

Results table Below the graphic is the BLAST results table that provides basic information about the hits along with the statistics of each hit. This table is long but the top hits are shown in Figure 3.7.

56

Chapter 3: Introduction to the BLAST Suite and BLASTN

Figure 3.7 Tabular display of BLASTN hits. BLASTN lists the significant hits in a table that has both identifiers and descriptions of the hits, as well as statistical measures of the significance.

The first column in this table lists the accession numbers of the hits in the database. These accession numbers are hypertext so you can follow these to the RefSeq records where you can find the annotation on these sequences. The next column is the description. BLAST takes the information from the DEFINITION line in the database record and places it here, but because of space limitations, many of the descriptions will appear truncated. The next two columns are associated with the statistics of the database search. As mentioned earlier, BLAST uses statistics to sort through all the hits, shows you only the best, and then tells you why (statistically) they are the best. The first of these numbers is called the Max Score. Although the average user of BLAST often overlooks it, the change in this score is often important. If you see a sudden drop in the Max Score, expect to see a change in the query–hit alignment length, quality, or both. The next column, “Total Score,” becomes important when BLAST finds multiple, but not joined, sections of similarity between the query and the hit. For each area of similarity, BLAST generates an alignment and a score. If the Max Score is equal to the Total Score, then only a single alignment is present. If the Total Score is larger than the Max Score, then multiple alignments must be present and their individual scores have contributed to the Total Score. For this BLASTN with DD148865, the values in Max Score and Total Score are identical, indicating that only single alignments were generated for this BLASTN search. “Query Coverage” is the next column. The original query, DD148865, is 631 nucleotides long. If BLAST can align all 631 nucleotides of this query against a hit, then that would be 100% coverage. Remember, Query Coverage does not take into account the length of the hit, only the percentage of the query that aligns with the hit. Next is the E value or Expect value, which represents the number of hits you would expect to find by chance given the quality of the alignment and the size of the database. If a database of only As, Ts, Gs, and Cs gets sufficiently large, you

BLAST Results start finding sequence similarities by chance, particularly with short queries. The E value in BLAST takes into account both the length and composition of the alignment along with the percentage identity found. A number close to zero means that the hit has to be significant and not due to chance. BLAST results tables are sorted by E value, the most significant hits appearing at the top. When there are two or more identical E values, the Max Score is then used to sort the hits. The next column is called “Maximum Identity.” BLAST calculates the percentage identity between the query and the hit in a nucleotide-to-nucleotide alignment. If there are multiple alignments with a single hit, then only the highest percent identity is shown. The last column, “Links,” contains links to databases for that hit and these will not be discussed here.

Interpretation of the table The top two hits are very significant, both having E values of 0.0. Even the eighth hit  has a very small E value (1e–32) so this search has found a number of significant hits. Based on the first line of the table, it appears that the unknown query used in this BLASTN is nearly identical to the human beta hemoglobin mRNA sequence. Over 96% of the query’s length is 99% identical to the first hit. Just concentrating on the Maximum Identity column, the other top hits, although strong, show 93% or less identity when considering large portions of the query length. Note that the 6% drop in Maximum Identity and 22% drop in Query Coverage translated to a significant drop in the Max Score between the first and second hit. The descriptions of the next four hits reveal that the query has found other members of the hemoglobin family: delta, epsilon 1, gamma G, and gamma A. The next three hits, starting with accession number NR_001589, are a hemoglobin pseudogene and two predicted genes. The pseudogene may be both divergent and missing sequence found in the functional family members. Note that the query coverage is higher than the hit above it, but the percent identity is lower and the E value is greater. Gene predictions are often generated automatically, without human supervision, and can be based on incomplete experimental evidence. These may be missing portions, such as exons, of the real gene. The Maximum Identity of these two predictions to the query is identical to that seen with the epsilon 1 transcript (the third hit), but the Query Coverage is significantly lower. It would be interesting to analyze these sequences in detail to support or refute these predictions. After the two predicted genes, there is a dramatic drop in the Query Coverage, and the E value makes a huge jump. Although there are multiple hits that have high Maximum Identity, the large E value (approaching or greater than 1) indicates that the identities seen can be due to chance. Exploring the annotation of these hits would find that they have significantly different functions from hemoglobin. Not knowing anything else about the identities that are seen for these low-scoring hits, it is easy to conclude that these are not significant.

The alignments Below the table are the alignments (also called high-scoring subject pairs, HSPs) between the query and the hits. The statistics seen in the BLAST table are repeated here, along with additional important numbers. Within the alignments, the E value is called “Expect.” The description line is now shown in its entirety and the length of the hit is shown. The alignments clearly show the relationships that BLAST has found between your query and the hits. Examine the data in Figure 3.8 above the alignment between the query (DD148865) and the first hit, NM_000518. The identity is 99%, with 607 nucleotides out of 611 nucleotides aligned. In the graphic above, this hit was shown as a solid red line almost all the way across the length of the query. Here, BLAST shows this by displaying horizontal pairs of sequence; the query is shown above the hit, now labeled the Subject (Sbjct). There are vertical lines between the two sequences wherever they have the same nucleotide. Continuous alignments are

57

58

Chapter 3: Introduction to the BLAST Suite and BLASTN

Figure 3.8 BLASTN alignment between Query DD148865 and Sbjct, NM_000518. The Query and Sbjct lines are labeled and aligned bases have a vertical bar “|” between identical bases. Notice that nucleotide 2 of DD148865 aligns with nucleotide 17 of NM_000518, and nucleotide 61 aligns with nucleotide 76.

>ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA Length=626 Score = 1085 bits (1202), Expect = 0.0 Identities = 607/611 (99%), Gaps = 1/611 (0%) Strand=Plus/Plus Query

2

Sbjct

17

Query

62

Sbjct

77

Query

122

Sbjct

137

Query

182

Sbjct

197

Query

242

Sbjct

257

Query

302

Sbjct

317

Query

362

Sbjct

377

Query

422

Sbjct

437

Query

482

Sbjct

497

Query

542

Sbjct

557

Query

602

Sbjct

616

CAACTGTGTTCACTANCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAA ||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||| CAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAA

61

GTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGABGCCCT |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| GTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCT

121

GGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCT |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCT

181

GTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCT |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCT

241

CGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACT |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| CGGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACT

301

GAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAA

361

CGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGC |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| CGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGC

421

TGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATSACTAAGC |||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||| TGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGC

481

TCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTA |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTA

541

AACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAAGCATTT |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| AACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAA-CATTT

601

ATTTTCATTGC ||||||||||| ATTTTCATTGC

76

136

196

256

316

376

436

496

556

615

612 626

displayed in lines of 60 nucleotides, with both query and subject wrapping to new lines until the alignment ends. Each line is numbered at the beginning and the end, allowing you to see where the alignment begins and ends. Looking at this first alignment you can easily conclude that your “unknown” query, DD148865, is Homo sapiens beta hemoglobin. The query is nearly fully aligned to a reference sequence for human beta hemoglobin, NM_000518. What do the statistics tell you? This first hit is not a chance finding. In fact, the chance that you would encounter this randomly is so low that the calculated E value is 0.0: there is a zero chance you would find this by accident. The unknown must be Homo sapiens beta hemoglobin mRNA. Look back at Figure 3.2 and read the title of the patent associated with this sequence: “A group of genes which is differentially expressed in peripheral blood cells, and diagnostic methods and assay methods using the same.” Our conclusion is consistent with the subject matter

BLAST Results of the patent. Would it have been easier to just read the patent? It is important to remember that many sequences are published in patents before their identities are known, and so reading the patent may not reveal the identity. In this case, a simple BLAST search gave us the identity. The graphic (see Figure 3.6 and color plates) and the alignment (Figure 3.8) might suggest that DD148865 is a complete copy of beta hemoglobin, at least compared to this reference sequence, but it is not so. Notice the numbering of the lines at the beginning, or the 5P end, of the sequence. These coordinates indicate that the second nucleotide of the query aligns to nucleotide 17 of the reference sequence. At the 5P end, DD148865 is missing 16 nucleotides possessed by the reference sequence, NM_000518. In DD148865, the first base is a G while the 16th base of NM_000518 is an A. Rather than start the alignment with a mismatch, BLAST just trimmed off the first base of DD148865. At the other end of this first line, see that nucleotide 61 of the Query lines up with nucleotide 76 of the Sbjct. The statistics just above this alignment indicate that there are 607 nucleotides aligned out of 611 in this alignment. Where are the four mismatches? The one discussed above at the beginning of the query is not included in these numbers because it is not shown. The first is in the first row of the alignment at coordinate 17 of the query. The authors knew there was a base here but could not tell which base. So the IUBMB symbol N for “any base” was used. The second mismatch is in line two of the alignment, at coordinate 116 of the query. The authors were not sure which nucleotide was here but they knew it was a C, G, or T. Thus, the IUBMB symbol for these possibilities (B) appears in this position. In the reference sequence NM_000518, the nucleotide in this position is a G, consistent with the nucleotide code (B) in DD148865. The third mismatch is at query coordinate 474 where there is an S (C or G) in the query and a C in the reference position, again consistent with the correct base. The final mismatch seen at coordinate 596 of the query is different (Figure 3.8). Here, the query has an extra base, a G, so a gap (–) is inserted in the reference sequence at this position. This maintains the alignment for the remainder of both sequences. Which is correct? The reference sequence should definitely be given serious consideration; however, this single base may be key to the author’s reason for submitting this sequence. Is this an interesting mutation, either natural or synthetic? Or is the sequence just incorrect at this position? Further reading and analysis would be required to answer this question. For example, can you find any other human beta globin mRNA sequences with this extra base? Look at the last lines of the alignment between this query and the hit: nucleotide 612 of DD148865 aligns with nucleotide 626 of NM_000518. What is the length of the entire query sequence? Looking back at the annotation for this record, the DD148865 sequence is 631 nucleotides long (top line of Figure 3.2). The alignment stops short of the poly(A) stretch, visible earlier at the end of the DD148865 sequence file (see Figure 3.3). Notice that the length of the reference sequence NM_000518 is given in Figure 3.8 as “Length=626.” BLAST took the alignment as far as it could go and then stopped when it ran out of reference sequence at the 3P end. An added complication to the alignment interpretation is the poly(A) tail in DD148865. The poly(A) tails of many sequences are trimmed before the sequence is submitted to GenBank and other databases. Others are not trimmed, as seen in this example. It is possible that the original poly(A) tail for NM_000518 started in the same location as DD148865. But the poly(A) tail for sequences is variable in both length and location. Furthermore, RefSeq sequences are compiled from one or more sequences and the curators of RefSeq may have trimmed this sequence here for other reasons. Regardless, we can’t be sure what’s going on at the 3P end, but the rest of the alignment is excellent and requires little consideration. As you work through this book, it is highly recommended that you generate simple drawings (pencil and paper are fine!) to understand the relationships that BLAST finds for you. Figure 3.9 is an example “sketch” which shows the relationship

59

60

Chapter 3: Introduction to the BLAST Suite and BLASTN

Figure 3.9 A rough “sketch” outlining the relationship between the query and the hit. Coordinates are taken directly from the BLAST alignment. Although this is an “electronic” sketch, pencil and paper work fine.

1 612 631 Query 5P |--------------------------------|--| 3P Sbjct 5P |--------|--------------------------------| 3P 1 16 626

between the sequences described above, based on the coordinates. It shows that although the query is missing the 5P end, it has more nucleotides at the 3P end, although they are a stretch of As. Does this mean that DD148865 is only a fragment? This is a strong possibility, but it will take more analysis to confirm this. Remember that cDNA synthesis is an imperfect process and truncations, particularly at the 5P end of cDNAs, are very common. The poly(A) tail is an obvious difference but database records are inconsistent about this feature. There are many sequences in RefSeq that have poly(A) tails, but NM_000518 does not. Remember that a poly(A) tail is added to transcripts after transcription takes place, so if you look at the genomic DNA, you would not see a stretch of As at the end of the last exon.

Other BLASTN hits from this query It is a good habit to look at additional alignments to confirm any conclusions you make from the first alignment and to learn more about your query. The second hit, NM_000519, is delta hemoglobin (Figure 3.10). If you knew nothing about the beta hemoglobin gene before, you now know that it is a member of a gene family and it is quite similar to another member, delta. The identity between the query and this second hit is 93%, 438 nucleotides out of 470. Looking at the alignment closely, you can now see nucleotides that do not align. For example, in the second line you can see that seven nucleotides are mismatched and missing the vertical line of identity between the query and reference sequences. Still, 93% identity is high and visually the alignment still appears strong. If these nucleotide differences fell within the coding region of the delta hemoglobin transcript, these could translate into a different amino acid sequence, but delta and beta hemoglobin would look quite similar in protein sequence nevertheless. The alignment length is shorter than that seen with the first hit. Only 438 nucleotides out of 470 nucleotides from the query align with delta hemoglobin (remember, the total length of the query is 631 nucleotides). The decrease in this length came from the 3P end. This can be seen with the 3P coordinates of the query (472 versus 612, above). However, the 5P end and 3P end of delta hemoglobin mRNA appear to be longer than that of beta hemoglobin. Draw a picture to see the difference. The annotation for the delta hemoglobin sequence indicates that the coding region is from nucleotides 196 to 639 of NM_000519. These numbers are in good (but not exact) agreement with the alignment. That is, the coding region between these two globins is relatively conserved and aligns well with BLAST, but the noncoding regions have diverged enough to not align. It appears that sequence immediately upstream of the coding regions (approximately nucleotides 163–195) is still quite conserved, reflecting the importance of this sequence. Comparing the first two hits, the E value stays the same (0.0), but the Score (Max Score in the results table) drops from 1085 to 706. The drop in score reflects the decrease in alignment length between the query and the hit, from 611 to 470 nucleotides, as well as the drop in percent identity. This shows how the Max Score can sometimes tell you more than the E value. Let’s now skip to the seventh alignment between the query and XR_132577 (Figure 3.11). This is a predicted mRNA. It has never been isolated in the laboratory, but instead it was stitched together by a gene prediction tool that took cDNAs

BLAST Results >ref|NM_000519.3 Homo sapiens hemoglobin, delta (HBD), mRNA Length=774 Score = 706 bits (782), Expect = 0.0 Identities = 438/470 (93%), Gaps = 0/470 (0%) Strand=Plus/Plus Query

3

Sbjct

163

Query

63

Sbjct

223

Query

123

Sbjct

283

Query

183

Sbjct

343

Query

243

Sbjct

403

Query

303

Sbjct

463

Query

363

Sbjct

523

Query

423

Sbjct

583

AACTGTGTTCACTANCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAG ||| |||||||||| ||||||||||||||||||||||||||||||||||||||||||||| AACAGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAG

62

TCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGABGCCCTG |||| || | ||||||||||||||| ||||||||||||| |||||||||||| |||||| ACTGCTGTCAATGCCCTGTGGGGCAAAGTGAACGTGGATGCAGTTGGTGGTGAGGCCCTG

122

GGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTG ||||| | ||||||||||||||||||||||||||||||||||||||||||||||||||| GGCAGATTACTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTG

182

TCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTC ||| ||||||||||||||||||||||||||||||||||||||||||||||||| ||||| TCCTCTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAGGTGCTA

242

GGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTG ||||||||||||||||||||||||||||||||||||||||||||||| ||| | ||| GGTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACTTTTTCTCAGCTG

302

AGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAAC ||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||| AGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCTTGGGCAAT

362

GTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCT |||||||| ||||||||||||| |||||||||| ||||||||||||| | |||||||| GTGCTGGTGTGTGTGCTGGCCCGCAACTTTGGCAAGGAATTCACCCCACAAATGCAGGCT

422

GCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTA ||||||||||| ||||||||||||||||||||||||||||| |||||||| GCCTATCAGAAGGTGGTGGCTGGTGTGGCTAATGCCCTGGCTCACAAGTA

282

342

402

462

522

582

472 632

Figure 3.11 The seventh hit (XR_132577) from the BLASTN results with query DD148865. This gene prediction is very divergent compared to the top hits.

Score = 141 bits (156), Expect = 1e-32 Identities = 136/175 (78%), Gaps = 0/175 (0%) Strand=Plus/Minus 312

Sbjct

255

Query

372

Sbjct

195

Query

432

Sbjct

135

Figure 3.10 Alignment between query DD148865 and the second BLAST hit, delta hemoglobin NM_000519. Nonidentical nucleotides lack the vertical bar “|” seen between identical nucleotides.

222

>ref|XR_132577.1| PREDICTED: Homo sapiens hypothetical LOC100653006 (LOC100653006), miscRNA Length=255

Query

61

CACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTC ||||||||||||||||| ||||||||||||||||| | ||||||||| || |||||||| CACTGTGACAAGCTGCATGTGGATCCTGAGAACTTAAAGCTCCTGGGAAATGTGCTGGTG

371

TGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAG || |||| || || ||||||||||||||||| ||||||||| ||| ||| ACCGTTTTGGCAATCCATTTCGGCAAAGAATTCACCCCTGAGGTGCAGGCTTCCTGGCAG

431

AAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATSACTAAGCTCGCT || ||||| |||| ||||| | ||||||| || || || ||| ||||| || AAGATGGTGACTGGAGTGGCCAGTGCCCTGTCCTCCAGATACCACTGAGCTCACT

486 81

196

136

62

Chapter 3: Introduction to the BLAST Suite and BLASTN originating from this genomic region as evidence for its existence. This predicted transcript shows enough similarity to our query to appear on the list of hits with a very significant E value: 1e–32. There is 78% identity over a 175-nucleotide alignment, much smaller than the values seen with beta and delta hemoglobin. The sequence seems to be related to the query, but is clearly not a close member of the family. Maybe an exon was copied and moved to a distant location. Let’s return to the results table for a moment. Below the two predicted sequences (XR_132577 and XR_132954) the table shows a number of hits with increasing E values, well above the very low numbers seen at the top of the table. The biological functions of these low-scoring hits are also very varied, with no common theme for primary function. There is a cluster of hits to transcription variants of “Homo sapiens anterior pharynx defective 1 homolog A,” and the alignment of one of these is shown in Figure 3.12A. Notice that the poly(A) tail of the query contributed a large number of the aligned bases. BLAST tries to emphasize this to you by changing these nucleotides (a simple sequence) to lowercase. There may be some significance to these two 3P ends being similar but the statistics say that care should be taken. Without the poly(A) contribution, only 36 nucleotides of the query’s remaining 612 nucleotides show some similarity to this transcript, which is hardly significant. Hit number 44 (NM_000794) of this BLASTN result has a Max Score of 35.6 and an E value of 1.4. By comparison, the first hit has a Max Score of 1085 and E value of 0.0. Alignments with numbers like those for hit number 44 can be expected to be very short and insignificant (Figure 3.12B). Now go back to the graphic. By floating your mouse over the bars in the graphic, find the bar that represents the alignment with the dopamine receptor D1 (NM_000794) seen in Figure 3.12B (hint: use the coordinates of the query and the score-expected color to narrow your region of searching). There are other hits that align to the same region, based on the stack of bars in the graphic. What does this mean? Dopamine is a neurotransmitter and this receptor is expressed in the brain. This function seems very different to that of the hemoglobins, which is to bind oxygen in the blood. A quick look at the dopamine receptor annotation shows that the region of alignment, between nucleotides 1053 and 1081, is

Figure 3.12 Lowscoring alignments from BLASTN with query DD148865. Compared to the top hit, these alignments have a very low Score, a high Expect value, and a very short alignment length. (A) Hit number 9, NR_045035. (B) Hit number 44, NM_000794.

(A) >ref|NR_045035.1| Homo sapiens anterior pharynx defective 1 homolog A (C. elegans) (APH1A), transcript variant 7, non-coding RNA Length=2216 Score = 53.6 bits (58), Expect = 5e-06 Identities = 45/55 (82%), Gaps = 3/55 (5%) Strand=Plus/Plus Query

577

Sbjct

2159

GATTCTGCCTAATAAAAAAGCATTTATTTTCATTGCaaaaaaaaaaaaaaaaaaa |||| || |||||||||||| || || | |||| |||||||||||||||||| GATTTTGACTAATAAAAAAGAAT---TTGTAATTGTGAAAAAAAAAAAAAAAAAA

(B) >ref|NM_000794.3| Homo sapiens dopamine receptor D1 (DRD1), mRNA Length=3373 Score = 35.6 bits (38), Expect = 1.4 Identities = 25/29 (86%), Gaps = 0/29 (0%) Strand=Plus/Plus Query

347

Sbjct

1053

CAGGCTCCTGGGCAACGTGCTGGTCTGTG || ||||||||| ||| ||||||||||| CACGCTCCTGGGGAACACGCTGGTCTGTG

375 1081

631 2210

BLAST Results in the coding region of the receptor. Remembering that codons are in groups of three nucleotides, a quick calculation says that this small alignment of 29 nucleotides encodes about 10 amino acids which appear to be similar between beta hemoglobin and the dopamine D1 receptor. Considering that these two proteins are 147 and 477 amino acids long, respectively, such a short similarity does not suggest similarity in protein function. But, nevertheless, it would be fun to investigate this small stretch of amino acids, especially since many other proteins share something in common with hemoglobin. You will need additional skills to study this, and you will receive them later on in this book.

Simultaneous review of the graphic, table, and alignments Although we reviewed the results sections independently, we did move back and forth between the sections to get a better view of the independent pieces of data. When getting your first look at BLAST results, it is often helpful to do a quick review of all the sections to get a general understanding of the results, and then examine the sections individually and more slowly for details. While looking at the table, you can visualize the hits across the query because you saw them in the graphics panel. When looking at the alignments, you can visualize the trend in the scores because you saw them in the table. And when you look at the graphics, you can visualize the identities because you have scanned the alignments. Now that you understand the components of the BLAST output, let’s look at the results again more quickly and introduce some additional results. Below is a fast-paced narrative to better demonstrate the quick review. In each case, look back at the section figure and follow along. ●





The graphic shows a single high-scoring hit which stretches from end to end of the query. Five other high-scoring hits appear to end around query coordinate 475. Another pair of hits aligns from approximately nucleotide 300 to 475. Finally, there are multiple groups of low-scoring hits that are quite short in length. Some of these are at the 3P end of the query—maybe the poly(A) stretch is responsible for most of these hits. Floating the mouse over the top red bar shows (in the small text window above the graphics pane) that the best hit is beta hemoglobin. The next five red bars represent the four members of the hemoglobin family (delta, epsilon 1, gamma G, and gamma A) and a pseudogene. Their bars are shorter than that of beta hemoglobin, consistent with the drop in Query Coverage seen in the table. The black bars are short and correspond to the low-scoring hits in the table. There is some vertical stacking of the black bars so this will have to be analyzed further by looking at the alignments. The table descriptions show multiple hits to members of the hemoglobin family. The Max Scores reflect a drop in quality and length of alignment (Query Coverage) after the best hit, beta hemoglobin. Four family members and a pseudogene show good alignment length and E values, but none are as good as the first hit, so it is safe to say that our query is closest to beta hemoglobin. Two predicted genes appear high on the list, but have shorter query coverage than the other family members. There are many other hits listed but their scores, query coverage, and E values are all very poor. Their descriptions are also varied, with names that look very different than the globins that bind oxygen. The alignments reflect what we have seen in the graphic and table. The best hit is an almost base-for-base alignment between the query and human beta hemoglobin. There are some missing bases at the ends of the alignments, but these should have little impact on deciding the identity of our unknown query. The next few alignments show that the sequences of the hemoglobin family are quite similar but some significant differences at the 3P ends prevent full alignment. Except for some predictions, the rest of the hits appear to be insignificant.

63

64

Chapter 3: Introduction to the BLAST Suite and BLASTN

3.5 BLASTN ACROSS SPECIES Now let’s perform another BLASTN search. Rather than try to identify an unknown, the goal of this search is to find hemoglobin genes in other animals.

BLASTN of the reference sequence for human beta hemoglobin against nonhuman transcripts In the first search of this chapter, we identified the reference mRNA sequence for human beta hemoglobin. Go back to those results and, using the accession number, retrieve this sequence from the database in FASTA format. Keep this beta hemoglobin browser window open for later reference. In another window, navigate to the NCBI BLASTN form and paste in the FASTA format for the reference sequence for human beta hemoglobin. When trying to identify our unknown, above, we restricted the BLASTN search to just reference sequences annotated as coming from Homo sapiens. This time, enter “vertebrates” in the Organism box, broadening the search but still restricting it to organisms that are likely to have hemoglobin (Figure 3.13). Again, select BLASTN (not megaBLAST) and launch the BLAST search. When the screen refreshes and the results appear, perform your quick review of the results and then look at the details. Let’s first look at the table (Figure 3.14). The first hit is the reference human beta hemoglobin cDNA: as expected, this BLAST query found itself in the database. Below the human sequence are hits from a variety of vertebrates. Should these Latin names look unfamiliar to you, use the NCBI Taxonomy database to look up the common name for these species. The leading hits are from primates, for example chimpanzee (Pan troglodytes), gibbon (Nomascus leucogenys), orangutan (Pongo abelii), marmoset (Callithrix

Figure 3.13 Configuring the BLASTN form to search reference mRNAs from other species.

Figure 3.14 Human beta hemoglobin BLASTN results table, showing hits across many species.

BLASTN Across Species jacchus), and Rhesus monkey (Macaca mulatta). Note that for many of these and other species, the mRNA is annotated as “Predicted” reflecting, in most cases, that the sequence is derived from genomic sequencing, not sequencing of cDNA. This indicates the amount of attention these other animals have received in studying their genes. Once you get outside of mainstream model organisms, relatively little mRNA/cDNA cloning of globins and many other genes has taken place. These genes were predicted based on very strong similarity to known mRNA sequences from human, mouse, rat, and other well-studied organisms. Figure 3.14 shows that the E value for many of the top hits is 0.0. For these sequences, there is almost a twofold drop in Max Scores. The Query Coverage drops throughout the list until it gets to the cow (Bos taurus) where it is 100%. However, the percent identity between the human and cow sequence is only 81%, which pushed it down the list of hits. Looking at the alignment to the predicted chimpanzee beta hemoglobin it is easy to see strong similarity between the human and chimpanzee sequences. The identities measurement indicates that the human and chimpanzee sequences are 99% identical. The alignment of 625 out of 626 nucleotides is nearly 100%, but this field uses whole numbers and rounds this value down to 99%. With this and many other human–chimpanzee sequence alignments, you can clearly see that at the DNA level, chimpanzees are our closest relatives. Moving down the table or the alignments, certain trends are seen. Unlike the first BLASTN, which found all the members of the human hemoglobin family among the top hits, this search shows that most of the other human hemoglobin family members are not seen before many beta hemoglobin sequences in other species are found. Homo sapiens delta hemoglobin is the next human hit in the table, but no other human sequences are seen until much further down on the list. This indicates that, at least at the mRNA sequence level, the beta hemoglobins in these top species are more similar than human hemoglobin family members are to each other. At the bottom of the table in Figure 3.14, among all the beta hemoglobin hits, notice that the cow (Bos taurus) hits include “gamma” and “gamma 2,” which is a family member not seen in humans. As you explore genes from other species, you will see variation in names that will often reflect differences in physiology between organisms, differences in naming conventions, history, and even mistakes. Looking at the graphic display for this BLASTN search (Figure 3.15), notice that many hits do not align with the 5P or the 3P ends of the query. Look at the alignment between the query and the Equus caballus (horse) beta hemoglobin (Figure 3.16). The alignment with the query starts at nucleotide 51 and ends with nucleotide 492. Why are so many sequences failing to align from end to end? The answer can be gathered from the alignment and the annotation of the sequence records. Go to the GenBank record (NM_000518) and look at the annotation for the human beta hemoglobin and you see that the coding region sequence, abbreviated as CDS, starts at nucleotide 51, the A of the ATG start codon. The horse sequence, NM_001164018, starts at this exact base (horse nucleotide number 1 aligns with human nucleotide 51). The 5P untranslated region (UTR) is not even present in this horse reference sequence. Based on this BLAST search, it appears that the 5P untranslated regions of many vertebrate mRNAs are not present in the sequence records or cannot align well with the human untranslated region; hence the common start site for many of the alignments is in the vicinity of human nucleotide 51. The 3P ends of many of these alignments terminate around query nucleotide 494 (seen in the graphic, Figure 3.15), which is the 3P boundary of the human beta hemoglobin coding region. Certainly, divergence of sequence in the 3P UTR explains some of these terminations. In the case of the horse sequence, there is

i

65

Latin species names

Throughout this book you will often encounter the Latin species names for organisms. Some will be easily recognized (for example, Rattus) but others can be guessed if you have knowledge of other topics, such as constellations of the evening sky, that also use Latin names. Bos taurus is the cow and the constellation is Taurus the bull. Canis major is a hunting dog of Orion, and the Latin name for the dog is Canis familiaris. Sheep is Ovis aries, and the zodiac sign is Aries.

66

Chapter 3: Introduction to the BLAST Suite and BLASTN

Figure 3.15 The BLASTN graphic of human beta hemoglobin mRNA against many species. Note the truncations at the 5P (left) end of the graphic as well as the distinct boundary around nucleotide 500 at the 3P end.

a simple reason to explain the sudden stop of alignment at horse nucleotide 442: the horse sequence comes to an end. Look at the description line of this alignment and it says “Length=444.” Scroll through the alignments and you’ll notice that many sequences do not include the 3P UTR of the mRNA. Many genomes are annotated in an automated fashion and genes are predicted based on similarity to known, well-annotated genes from other organisms. Like the 5P UTR, the 3P UTRs of many predicted transcripts are underrepresented in the database. In general, gene coding regions are more conserved than the noncoding regions. A single nucleotide change in the coding region can change the amino acid sequence, possibly alter the structure and function of the protein, or introduce a stop codon and truncate the translation product. There are fewer constraints for sequence and function on untranslated regions. However, there are very important regulatory elements at work in untranslated regions. As long as regulatory elements, if present, are not disrupted, many nucleotide substitutions, insertions, and deletions are tolerated and lead to sequence differences in untranslated regions so extensive that they fail to align using BLAST.

Paralogs, orthologs, and homologs In the first BLASTN search of this chapter, we were able to identify members of the human hemoglobin family: in order, beta (the unknown found the reference sequence for itself ), delta, epsilon 1, gamma G, and gamma A hemoglobins. Based on the identities between our beta hemoglobin query and these hits, it is clear the family members are still closely related to each other, the lowest identity

BLASTN Across Species >ref|NM_001164018.1| Length=444

Equus caballus hemoglobin, beta (HBB), mRNA

Score = 522 bits (578), Expect = 3e-146 Identities = 381/442 (86%), Gaps = 0/442 (0%) Strand=Plus/Plus Query

51

Sbjct

1

Query

111

Sbjct

61

Query

171

Sbjct

121

Query

231

Sbjct

181

Query

291

Sbjct

241

Query

351

Sbjct

301

Query

411

Sbjct

361

Query

471

Sbjct

421

ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAAC |||||||| |||| | ||| |||||| | || || |||||||||| ||||||||| ATGGTGCAACTGAGTGGTGAAGAGAAGGCAGCTGTCTTGGCCCTGTGGGACAAGGTGAAT

110

GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAG | ||| |||||||||||||| |||||||||||||||||||| |||||||| ||||| ||| GAGGAAGAAGTTGGTGGTGAAGCCCTGGGCAGGCTGCTGGTTGTCTACCCATGGACTCAG

170

AGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAG ||||||||||| ||||||||||||||||||| ||||| |||||| ||||||||||| ||| AGGTTCTTTGACTCCTTTGGGGATCTGTCCAATCCTGGTGCTGTGATGGGCAACCCCAAG

230

GTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGCTCACCTGGAC |||||||| || |||||||||||||| ||||| |||| ||| || ||| || ||| GTGAAGGCCCACGGCAAGAAAGTGCTACACTCCTTTGGTGAGGGCGTGCATCATCTTGAC

290

AACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT |||||||||||||||||||| | |||||||||||||||||||||||||||||||||||| AACCTCAAGGGCACCTTTGCTGCGCTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT

350

CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGC ||||||||||||||||||||||||||||||||||| ||||||||| | ||||||||| CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTTGTTGTGCTGGCTCGCCACTTTGGC

410

AAAGAATTCACCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAAT || || ||||||||| |||||||| ||||||| || ||||||||||||||||| ||| AAGGATTTCACCCCAGAGTTGCAGGCTTCCTATCAAAAGGTGGTGGCTGGTGTGGCCAAT

470

GCCCTGGCCCACAAGTATCACT || ||||||||||| || |||| GCACTGGCCCACAAATACCACT

492 442

being 76% between beta and gamma A. The simplest explanation for these very strong identities is that they all share a common ancestor. Some time in the distant past, there was a first hemoglobin gene. Then, through gene duplication events, other members arose, diverged in sequence, and became specialized in function. We can find them today as we did in the above BLASTN search. This stands as a model of what has happened throughout evolution; one gene gave rise to other family members. Gene family members within the same organism are referred to as paralogs. Paralogs share a common ancestor and reside in the same genome. They are clearly related to each other but are usually specialized and have different functions. Beta hemoglobin is expressed in adults, while gamma A hemoglobin protein is only found in the fetus. This specialization reflects the distinctly different oxygen-binding needs between an air-breathing adult and a fetus growing in a womb and getting oxygen from its mother’s blood. This is further illustrated by the human diseases called thalassemias, where a globin gene does not function and the other globin family members cannot adequately substitute for the lost function. As seen in the last BLASTN search, many other animals also have hemoglobins. Genes that perform identical functions in different organisms are called orthologs. The human beta hemoglobin gene is orthologous to the horse beta hemoglobin gene. Both are expressed in adult animals. Did they evolve

60

120

180

240

300

360

420

67

Figure 3.16 BLASTN alignment between human (NM_000518) and horse beta hemoglobin (NM_001164018). Note that the alignment starts at the “ATG” of the coding region (underlined), base number one of the horse sequence. There is no horse 5P untranslated region.

68

Chapter 3: Introduction to the BLAST Suite and BLASTN independently? No. The simplest explanation is that there was a common animal ancestor who had the first beta hemoglobin gene, and the evolutionary descendants of that ancestor inherited this gene. Homolog is a term that describes both paralogs and orthologs. When comparing genes between organisms, and it is not clear if they are orthologous, then the genes are described as homologs. When describing genes that show some identity but it is not clear if they are family members, then it is safer to describe them as sharing homology. The human alpha hemoglobin and mouse beta hemoglobin clearly have a common ancestor but perform different functions. They are homologs, not orthologs, and certainly not paralogs.

3.6 BLAST OUTPUT FORMAT The output of BLAST described above is the HTML or Web format of the results. This format allows easy navigation between the graphic, table, and alignments, as well as instant access to sequence files through hypertext. The NCBI and other Websites provide this to you because of these obvious advantages. However, you may encounter Websites or an instance of BLAST you run from a command-line interface like UNIX where the output is raw text. In this case, the results may look like Figure 3.17. The advantage of this format is the simplicity; copying from this output, or parsing using a simple programming script, is uncomplicated by hidden formatting. The NCBI gives you the option to output your results as “plain text” by clicking on “Formatting options” near the top of the results page. In fact, if you are copying your own BLAST results and pasting them into reports, you may wish to use this format. This book will often show you the raw text format for simplicity. Note that in this raw form, the only columns after the description are the Score and E Value.

3.7 SUMMARY In this chapter, the focus was BLASTN, a Web application that allows you to search nucleotide databases with nucleotide sequence queries. You learned how to paste a sequence into the query window, select a database to search, sometimes narrow your search to sequences of a certain species, and then launch the search. When the results came back you were able to quickly review the search by looking at the graphic and the table. You could look at the graphic to get a visual idea of how your query lined up to the hits, and sometimes saw that the query found smaller sequences. In the BLASTN results table, you saw the hits along with the statistics relating to them. Besides obvious criteria such as percent identity, there was the E value that showed you the probability that hits were found by chance. An E value of 0.0 meant that these hits were very real and could not be explained by random occurrences. Finally, you looked at the alignments where you could see, base by base, how your query lined up with the hits.

EXERCISES Exercise 1: Biofilm analysis Public water supply lines are immersed in water for decades and a community of microorganisms thrives on these wet surfaces. These slippery coatings are referred to as biofilms and the bacterial makeup is generally unknown because scientists are unable to culture and study the vast majority of these organisms in the laboratory. In 2003, Schmeisser and colleagues published a study where they collected and sequenced the DNA from bacteria growing on pipe valves of a drinking water network in Northern Germany. Through sequence similarity, they were able to classify a large number of these organisms as belonging to certain species or groups. In this process they identified many new species. In this

Exercises (A) Sequences producing significant alignments: ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ref|NM_000519.3| Homo sapiens hemoglobin, delta (HBD), mRNA ref|NM_005330.3| Homo sapiens hemoglobin, epsilon 1 (HBE1), mRNA ref|NM_000184.2| Homo sapiens hemoglobin, gamma G (HBG2), mRNA ref|NM_000559.2| Homo sapiens hemoglobin, gamma A (HBG1), mRNA ref|XM_002344540.1| PREDICTED: Homo sapiens similar to PRO298... ref|XM_002347218.1| PREDICTED: Homo sapiens similar to PRO298... ref|XM_002343046.1| PREDICTED: Homo sapiens similar to PRO298... ref|NR_001589.1| Homo sapiens hemoglobin, beta pseudogene 1 (... ref|NM_001128602.1| Homo sapiens RAS guanyl releasing protein... ref|NM_005739.3| Homo sapiens RAS guanyl releasing protein 1 ... ref|NM_080723.4| Homo sapiens neurensin 1 (NRSN1), mRNA ref|NM_016642.2| Homo sapiens spectrin, beta, non-erythrocyti... ref|NM_000794.3| Homo sapiens dopamine receptor D1 (DRD1), mRNA ref|NM_199077.1| Homo sapiens cyclin M2 (CNNM2), transcript v... ref|NM_199076.1| Homo sapiens cyclin M2 (CNNM2), transcript v... ref|NM_017649.3| Homo sapiens cyclin M2 (CNNM2), transcript v... ref|NM_144666.2| Homo sapiens dynein heavy chain domain 1 (DN... ref|NM_021020.2| Homo sapiens leucine zipper, putative tumor ... ref|NM_000798.4| Homo sapiens dopamine receptor D5 (DRD5), mRNA ref|NM_015221.2| Homo sapiens dynamin binding protein (DNMBP)... ref|NM_080539.3| Homo sapiens collagen-like tail subunit (sin... ref|NM_182515.2| Homo sapiens zinc finger protein 714 (ZNF714... ref|NM_080538.2| Homo sapiens collagen-like tail subunit (sin... ref|NM_005677.3| Homo sapiens collagen-like tail subunit (sin... ref|NM_016315.2| Homo sapiens GULP, engulfment adaptor PTB do... ref|NM_024686.4| Homo sapiens tubulin tyrosine ligase-like fa... ref|NM_015540.2| Homo sapiens RNA polymerase II associated pr... ref|NM_032444.2| Homo sapiens BTB (POZ) domain containing 12 ... ref|NM_014234.3| Homo sapiens hydroxysteroid (17-beta) dehydr... ref|NM_148414.1| Homo sapiens ataxin 2-like (ATXN2L), transcr... ref|NM_007245.2| Homo sapiens ataxin 2-like (ATXN2L), transcr... ref|NM_145714.1| Homo sapiens ataxin 2-like (ATXN2L), transcr... ref|NM_148415.1| Homo sapiens ataxin 2-like (ATXN2L), transcr... ref|NM_148416.1| Homo sapiens ataxin 2-like (ATXN2L), transcr... ref|NM_016261.2| Homo sapiens tubulin, delta 1 (TUBD1), mRNA ref|NM_000911.3| Homo sapiens opioid receptor, delta 1 (OPRD1... ref|NM_007261.2| Homo sapiens CD300a molecule (CD300A), mRNA ref|NM_020857.2| Homo sapiens vacuolar protein sorting 18 hom...

Score (Bits)

E Value

1034 654 381 343 334 241 241 241 233 35.6 35.6 35.6 35.6 35.6 35.6 35.6 35.6 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7 33.7

0.0 0.0 8e-105 2e-93 1e-90 2e-62 2e-62 2e-62 2e-60 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5

(B) >ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA Length=626 Score = 1034 bits (1146), Expect = 0.0 Identities = 573/573 (100%), Gaps = 0/573 (0%) Strand=Plus/Plus Query

1

Sbjct

54

Query

61

Sbjct

114

Query

121

Sbjct

174

GTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTG

60

GATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGG

120

TTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTG |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTG

180

113

173

233

69

Figure 3.17 The “plain text” format of NCBI BLAST results. (A) The BLASTN results table. (B) The top of a BLASTN alignment.

70

Chapter 3: Introduction to the BLAST Suite and BLASTN exercise, you are to use BLASTN to repeat some of their analysis and identify the makeup of these biofilms. Below is a list of 10 sequence accession numbers from their study. You are to use the NCBI BLASTN Web form to search for sequence similarities to try to identify the bacteria growing within these biofilms. AY187314 AY187315 AY187316 AY187317 AY187318 AY187325 AY187326 AY187330 AY187332 AY187333 1. Retrieve each sequence from the NCBI GenBank and, based on the annotation of these sequence records, identify what gene was used in their analysis. 2. For each sequence, convert the file format to FASTA using the “Display Settings.” 3. Navigate to the NCBI BLASTN Web form and paste the FASTA format of each DNA sequence into the Query window. 4. Choose the “Nucleotide collection (nr/nt)” as the database to be searched. 5. To save lots of time for your searches, restrict your search to “bacteria (taxid:2)” in the Organism field. 6. Pick “Somewhat similar sequences (BLASTN)” as the program to be used in the search. 7. When ready, launch the search by clicking on the “BLAST” button. 8. Open up additional Internet browser windows and launch the other searches. 9. Ten individual windows of results will be returned within a few minutes. Be sure to stay organized and record your conclusions for each accession number. 10. For each BLASTN search, survey the results graphic, table, and alignments to assign each unknown sequence to an organism. You may not find 100% identity between your query and the hits, except for the self-hit. Note that the first hit may also be an unknown so you should examine all the hits before drawing any conclusions as to what kind of bacteria the sequence came from. 11. Using the NCBI PubMed database or other Internet resources, try to find basic information about the genus and/or species; for example, habitats where  these bacteria grow, and if they are associated with any diseases or environmental pollutants.

Exercise 2: RuBisCO It is often said that ribulose bisphosphate carboxylase (RuBisCO) is the most abundant protein on the planet. This enzyme is part of the Calvin cycle and is the key enzyme in the incorporation of carbon from carbon dioxide into living organisms. It is part of an enzyme complex found in plants, terrestrial or aquatic, and most probably played an important role in the development of our atmosphere and life on Earth.

Further Reading Arabidopsis thaliana, a member of the mustard family, is an important model system for higher plants. It is easily cultivated in the laboratory, undergoes rapid development, and produces a large number of seeds, making it amenable to genetic studies. Although not important agronomically, Arabidopsis has provided fundamental knowledge of plant biology and it was the first plant genome to be sequenced (in 2000). In this exercise, you will use BLASTN to identify members of the RuBisCO gene family in Arabidopsis. 1. Retrieve the reference mRNA for the Arabidopsis RuBisCO small chain subunit 1b, NM_123204, at the NCBI Website. 2. Change the format to FASTA and paste the sequence into the NCBI BLASTN Web form Query window. 3. Set the database to “Reference RNA sequences (refseq_rna)” and restrict the organism to “Arabidopsis thaliana (taxid:3702).” 4. Set the program selection to “Somewhat similar sequences (BLASTN)” and click on the “BLAST” button to launch the search. 5. When the results are returned, you should now utilize the graphic, table, and alignments to identify the family members. 6. The Reference RNA database should not have any redundancy but two family members have alternatively spliced mRNAs. Compare the alignments carefully and examine the annotation (especially the coordinates of the coding regions) of all the relevant sequence records to describe and understand the major differences between these family member transcripts. 7. Create a table with a listing of the names of family member transcripts and their accession numbers, their mRNA length, the coordinates of the coding regions (CDS), and a brief description of what is observed in the alignments.

FURTHER READING Altschul SF, Gish W, Miller W et al. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410. The first BLAST paper, a classic in the field of bioinformatics. Altschul SF, Madden TL, Schäffer AA et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. This is another foundation paper by the creators of the BLAST suite of programs, describing significant improvements to the algorithm. Schmeisser C, Stöckigt C, Raasch C et al. (2003) Metagenome survey of biofilms in drinking-water networks. Appl. Environ. Microbiol. 69, 7298–7309. This paper is relevant to the biofilms exercise.

Internet resources To learn more about Arabidopsis and the Arabidopsis genome sequencing project, go to www.arabidopsis.org and click on their Education and Outreach portal. This is an extensive Website with many resources ranging from fundamental learning about Arabidopsis to detailed workings of plant research and how the genome was sequenced. To learn more about the globin family of genes, go to the NCBI Bookshelf of electronic books, choose “Human Molecular Genetics” from the extensive list of textbooks, and in the search window for this text, enter “globin.” This gene was discussed or illustrated 74 times in this book, within a variety of topics. For example, there is a figure showing the “Evolution of the globin superfamily.” Due to their history, biology, biochemistry, genetics, diseases, size, and genomic structure, globin genes are frequent subjects in textbooks.

71

             

This page is intentionally left blank.  

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA GAAGGGGAAACAGATGCAGAAAGCATCT AGAAAGCATCT ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCT ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAA GAAGGGGAA

CHAPTER 4

Protein BLAST: BLASTP Key concepts • Protein BLAST (BLASTP) at the NCBI and ExPASy Websites • The genetic code • Amino acids and their overlapping properties • The BLASTP scoring matrix

4.1 INTRODUCTION In the last chapter, the focus was on nucleotide BLAST (BLASTN), a Web application that allows you to search nucleotide databases with nucleotide sequence queries. BLASTN gives you powerful search capabilities over simple text-based searches. You searched with unknowns and determined their identities based on similarities or identities to reference sequences. Now, if you were asked to find the insulin gene for the pig, you could start with the human version of insulin and search a DNA database, perhaps limiting your search to DNA sequences from Sus scrofa. There are other types of BLAST, giving you additional power and flexibility of approaches, and in this chapter we will cover protein BLAST (BLASTP, Table 4.1).

4.2 CODONS AND THE GENETIC CODE With just four possible nucleotides (A, T, G, or C) in each position, a small gene such as insulin has a coding region that looks deceptively simple (Figure 4.1). Historically, it must have been a surprise when the chemical makeup of DNA was determined and it consisted of just four nucleotides. How could something so simple encode 20 amino acids and the phenomenal diversity of life in the world? The answer was equally simple; DNA sequence is read in “threes,” three nucleotides at a time, during the translation into protein sequence. These groups of

Table 4.1 Comparison of BLAST definition Type

Query

Database

BLASTN

Nucleotide

Nucleotide

BLASTP

Protein

Protein

The names, BLASTN and BLASTP, are easy to remember: you use a Nucleotide query to search a nucleotide database, and a Protein query to search a protein database, respectively.

74

Chapter 4: Protein BLAST: BLASTP

Figure 4.1 DNA sequence of the human insulin mRNA.

>NM_000207 Homo sapiens insulin (INS), mRNA, coding region only ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAG CCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGG CTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGG GGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAAC AATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAG

three are referred to as codons and, collectively, they make up the genetic code. If you consider all the possible combinations of three that can be made with four nucleotides, there are 64 possible triplets of the four nucleotides. It can be calculated as follows: (4 possible bases in the first position) (4 possible bases in the second position) (4 possible bases in the third position) 4 × 4 × 4 = 64 codons Twenty codons are adequate for the 20 amino acids. But all possible codons are utilized, with some redundancy, usually referred to as degeneracy. In addition, there are three codons that signal the termination of mRNA translation, and these are called terminators. There are several ways to portray the genetic code but a commonly used chart appears in Figure 4.2. As you examine the genetic code, you should notice a number of features. First, the degeneracy described above: there are two codons for phenylalanine (TTT, TTC), four codons for valine (GTT, GTC, GTA, GTG), and six codons for leucine (TTA, TTG, CTT, CTC, CTA, CTG). Codon assignments are not always multiples of two: methionine (ATG) and tryptophan (TGG) each have a single codon while isoleucine has three (ATT, ATC, ATA). Note that it is common to refer to a group of codons by using an “N” to represent “any base.” For example, the four codons for valine can be written as GTN.

Figure 4.2 The genetic code. All 64 possible triplets of the four nucleotides create a code for translating mRNA into protein sequence. To find the amino acid assigned to a given triplet or codon, start with the first nucleotide on the far left. Then find the column of the second nucleotide. Finally, find the narrow row of the third nucleotide on the far right. The intersection of all three vectors (Row × Column × Row) arrives at the amino acid assignment for a nucleotide triplet. For example, “GGG” is found in the intersection of the bottom large row, the far right column, and the last row. The three-letter and single-letter abbreviations for the amino acids appear in this figure.

Second Nucleotide | T | C | A | G | ----+--------------+--------------+--------------+--------------+---| TTT Phe (F) | TCT Ser (S) | TAT Tyr (Y) | TGT Cys (C) | T F T | TTC Phe | TCC Ser | TAC Tyr | TGC Cys | C T i | TTA Leu (L) | TCA Ser | TAA Ter (*) | TGA Ter (*) | A h r | TTG Leu | TCG Ser | TAG Ter (*) | TGG Trp (W) | G i s --+--------------+--------------+--------------+--------------+-- r t | CTT Leu (L) | CCT Pro (P) | CAT His (H) | CGT Arg (R) | T d C | CTC Leu | CCC Pro | CAC His | CGC Arg | C N | CTA Leu | CCA Pro | CAA Gln (Q) | CGA Arg | A N u | CTG Leu | CCG Pro | CAG Gln | CGG Arg | G u c --+--------------+--------------+--------------+--------------+-- c l | ATT Ile (I) | ACT Thr (T) | AAT Asn (N) | AGT Ser (S) | T l e A | ATC Ile | ACC Thr | AAC Asn | AGC Ser | C e o | ATA Ile | ACA Thr | AAA Lys (K) | AGA Arg (R) | A o t | ATG Met (M) | ACG Thr | AAG Lys | AGG Arg | G t i --+--------------+--------------+--------------+--------------+-- i d | GTT Val (V) | GCT Ala (A) | GAT Asp (D) | GGT Gly (G) | T d e G | GTC Val | GCC Ala | GAC Asp | GGC Gly | C e | GTA Val | GCA Ala | GAA Glu (E) | GGA Gly | A | GTG Val | GCG Ala | GAG Glu | GGG Gly | G ----+--------------+--------------+--------------+--------------+---(*) Termination codons are labeled as "Ter" and "*".

Codons and the Genetic Code For the most part, the genetic code is orderly. Related codons encode the same amino acid. For example, all proline codons begin with CC, all glycine codons begin with GG, and all valine codons begin with GT. Leucine codons can start with either a T or C, but all leucine codons have a T in the second position. With only four possibilities, having one nucleotide in common isn’t necessarily significant.

There once was a company that sold a large suite of sequence analysis programs and although they are no longer in business, many institutions still use this old software. The company’s name was Genetics Computer Group, but everyone called it “GCG.” They had a company softball team called the “Alanines,” named after the codon GCG.

There is a large advantage for having a genetic code with redundancy. The homologous DNA sequences that were found with BLASTN in Chapter 3 showed good levels of conservation of DNA sequence, but there were definitely differences, especially across species. There is evolutionary pressure to maintain the same amino acid sequence for a protein. But when a change in DNA sequence occurs, the redundancy of codons means that there is not necessarily a change of the amino acid. In many cases, the third nucleotide of the codon can be modified with reduced or no consequences on the encoded amino acid. In fact, we can easily find instances where the conservation of amino acid sequence leaves a specific signature involving the third base. Here is another case where your eyes can easily recognize a pattern if you take the time to look and you know what you are looking for. Recalling the Chapter 3 BLASTN search of human beta hemoglobin (NM_000518) against RefSeq, included on the list of hits was the rat epsilon hemoglobin mRNA (NM_001008890). One pair of aligned sequences from rat epsilon hemoglobin shows a pattern where there are pairs of identical bases and then the third is different (Figure 4.3). Despite there being 12 nonidentities in this alignment of 60 nucleotides, there is only one that results in an amino acid difference for this region. Ten of those nucleotide differences occurred in the third position resulting in no change in amino acid (for example, the codon labeled “2” in Figure 4.3B). One third-position change (codon “1” in Figure 4.3B) resulted in a D (GAT) to E (GAG) substitution but these amino acids are biochemically similar so this difference is probably less significant than many other possible substitutions. The final difference in this aligned stretch is in the first position of the codon labeled “3” (CTG vs TTG) but both codons are for leucine.

Figure 4.3 Conservation of protein sequence signature. (A) BLASTN alignment between a coding region of human beta hemoglobin (NM_000518) and rat epsilon hemoglobin (NM_001008890) mRNAs. (B) The same alignment as in (A) with every other codon underlined for easier viewing and three codons marked with 1, 2, and 3. (C) Translations of this same coding region with one-letter abbreviations for the amino acids. The single amino acid change for this sequence is bold and underlined.

It is important to remember that an organism doesn’t choose where the changes in nucleotide sequence will occur. What we see today are changes that were tolerated after they happened. Mutations are allowed in positions where there is little or no impact on the structure or function of a particular sequence. For example, should the third base of an alanine codon change from GCG to GCA, GCC, or GCT, there is no consequence on the protein sequence since all four codons are for alanine. Importantly, mutations that lead to an advantage for the organism are positively selected. If an enzyme that breaks down sugar has a change in protein sequence and can now digest sugar more efficiently, a bacterium that

rat epsilon hemoglobin

GCG and the Alanines

i

Looking elsewhere in the code, terminator codons are unique in that they do not encode an amino acid but, instead, signal the termination of translation. There are other groups of two or four codons for certain amino acids. A notable set of codons is for serine: there is one group of four (TCN) but another pair which seems to have little in common with the other four (AGT, AGC).

(A) human beta hemoglobin

GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAG || || || ||||||||||| ||| |||| || || || || || ||||| ||||||||| GTTGAGGAGGTTGGTGGTGAAGCCTTGGGAAGACTTCTCGTTGTGTACCCATGGACCCAG

(B) human beta hemoglobin rat epsilon hemoglobin

1 2 3 GTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAG || || || ||||||||||| ||| |||| || || || || || ||||| ||||||||| GTTGAGGAGGTTGGTGGTGAAGCCTTGGGAAGACTTCTCGTTGTGTACCCATGGACCCAG

(C) human beta hemoglobin rat epsilon hemoglobin

V V

D E

E E

V V

G G

G G

E E

A A

L L

G G

R R

L L

L L

V V

V V

75

Y Y

P P

W W

T T

Q Q

76

Chapter 4: Protein BLAST: BLASTP harbors this new protein will grow faster than its neighbors and soon come to dominate the local environment.

Memorizing the genetic code There are people who have memorized the entire genetic code. This is not essential to perform sequence analysis, but as you work your way through this book, there will be times where you will find it helpful to know some of the genetic code without having to look it up. Perhaps you could start with the beginnings and ends of coding regions. ●



Most proteins begin with the codon ATG, which is translated to methionine (look at the first codon of the insulin-coding region in Figure 4.1). The translation of proteins ends with one of the codons TAA, TAG, TGA (look at the last codon of the insulin-coding region in Figure 4.1).

4.3 AMINO ACIDS For convenience, amino acids are abbreviated two different ways, using threeletter and one-letter codes (Figure 4.4). Even with a small protein such as preproinsulin, you see all 20 amino acids represented. Figure 4.5 shows the 110 amino acid protein sequence encoded by the insulin DNA sequence (see Figure 4.1). mRNAs are translated from the 5P end to the 3P end. The first amino acid of the protein is the N-terminus while the last amino acid is the C-terminus. However, these terms are also used to describe the terminal regions, not just individual amino acids. For example, sometimes the first five amino acids are described as the N-terminus, but it is just as appropriate to label the first 30 amino acids the same way. If you find it difficult to remember that the N-terminus is encoded by the 5P end and the C-terminus by the 3P end, just recall that “3” rhymes with “C.”

Figure 4.4 Amino acids and their abbreviations. The literature and specialized bioinformatics applications will often use the three-letter abbreviations, most of which are the first three letters of the amino acid’s name. But in most books, publications, and bioinformatics applications, the one-letter abbreviations for amino acids are seen. The column on the far right contains common ways to remember the abbreviations. Protein sequence can be derived from the coding sequence of an mRNA, or can be chemically determined. If the amino acid sequence of a protein is uncertain, an “X” is used as a placeholder. It represents any amino acid.

Figure 4.5 Protein sequence of human preproinsulin.

AMINO ACIDS AND THEIR SYMBOLS A C D E F G H I K L M N P Q R S T V W Y X

Ala Cys Asp Glu Phe Gly His IIe Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr

Alanine Cysteine Aspartic acid Glutamic acid Phenylalanine Glycine Histidine Isoleucine Lysine Leucine Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine Any amino acid

CODONS

MEMORY AID

GCA GCC GCG GCT TGC TGT GAC GAT GAA GAG TTC TTT GGA GGC GGG GGT CAC CAT ATA ATC ATT AAA AAG TTA TTG CTA CTC CTG CTT ATG AAC AAT CCA CCC CCG CCT CAA CAG AGA AGG CGA CGC CGG CGT AGC AGT TCA TCC TCG TCT ACA ACC ACG ACT GTA GTC GTG GTT TGG TAC TAT

Alanine Cysteine AsparDic acid GluEtamic acid Fenylalanine Glycine Histidine Isoleucine K is adjacent to Lysine in the alphabet Leucine Methionine AsparagiNe Proline Qtamine aRginine Serine Threonine Valine tWiptophan tYrosine

>NP_000198 insulin preproprotein [Homo sapiens] MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG GPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

Amino Acids Take a moment to look at this preproinsulin protein sequence. What do you notice? Are some amino acids abundant, and others rare? Is it surprising that more than one-third of this protein is made up of three amino acids: 10 alanines (A), 12 glycines (G), and 20 leucines (L)? Do you see adjacent pairs of amino acids (LL, FF, RR)? Even English words appear (“VEAL,” “GAG,” “LEGS”) at random. To test your powers of observation, try and find these three words and others.

Amino acid properties Let’s return to the substitution we saw in the human–rat hemoglobin mRNA alignment above (Figure 4.3). We saw there was only one amino acid substitution, aspartic acid (D) for glutamic acid (E), and these amino acids were referred to as biochemically similar. Looking at Figure 4.6, these two amino acids are quite similar in appearance, with a common backbone and variable side chains or “R groups.” Both side chains are acidic and interact with their environment in a similar fashion. If evolution substitutes a glutamic acid for an aspartic acid, this may not change the shape of the protein very much. Glutamic acid is larger by three atoms (CH2), which can be a critical feature of this position in the protein. If the amino acid in this position were pointing inward to a reactive pocket, glutamic acid would protrude deeper into that pocket than aspartic acid and that could be crucial to the proper function of the protein. Alternatively, it may be crucial to provide room in a pocket and glutamic acid may be too large. Nature allows certain substitutions in many positions as long as both the structure and the properties of this region of the protein are conserved. Still, there is change, and sometimes abandonment, of one function as another function is acquired. As we compare similar proteins, you will notice that certain regions tolerate insertions or deletions, which are collectively known as indels. If you examine the three-dimensional structure of these proteins, you often see that these variable regions are on the exterior surface of the protein. As long as a change does not alter the hydrophilic nature of this region (which is important since it comes in contact with the aqueous environment) and does not significantly alter the structure or function of the rest of the protein, it can be tolerated. In contrast, place a new hydrophilic or biochemically reactive amino acid in the hydrophobic interior of a protein which is normally folded into a tight ball, and the structure, function, and stability are liable to be affected.

aspartic acid (Asp, or D) H

O

N

C

C

H

CH2 C O–

O

backbone

side chain

glutamic acid (Glu, or E) H

O

N

C

C

H

CH2 CH2

backbone

side chain

C O

O–

Figure 4.6 The chemical structures of the amino acids aspartic acid and glutamic acid. Both amino acids share a common backbone (N–C–C). The variable lower part of each molecule is known as the side chain. From, Zvelebil M and Baum JO (2008) Understanding Bioinformatics. Garland Science.

The amino acid side chains that are brought in close proximity may be orchestrating a complex chain of biochemical events that are extremely sensitive to the slightest change. Little or no substitution is tolerated in these regions because the amino acids here must have certain dimensions, charge, and other properties. These critical amino acids are often not adjacent to each other in the polypeptide chain. In Figure 4.7, amino acid side chains from distant regions of a hypothetical unfolded protein are brought together to create an active site. This active site may bind to a substrate and catalyze a change, for example. As we examine protein sequence alignments generated by BLASTP, you will notice “islands” of conservation that may indeed represent distant regions brought together for a key protein function.

amino acid side chains

binding site FOLDING

unfolded protein

folded protein

77

Figure 4.7 A hypothetical protein showing side chains being brought into close proximity by folding of the protein. From, Alberts B, Johnson A, Lewis J et al. (2007) Molecular Biology of the Cell, 5th ed. Garland Science.

78

Chapter 4: Protein BLAST: BLASTP

4.4 BLASTP AND THE SCORING MATRIX As described in Chapter 3, a BLAST algorithm matches queries with database entries and, based on matches, tries to extend the alignments until penalty points accumulate past a certain threshold. With BLASTN, you have five possibilities in each position of an alignment between the query and a hit: A, T, G, C, and N (for any nucleotide). As BLASTN tries to find alignments, the scoring is relatively simple: an A should line up with an A, a T with a T, a G with a G, and a C with a C. Any base can align with N. Any other pairs, such as A lining up with a T, is a mismatch and a penalty is applied. When working with protein sequences, you have 21 possibilities for each position (20 amino acids plus X or any amino acid). If you tried to keep it simple, you could design your search program to just reward matches and penalize mismatches. But as discussed above, there is an added layer of complication when trying to align protein sequences: nature often tolerates conservative differences in amino acid sequences. Figure 4.8 shows the 20 amino acids grouped by the properties of the side chains. A smaller penalty should be applied if an amino acid substitution maintains the properties important for a protein to function. Rather than just counting matches and mismatches, BLASTP uses a scoring matrix for matches, mismatches, and conservative substitutions. A scoring matrix rewards identities, gives “partial credit” for some mismatches, and penalizes others. The scoring on a matrix is critical to the success of a BLASTP search. If penalties are too high, and rewards too low, sequences distant but related to your query would go undetected by BLASTP. But if penalties are too low and rewards are too high, then sequences unrelated to your query would clutter the results.

Building a matrix So how do you approach building a matrix? You could focus on physical size and create matrix scores based on the similarities of amino acid dimensions. Certainly, there are critical amino acids filling a cavity in a protein, or maybe not filling that cavity, and any change in size disrupts the important dimensions of, for example, an enzyme’s reactive site. Or, you could also place emphasis on the biochemical properties of the amino acid; whether it is acidic, basic, nonpolar, uncharged, hydrophobic, or hydrophilic. Like physical size, there are circumstances where the biochemical properties of a given position on the protein have to be maintained in order to maintain function or structure. People who study protein structures and functions will tell you that all of these properties are important. For every amino acid of every protein, there are different forces at work that shape the most important property for substitutions at those positions. Rather than try to design a scoring matrix to cover all of these specific issues, a powerful approach would be to build a matrix based on what we see in protein families. For millions of years, nature has been experimenting and natural selection has left us with substitutions that work. Why not just catalog all of these substitutions and let the frequencies at each position guide us? The flaw in the approach as stated is that “all of these substitutions” are too many. There are bacterial proteins that are related to human proteins and if you were to align these sequences, you would see so many substitutions that you could conclude that almost everything is tolerated and scoring-matrix penalties  should be uniformly low. In other words, a matrix built with such far-ranging diversity would be quite “noisy” and not practical for most purposes. In the other extreme, if you were to base your matrix on the alignment of human and other primate sequences, you would see so few differences that you could conclude that, based on their rarity, there is low tolerance for substitutions in nature. A compromise that serves as a general-purpose matrix is called BLOSUM62 (pronounced “blah  sum 62”). Steve and Jorja Henikoff created a database of protein sequence alignments called BLOCKS where homologous and continuous (no gaps) sections of related

BLASTP and the Scoring Matrix BASIC SIDE CHAINS

NONPOLAR SIDE CHAINS

lysine

arginine

histidine

alanine

valine

(Lys, or K)

(Arg, or R)

(His, or H)

(Ala, or A)

(Val, or V)

H

O

H

O

H

O

H

O

H

O

N

C

C

N

C

C

N

C

C

N

C

C

N

C

C

H

CH2

H

CH2

H

CH2

H

CH3

H

CH

CH2

CH2

CH2

CH2

CH2

NH

NH3+

C +H N 2

CH3

C HN

CH isoleucine (Ile, or I)

H

O

H

O

N

C

C

N

C

C

H

CH2

H

CH

CH

CH3

NH2

ACIDIC SIDE CHAINS glutamic acid

(Asp, or D)

(Glu, or E)

H

O

H

O

N

C

C

N

C

C

H

CH2

H

C

CH3

O

(Pro, or P)

(Phe, or F)

CH2

H

O

H

O

C

C

N

C

C

CH2

H

CH2

N

C O



CH2 CH2

asparagine

glutamine

(Asn, or N)

(Gln, or Q)

methionine

tryptophan

H

O

H

O

(Met, or M)

(Trp, or W)

N

C

C

N

C

C

H

O

H

O

H

CH2

H

CH2

N

C

C

N

C

C

CH2

H

CH2

H

CH2

C O

NH2

C O

CH2 NH2

S

serine

threonine

tyrosine

(Ser, or S)

(Thr, or T)

(Tyr, or Y)

H

H

H

H

CH2 OH

CH3

CH2

UNCHARGED POLAR SIDE CHAINS

C

CH3

phenylalanine

O

N

CH2

proline



O

leucine (Leu, or L)

NH+

HC

aspartic acid

CH3

O C

N

C

H

CH

O C CH3

N

C

H

CH2

CH3

N H

glycine

cysteine

(Gly, or G)

(Cys, or C)

H

O

H

O

N

C

C

N

C

C

H

H

H

CH2

O C

OH

OH

SH

Figure 4.8 Amino acids can be grouped based on the properties of their side chains. It is common to divide them into these four groups: basic, acidic, uncharged, and nonpolar. From, Zvelebil M and Baum JO (2008) Understanding Bioinformatics. Garland Science.

79

80

Chapter 4: Protein BLAST: BLASTP proteins were aligned. The percentage identity allowed between aligned family members was then varied and different cutoffs were used to build matrices for testing. For example, a matrix was built where the percentage identity in the alignments was no greater than 80%, thereby excluding more closely related proteins. These matrices were then used to calculate the sensitivity of finding members of a divergent family of receptors. Using several database-searching programs, including BLAST, it was determined that alignments with identity no greater than 62% created a matrix that performed best and missed the fewest members of the receptor family. This percentage identity appears to be a good compromise for the tolerance of many amino acid substitutions and the BLOSUM62 matrix (Figure 4.9) is the default standard for BLASTP searches. Across the bottom and along the left side of the scoring matrix are the one-letter amino acid abbreviations. Within the matrix are the scores for the intersections (alignments) between amino acid columns and rows. Maintenance (identity) of the amino acid is rewarded with a high score, although some identical amino acid alignments are more important than others. Compare C-C and A-A, for example. Cysteines (C) are given a high score because they are often involved in disulfide bridges, the chemical linking between two cysteines. These cysteine pairs are usually very critical for the proper folding of a protein. Alanine, on the other hand, is quite common and under many circumstances, biochemically neutral, and so its maintenance (identity) is awarded a lower score. Tryptophan (W) is a rare amino acid and so finding two tryptophans aligning should be rewarded. Zero is a neutral score. For example, the substitution of an alanine (A) for a glycine (G) receives neither a penalty nor a reward. Figure 4.8 shows that both have side groups that are nonpolar and rather small.

4.5 AN EXAMPLE BLASTP SEARCH We’ll now perform a BLASTP search, examine the alignments between the query and the hits, and look for distant homologies found by the query. For our query, we will use the sequence of human preproinsulin (see Figure 4.5). We’ll also return to the NCBI Website to run BLASTP, and you’ll see that the BLASTN and BLASTP forms are quite similar, requiring little explanation here.

Figure 4.9 The BLOSUM62 matrix.

A R N D C Q E G H I L K M F P S T W Y V

4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 A

5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 R

6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 N

6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 D

9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 C

5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 Q

5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 E

6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 G

8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 H

4 2 -3 1 0 -3 -2 -1 -3 -1 3 I

4 -2 2 0 -3 -2 -1 -2 -1 1 L

5 -1 -3 -1 0 -1 -3 -2 -2 K

5 0 -2 -1 -1 -1 -1 1 M

6 -4 -2 -2 1 3 -1 F

7 -1 4 -1 1 5 -4 -3 -2 11 -3 -2 -2 2 7 -2 -2 0 -3 -1 P S T W Y

4 V

An Example BLASTP Search

81

Retrieving protein records Protein records can be found much the same way as nucleotide records. Text queries at the NCBI main page give you access to all their major databases, including one for proteins (Figure 4.10). Most NCBI mRNA or Gene files will give you easy access to the protein sequences encoded by that nucleic acid. For example, in a RefSeq mRNA file, the translation product is displayed in an annotation feature called CDS (coding sequence) and a screenshot of this region of the annotation is in Figure 4.11. Although it is certainly convenient for the protein sequence to be visible from within the nucleotide record, it is not intended for any serious analysis other than a visual examination. Copying and pasting from this field will also yield a format that needs “cleaning up” to make it into FASTA format, for example. However, within the CDS section is a link to the protein record. Notice the line labeled “protein_id” in Figure 4.11 and the hypertext “NP_000198.1” which will take you to the protein RefSeq record. Once there, you will find a link back to the corresponding mRNA record near the top of the file in the line labeled “DBSOURCE” (Figure 4.12). When viewing any database record at any Website, a small amount of hunting will often find links to the files you need.

Running BLASTP 1. At the NCBI Website, find the protein RefSeq sequence for human preproinsulin precursor, NP_000198. 2. Open another Web page and go to the NCBI BLAST page and choose Protein BLAST. 3. Paste in the sequence or enter the RefSeq accession number. 4. Change the database to “Reference proteins.” 5. Make sure BLASTP is the algorithm and click on the BLAST button. Figure 4.10 The NCBI drop-down menu. This appears on the home page and allows convenient searching of the Protein database with text queries.

Figure 4.11 The CDS section of an NCBI mRNA record. This contains a translation of the protein encoded by this mRNA.

Figure 4.12 A RefSeq protein record. The line “DBSOURCE” has a hyperlink to the DNA record encoding this protein.

82

Chapter 4: Protein BLAST: BLASTP

The results The top of the BLASTP search results can often look very different from that seen with BLASTN. In this case, the query description has been greatly expanded and many different protein records are listed, each marked with a “>” sign. This is because all of these sequences (20 are seen in Figure 4.13) are identical to each other. As you look closely at this list, you can see that these files come from a number of databases. The gi (gene identification) number appears first, followed by those from the RefSeq (ref ), Swiss-Prot (sp), GenBank (gb), EMBL (emb), and Protein Research Foundation (prf ) databases. These different databases each have a human preproinsulin protein sequence which is 100% identical to the original query, NP_000198. The NCBI shows them here to spare you from these being the top 20 hits in your BLAST results. Looking closely at the list, you may be surprised to see the gorilla sequence, AAN06935, indicating that it is 100% identical to the human sequence. Scrolling down, there is a graphic (see Figure 4.14 and color plates) that is quite similar to the BLASTN figure, with the query length going horizontally across the top and the hits being represented as colored bars. What is different is that the top of the graphic now has an additional section depicting any similarities between your query and the sequences of known protein families. In addition to searching a database of sequences (RefSeq), your query was searched against a database of conserved domains, and the two major hits are depicted here as thick bars. Appropriately, the domains are specific to insulin-like growth factors (ILGF) and a larger family or “superfamily” that includes insulin and many related sequences. Based on the scale across the top of this section, the domain that was detected begins near amino acid 25 and goes to the end of the query. Looking down at the rest of the graphic, colored differently because of lower scores, the hits are beginning to lose similarity to the N-terminus but are retaining similarity from around amino acid 25 to the end of the query, the same coordinates seen with the domain search results.

Figure 4.13 The top of the NCBI BLASTP results.

The columns of the results table should also look familiar (Figure 4.15). Of course, the query found itself (NP_000198, the top hit) and, by association, the other hits listed at the top of the results window. But even with the expected 100% identity between the query and itself, notice that the E value is not 0.0. Human preproinsulin is only 110 amino acids long and the calculations performed by BLASTP indicate that the number of hits in the database expected to be found is not 0.0, although 1e–73 is quite small and nobody would argue that

An Example BLASTP Search

83

Figure 4.14 The graphic from the NCBI BLASTP results. The query is the human preproinsulin protein sequence NP_000198 and the database is RefSeq protein. See color plates for a color version of this figure.

Figure 4.15 The table from the NCBI BLASTP results. The RefSeq protein database was searched with the human insulin protein sequence, NP_000198.

84

Chapter 4: Protein BLAST: BLASTP

Figure 4.16 The BLASTP alignment between the human preproinsulin protein (Query) and the rat hit (Sbjct).

>ref|NP_062002.1| insulin-1 preproprotein [Rattus norvegicus] Length=110 Score = 171 bits (433), Expect = 2e-41, Method: Compositional matrix adjust. Identities = 91/110 (82%), Positives = 95/110 (86%), Gaps = 0/110 (0%) Query

1

Sbjct

1

Query

61

Sbjct

61

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED MALWMR LPLLALL LW P PA AFV QHLCG HLVEALYLVCGERGFFYTPK+RRE ED MALWMRFLPLLALLVLWEPKPAQAFVKQHLCGPHLVEALYLVCGERGFFYTPKSRREVED LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN QV Q+ELGGGP AG LQ LALE + QKRGIV+QCCTSICSLYQLENYCN PQVPQLELGGGPEAGDLQTLALEVARQKRGIVDQCCTSICSLYQLENYCN

60 60

110 110

this hit is by chance. Longer proteins finding themselves via BLASTP will have an E value of 0.0. The results table shows a list of insulin proteins. For most of these orthologs, you can see their entire descriptions, including the species. Past the human and chimpanzee hits, the drop in Max Score and increase in E value indicate that we should expect mismatches.

The alignments The alignments generated by BLASTP have elements you saw in those generated by BLASTN. One striking difference is that BLASTP shows a letter between aligned identical amino acids rather than a vertical bar, and a new symbol, the plus (“+”) sign, is used for some nonidentities. These are amino acid substitutions scored as conservative changes by the BLOSUM62 matrix. Figure 4.16 shows the alignment between the human query and the rat insulin preproprotein. There are blank spaces for most nonidentities between amino acids, but four pairs where a “+” sign is used. These four conservative amino acid differences have a BLOSUM62 score of “1” or larger. All nonidentities where the BLOSUM62 scores are “0” or negative numbers have a blank space. Referring back to the BLOSUM62 matrix in Figure 4.9, we see the scores for selected pairs of amino acids in Figure 4.17.

Pair T+S V+L S+A E+D L F A V G E L P

BLOSUM62 Score 1 1 1 2 0 0 −2 −3

Figure 4.17 BLOSUM62 scores of eight selected pairings from the alignment shown in Figure 4.16. The “+” symbol between some pairs indicates conservative substitutions while the other pairs listed here have a blank space as seen in the BLASTP alignments in Figure 4.16.

Notice that no gaps were needed to generate a strong alignment between these distant species (Figure 4.16). Both proteins are 110 amino acids long. Although there are some regions which have mismatches (for example, the first 27 amino acids), there are stretches where there are none. Of the 19 mismatches, four are marked with a “+.” Like BLASTN, we see some calculations immediately above each alignment, such as the “Score” and “Expect” values. In addition to “Identities,” there is now “Positives” which is the sum of the Identities plus any amino acids aligned with a “+” sign. In the case of the human–rat alignment, this would be 91 identities plus 4 conservative changes = 95. Another way to refer to the Positives is similarities.

Distant homologies Scroll through the alignments and you will observe a high degree of conservation. Insulin is critical to the regulation of blood sugar in many animals so it should not be surprising to see the similarities between distant species. Stop at the alignment with preproinsulin 1 of Oncorhynchus mykiss, the rainbow trout (Figure 4.18). Trout preproinsulin is 105 amino acids long, quite similar to the 110 amino acid human form. Despite the millions of years since we shared an ancestor with the fish, you see that trout preproinsulin is similar to the human sequence. You can also see the

Pairwise BLAST >ref|NP_001118141.1|preproinsulin 1 [Oncorhynchus mykiss] Length=105 Score = 82.8 bits (203), Expect = 9e-15, Method: Compositional matrix adjust. Identities = 46/96 (47%), Positives = 59/96 (61%), Gaps = 12/96 (12%) Query

18

Sbjct

19

Query

78

Sbjct

70

GPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSL G D AAA QHLCGSHLV+ALYLVCGE+GFFY PK R+ + L + A GADAAAA---QHLCGSHLVDALYLVCGEKGFFYNPK--RDVDPL----IGFLSPKSAKEN QPLALEGSLQ---KRGIVEQCCTSICSLYQLENYCN + + ++ KRGIVEQCC C+++ L+NYCN EEYPFKDQMEMMVKRGIVEQCCHKPCNIFDLQNYCN

110 105

impact the “Positives” scoring has on the percentages. The human–trout alignment is 47% identical, but 61% positive or similar. A big jump. Prior to amino acid 18, BLASTP could not align the two insulins but there is a solid region of similarity starting around trout amino acid 26 where the sequence conservation is high and probably essential to the function of the protein. As you recall from Figure 4.14, lower-scoring hits viewed with the graphic failed to align with the N-terminus of human preproinsulin but did show similarity starting somewhere around the amino acid 26 coordinate. After the insulin mRNA is translated, the N-terminus of the preproinsulin is cleaved off during the secretion process. The protein folds and creates three pairs of disulfide bonds, chemically linking distant cysteines. The proinsulin protein is then cleaved into two chains, but they remain linked to each other through these disulfide bonds and attain a structure critical to the function of the insulin protein. There are no mismatches in the important cysteine (C) positions, suggesting that there is no radical change in secondary structure. Other regions are very divergent, only showing a limited number of aligned amino acids or conservative substitutions.

4.6 PAIRWISE BLAST The NCBI BLAST Web forms have the option of performing a pairwise alignment between two sequences of your choosing. It will not align two sequences from end to end; should only a short sequence of similarity be detected by BLASTP, this will be the only alignment you will see. To demonstrate the pairwise BLAST function, let’s return to the gorilla preproinsulin protein. Earlier in this chapter, BLASTP concatenated the names of all the protein sequences that were identical to the human preproinsulin protein when we used this sequence as a query (see Figure 4.13). By running a pairwise BLASTP alignment we should be able to verify the 100% identity between the gorilla and human preproinsulins. 1. Go to the NCBI BLASTP Web form as before. Notice a check box called “Align two or more sequences.” 2. Check that box and the window refreshes, replacing the database choice section with a second window for entering a sequence (see Figure 4.19). 3. Obtain the accession numbers for the gorilla and human preproinsulin sequences from Figure 4.13 or by other means. 4. Enter these two accession numbers as indicated in Figure 4.19. 5. Click on the BLAST button as usual.

77 69

85

Figure 4.18 BLASTP alignment between human preproinsulin (Query) and trout preproinsulin (Sbjct).

86

Chapter 4: Protein BLAST: BLASTP

Figure 4.19 The NCBI pairwise BLASTP Web form. By clicking on the “Align two or more sequences” box, the form switches from normal BLASTP, with a query searching a database, to a form that aligns sequences of your choosing. This check box is found on all NCBI BLAST forms.

Very quickly, the results will appear and you will see a list of all the identical sequences as before (Figure 4.20). But below that is the alignment between the gorilla and human preproinsulin proteins. This alignment confirms the 100% identity between the two preproinsulins. Notice that the “hit” in the results is the human protein. The human preproinsulin, NP_000198, was entered in the lower part of the form (see Figure 4.19) and is therefore treated as the subject while the gorilla sequence entered at the top of the form is the query.

4.7 RUNNING BLASTP AT THE ExPASy WEBSITE Until now, we have only used sequence analysis tools that are found on the NCBI Website. Although we could use the NCBI tools for almost every chapter of this book, many excellent Websites and bioinformatics applications would be overlooked. ExPASy (rhymes with ecstasy), the Proteomics Server of the Swiss Institute of Bioinformatics, is among the best for tools specifically designed for the study of proteins. The ExPASy version of BLASTP is no exception and we will perform two database searches to demonstrate the features of its interface. First, let’s go to the home page, www.expasy.ch. Take a moment to see what is offered here by clicking on the left sidebar link called “Resources A..Z”: there are a number of databases and tools almost entirely focused on proteins. A comprehensive database created here is the UniProtKB database, made up of two divisions: TrEMBL and the Swiss-Prot databases. TrEMBL (pronounced as “tremble”) stands for “translated EMBL,” and is the non-curated translations of the EMBL DNA database. TrEMBL records are automatically generated by software with little or no human intervention. Many of the entries in Swiss-Prot (“swiss prote,” rhymes with wrote) are famously curated by people who take great care in providing accurate and often deep annotation of a sequence based on the

Running BLASTP at the ExPASy Website

87

Figure 4.20 Alignment between gorilla (Query) and human (Sbjct) preproinsulins.

review of the literature and consultation with area experts. When given a choice between using an automatically generated file and a file curated by a proteomics expert, go with the curated file.

Searching for pro-opiomelanocortin using a protein sequence fragment To perform the following two exercises, we’ll use protein sequences from the ExPASy Website. Near the top of the home page is a search function with a dropdown menu and text box (Figure 4.21). From the menu, select “UniProtKB.” Enter the UniProtKB accession number Q53WY7 and click on the “search” button. Examine this TrEMBL sequence record and you will see that it has little annotation. It is a 30 amino acid fragment of the human pro-opiomelanocortin protein (POMC). With BLASTP we’ll find the full-length version of this protein, gain access to a wealth of annotation, and see exactly where the fragment is within the complete protein. Like the NCBI Web form, you can use either the

Figure 4.21 Search function for the ExPASy Website.

88

Chapter 4: Protein BLAST: BLASTP protein sequence or the accession number as a query so there is no need to retrieve the sequence from this record. However, this ExPASy BLASTP form only works on accession numbers of the UniProtKB database; NCBI numbers cannot be used. Go to the BLASTP form, either by selecting BLAST from the “Resources A..Z” link on the ExPASy home page or by typing in the Web address, web.expasy.org/ blast. It is different from the NCBI Web form, but if you look closely you will see many of the same features: a window to enter the query sequence; a section to choose the database and apply species restrictions; a section to change search options; and a launch button. Hopefully, your experience with the NCBI BLAST forms for BLASTN and BLASTP has prepared you to use the ExPASy BLASTP form and little explanation is given here. Paste Q53WY7 into the appropriate field and choose Mammalia in the database drop-down menu to restrict the searches to mammals. For reasons that will become apparent soon, the only change from the defaults for this exercise is you must uncheck the box for “Filter the sequence for low-complexity regions” (see Figure 4.22). Click the “Run BLAST” button to launch the search. When the window refreshes, you see a familiar basic format for the top of the results: a description of the query and the database searched. Unlike the NCBI BLASTP results, the table is next (see Figure 4.23). On the left side of the table there are two-letter abbreviations for the database where these hits reside: “sp” for Swiss-Prot and “tr” for TrEMBL. Like the NCBI Website, there are columns for accession number (AC), Description, Score, and E-value. Next is a fantastic graphical representation of the hits (see Figure 4.24 and color plates). Along the left side are the names of the sequences. In the center is a graph much like that seen at the NCBI BLAST page: the query goes across the top of the graph, and colored bars represent the query sequence aligning with the hit. The ExPASy color is based on percent identity, while the NCBI coloring is based on the Score. However, the real showpiece is the right graphic. The hits are represented as gray bars and the position of any alignments with the query is shown in color. Comparing the two graphics, look at the first hit, COLI_HUMAN. The query graphic (center panel) shows a solid bar from left to right indicating that the entire query (30 amino acids) aligns with the hit. You don’t know how long the hit is, or which part of it aligns with the query—the beginning, or the end, or somewhere in the middle? On the right graphic the answers appear; the first hit is roughly 200 amino acids long but the area aligning with the query is only a short segment near the middle, depicted in green. There is so much information here, in a simple graphic. Although you can determine the exact location of similarities based on the coordinates of the alignments, this one graph visualizes a lot of this information. You can see that most of the hits are about the same length and the location of the similarities is often aligned as well (in the first half of the protein sequence). There are exceptions and perhaps these can be grouped in some way by exploring their annotation. Note that the scale used on the right graph is not linear.

Figure 4.22 The options menu for ExPASy BLASTP. Filtering of low-complexity regions is controlled here.

Running BLASTP at the ExPASy Website Db sp tr tr tr tr tr tr tr tr sp tr sp tr tr sp tr tr

AC P01189 Q6FHC8 Q5TZZ7 Q53WY7 Q53T23 A6XND7 Q8HZB2 Q8HZB3 Q8HZB1 P01192 A8VWB5 P01190 Q8HZB0 Q8HZA9 P01191 Q8MIC5 B4XH73

Description

Score E-value

COLI_HUMAN Pro-opiomelanocortin precursor (POMC) (Cort... _HUMAN POMC protein (Fragment) [POMC] [Homo sapiens (Hu... _HUMAN Proopiomelanocortin (Adrenocorticotropin/ beta-l... _HUMAN Proopiomelanocortin (Fragment) [proopiomelanocor... _HUMAN Proopiomelanocortin (Adrenocorticotropin/ beta-l... _HUMAN Proopiomelanocortin preproprotein [Homo sapiens ... _9PRIM Proopiomelanocortin (Fragment) [Gorilla gorilla ... _PANTR Proopiomelanocortin (Fragment) [Pan troglodytes ... _PONPY Proopiomelanocortin (Fragment) [Pongo pygmaeus (... COLI_PIG Pro-opiomelanocortin precursor (POMC) (Cortic... _PIG Proopiomelanocortin protein [POMC] [Sus scrofa (Pig)] COLI_BOVIN Pro-opiomelanocortin precursor (POMC) (Cort... _9PRIM Proopiomelanocortin (Fragment) [Macaca sp] _SAGOE Proopiomelanocortin (Fragment) [Saguinus oedipus... COLI_SHEEP Pro-opiomelanocortin precursor (POMC) (Cort... _SHEEP Proopiomelanocortin [pomc] [Ovis aries (Sheep)] _CAPHI Alpha-melanocyte-stimulating hormone (Fragment) ...

99 99 99 99 99 94 92 91 85 83 83 83 83 83 82 82 82

2e-21 2e-21 2e-21 2e-21 2e-21 7e-20 2e-19 4e-19 2e-17 1e-16 1e-16 1e-16 2e-16 2e-16 3e-16 3e-16 3e-16

89

Figure 4.23 The results table for ExPASy BLASTP. In this search, the UniProtKB sequence Q53WY7 was used as a query against the mammalian division of the UniProtKB database.

Figure 4.24 The results graphic for ExPASy BLASTP. The POMC fragment, UniProtKB accession number Q53WY7, was used as a query against the mammalian division of the UniProtKB database. On the far left are sequence names for each row. The center panel depicts the query and the best areas of identity with the hits. The right panel shows the hits as gray bars with the sequence identities with the query shaded lighter gray. See color plates for a color version of this figure.

90

Chapter 4: Protein BLAST: BLASTP

Figure 4.25 The 30 amino acid query has perfect identity with P01189 between amino acids 75 and 104. The UniProtKB sequence Q53WY7 was used as a query against the mammalian division of the UniProtKB database.

sp P01189 COLI_HUMAN Pro-opiomelanocortin precursor (POMC) (Corticotropin-lipotropin) [Contains: NPP; Melanotropin gamma (Gamma-MSH); Potential peptide; Corticotropin (Adrenocorticotropic hormone) (ACTH); Melanotropin alpha (Alpha-MSH); Corticotropin-like intermediary peptide (CLIP); Lipotropin beta (Beta-LPH); Lipotropin gamma (Gamma-LPH); Melanotropin beta (Beta-MSH); Beta-endorphin; Met-enkephalin] [POMC] [Homo sapiens (Human)] 267 AA

Score = 99.0 bits (226), Expect = 2e-21 Identities = 30/30 (100%), Positives = 30/30 (100%) Query: 1 Sbjct: 75

RKYVMGHFRWDRFGRRNSSSSGSSGAGQKR 30 RKYVMGHFRWDRFGRRNSSSSGSSGAGQKR RKYVMGHFRWDRFGRRNSSSSGSSGAGQKR 104

Scroll down to the alignments and the first hit is the full-length version of POMC, a Swiss-Prot record called COLI_HUMAN, accession number P01189 (Figure 4.25). The alignments are much like what is seen in the NCBI version of BLASTP. From the brief annotation, we learn that P01189 is 267 amino acids long and amino acids 75–104 align with the query. Looking at the graphic, most of the green color is to the left of the 100 (amino acid 100 on the scale aligns with the “1” of “100,” amino acid 200 aligns with the “2,” etc.), consistent with the alignment coordinates of 75–104. Remembering that Swiss-Prot records are rich in annotation, explore P01189 (the accession numbers are hypertext) and you will learn that POMC is processed by proteases into 11 different peptides. The function of four of these peptides is well understood. 1. ACTH (adrenocorticotropic hormone) stimulates the release of another hormone, cortisol, from the adrenal glands. Cortisol is associated with stress. 2. MSH (melanocyte-stimulating hormone) influences the pigmentation of skin. 3. Beta-endorphin is an opiate, famously released when you run a long distance. 4. Met-enkephalin is another opiate. Another section describes the connections between these peptides and diseases. The record also includes an extensive description of the Gene Ontology of POMC, or where and how this protein fits into the grand scheme of biology. For example, the peptides derived from POMC are involved with cell–cell signaling, they are secreted, and the receptors that bind the peptides are identified. Next come the amino acid coordinates for the 11 peptides. Based on the alignment coordinates from Figure 4.25, amino acids 75–104, the location of our fragment within the full-length protein, correspond to the melanotropin gamma peptide and part of a peptide with an unknown function. Also included are a large number of references and hyperlinks to dozens of other database records that provide an absolute wealth of information on POMC. When this BLASTP search was launched, you were instructed to uncheck a box that controlled the filtering of simple sequences. To see the impact that complexity filtering has on the BLASTP result, repeat this BLASTP search with Q53WY7 and, this time, check the box under Options for “Filter the sequence for lowcomplexity regions.” Be sure to keep the current BLASTP results window, too, so you may easily compare them. You will see (Figure 4.26) that when this filtering

Running BLASTP at the ExPASy Website Score = 75.3 bits (170), Expect = 3e-14 Identities = 22/30 (73%), Positives = 22/30 (73%) Query: 1 Sbjct: 75

RKYVMGHFRWDRFGRRNXXXXXXXXAGQKR 30 RKYVMGHFRWDRFGRRN AGQKR RKYVMGHFRWDRFGRRNSSSSGSSGAGQKR 104

is applied, the serine- and glycine-rich region of our query, SSSSGSSG, is turned into XXXXXXXX. This filtering is applied by default to shield you from nonspecific hits caused by regions of low complexity. Also notice that the calculations now indicate lower percent identity (that is, 73% versus 100%), a lower Score, and a less significant Expect value. In this case, turning off the filtering allows a clearer alignment and a more accurate representation for our very short query.

91

Figure 4.26 With low-complexity filtering turned on, eight amino acids of Q53WY7 are turned to “X.” UniProtKB sequence Q53WY7 was used as a query against the mammalian division of the UniProtKB database. Unlike the previous search, the low-complexity filtering parameter was checked. The percent identity has dropped from 100% to 73%.

Searching for repeated domains in alpha-1 collagen Now let’s perform a search with a more complicated result so you can see the additional power of the graphics. On the ExPASy home page, retrieve the record for UniProtKB accession number Q9UMA6. This record is a 27 amino acid fragment of the human alpha-1 procollagen protein. Like the POMC example, above, we’ll find the full-length version of this protein, gain access to the extensive annotation on this protein, and see the relationship between the query and the hits. Go back to the ExPASy BLASTP form and enter Q9UMA6 as the query. From the database choice drop-down menu, select Homo sapiens so the search is restricted to the human sequences. Launch the search. When the window refreshes, look at the BLASTP results table (Figure 4.27). The original table is quite long and, in this shortened version, rows were deleted (indicated by dots) to bring some at the bottom into view. Going down the first column, we see a new label for a number of hits, “sp_vs.” This indicates that they are Swiss-Prot records where there is variable splicing of the transcript, resulting in different protein products or isoforms from the same gene.

Figure 4.27 The BLASTP results of query Q9UMA6. The human division of the UniProtKB database was searched. Many rows of the results table were deleted, indicated by dots, to shorten the table.

Db

AC

Description

Score E-value

sp tr tr sp tr sp tr tr tr tr tr

P02452 Q9UMA6 Q6LAN8 P05997 B4DNJ0 P13942 Q5STP6 H0YIS1 H0YHY3 H0YHN2 H0YHM9

CO1A1_HUMAN Collagen alpha-1(I) chain precursor (Alpha... _HUMAN Type I collagen alpha 1 chain (Fragment) [COL1A1... _HUMAN Collagen type I alpha 1 (Fragment) [COL1A1] [Hom... CO5A2_HUMAN Collagen alpha-2(V) chain precursor [COL5A... _HUMAN cDNA FLJ60734, highly similar to Collagen alpha-... COBA2_HUMAN Collagen alpha-2(XI) chain precursor [COL1... _HUMAN Uncharacterized protein [COL11A2] [Homo sapiens ... _HUMAN Collagen alpha-2(XI) chain (Collagen, type XI, a... _HUMAN Collagen alpha-2(XI) chain [COL11A2] [Homo sapie... _HUMAN Collagen alpha-2(XI) chain [COL11A2] [Homo sapie... _HUMAN Collagen alpha-2(XI) chain [COL11A2] [Homo sapie...

84 84 84 57 57 45 45 45 45 45 45

2e-17 2e-17 2e-17 3e-09 3e-09 1e-05 1e-05 1e-05 1e-05 1e-05 1e-05

P13942-2 P13942-3 P13942-4 P13942-5 P13942-6 P13942-7 P13942-8

Isoform Isoform Isoform Isoform Isoform Isoform Isoform

45 45 45 45 45 45 45

1e-05 1e-05 1e-05 1e-05 1e-05 1e-05 1e-05

. . . sp_vs sp_vs sp_vs sp_vs sp_vs sp_vs sp_vs

2 3 4 5 6 7 8

of of of of of of of

Collagen Collagen Collagen Collagen Collagen Collagen Collagen

alpha-2(XI) alpha-2(XI) alpha-2(XI) alpha-2(XI) alpha-2(XI) alpha-2(XI) alpha-2(XI)

chain chain chain chain chain chain chain

OS=Homo OS=Homo OS=Homo OS=Homo OS=Homo OS=Homo OS=Homo

sapi... sapi... sapi... sapi... sapi... sapi... sapi...

92

Chapter 4: Protein BLAST: BLASTP The first hit, Swiss-Prot record P02452, is the full-length version of the alpha-1 procollagen protein. Visit the Swiss-Prot record for this protein and you will be taken to a richly annotated file, as we saw for POMC. Collagen is the most abundant protein in your body and this is a great opportunity to learn more about it. Scroll down to the graphic of the BLASTP search (see Figure 4.28 and color plates) and you will now see something that is perhaps puzzling. The center panel graphic mostly indicates single colored bars, for example the top bars are green. But the right side of the graph shows a single green bar and multiple yellow bars on the gray background of the hits; the query aligns to multiple places on the hit. There is one area of strong similarity (green) but there are other locations with weaker similarity (shades of yellow). This can’t be shown on the left graphic but is easily seen on the right graphic. This may be better understood while looking at the alignments. The top six alignments of the first hit are shown in Figure 4.29. Notice that the first alignment is 100% identical and the coordinates are amino acids 353–379 (this is the green bar on the graphic). However, there are other places on the hit where the query can be aligned and they can be seen on both the right graph and the alignments. There are many other hits that can align in multiple places with the query. They are all clearly related, based on the similar pattern of “stripes” this query has left on them in the graphic seen in the right side of Figure 4.28 (see also color plates). Exploring the annotation of these hits will shed light on their relationships.

Figure 4.28 The results graphic from BLASTP query Q9UMA6. The human division of UniProtKB was searched. Note the multiple regions of similarity between the query and the hits in the right panel. See color plates for a color version of this figure.

Let’s return to the table of hits for the query, a collagen fragment (Figure 4.27). The first hit is the full-length collagen alpha-1 that contains the fragment sequence, and the second hit is the fragment itself (self-hit). Despite the fact that the fulllength sequence is 1464 amino acids long and the fragment is 27 amino acids long, the values for the Score and the E value are identical for the two hits. This demonstrates that these values are dependent on the length and quality of the alignment, and independent of the length of the Subject.

Running BLASTP at the ExPASy Website sp P02452 CO1A1_HUMAN Collagen alpha-1(I) chain precursor (Alpha-1 type I collagen) [COL1A1] [Homo sapiens (Human)] 1464 AA Score = 84.2 bits (191), Expect = 6e-17 Identities = 27/27 (100%), Positives = 27/27 (100%) Query: 1

GEAGPQGPRGSEGPQGVRGEPGPPGPA 27 GEAGPQGPRGSEGPQGVRGEPGPPGPA Sbjct: 353 GEAGPQGPRGSEGPQGVRGEPGPPGPA 379 Score = 44.3 bits (97), Expect = 6e-05 Identities = 20/36 (55%), Positives = 20/36 (55%), Gaps = 9/36 (25%) Query: 1

GEAGPQGPRGSEGPQGVRGEPGP---------PGPA 27 GEAG QGP G GP G RGE GP PGPA Sbjct: 614 GEAGAQGPPGPAGPAGERGEQGPAGSPGFQGLPGPA 649 Score = 43.1 bits (94), Expect = 1e-04 Identities = 19/30 (63%), Positives = 19/30 (63%), Gaps = 3/30 (10%) Query: 1

GEAGPQGPRGSEGPQGV---RGEPGPPGPA 27 GE GP GP G G G RGEPGPPGPA Sbjct: 782 GESGPSGPAGPTGARGAPGDRGEPGPPGPA 811 Score = 40.5 bits (88), Expect = 8e-04 Identities = 20/35 (57%), Positives = 20/35 (57%), Gaps = 9/35 (25%) Query: 1

GEAGP---QGPRGS---EGPQGVRGEPGP---PGP 26 GE GP QGP G EG G RGEPGP PGP Sbjct: 449 GEPGPVGVQGPPGPAGEEGKRGARGEPGPTGLPGP 483 Score = 38.8 bits (84), Expect = 0.003 Identities = 18/36 (50%), Positives = 19/36 (52%), Gaps = 11/36 (30%) Query: 1

GEAGPQGPRGSEGPQGVR----------GEPGPPGP 26 G+ GP GPRG GP G R G PGPPGP Sbjct: 115 GDTGPRGPRGPAGPPG-RDGIPGQPGLPGPPGPPGP 149 Score = 36.7 bits (79), Expect = 0.012 Identities = 20/42 (47%), Positives = 20/42 (47%), Gaps = 15/42 (35%) Query: 1

GEAGPQGPRGS---EGPQGVRGE------------PGPPGPA 27 G AGP GP G EG G RGE PGPPGPA Sbjct: 890 GNAGPPGPPGPAGKEGGKGPRGETGPAGRPGEVGPPGPPGPA 931

Finally, there are seven isoforms, or alternative splice forms, listed for collagen alpha-2 (XI) in the results table. As indicated earlier, they are labeled “sp_vs” on the left column. They also appear in the graphic as P13942-2 to -8. Looking very carefully at the right graph, you can see that these sequences vary slightly in length and the yellow areas of similarity with the query move to the left or right, depending on the regions that are spliced out. This can be gathered from the alignments as well, with the added benefit of getting exact coordinates from the alignments. But the graphic captures the complex details of all these alignments in an easy-to-understand view. But what do all these “stripes” on the right graphic, and these multiple regions of similarity to our query, mean? What is witnessed here is the history of this protein. Early in evolution there was a protein with one domain similar to the

93

Figure 4.29 Six of the regions of similarity between the query Q9UMA6 and hit P02452. The top alignment of 100% identity is represented as the green bar in Figure 4.28 (see color plates) while the other alignments are shown as shades of yellow.

94

Chapter 4: Protein BLAST: BLASTP 27 amino acid query. In time, the gene with this single domain had duplication events that generated multiple copies of this domain within the same protein. Over time, these domains diverged from each other, perhaps taking on new functions, and what we now see are those multiple domains scattered on the length of full-length alpha-1 procollagen. This domain is seen on other proteins as well, indicating that a single gene with multiple domains probably gave rise to other genes which have since diverged in sequence and function, but nevertheless maintain similarity in these domains.

4.8 SUMMARY In this chapter, another member of the BLAST suite of database searching tools was introduced; BLASTP uses protein queries to search a protein database. Amino acids were described, and their overlapping properties allow many biochemically conservative substitutions in nature. In BLASTP alignments, these non-identical pairings are scored according to a substitutions matrix, allowing distantly related proteins to be identified and compared. BLASTP searches were run both at the NCBI and the ExPASy Websites. Interesting properties of proteins were introduced, including the processing of proteins into smaller peptides and the repetition of related domains within a single protein.

EXERCISES Exercise 1: Typing contest This contest is actually educational and many students of sequence analysis find it fun. The E value and the other statistics generated by BLASTP are a function of the length and the quality of the alignment. Before getting to more serious exercises, we’ll have a typing contest which can be run in a number of ways: (a) type for a fixed time (for example, 2 minutes); or (b) have a race with others where everyone stops typing when the winner finishes first. Even fast and accurate typists will find that they are slower and less accurate when typing sequences rather than English words. With no spaces and all the characters in uppercase, it is easy to lose track of where you are. 1. Type the sequence (Figure 4.30) directly into an NCBI BLASTP query window. 2. When the typing has ended, run a BLASTP search against RefSeq proteins (no species defined). 3. Look at the Scores, E values, and percent identities. The sequence is from a real protein (can you identify it?) so you can achieve 100% identity to the first hit. If you were a fast and accurate typist, your E value will be lower (better) than other attempts where fewer amino acids were typed or more errors were made. 4. Record your query length, Score, percent identity, and E value. 5. Repeat the contest and compare these values to your first run. 6. Look at the alignments and notice if you accidentally typed in any conservative substitutions. 7. Based on the differences between repeat runs, determine why your E value changed. Figure 4.30 Typing contest sequence.

DGQRGGGGGATGSVGGGKGSGVGISTGGWVGGSYFTDSYVITKNTRQFLVK IQNNHQYKTELISPSTSQGKSQRCVSTPWSYFNFNQYSSHFSPQDWQRLTN EYKRFRPKGMHVKIYNLQIKQILSNGADTTYNNDLTAGVHIFCDGEHAYPN ATHPWDEDVMPELPYQTWYLFQYGYIPVIHELAEMEDSNAVEKAICLQIPF FMLENSDHEVLRTGESTEFTFNFDCEWINNER

Exercises

Exercise 2: How mammoths adapted to cold Mammals have adapted to cold climates through both physical and physiological changes that allow them to survive and thrive in colder temperatures. Compared to their relatives in warmer climates, a heavier coat of hair and more abundant fat keep the animal warm. Mammals can also limit heat loss by allowing their limbs to be colder than their bodies. However, these colder legs present a problem for oxygen transport. Globin proteins within red blood cells bind oxygen in the lungs and carry it elsewhere to be released. However, when blood cells are cold, they do not release oxygen very readily and cold limbs could quickly become starved of oxygen. Changes to the globin protein sequence are necessary to facilitate oxygen release at lower temperatures. This was dramatically demonstrated when the globin genes of an extinct species, the woolly mammoth, were identified. Mammoths lived in an arctic climate and died off approximately 10,000 years ago, but wellpreserved remains are frequently found in the present-day tundra. Globin DNA sequence was recovered from these remains and, through genetic engineering, translated and tested for oxygen binding at different temperatures. Campbell and colleagues were able to demonstrate that, compared to the Asiatic elephant (a close relative), the mammoth globin sequence had amino acid substitutions that allowed it to release oxygen at low temperatures. In this exercise you will identify these changes by comparing the mammoth protein sequence to that of the Asiatic elephant. Specifically, a globin protein called beta/delta hybrid will be compared. This will be accomplished using pairwise BLASTP at the NCBI Website. You will later compare the nucleotide sequences and identify the changes at the codon level.

Pairwise comparison of mammoth and elephant globin proteins 1. Go to the NCBI BLASTP Web form and click the check box to turn it into a pairwise comparison tool. 2. In the two sequence fields, enter the accession number for the mammoth protein, ACV41408, and the Asiatic elephant protein, ACV41395, and click on the BLAST button. 3. When the screen refreshes, examine the alignment and identify the amino acid substitutions in the globin protein. 4. Using the BLOSUM62 matrix (Figure 4.9), what are the matrix scores for these substitutions? 5. Keep this screen open for later reference.

Nucleotide changes in the woolly mammoth beta/delta hybrid gene For this part of the analysis, you will need to retrieve the coding sequences of the two globin mRNAs. That is, we want to only compare the nucleotides that encode the amino acids and ignore the nucleotides of the untranslated regions of the mRNAs. The NCBI makes this quite easy with just a few clicks. When looking at a nucleotide record for each animal, there is an annotation “Feature” section called CDS (coding sequence). Click on the hypertext “CDS” and the screen will refresh, and highlighted are the regions of the genomic DNA sequence that correspond to the predicted mRNA. In the bottom-right corner of the screen you will see “Display: FASTA GenBank Help.” Click on the “GenBank” link and a new page is created where you just have the coding sequence. You can verify this by looking at the first three nucleotides (usually the codon for methionine, ATG) and the last three nucleotides (a termination codon, either TAA, TAG, or TGA). The length of the sequence will now be much shorter. Convert this to FASTA format by clicking on either the FASTA link on the bottom of the page, or the other FASTA link at the upper left of the page.

95

96

Chapter 4: Protein BLAST: BLASTP 1. Retrieve the NCBI files for the woolly mammoth (FJ716094) and Asiatic elephant (FJ716083) beta/delta hybrid globin mRNAs. 2. For each record, retrieve only the coding region as instructed above. 3. For each record, display the FASTA formatted file and paste these sequences separately into the two text fields of the NCBI BLASTN Web form for a pairwise alignment. 4. Click on the BLAST button. When the screen refreshes, you will see the pairwise alignment between the mammoth and Asiatic elephant globin coding sequences. In the protein sequence comparison, above, you identified three amino acid substitutions. You can now see the nucleotide changes responsible for these amino acid substitutions. Normally you would need coordinates to verify that you are working with the correct nucleotide changes. But in this case, there are only three nucleotide differences between these related species. 5. Use the genetic code (Figure 4.2) to determine the nucleotide changes that occurred to create these three amino acid substitutions. Manually translate the nucleotides into amino acids in the vicinity of each nucleotide mismatch to determine if the change was in the first, second, or third position of the affected codons. Since you know the amino acid sequence at each substitution for each animal, translating different groups of three nucleotides will allow you to answer this question.

Exercise 3: Longevity genes? It has been known for many years that a calorie-restricted diet can lead to the extension of life expectancy. Early experiments in yeast, roundworm, and fly have shown that by limiting food, life expectancy can be extended by 30–40% with no apparent ill effects. Similar experiments with mice, rats, and nonhuman primates have shown that this effect has been conserved throughout evolution. In a spectacular demonstration of the effects of calorie restriction on the aging process, Colman and colleagues placed Rhesus macaque monkeys on a calorie-restricted diet and studied them for 20 years. Be sure to look at pictures of these monkeys, and their matched controls with no calorie restrictions, in the reference provided at the end of this chapter. Many people are interested in living longer yet balk at the idea of reducing calorie consumption by one-third, the restriction necessary to attain benefit in other mammals. However, a factor in red wine called resveratrol has also been shown to modulate aging factors in flies. Could this be the answer to living longer? Research has led to the discovery of a protein family called sirtuins that are regulated through calorie restriction and appear to be players in the pathways responsible for extending life span. Manipulation of these genes has defined some of their functions and they include effects on gene regulation, reactions to stress, respiration, and metabolism. Interestingly, the wine compound resveratrol has also been shown to modulate sirtuin activity, thus joining the two manipulations, calorie restriction and resveratrol treatment, at the molecular level. In this exercise, you are going to find orthologs and paralogs of human SIR2, a member of the sirtuin family. Since the original experiments were done in yeast (Saccharomyces cerevisiae), you’ll start with the yeast Sir2p protein as a query and search for protein sequences using BLASTP. Finding the SIR2 ortholog in a higher organism, you will then use that organism’s protein sequence to find paralogs. You will need to stay organized. There will be multiple queries and multiple database restrictions. You should always be aware of what you are doing, and why you are doing it.

Further Reading 1. Go to the NCBI Website and use Sir2p and cerevisiae as your query terms for the Entrez interface. When you have a list of protein sequences, find the hypertext filter for the results (upper right of the results page) and click on “RefSeq.” From the shorter list of results, identify the yeast reference sequence for Sir2p. Most RefSeq proteins begin with “NP_.” 2. Using this yeast Sir2p reference protein sequence as a query, run a BLASTP search against RefSeq proteins, with the database restricted to Saccharomyces cerevisiae. 3. You should find your query in the list of hits. In addition, did you find any paralogs? Do they have names similar to Sir2p? 4. Exploring the RefSeq annotation for these top hits (via the hypertext RefSeq accession numbers in your BLASTP results), do these suspected paralogs have similar function? Make a list of their names and probable functions. 5. Using the yeast Sir2p reference protein sequence as a query, run another BLASTP search against RefSeq proteins, restricted to proteins of Drosophila melanogaster. 6. Can you identify the Drosophila ortholog to yeast Sir2p? What is its name? 7. Can you identify any Drosophila paralogs to Sir2? Make a list of their names and leave this window open. 8. Using the Drosophila Sir2 reference protein as a query, run another BLASTP search against Drosophila RefSeq proteins. 9. Compare the hits from the yeast query (step 5) to the hits with the Drosophila query (step 8). How do they compare? 10. How does the number of hits compare between the searches performed in step 5 and step 8? Explain why the numbers in the lists are different. 11. Using the yeast Sir2p reference protein as a query, perform a BLASTP search against RefSeq proteins, restricted to proteins of Homo sapiens. 12. Can you identify the Homo sapiens ortholog to yeast Sir2p? Is it an easy decision? 13. How many human SIR2 paralogs can you identify? Do not count the different isoforms. Keep this window open. 14. Using the human SIR2 reference protein as a query, run another BLASTP search against the human reference proteins. 15. How does the list of hits from step 11 compare to the list of hits from step 14? 16. Looking back at the graphics when the yeast Sir2p protein was used against the Drosophila (step 5) and human (step 11) databases, what were the coordinates of the domain with the highest similarity to the best hits? Using the Conserved Domains graphic, what does this domain correspond to? 17. Using the human SIR2 reference protein sequence, find the Macaca SIR2 ortholog using BLASTP, with species restriction to Macaca mulatta. How similar are these two orthologs?

FURTHER READING Alberts B, Johnson A, Lewis J et al. (2008) Molecular Biology of the Cell, 5th ed. New York: Garland Science. A textbook for general knowledge of amino acids and proteins. Campbell KL, Roberts JE, Watson LN et al. (2010) Substitutions in woolly mammoth hemoglobin confer biochemical properties adaptive for cold tolerance. Nat. Genet. 42, 536–540. This paper describes the work on the mammoth adaption to cold weather.

97

98

Chapter 4: Protein BLAST: BLASTP Colman RJ, Anderson RM, Johnson SC et al. (2009) Caloric restriction delays disease onset and mortality in rhesus monkeys. Science 325, 201–204. This paper is mentioned in the exercise on longevity genes. Be sure to see the pictures of the monkeys in this article. There are additional pictures on the Internet. Donmez G & Guarente L (2010) Aging and disease: connections to sirtuins. Aging Cell 9, 285–290. This paper describes the work mentioned in the longevity genes exercise. Eddy S (2004) Where did the BLOSUM62 alignment score matrix come from? Nat. Biotechnol. 22, 1035–1036. The short article is a description of the BLOSUM62 matrix, widely used in bioinformatics applications. Gasteiger E, Gattiker A, Hoogland C et al. (2003) ExPASy: the proteomics server for indepth protein knowledge and analysis. Nucleic Acids Res. 31, 3784–3788. A general article on the ExPASy Website. Henikoff S & Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919. This article describes the construction of the blocks database and the derived matrices. Pritchard LE & White A (2007) Neuropeptide processing and its impact on melanocortin pathways. Endocrinology 148, 4201–4207. This article discusses the proteolytic processing of POMC and the biology of the peptides generated by this cleavage.

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA GAAGGGGAAACAGATGCAGAAAGCATCT AGAAAGCATCT ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCT ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAA GAAGGGGAA

CHAPTER 5

Cross-Molecular Searches: BLASTX and TBLASTN

Key concepts • mRNA structure and cDNA synthesis • BLASTX uses a nucleotide query to search a protein database • TBLASTN does the opposite: it uses a protein query to search a nucleotide database • Using BLASTX and TBLASTN to analyze cDNA sequences and probe huge cDNA libraries for sequences of interest

5.1 INTRODUCTION So far, we have studied versions of BLAST which look for sequence identities and similarities within the same molecule type: BLASTN uses a nucleotide query to search a nucleotide database, and BLASTP uses a protein query to search a protein database. There are two other versions of BLAST which both cross molecular boundaries. BLASTX uses a nucleotide query to search a protein database, and TBLASTN uses a protein query to search a nucleotide database (Table 5.1). BLASTX (pronounced “BLAST-ex”) and TBLASTN (pronounced “tea-BLAST-en”) are particularly useful when studying cDNA, so this chapter will  begin by revisiting mRNA structure and cDNA synthesis, providing details that will make interpreting BLAST output easier.

Table 5.1 Four of the BLAST applications Type

Query

Database

BLASTN

Nucleotide

Nucleotide

BLASTP

Protein

Protein

BLASTX

Nucleotide

Protein

TBLASTN

Protein

Nucleotide

BLASTN was covered in Chapter 3 and BLASTP was covered in Chapter 4.

100

Chapter 5: Cross-Molecular Searches: BLASTX and TBLASTN

5.2 MESSENGER RNA STRUCTURE Eukaryotic protein-encoding genes are transcribed by a protein complex containing RNA polymerase II. In an exquisitely evolved process, this complex will faithfully transcribe genomic DNA, often for thousands of nucleotides, and generate an RNA copy of the gene. Processing by other enzymes removes the transcribed introns and enzymes perform other post-transcriptional processing, resulting in the mature RNA known as messenger RNA (mRNA). There are three major regions of mRNA that are important for function (Figure 5.1). As the name suggests, the coding region is translated into protein sequence. The coding region is flanked by untranslated regions (UTR, pronounced “youtea-are”) that are referred to as the 5P untranslated and 3P untranslated regions, or the 5P UTR and 3P UTR, respectively. The 5P UTR is usually less than several hundred nucleotides in length, and may be encoded by multiple exons. Shortly after transcription starts, the end of the nascent transcript is modified with a 5P methyl “cap.” Features within the 5P UTR are recognized by the ribosome and associated co-factors to initiate translation. Although not the principal translation products of the mRNA, there are often ATG codons in the 5P UTR that mark the beginning of short open reading frames. The coding region starts with ATG (the codon for methionine) and ends with one of the three terminator codons: TAA, TGA, or TAG. The size of the coding region ranges from several hundred to thousands of nucleotides in length. For example, titin is the largest human gene, with over 300 exons and a coding region of over 80,000 nucleotides. Even though many amino acids can be specified by several different codons (see Figure 4.2), there is a bias to use certain codons and the differences between organisms can be quite striking. For example, in humans, the leucine codon CTG is used almost fivefold more often than CTA (40% versus 7% of the time). In Arabidopsis (thale cress), CTG is only used 5% of the time and the most favored codon for leucine is CTC (45%). It is thought that the observed codon bias reflects the need to either optimize or regulate translation rates and that codon choice is tied to the abundance of corresponding transfer RNAs for the codons. The length of the 3P UTR ranges from less than one hundred nucleotides to thousands of bases long. It is not unusual for the 3P UTR to be longer than the 5P UTR and coding region, combined. Shortly after a gene is completely transcribed, a “tail” of A nucleotides is added to the 3P end of the mRNA. Unlike the rest of the mRNA, the poly(A) tail is synthesized in the absence of a template: there is no genomic sequence that corresponds to the poly(A) tail at the 3P end of the last exon. This tail is quite long, often 200 nucleotides in length, and is thought to delay the degradation of mRNA. The 3P UTR was widely ignored until it was recognized as containing important regulatory elements for translation and mRNA stability. In fact, you will find many sequence records in which only the coding region of mRNA is present and the UTR sequence is absent. In recent years, these elements in the 3P UTR have been studied intensely, and we’ll be learning about them in Chapter 10.

Figure 5.1 Messenger RNA (mRNA) structure. Eukaryotic mRNA has a coding region flanked by untranslated regions (5P UTR and 3P UTR). The 3P end is polyadenylated after transcription of the genomic DNA is completed.

From an evolutionary viewpoint, there are many constraints on the coding region. As we learned in Chapter 4, if a protein sequence changes too much the protein may no longer fold properly and/or function. The 5P and 3P untranslated regions, however, have fewer constraints and evolve at a much faster rate. As long as the small but important regulatory elements found in each remain functional, the UTRs can diverge to the point where orthologous UTRs may no longer look similar yet the protein-coding regions can be easily aligned.

5P

5P UTR

Coding region

3P UTR

AAAAA-3P

cDNA

101

5.3 cDNA Many of the details of mRNA and gene structure have been determined by the isolation and analysis of DNA that is complementary to mRNA, called cDNA. In this section, cDNA will be described extensively, including its synthesis, the different forms, and the bioinformatics databases where cDNAs are found.

Synthesis For cloning and sequencing purposes, mRNA is usually synthetically converted to complementary DNA (cDNA). Techniques vary but a common approach is to first anneal a short run of Ts (oligo(dT) primer) to the 3P poly(A) tail of mRNA. Reverse transcriptase, the enzyme used to convert mRNA to cDNA, is then added, which binds to the 3P end of this oligo(dT) primer. The enzyme then starts moving toward the 5P end of the mRNA, synthesizing cDNA based on the mRNA sequence that it is “reading.” Once the first strand of cDNA is made, how do you make the cDNA doublestranded, usually a prerequisite for cloning? Again, techniques vary, but what often happens is the 3P end of the first strand folds back onto itself and acts as a priming spot for an enzyme to begin synthesis of the second strand. Synthesis of the second strand then proceeds using the first strand cDNA as a template. As a result of this self-priming, some of the original 5P end of the mRNA will not be represented. Today there are new approaches that target the 5P end of the mRNA during cDNA synthesis. For example, a “tail” of known sequence can be added to the 3P end of the first strand cDNA, and then a primer with the complement of this sequence can be used to begin synthesis of the second strand. There are many different methods to reach the 5P cap site so you will encounter sequences and databases where the annotation specifically states this goal. Synthesizing cDNA can be technically difficult (Figure 5.2). Early in the history of genetic engineering, scientists struggled to clone and sequence full-length copies of cDNA. Even today, there are still significant limitations, not the least of which is that mRNA is very susceptible to degradation. If it is not handled carefully, mRNA can easily break down into small pieces. Subsequent steps to synthesize cDNA from this partially degraded mRNA would generate even smaller pieces (Figure 5.2C). Another concern is that mRNA can form complex secondary structures. Since mRNA is single-stranded and composed of only four bases, regions can fold back onto each other and base-pair. These structures can then impede the movement of reverse transcriptase along the molecule and cause the enzyme to (A) 5P ----------------------------------------------------------AAAAAAAA 3P (B) 3P ref|NC_000071.5| Mus musculus strain C57BL/6J chromosome 5, MGSCv37 C57BL/6J Length=152537259 Score = 28.1 bits (61), Expect = 3.5, Method: Compositional matrix adjust. Identities = 12/18 (67%), Positives = 15/18 (84%), Gaps = 0/18 (0%) Frame = +1 Query

1

Sbjct

88805329

MRSTILLFCLLGSTRSLP M++ ILL CLLGS +SLP MKTMILLLCLLGSAQSLP

18 88805382

Figure 6.14 Using TBLASTN to find the orthologous mouse amelotin exons with the human protein. (A) Amino acid sequence distributed by exon for human amelotin, NP_997722. (B) TBLASTN alignments generated by the individual peptides from the human query aligned with NC_000071. Note that the query coordinates have been changed to match those in (A). In addition, the alignments were ordered to reflect the exon order within the gene. (C) A multi-FASTA file created from sequences in (A). This was pasted directly into the NCBI TBLASTN form.

Exon Detection (B) continued Score = 23.9 bits (50), Expect = 95, Method: Compositional matrix adjust. Identities = 16/28 (58%), Positives = 17/28 (61%), Gaps = 0/28 (0%) Frame = +2 Query

19

Sbjct

88807061

QLKPALGLPPTKLAPDQGTLPNQQQSNQ QL PA G+P TK P Q T QQQ NQ QLNPASGVPATKPTPGQVTPLPQQQPNQ

46 88807144

Score = 35.4 bits (80), Expect = 0.026, Method: Compositional matrix adjust. Identities = 18/22 (82%), Positives = 20/22 (91%), Gaps = 0/22 (0%) Frame = +2 Query

47

Sbjct

88807880

VFPSLSLIPLTQMLTLGPDLHL VFPS+SLIPLTQ+LTLG DL L VFPSISLIPLTQLLTLGSDLPL

68 88807945

Score = 28.1 bits (61), Expect = 5.2, Method: Compositional matrix adjust. Identities = 18/29 (63%), Positives = 18/29 (63%), Gaps = 1/29 (3%) Frame = +2 Query 69 Sbjct

88809302

LNPAAGMTPGTQTHPLTLGGLNVQQQLHP NPAAG G T P TLG LN QQQL P FNPAAGPH-GAHTLPFTLGPLNGQQQLQP

98 88809385

Score = 20.8 bits (42), Expect = 351, Method: Compositional matrix adjust. Identities = 9/12 (75%), Positives = 10/12 (84%), Gaps = 0/12 (0%) Frame = +3 Query

99

Sbjct

88810626

VLPIFVTQLGAQ +LPI V QLGAQ MLPIIVAQLGAQ

110 88810661

***Query 111-119 not found*** Score = 84.0 bits (206), Expect = 8e-17, Method: Compositional matrix adjust. Identities = 45/87 (52%), Positives = 56/87 (65%), Gaps = 1/87 (1%) Frame = +2 Query

121

Sbjct

88813922

Query

181

Sbjct

88814102

QIFTSLIIHSLFPGGILPTSQAGANPDVQDGSLPAGGAGVNPATQGTPAGRLPTPSGT-D QIFT L+IH LFPG I P+ QAG PDVQ+G LP AG QGT G + TP T D QIFTGLLIHPLFPGAIPPSGQAGTKPDVQNGVLPTRQAGAKAVNQGTTPGHVTTPGVTDD DDFAVTTPAGIQRSTHAIEEATTESAN DD+ ++TPAG++R+TH E T + N DDYEMSTPAGLRRATHTTEGTTIDPPN

180 88814101

206 88814182

Query 208-209 not attempted (C) >1-18 MRSTILLFCLLGSTRSLP >19-46 QLKPALGLPPTKLAPDQGTLPNQQQSNQ >47-68 VFPSLSLIPLTQMLTLGPDLHL >69-98 LNPAAGMTPGTQTHPLTLGGLNVQQQLHPH >99-110 VLPIFVTQLGAQ >111-119 GTILSSEEL >120-207 PQIFTSLIIHSLFPGGILPTSQAGANPDVQDGSLPAGGAGVNPATQGTPAGRLPTPSGTDDDFAVTTPAGIQRSTHAIEEATTESANG

143

144

Chapter 6: Advanced Topics in BLAST

Box 6.3 Where can I find a protein sequence distributed by exons? The distribution of amino acid sequence by coding exons (see Figures 6.10 and 6.14A) can be obtained in a variety of ways, but a convenient route is to look for the CCDS link in the annotation within an NCBI protein record (Figure 1). CCDS database entries exist for many model organisms. CDS

1..209 /gene=”AMTN” /gene_synonym=”MGC148132; MGC148133; UNQ689” /coded_by=”NM_212557.2:90..719” /db_xref=”CCDS:CCDS3542.1” /db_xref=”GeneID:401138” /db_xref=”HGNC:33188” /db_xref=”HPRD:18270” /db_xref=”MIM:610912”

Figure 1 CCDS link in the NCBI protein record. In bold is the link to the CCDS database.

This “CCDS” link goes to the Consensus CDS Protein Set database where you will find additional annotation of the coding DNA and protein sequence. It includes a colorized version of the sequences with a shift in color at the exon boundaries. In Figure 2, alternating exons are underlined for clarity. Nucleotide Sequence (630 nt): ATGAGGAGTACGATTCTACTGTTTTGTCTTCTAGGATCAACTCGGTCATTACCACAGCTCAAACCTGCTT TGGGACTCCCTCCCACAAAACTGGCTCCGGATCAGGGAACACTACCAAACCAACAGCAGTCAAATCAGGT CTTTCCTTCTTTAAGTCTGATACCATTAACACAGATGCTCACACTGGGGCCAGATCTGCATCTGTTAAAT CCTGCTGCAGGAATGACACCTGGTACCCAGACCCACCCATTGACCCTGGGAGGGTTGAATGTACAACAGC AACTGCACCCACATGTGTTACCAATTTTTGTCACACAACTTGGAGCCCAGGGCACTATCCTAAGCTCAGA GGAATTGCCACAAATCTTCACGAGCCTCATCATCCATTCCTTGTTCCCGGGAGGCATCCTGCCCACCAGT CAGGCAGGGGCTAATCCAGATGTCCAGGATGGAAGCCTTCCAGCAGGAGGAGCAGGTGTAAATCCTGCCA CCCAGGGAACCCCAGCAGGCCGCCTCCCAACTCCCAGTGGCACAGATGACGACTTTGCAGTGACCACCCC TGCAGGCATCCAAAGGAGCACACATGCCATCGAGGAAGCCACCACAGAATCAGCAAATGGAATTCAGTAA Translation (209 aa): MRSTILLFCLLGSTRSLPQLKPALGLPPTKLAPDQGTLPNQQQSNQVFPSLSLIPLTQMLTLGPDLHLLN PAAGMTPGTQTHPLTLGGLNVQQQLHPHVLPIFVTQLGAQGTILSSEELPQIFTSLIIHSLFPGGILPTS QAGANPDVQDGSLPAGGAGVNPATQGTPAGRLPTPSGTDDDFAVTTPAGIQRSTHAIEEATTESANGIQ

Figure 2 The Consensus CDS Protein Set database. This portion of the database file includes nucleotide and protein sequences colored by exon.

You can break the translation into “exons” by copying this sequence, pasting into a word processor document, and introducing paragraph marks. The coordinates (for example, 1–18, 19–54, 55–98, etc.) can be obtained by looking at the protein sequence record.

permutations in parameters, no alignments were generated with BLAST (the NCBI message is “No significant similarity found.”) Since we are working with well-annotated genes, proteins, and genomes, it was verified that the mouse genomic sequence did, indeed, have the missing exon. This was checked by performing a BLASTN search using the mouse amelotin cDNA as a query and mouse genomic DNA as the database to search. The disadvantage of this approach would be the numerous queries that have to be individually run. But again, the NCBI makes this easy by allowing a multiFASTA file to be used in the Query window (Figure 6.14C).

6.5 REPETITIVE DNA A large fraction of eukaryotic genomes, ranging from 35 to 50%, is repetitive in nature. The repeats are arranged in tandem arrays, clustered in groups, or scattered as singletons throughout the genome. Repetitive elements can be as short as two nucleotides, or as long as thousands of bases, and are repeated hundreds to many thousands of times. Many of these elements are considered “junk,” leftovers from millions of years of unchecked propagation and change. Others

Repetitive DNA

145

perform a vital function, some of it accidental, and have been major players in the structure and evolution of genomes. Because of their abundance, there are times when repetitive DNA interferes with sequence analysis, particularly similarity searching, so understanding these elements is crucial so you can compensate or alter your methods. Below is a summary of some of the major kinds of repetitive DNA elements. They are studied extensively and have been subdivided into many classes, but only the major classes are described here, organized by size.

Simple sequences As the name suggests, simple sequences consist of very small “repeats” of only two or more nucleotides. Examples include polypyrimidines (for example, CTCTCTCTCTCTCTCT) and single nucleotide stretches (for example, AAAAAAAAAAA, CCCCCCCCCCCC, etc.). The functions of these stretches are not known, yet they are often conserved in evolution. BLAST and other databasesearching tools have filters that can spare you from the many thousands of hits when simple sequences are present in your query. For example, in BLAST there is a utility called DUST that identifies stretches where your query sequence is simple in composition, and then masks them for the search.

Satellite DNA Telomeres are specialized structures that cap the ends of chromosomes. Centromeres are located within chromosomes and contain attachment points that bind spindle fibers, which pull apart newly replicated chromosomes. These ends and centers of chromosomes have specialized DNA elements collectively referred to as satellite DNA. These regions contain repeating elements, such as (GGGTTA)n, (GGGGTT)n, and (GGGGTTTT)n, the “n” referring to the number of repeats, which can range in the many thousands. Telomeres and centromeres are almost always condensed in structure and do not harbor any genes. Despite their relative abundance in the genome, you are not likely to encounter them during sequence analysis because these sequences are avoided or are unable to be sequenced in a high-throughput manner.

Mini-satellites These tandemly arrayed sequences consist of repeated elements of 10 or more nucleotides, and the sequence can vary considerably. Their variability provides genetic markers for identifying the DNA of individuals.

LINEs and SINEs Long interspersed nuclear elements (LINEs) are dispersed repetitive elements that range in size from a few hundred to several thousand bases. They are thought to be parasitic elements that propagate themselves like retroviruses, and they contain similar structural elements. They are transcribed into RNA, then reverse transcribed into DNA and inserted into the genome. In fact, a poly(A) stretch is seen at the end of these elements in the genomic DNA. Much like the synthesis of cDNA, the natural reverse transcription step is thought of as inefficient and leads to the vast majority of the elements being truncated versions of the fulllength “parents” (see Figure 6.15). In mammalian genomes, upward of several hundred thousand of these elements are found but most are from the 3P end of the parents because of this defective reverse transcription. Looking closely at the structure of the relatively few full-length copies, two overlapping reading frames

Truncated versions

Full length

-----AAAAAAAAA -----------AAAAAAAAA --------------------AAAAAAAAA ----------------------------AAAAAAAAA ----------------------------------------AAAAAAAAA

Figure 6.15 LINE-1 elements. Reverse transcription of full-length LINE-1 mRNAs is thought to start at the 3P end. After the reverse transcriptase stalls or “falls off” the mRNA, the truncated cDNA becomes double-stranded and is inserted into the genome. Most copies of LINE-1 elements are truncated at the 5P end.

146

Chapter 6: Advanced Topics in BLAST are seen, with the large frame encoding a protein that has distant homology to reverse transcriptase. Since selective pressure is usually not present to maintain the sequence of the truncated versions, they rapidly evolve away from the original sequence until they are barely recognizable as LINE sequence. There are tens of thousands of copies of short interspersed nuclear elements (SINEs) scattered about the genome. Examples include the Alu elements in humans, and the B1 elements in mice. Alu and B1 elements are originally derived from the 7SL RNA gene. Like the LINE-1 elements, SINE transcripts get reverse transcribed and inserted into the genome. They are approximately 300 nucleotides in length but old elements have rapidly evolved and may be shorter and barely recognizable. SINEs and LINEs often end up in the 3P UTRs of mRNAs and are frequently adjacent to exons. To give you an idea of how frequently you might encounter Alu repeats, perform a pairwise BLASTN between an 8.45 million nucleotide genomic fragment, CH471135, and U14574, a short sequence containing an Alu element. The alignments will demonstrate the range in identity between Alu elements but the BLASTN graphic (Figure 6.16) shows the high frequency in genomic DNA, with each tiny vertical mark on the horizontal line corresponding to an Alu element. The relative abundance of these repetitive sequences was crucial in finding the human genes that cause cancer (see Box 6.4). If you perform a BLASTN search and your query hits many different genes and chromosomes, immediately suspect a repetitive element may be present in your query. Based on the coordinates within your query, you can remove this sequence for further analysis or filter your query using a BLASTN parameter (Figure 6.17).

Tandemly arrayed genes Out of necessity, nature has created gene families that are organized as large arrays of repeated sequences. Rather than scattered about the genome, these members of a gene family are organized into clusters in a very orderly arrangement. These large arrays meet the synthesis demands of the cell or have a structure that allows unique forms for gene regulation. Here are examples you may encounter. Ribosomal RNAs (rRNAs) are the “scaffolds” of the ribosomes and the genome has evolved to contain many copies of these genes. In humans, there are approximately 5000 copies of 5S rRNA genes, and several hundred copies of 28S, 5.8S, and 18S rRNA genes. They are tandemly arrayed (Figure 6.18).

Figure 6.16 Alu elements in human genomic DNA. A BLASTN pairwise alignment was generated using the genomic fragment CH471135 as a “Query” and an example Alu element U14574 as the subject. The vertical marks on the horizontal line represent Alu elements distributed on the 8.45 million nucleotide genomic DNA fragment.

Figure 6.17 A screenshot of the filter parameter in the NCBI BLAST form.

Histone genes are found in high copy number in some organisms out of necessity. The developing sea urchin embryo is a frequently used subject of biology classes due to the high number of cell divisions within the span of a few hours (the length of a laboratory class). As the cells divide, their DNA is replicated at an equally very

Interpreting Distant Relationships

147

Box 6.4 Repetitive elements and human oncogene discovery Oncogenes, genes that can cause cancer, were discovered with the help of repetitive DNA. It had been established that you could transfer the DNA from human tumors into cultured mouse cells using a technique called transfection. In this procedure, purified DNA is sheared into small pieces and then mixed with calcium phosphate to form a precipitate. Cultured cells take up this precipitate–DNA mixture and a very small percentage of these cells then stably incorporate this human DNA into their genomes. Mouse cells that took up human genes responsible for cancer-like growth changed shape and growth characteristics, forming piles of cells called foci. These cells could then be harvested from the culture dish. Somewhere in the mouse genome in these cells was a human gene responsible for this transformation, but could it be identified? In the 1980s, relatively few genes had been sequenced and deposited into GenBank. So you couldn’t identify many

human genes from mouse genes based on sequence. But the sequence of human and mouse repetitive DNA had been published and so this was used for oncogene identification. A genomic library was constructed from these cancerous mouse cells. In most clones in this library, all the DNA was mouse genomic DNA. But in a rare number of the clones, human genomic DNA was identified using radioactive probes for human repetitive elements such as Alu. The logic was that when the human tumor genomic DNA was sheared for transfection, pieces included genes plus flanking repetitive elements. By identifying mouse library DNA that contained these human repetitive elements, the adjacent gene must be the human oncogene that transformed the normal mouse cells into cancer cells. This was verified by transfecting normal mouse cells with these suspected, and now purified, human oncogenes. Mouse cells that took up this DNA were transformed into cancer cells.

high rate. Once synthesized, genomic DNA must wrap around histone proteins to form chromatin. The generation of these essential proteins keeps pace with DNA synthesis by the transcription of tandemly arrayed clusters of histone genes. In the sea urchin Echinus esculentus, each of the early embryonic histone genes (H4, H2B, H3, H2A, and H1) are found in hundreds of repeating units adjacent to each other (Figure 6.19).

Figure 6.18 A tandem array. Depicted here is a “head-to-tail” arrangement of repeated elements, with little or no other sequence between elements.

6.6 INTERPRETING DISTANT RELATIONSHIPS This chapter and the previous three have all focused on the BLAST suite and associated topics. Although BLAST is a very useful tool, if it was replaced tomorrow, you should still understand the makeup of the databases, important areas of consideration, and approaches to detecting and identifying sequences through similarity. This section is an attempt to provide a set of guidelines to consider when judging distant homologies. This list is not comprehensive and should not be used as a “checklist” or an absolute test. In more ways than one, you play detective in sequence analysis, and have to gather evidence through various routes.

Name of the protein If a hit has the same or similar name to your query, it should be looked at closely, but names can be misleading. As seen earlier in this chapter with the opsin family (Section 6.2), comparisons across species need careful consideration. Many high throughput cDNA sequencing efforts are combined with automated annotation of these cDNAs, and no human being has examined the validity of the names given. For example, something sequenced from a kidney cDNA library might be (incorrectly) called “kidney-specific.” Also, if an automated BLAST search found a small region of similarity between the query and the hit, it may be labeled “similar to … .” You will need to manually assess the validity of sequences labeled in this way. Names derived from laboratory experiments and actual human thought are much more likely to be accurate. H4-H2B-H3-H2A-H1→H4-H2B-H3-H2A-H1→H4-H2B-H3-H2A-H1→H4-H2B-H3-H2A-H1→…

Figure 6.19 Tandemly arrayed histone genes in a sea urchin. The early embryonic histone genes of the sea urchin Echinus esculentus are arranged in repeating units, each containing one copy of each of the five histone genes. The arrows depict the boundaries between adjacent elements.

148

Chapter 6: Advanced Topics in BLAST

Percentage identity Distance in evolution is often tied to percentage identity. Percent identity between orthologs can be very high between closely related species (for example, 95% or greater for human and other primates), moderate between related species (for example, 80% or greater for human and other mammals), lower for more distant relationships (for example, 40–80% for human and other vertebrates), and even lower between kingdoms (for example, less than 40% for human and other animal cells). You can sometimes see homology between proteins with the same function (for example, paralogs or orthologs) down to about 25% identity. Short stretches of high identity can be misleading, so consider the length of the identity and the query length. Some proteins are very highly conserved showing little change over evolution (see rhodopsin in Section 6.2), while others diverge rapidly.

Alignment length and length similarity between query and hit In the strongest situation, your query should align to a hit over its entire length, demonstrating that the similarity goes from end to end. But what if the length of similarity is much shorter than your query length? If your query is 1000 amino acids long but the similarity is limited to 200 amino acids, the size difference should tell you that these might not be related proteins. But to make things complicated, some proteins have conserved and nonconserved domains so the only similarity between distantly related proteins will be these smaller regions that define function and name. To illustrate this, Figure 6.20 is a BLASTP result graphic of a human kinase Query (NP_004825) against bacterial proteins. Note

Figure 6.20 BLASTP results with a human kinase (NP_004825) as the Query, searching a database of bacterial proteins. Only the kinase domain is conserved.

Interpreting Distant Relationships that the Query is over 1100 amino acids long but the 265 amino acid catalytic domain of the kinase is finding the catalytic domains of many bacterial kinases. The rest of the query sequence is less conserved between human and bacteria, and generates no alignments. Nevertheless, these distantly related kinases share similarity in a very important domain, establishing their link. But, nature also frequently preserves the approximate length (size) of related proteins. For example, human insulin is 110 amino acids long and most real hits are 94–110 amino acids long (Section 4.5). Size—similarity or difference—should be considered carefully.

E value Are two sequences related to each other? BLAST statistics can assist in the decision-making process but cannot always provide a definitive answer. As first introduced in Chapter 3, and explored further in Chapter 4, and even seen in a simple typing contest (Chapter 4, Exercise 1), E values are influenced by length and the quality of the alignment. Remember that short queries will give large E values, even in the best of alignments. If E values are very small numbers (close to zero) decisions are easier. If the E value is close to, or even above 1.0, be careful with your interpretation. Hits with very large E values can have description lines that show that distant relatives were found with BLAST. In the example in Figure 6.21, you see that hits with E values of 99 will still require consideration. Note that the best alignment (Figure 6.22A) and one of the last alignments (Figure 6.22B) are not remarkably different at first glance.

Gaps BLAST inserts gaps to lengthen alignments but keeps score to make sure it is worth the cost (first seen in Chapter 3). In Figure 6.23, a large gap was introduced but it resulted in an alignment at the C-terminus which is comparable to what is seen at the N-terminus. Gap creation and extension parameters can be varied in BLAST, but the results should look convincing. Introducing a large gap to provide a very small extension of the larger alignment is often not appropriate. Box 6.5 shows an example of where a large gap was appropriate.

Sequences producing significant alignments: ref|YP_593309.1| serine/threonine protein kinase [Candidatus ... ref|YP_001377321.1| protein kinase [Anaeromyxobacter sp. Fw10... ref|YP_002493996.1| serine/threonine protein kinase with WD40... ref|YP_002135874.1| serine/threonine protein kinase with WD40... ref|YP_002431860.1| serine/threonine protein kinase [Desulfat... . . many hits not shown… . . ref|YP_739489.1| serine/threonine protein kinase [Shewanella ... dbj|BAI92550.1| hypothetical protein [Arthrospira platensis N... ref|YP_001500449.1| serine/threonine protein kinase [Shewanel... ref|YP_257659.1| lipopolysaccharide core biosynthesis protein... ref|ZP_05552205.1| conserved hypothetical protein [Fusobacter...

Score (Bits)

E Value

114 108 105 103 103

4e-24 2e-22 3e-21 8e-21 1e-20

30.4 30.4 30.4 30.4 30.4

96 99 99 99 100

Figure 6.21 Hits with large E values can still be significant. The catalytic domain (265 amino acids) of human kinase NP_004825 was used in a BLASTP search with the following parameters: the E value maximum was set to 100 and the number of maximum target sequences set to 10,000. The non-redundant protein sequences (nr) database was searched, restricted to bacterial sequences. The top five and the bottom five hits from the results table are listed.

149

150

Chapter 6: Advanced Topics in BLAST

(A) >ref|YP_593309.1| serine/threonine protein kinase [Candidatus Koribacter versatilis Ellin345] gb|ABF43235.1| serine/threonine protein kinase [Candidatus Koribacter versatilis Ellin345] Length=381 Score = 114 bits (286), Expect = 4e-24, Method: Compositional matrix adjust. Identities = 74/221 (34%), Positives = 125/221 (57%), Gaps = 30/221 (13%) Query

23

Sbjct

10

Query

75

Sbjct

68

Query

135

Sbjct

115

Query

193

Sbjct

175

GIFELVEVVGNGTYGQVYKGRHVKTGQLAAIKVM--------DVTEDEEEEIKLEINMLK G +E+V +GNG G+VYK RH + + A+KV+ +VT+ EI++ N+ GAYEIVGPIGNGGMGEVYKVRHTISQRTEAMKVLLSGAARRPEVTDRFVREIRVLANL--

74

KYSHHRNIATYYGAFIKKSPPGHDDQLWLVMEFCGAGSITDLVKNTKGNTLKEDWIAYIS +H NIA + AF H+DQL +VMEF ++++++ + G L+ D +AYI ---NHPNIAALHTAF------HHEDQLIMVMEFIEGKNLSEML--STGMVLR-DSVAYI-

134

REILRGLAHLHIHHVIHRDIKGQNVLLTENAEVKLVDFGVS--AQLDRTVGRRNTFIGTP R+ + LA+ H VIHRDIK N+++ +VKL+DFG++ + D + + +G+ RQAVTALAYAHSQGVIHRDIKPSNIMINSAGQVKLLDFGLALMSTPDPRLTSSGSLLGSV

192

YWMAPEVIACDENPDATYDYRSDLWSCGITAIEMAEGAPPL ++++PE I + T D RSDL++ G+T E+ G P+ HYISPEQIRGE-----TMDARSDLYAVGVTLFEVITGRLPI

67

114

174

233 210

(B) >ref|YP_001500449.1| serine/threonine protein kinase [Shewanella pealeana ATCC 700345] gb|ABV85914.1| serine/threonine protein kinase [Shewanella pealeana ATCC 700345] Length=609 Score = 30.4 bits (67), Expect = 99, Method: Compositional matrix adjust. Identities = 43/169 (26%), Positives = 73/169 (44%), Gaps = 32/169 (18%) Query

25

Sbjct

53

Query

83

Sbjct

111

Query

135

Sbjct

155

FELVEVVGNGTYGQVYKGRHVKTGQLAAIKVMDVT--EDEEEEIKLEINMLKKYSHHRNI ++ +E+VG G YG V+ G + K GQ K +T + ++ ++ E ML + H N+ YQELELVGKGAYGFVFAGVN-KLGQAHVFKFSRLTLPQHIQDRLEEEAFMLSQVI-HPNV

82

--------ATYYGAFIKKSPPGHDDQLWLVMEFCGAGSITDLVKNTKGNTLKEDWIAYIS G + PG D + + C + L + I+ PPVIKFEHVGKQGILVMARAPGED-----LEQLC-----------IRVGALPVATVMNIA

134

REILRGLAHLHIHH-VIHRDIKGQNVLLTENAE-VKLVDFG--VSAQLD R++ L +LH +IH DIK N++ N + + L+D+G V AQ D RQLAAILQYLHNGRPLIHGDIKPSNLVYDVNTQHLSLIDWGSAVFAQRD

110

154

179 203

Figure 6.22 Hits with large E values can still be significant. The catalytic domain (265 amino acids) of human kinase NP_004825 was used in a BLASTP search with the following parameters: the E value maximum was set to 100 and the number of maximum target sequences set to 10,000. The non-redundant protein sequences (nr) database was searched, restricted to bacterial sequences. (A) The alignment from the top hit. (B) The alignment from a bottom hit.

Conserved amino acids As sequences are aligned to your query, you may notice key amino acids that are conserved throughout evolution—a pattern of cysteines (C), a leucine (L) every seven amino acids, or some other signature that looks either unusual or interesting. This recognition should come with practice. Identifying specific protein motifs with software will be covered in Chapter 8. The NCBI BLAST programs, and other sites, check your query for signatures during the search and will display these domains as graphics (Figure 6.24). Should a domain be identified, click on these graphics to reach more information. You may learn that key amino acids

Interpreting Distant Relationships >emb|CAG08854.1| Length=291

unnamed protein product [Tetraodon nigroviridis]

Score = 153 bits (387), Expect = 1e-37, Method: Compositional matrix adjust. Identities = 90/288 (32%), Positives = 156/288 (55%), Gaps = 23/288 (7%) Query

40

Sbjct

5

Query

100

Sbjct

65

Query

160

Sbjct

125

Query

220

Sbjct

183

Query

259

Sbjct

243

LTIIGILSTFGNGYVLYMSSRRKKKLRPAEIMTINLAVCDLGISVVGKPFTIISCFCHRW L I +L GN VL + SR + L P ++ +N++ D+ +SV G P ++ + RW LGFILVLGFLGNFLVLLVFSRFPRLLTPVNLLLVNISASDMLVSVFGTPLSLAASVRGRW

99

VFGWIGCRWYGWAGFFFGCGSLITMTAVSLDRYLKICYLSYGVWLKRKHAYICLAAIWAY + G GCRWYG++ FG SL++ + +SL+RY ++ + + + A I +AA W Y LTGASGCRWYGFSNALFGVVSLVSYSLLSLERYAEVLWDPQTSASRYQRAKIAVAASWFY

159

ASFWTTMPLVGLGDYVPEPFGTSCTLDWWLAQASVGGQVFILNILFFCLLLPTAVIVFSY + FWT PL G Y PE GT+C++ W Q + + +I+ + FCLLLP V++F Y SLFWTLPPLFGWSSYGPEGLGTTCSVQW--HQRTASSRSYIICLFIFCLLLPLLVMIFCY

219

VKIIAKVKSSSKEVAHFDSRIHSSHV---------------------LEMKLTKVAMLIC +++ +++ S V+ +R S E + ++ + + GRMLLALRAWSLRVSAAGTRSRPSAAGGGSDCCTVCVLQVGGAAGERREALVLQMVLCMV

258

AGFLIAWIPYAVVSVWSAFGRPDSIPIQLSVVPTLLAKSAAMYNPIIY AG+L+ W+PY V++ ++FG P +P S++P+LLAK++ + NP+IY AGYLLCWMPYGAVAMLASFGPPGVVPPTASLIPSLLAKTSTVLNPVIY

151

Figure 6.23 BLASTP alignment between human opsin-5 and CAG08854, from the pufferfish. Note the large gap.

64

124

182

242

306 290

Figure 6.24 Conserved domains identified by the NCBI BLAST program. Near the top of the BLASTP results is a graphic that will display conserved domains present in the query. Clicking on these objects will navigate to the Conserved Domain database where additional information can be found.

Box 6.5 What is going on with this alignment? NP_005753 was used as a TBLASTN query against human genomic DNA and Figure 1 shows one of the alignments with genomic sequence file NC_000019.9. Query

241

Sbjct

22939

Query

281

Sbjct

23119

QYQFLEDAVRNQRKLLASLVKRLGDKHATLQKSTKEVRSS-------------------+YQFLEDAVRNQRKLLASLVKRLGDKHATLQKSTKEVRSS RYQFLEDAVRNQRKLLASLVKRLGDKHATLQKSTKEVRSS*VWVLGLWGWPRAARPYLA* ------IRQVSDVQKRVQVDVKMAIL IRQVSDVQKRVQVDVKMAIL PAVSPRIRQVSDVQKRVQVDVKMAIL

280 23118

300 23196

Figure 1 Adjacent exons separated by a small intron.

Why are there two areas of identity between the query and the genomic DNA and a large gap? This is one contiguous genomic DNA sequence, the Subject (Sbjct) going from 22939 to 23196. It encodes query amino acids 242–280 AND 281–300. There was insufficient distance between these two exons on the genomic DNA sequence for BLAST to separate them into two separate alignments: the gap penalties did not exceed the benefit of including a distant 20 out of 20 match. Instead, you see two regions of identity separated by (translated) genomic DNA. The “*” symbols in the Sbjct line represent encountered stop codons in the intron.

152

Chapter 6: Advanced Topics in BLAST

Figure 6.25 Evenly spaced alignment patterns. Human leucine zipper 4 protein NP_057467 is aligned with an unknown sea anemone protein XP_001640018 by BLASTP. Note the patterns for serines, histidine–glycine pairs, and threonine–glutamine pairs created by small internal repeats in both sequences.

Score = 84.3 bits (207), Expect = 2e-14 Identities = 63/261 (24%), Positives = 93/261 (35%), Gaps = 15/261 (5%) Query

107

Sbjct

139

Query

167

Sbjct

195

Query

226

Sbjct

255

Query

282

Sbjct

315

NGQPLIEQEKCSDNYEAQAEKNQGQSEGNQHQSEGNPDKSEESQGQPEENHHSERSRNHL + Q S+ + S S S G H +++ + SNT----QHDMSNTQHDMSNTQHDMSNTQHGMSNTQHGMSNTQHGMSNTQHGMSNTQHGM

166

ERSL-SQSDRSQGQLKRHHPQYERSHGQYKRSHGQSERSHGHSERSHGHSERSHGHSERS + S+ G H + HG HG S HG S H S HG S SNTQHGMSNTQHGMSNTQHDMSKTQHGMSNTQHGMSNTQHGMSNTQHDMSNTQHGMSNTQ

225

HGHSKR----SRSQGDLVDTQSDLIATQRDLIATQKDLIATQRDLIATQRDLIVTQRDLV H S S +Q D+ TQ D+ TQ + TQ D+ +TQ + TQ D+ +TQ + HDMSSTQHGMSNTQHDMSSTQHDMSNTQHGMSNTQHDMSSTQHGMSNTQHDMSITQHGMC

281

ATERDLIN----QSGRSHGQS T+ D+ N S HG S NTQHDMSNTQHGMSNTQHGMS

194

254

314

298 335

specify membership to a family or function. But remember, not everything has been discovered. These key amino acids should not be confused by similarity due to composition— for example, one proline-rich sequence aligning with another proline-rich sequence, with no related function. But having domains rich in one or more amino acids is a property, too. It may be required to have a domain that is rich in hydrophobic amino acids, for example. Look at the alignment between a human and sea anemone protein in Figure 6.25. The percentage identity is low (24%) but the distance between certain amino acids has been maintained for over one billion years of evolution. This similarity deserves a closer look! Maybe these two proteins are folded the same way and have the same or related function.

6.7 SUMMARY This chapter marks the end of formal coverage of the BLAST programs in this book. Chapter 3 covered BLASTN, Chapter 4 explored BLASTP, and both BLASTX and TBLASTN were covered in Chapter 5. Chapter 6 advanced your knowledge of using BLAST through workflows, specialized topics, parameter adjustment, and frequently encountered problems. For the remainder of this book, BLAST can and will be used as a handy tool to call upon as you address sequence analysis problems with other bioinformatics applications.

EXERCISES Exercise 1: Simple sequences This first set of exercises explores very simple sequences: two-nucleotide combinations such as CTCTCT. 1. What are all the possible two-nucleotide combinations? 2. Considering that BLASTN searches both strands of DNA, what is the minimum number of searches required to look for all these possible two-nucleotide combinations in the human genome? 3. Perform BLASTN searches of all the possible two-nucleotide combinations: search the human RefSeq mRNA sequences with queries of length ~30 nucleotides. Was it necessary to adjust any parameters?

Exercises 4. Looking at the coordinates of the hits, how does BLASTN handle the alignments between the query and the hits? 5. For each search, look at one mRNA sequence record of the hit and, by eye, find the two-nucleotide element found with BLASTN. This may be easier if you change the “Display Settings” from GenBank to FASTA.

Exercise 2: Reciprocal BLAST Major urinary proteins (MUPs) are found in the urine of mammals and have been recently shown to play a role in the fear response between species. Mice can detect and are afraid of the smell of isolated rat and cat MUPs. They even fear the smell of these animal proteins produced in genetically engineered bacteria. The role of MUPs in animal interactions is complex, as they have also been shown to induce aggression and other behaviors within the same species. MUPs are detected by receptors in a specialized nasal structure called the vomeronasal organ. It will be interesting to discover how intraspecies and interspecies detection of MUPs shapes the behavior of animals. The number of MUP genes varies between mammals, ranging from a single copy (for example, dog) to over 20 copies (mouse). Although several primates appear to have functional MUP genes, humans only have a pseudogene, defective because of a single base-pair change. MUPs are also part of a larger family, lipocalins, which extends well beyond mammals. You are a scientist who has been studying the role of mouse MUP26 in social behavior and wonder if the MUP genes of the opossum (Monodelphis domestica) would be interesting. The opossum is a marsupial, a distinct branch of mammals, and sequence analysis of its genes would represent a departure from many of the other MUP studies. You wonder if opossums may have a complex social communication with other animals, mediated by MUPs. As a start, you are interested in finding the ortholog to mouse MUP26 in the newly sequenced opossum genome. 1. Retrieve the mouse MUP26 RefSeq protein sequence and conduct a BLASTP search against the Monodelphis domestica non-redundant proteins. 2. When the results are returned, survey the top hits and pick one or more proteins that may be an opossum MUP26 ortholog. 3. Perform a reciprocal BLASTP search against Mus musculus RefSeq proteins. 4. Based on the opossum sequence that has the top hit to mouse MUP26 in a reciprocal BLAST, identify the opossum ortholog to the mouse MUP26.

Exercise 3: Exon identification with TBLASTN In this exercise, you will use TBLASTN to identify the exons of a human gene with a human protein. You will be asked to navigate between databases to retrieve information. 1. Retrieve the RefSeq record for NP_005753. 2. Examine this sequence by eye. Considering that you will be using TBLASTN to identify human exons with this protein sequence, predict what parameter you may need to change in order to find all the exons. 3. From the RefSeq protein record, navigate to the CCDS database and identify the amino acids encoded by each exon. Construct a table, much like Figure 6.10, where each exon’s amino acids are on separate lines. How many exons encode protein sequences? 4. From the RefSeq protein record, navigate to the RefSeq mRNA record and identify how many exons are spliced together to form the mRNA. 5. From the annotation found with the RefSeq mRNA record, how long are the 5P and 3P UTRs?

i

153

Playing possum

In reaction to extreme fear the American opossum will collapse and reach a comalike state for up to four hours. This is the origin of the phrase, “playing possum,” or feigning death. During that time, the animal’s heart and respiratory rates decrease, there is excessive salivation and the release of urine and feces, and the animal emits a terrible odor. This death-like state often works as a defense, as predators soon lose interest and leave.

154

Chapter 6: Advanced Topics in BLAST 6. Using the default settings (reset the form if necessary) of the NCBI TBLASTN form, perform a pairwise TBLASTN of NP_005753 and genomic sequence AC016630. 7. Record and compare your predictions to the CCDS sequences. Is any sequence missing? 8. Change parameters to try to find any missing sequence. What parameter(s) did you change? 9. Were you able to find all the exons based on the protein sequence?

Exercise 4: Identification of orthologous exons with TBLASTN By a number of measures, the platypus is a very unusual mammal. It appears to have features, both physical and genetic, from birds, reptiles, and fish. It is so bizarre that when it was first discovered by scientists, it was thought of as an elaborate hoax. It has webbed feet like a duck, and its face is dominated by a specialized structure, reminiscent of a duck’s bill. The genus name of the platypus, Ornithorhynchus, should remind you of ornithology, the study of birds, and rhynchus is Latin for beak. The platypus “beak” is an organ that is specialized for hunting. While diving in water, the platypus closes its eyes, ears, and nostrils and relies entirely on the electrosensory receptors on its bill to locate the minute electrical signals generated by the muscular contractions of prey. Other features of the platypus are reminiscent of other animals. Remarkably, the platypus lays eggs rather than giving birth to live young, and the eggs are laid, like birds, through the single opening for both the digestive tract and reproduction. Once hatched, the young platypus is fed milk, not through nipples (the platypus has none) but through skin pores. The male platypus has a venom gland and barblike structures on its hind legs to deliver the toxin. This venom protein suggests a case of convergent evolution with reptilian venoms. Instead of maintaining its body temperature at 37°C like most other mammals, the platypus is a constant 32°C. The platypus has 52 chromosomes, 10 of which are sex chromosomes. Sequence analysis has shown that it has genes previously found only in birds, amphibians, and fish. But what about mammalian features that have been lost or radically changed: do we see changes at the gene level? This exercise explores a platypus gene that may no longer be under evolutionary selection. There are at least five mammalian genes responsible for the creation of tooth enamel. In Section 6.4, the coding regions of a mouse enamel protein, amelotin, were identified using the human protein. Now you will look for another gene responsible for tooth formation in the platypus genome. At birth the platypus has several teeth but soon loses these and is toothless for the rest of its life. Could a gene responsible for a fundamental tooth component lose function because it is no longer needed? You will address this and other questions in your analysis. 1. Using the human enamelin protein and TBLASTN, search the reference platypus genome using the default settings (except change the maximum number of target sequences to 10). When completed, identify the possible exons for the platypus enamelin protein and paste the appropriate alignments into a word processor document (in order from N- to C-terminus). This will be referred to as your “gene model.” This search may take a while so continue working on the next two questions below. Note that the term “gene model” usually refers to a predicted cDNA, not a collection of alignments or other evidence. 2. Look up the CCDS database entry for your query and identify the exons and encoded protein (by exon) to be expected should your query be similar to the platypus gene and protein. Construct a table, much like Figure 6.10, where the

Further Reading amino acids encoded by each exon are on separate lines and place this at the top of your gene model. 3. Read about the enamelin protein function by searching PubMed. There are other genes related to enamelin and these may also be found with this first TBLASTN. Remember: always work within the same genomic fragment (for example, NW_12345678), make sure the alignments you pick are all on the same strand, and check the genomic coordinates are sequential. 4. Vary the TBLASTN parameters and approaches, as demonstrated in this and other chapters, to search for coding exons that do not appear with the default TBLASTN parameters. When doing this step, be sure to perform pairwise TBLASTN alignments between your query and the appropriate platypus genomic fragment identified in step 1, above, to speed up your analysis and be a good neighbor of shared computer resources. Should you find additional exons, record what parameter changes led to your discoveries. 5. Add to and edit your gene model as you identify candidate exons. If necessary, examine and manually translate DNA sequence outside of alignments to add amino acids not aligned by BLAST. 6. From the information in your gene model, splice together what you think is the platypus enamelin protein. Perform another pairwise TBLASTN alignment between your model and the platypus genome to check your work. 7. Perform a BLASTP search with your prediction and identify the predicted platypus enamelin already in RefSeq. How do these compare? 8. Perform another pairwise BLASTP alignment between your model and the original query protein. How do they compare? Would any edits of your model improve the alignment, remembering that the platypus protein sequence is different from your query? Is there any sequence missing? From the standpoint of protein sequence, does it appear that the platypus gene might be full length and functional? 9. If necessary, examine the platypus genomic sequence by eye by retrieving the DNA sequence of your subject. Is there any unsequenced genomic DNA in the vicinity of where missing exons may be?

FURTHER READING Apte S (2009) A disintegrin-like and metalloprotease (reprolysin-type) with thrombospondin type 1 motif (ADAMTS) superfamily: functions and mechanisms. J. Biol. Chem. 284, 31493–31497. A nicely illustrated mini-review article on this large and biochemically important family. Catón J & Tucker AS (2009) Current knowledge of tooth development: patterning and mineralization of the murine dentition. J. Anat. 214, 502–515. A wonderful article on the genes and tissues of tooth development. Goodier JL & Kazazian HH (2008) Retrotransposons revisited: the restraint and rehabilitation of parasites. Cell 135, 23–35. A nice review article on this class of repetitive elements, frequently (if not accidentally) encountered in sequence analysis. Logan DW, Marton TF & Stowers L (2008) Species specificity in major urinary proteins by parallel evolution. PLoS One 3, e3280 (DOI:10.1371/journal.pone.0003280). A good discussion of this large family, with nice illustrations of gene organization and evolutionary relationships. Mikkelsen TS, Wakefield MJ, Aken B et al. (2007) Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature 447, 167–177. A thorough and interesting genome paper with comparisons to other model genomes. Shih C & Weinberg RA (1982) Isolation of a transforming sequence from a human bladder carcinoma cell line. Cell 29, 161–169. This paper describes the isolation of a human oncogene with the help of repetitive elements.

155

156

Chapter 6: Advanced Topics in BLAST Terakita A (2005) The opsins. Genome Biol. 6, 213–221. A good review article covering the opsin family members, along with their structure and function. Warren WC, Hillier LW, Marshall Graves JA et al. (2008) Genome analysis of the platypus reveals unique signatures of evolution. Nature 453, 175–184. This is a wonderful article that draws direct comparisons between the unusual features of the platypus and the genomic discoveries by the authors. Yu Y-K & Altschul SF (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21, 902–911. This article discusses the mathematical details of modifying substitution matrices to accommodate differences in protein composition.

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA GAAGGGGAAACAGATGCAGAAAGCATCT AGAAAGCATCT ACAAGGGACTAGAGAAACCAAAACGAAAGGTGCAGAAGGGGAAACAGATGCAGAAAGCATCT ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAA GAAGGGGAA

CHAPTER 7

Bioinformatics Tools for the Laboratory

Key concepts • Mapping restriction enzymes with the NEBcutter tool • Converting DNA sequence with Reverse Complement • Translation of DNA sequences with the ExPASy Translate tool • Identifying open reading frames with the NCBI ORF Finder • Using Primer3 software for designing PCR primers • Measuring DNA composition with DNA Stats • Measuring protein composition with the Composition/Molecular Weight Calculation Form • The Sequence Retrieval System (SRS), a database-querying tool • Graphically visualizing sequence similarity with DotPlot

7.1 INTRODUCTION As you work your way through this book, and your own analyses, you will encounter times when you need a small utility that does one thing and does it well. This chapter will focus on sequence analysis utilities that come in handy for work in the average academic, biotechnology, or pharmaceutical laboratory. You’ve already discovered how a word-processing program can save you time and minimize errors through the “Find and Replace” function. These additional programs described in this chapter are more sophisticated and specific for sequence analysis. They will help you design and implement experiments as well as analyze and make sense of the data collected. They will often make the difference between knowing answers and just guessing. If you are ever doing anything manual, repetitive, and prone to error, you should ask yourself “is there a simple tool which can do this for me?” There are literally hundreds of bioinformatics utilities on the Internet and it would be impossible to present them all here. This chapter will survey some practical as well as academic applications. The interfaces can be radically different in approach. Some utilities will be simple to understand and use while others will require more work to get at answers. Help documents will range from polished publications to no help at all. Some sites are maintained by commercial entities while others are kindly offered by academic laboratories. You should get used to the variety and be prepared to seek replacements. The utilities presented here are good but you may find others more suited to your needs. There are also

158

Chapter 7: Bioinformatics Tools for the Laboratory commercial desktop applications, which may provide most of what you need in one place. It is important to remember that there is unsatisfactory software out there; Web pages that don’t work, pages that generate ugly or confusing results, and sites that use outdated approaches or generate incorrect data. As you explore the Internet, remember to apply scientific methods of evaluation. You may have to experiment with sequences that you understand (controls) to verify that the Website you just discovered is working properly.

7.2 RESTRICTION MAPPING AND GENETIC ENGINEERING Open up a vendor catalog that caters to laboratories that clone, manipulate, and sequence genes and you see pages of maps of plasmids and viruses, the vectors of genetic engineering. Also included are lists of restriction enzymes that cleave these DNAs. Some of the first sequence analysis software was developed to identify restriction enzyme sites, generate maps, and assist in the planning of pioneering recombinant DNA experiments. Before describing a good example of software for this purpose, we will extensively review topics of “wet” laboratory work: restriction enzymes and the DNA sequences they cleave.

Restriction enzymes One of the critical advances that led to the explosive growth of genetic engineering was the discovery and utilization of restriction enzymes. Restriction enzymes are bacterial proteins that cleave DNA into pieces. Rather than doing this randomly, restriction enzymes recognize a very specific short sequence, for example GAATTC, and cleave both strands of DNA at this site. These restriction sites are palindromes: the sequence, read 5P to 3P, is the same on both strands of DNA, as seen in Figure 7.1. When describing the recognition sequence and cleavage site of a restriction enzyme, the shorthand is to only show one strand and to place a caret (^) to mark the spot where cleavage takes place (see Table 7.1). The EcoRI cut is asymmetric, G^AATTC, leaving a short single-stranded DNA sequence protruding. If a 5P end protrudes after cleavage, then this is described as a “5P overhang.” If a 3P end protrudes, then this is described as a “3P overhang.” When the cleavage takes place in the center of the recognition site, for example CCC^GGG, this leaves a “blunt” end with no overhang. Table 7.1 shows a sampling of restriction enzymes with a six-nucleotide recognition site.

Figure 7.1 The EcoRI restriction enzyme site. (A) The EcoRI site, GAATTC, with the flanking sequence depicted as Ns. Note that the site is palindromic: GAATTC is on the top strand while the complementary strand, reading 5P to 3P, is identical: GAATTC. (B) The enzyme cleaves after the G on both strands. Once the backbone of the DNA is broken in two places, the short stretch of complementary bases is not enough to keep the strands together. (C) Separated, the two 5P ends have unpaired bases or “sticky” ends of “AATT.”

The recognition site of commercially available restriction enzymes varies from 4 (for example, AluI, AG^CT) to 13 nucleotides (for example, SfiI, GGCCNNNN^NGGCC). There is another commercially available enzyme, CspCI, which recognizes a relatively short sequence but then cleaves far upstream of this site (Figure 7.2). As Figure 7.2 demonstrates, some enzymes do not require

(A) 5P-NNNNNGAATTCNNNNNN-3P 3P-NNNNNCTTAAGNNNNNN-5P (B) 5P-NNNNNG AATTCNNNNNN-3P 3P-NNNNNCTTAA GNNNNNN-5P (C) 5P-NNNNNG-3P 3P-NNNNNCTTAA-5P

5P-AATTCNNNNNN-3P 3P-GNNNNNN-5P

Restriction Mapping and Genetic Engineering

159

Table 7.1 A sampling of restriction enzymes, their recognition sites, and species of origin Enzyme

Recognition site

Bacteria species

BamHI

G^GATCC

Bacillus amyloliquefaciens

EcoRI

G^AATTC

Escherichia coli

HindIII

A^AGCTT

Haemophilus influenzae

KpnI

GGTAC^C

Klebsiella pneumoniae

PstI

CTGCA^G

Providencia stuartii

SacI

GAGCT^C

Streptomyces achromogenes

SalI

G^TCGAC

Streptomyces albus

SmaI

CCC^GGG

Serratia marcescens

SphI

GCATG^C

Streptomyces phaeochromogenes

XbaI

T^CTAGA

Xanthomonas badrii

Restriction enzymes are pronounced with different rules. For some, every letter is said. For others, syllables are used. Here are the common ways to pronounce the list: “bam-H-1,” “echo-are-1,” hin-dee-3,” “kay-p-en-1,” “p-s-t-1,” “sack-1,” “sal-1,” “smah-1,” “s-p-h-1,” and “x-b-a-1.”

NN^NNNNNNNNNNNCAANNNNNGTGGNNNNNNNNNNNN

a specific sequence for the entire recognition site, but only require that some nucleotides be present. The other nucleotides can be any base (depicted as N). The 3P and 5P overhangs are also known as “sticky ends” and, under the right circumstances, allow joining with other DNA molecules that have the same sticky ends. For example, DNA fragments having protruding AATT sticky ends can be joined to other fragments having the same protruding and complementary bases. Genetic engineers quickly recognized this power of restriction enzymes and designed many experiments and procedures around the use of these enzymes. A  common approach was to prepare inserts with restriction enzyme cleavage and clone them into corresponding “sticky ends” of a cloning vector. As an introduction to these steps, consider a scenario where you are cloning genomic DNA into a vector using just two restriction enzymes. In Figure 7.3, a piece of genomic DNA has been trimmed with SphI and EcoRI. Using the DNA in Figure 7.3 as an example insert, a vector is cut with SphI and EcoRI, generating sticky ends to accept the corresponding ends of the insert (Figure 7.4). (A) SphI: EcoRI: (B)

GCATG^C G^AATTC SphI

EcoRI cnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnng gtacgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncttaa

Figure 7.3 An insert with SphI and EcoRI sticky ends. (A) The recognition sequences for SphI and EcoRI restriction enzymes. (B) DNA was digested with a combination of SphI and EcoRI restriction enzymes. After digestion, the complementary bases of these restriction sites are exposed. The nucleotides here are depicted in lowercase in order to later follow their insertion into an uppercase plasmid sequence. Nucleotides recognized by the enzymes are underlined.

Figure 7.2 The recognition site for restriction enzyme CspCI. The enzyme recognizes the internal specific nucleotides, CAANNNNNGTGG, but then reaches upstream from these bases to cleave the DNA backbone (depicted with ^).

i

Alu repeats

In Section 6.5, the short interspersed repetitive sequences called Alu elements were described. Where did this name come from? If you digest human genomic DNA with the restriction enzyme AluI and sizefractionate the fragments on an agarose gel, you observe a general gradient of sizes from very large to very small. But embedded in this relatively smooth gradient of sizes is a prominent band of DNA fragments which all have approximately the same small size. This band caught the eye of scientists and upon analysis they discovered these thousands of repetitive elements that shared a conserved AluI restriction site. These were then named “Alu elements.”

160

Chapter 7: Bioinformatics Tools for the Laboratory

Figure 7.4 Cloning of an SphI– EcoRI fragment. (A) A vector sequence with SphI and EcoRI sites underlined. Once cleaved (B), removal of the intervening sequence leaves (C). (D) The extending nucleotides of the insert, depicted in Figure 7.3, are complementary to the extending nucleotides of the vector. The insert (lowercase) is accepted into the vector (uppercase) and a ligase is used to restore the backbone of the DNA. (E) The two restriction enzyme sites (underlined), SphI and EcoRI, are restored and this vector plus insert could be cut again by these two enzymes.

(A) SphI EcoRI AAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTCTTAT TTCGAACGTACGGACGTCCAGCTGAGATCTCCTAGGGGCCCATGGCTCGAGCTTAAGAATA (B) SphI EcoRI AAGCTTGCATG|CCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCG|AATTCTTAT TTCGAAC|GTACGGACGTCCAGCTGAGATCTCCTAGGGGCCCATGGCTCGAGCTTAA|GAATA (C) SphI AAGCTTGCATG TTCGAAC

EcoRI AATTCTTAT GAATA

(D) SphI EcoRI AAGCTTGCATG cnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnng AATTCTTAT TTCGAAC gtacgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncttaa GAATA (E) SphI EcoRI AAGCTTGCATGcnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngAATTCTTAT TTCGAACgtacgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncttaaGAATA

Restriction enzyme mapping: the polylinker site Using some of the earliest bioinformatics software, scientists simulated the digestion of DNA using many different restriction enzymes to find which would do the job. This procedure is called restriction enzyme mapping. For a simple demonstration of software that generates a restriction map, we’ll use a short DNA sequence known as a polylinker. There are often many different steps to genetic engineering, with foreign inserts being manipulated a number of times to achieve the experimental goals. One of the greatest advancements in the field of genetic engineering was the construction of artificial vectors which carried features that allowed easy manipulation in the laboratory. To accommodate the many different possible insertions and approaches, a vector polylinker cloning site was invented. A polylinker is a marvel of engineering; in one example, within the span of 57 nucleotides there are 10 different commonly used restriction enzyme sites. This generates 53 different cloning site possibilities. To see these restriction sites, we’ll use a restriction mapping tool called the NEBcutter.

NEBcutter There are choices of restriction mapping tools on the Internet, but the NEBcutter tool at the New England Biolabs Website is easy to use and generates a map with a clean look (tools.neb.com/NEBcutter2/index.php). For an example of a polylinker, a sequence can be found in GenBank file HQ418395, coordinates 1504–1560. This sequence is shown in Figure 7.5A in double-stranded form. In these 57 nucleotides are the sites for 10 restriction enzymes listed in Figure 7.5B, all of which have a six-nucleotide enzyme recognition site. Go to the GenBank file and copy the polylinker sequence as clean text (no numbers or spaces). Navigate to the NEBcutter Website, paste in the sequence, and hit “Submit.” When the screen refreshes you see a very complicated map with many more restriction enzyme sites than the 10 listed in Figure 7.5B. Many of these additional sites are not practical to use for digesting the polylinker site because they would cut the plasmid elsewhere too frequently. For example, AluI recognizes a four-base sequence for cutting, AG^CT. AluI cuts the 9921 nucleotide vector,

Restriction Mapping and Genetic Engineering

Figure 7.5 A polylinker cloning site. (A) Both strands of the 57 nucleotide sequence that contains 10 different restriction enzyme sites. (B) A list of the 10 restriction enzymes. (C) A screenshot, generated by the New England Biolabs NEBcutter, showing the location of all 10 sites.

(A) AAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCCCCGGGTACCGAGCTCGAATTC TTCGAACGTACGGACGTCCAGCTGAGATCTCCTAGGGGCCCATGGCTCGAGCTTAAG

(B) Enzyme BamHI EcoRI HindIII KpnI PstI SacI SalI SmaI SphI XbaI

Recognition site

G^GATCC G^AATTC A^AGCTT GGTAC^C CTGCA^G GAGCT^C G^TCGAC CCC^GGG GCATG^C T^CTAGA PstP

(C) HindPPP

SphP

161

SacP

×SalP

XbaP

BamHP

×SmaP

KpnP

×EcoRP

5’ . . . A A G C T T G C A T G C C T G C A G G T C G A C T C T A G A G G A T C C C C G G G T A C C G A G C T C G A A T T C 1 10 20 30 40 50 3’ . . . T T C G A A C G T A C G G A C G T C C A G C T G A G A T C T C C T A G G G G C C C A T G G C T C G A G C T T A A G SphP HindPPP

PstP

×SalP

XbaP

KpnP

×SmaP

mentioned as a source of the polylinker sequence above, 41 times (see Box 7.1). To illustrate just the 10 sites listed in Figure 7.5B, click on “Custom Digest” (after the first “Submit”) in the NEBcutter form and use the check boxes to select the 10 enzymes. Then hit the “Digest” button to view the map seen in Figure 7.5C. The map generated by the NEBcutter tool shows the sequence, now both strands, and has arrows and labels to indicate the locations in the DNA backbones where the enzymes cleave. Notice how the enzyme recognition sites overlap.

Box 7.1 How often do restriction enzymes cut DNA? The frequency of restriction enzyme sites in a DNA sequence is dependent on the sequence composition, but for the most part it is a function of the number of nucleotides in the enzyme recognition site. If DNA sequence were completely random, a given four-nucleotide sequence will be found once every 256 nucleotides. This is calculated by multiplying the probabilities of finding each base in each position:

¼ × ¼ × ¼ × ¼ = ½56 In the example shown in Figure 7.5, 41 AluI sites were found in a 9921 nucleotide vector. By calculation, 39 sites were expected, quite close to this estimation. The enzyme recognition sites in Figure 7.5B are six nucleotides long. These sites are expected to come up once per 4096 nucleotides. The vector is 9921 nucleotides long so we would estimate approximately two sites from an enzyme that recognizes six bases. But remember, this is a cloning vector and some restriction sites may have been purposefully destroyed so the polylinker site would contain the only EcoRI site, for example.

SacP

. . . 3’ . . . 5’

×EcoRP

162

Chapter 7: Bioinformatics Tools for the Laboratory

Input:

5P-ACTG-3P

Reverse-complement:

5P-CAGT-3P

Reverse:

3P-GTCA-5P

Complement:

3P-TGAC-5P

Desired outcome:

5P-ACTG-3P 3P-TGAC-5P

Figure 7.6 Output from the “Reverse Complement” utility on the Sequence Manipulation Suite Website. 5P and 3P labels have been added to more easily follow the result of the form’s actions. Of the three options, “complement” was needed to generate the desired outcome.

Generating reverse strand sequences: Reverse Complement In addition to assisting in laboratory experiments, simple utilities can be used to generate illustrations of sequence manipulations. This can be very important in documenting the planning, execution, and result of genetic engineering procedures. For example, a simple utility called Reverse Complement was used to generate Figure 7.5A. When obtaining DNA sequences from most Websites, only one strand is provided. To show the second strand of the polylinker in Figure 7.5A, a utility was needed to generate the complement. In this case, the “Sequence Manipulation Suite,” listing many utilities, was used, www.bioinformatics.org/sms2/index.html. A Web form, “Reverse Complement,” takes DNA sequence as input and gives the choice of reverse-complement, reverse, and complement. To illustrate these choices, consider a simple sequence, ACTG, and the desired outcome, seen in Figure 7.6. Reverse-complement refers to the complementary strand of the input, displayed in a 5P to 3P manner. Reverse literally reverses the order of the nucleotides, showing the sequence 3P to 5P. Complement generates what we need for the illustration. It generates the complement but doesn’t flip the orientation as seen with reverse-complement. In the desired outcome, the second strand is now correct: the T is opposite the A, G opposite the C, and so on.

DNA translation: the ExPASy Translate tool Continuing our analysis of the polylinker sequence, above, let’s simulate the translation of this short sequence. A feature of the polylinker is that it is usually placed within a gene, in a region that tolerates a short insertion, and allows the read-through translation in that gene’s reading frame. The insertion of almost any DNA sequence into the polylinker disrupts the translation of the “reporter” gene and inactivation of this gene is utilized in screening clones. To translate the polylinker, go to the Translate tool on the ExPASy Website, web.expasy.org/translate. Simply paste in the single strand of the polylinker and select “Includes nucleotide sequence” from the drop-down menu. Click on “Translate Sequence” and the output shown in Figure 7.7 is seen.

Figure 7.7 ExPASy Translate tool. The six reading frames, translated by the ExPASy Translate tool.

5´3´ Frame 1 aagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattc K L A C L Q V D S R G S P G T E L E F 5´3´ Frame 2 aagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattc S L H A C R S T L E D P R V P S S N 5´3´ Frame 3 aagcttgcatgcctgcaggtcgactctagaggatccccgggtaccgagctcgaattc A C M P A G R L - R I P G Y R A R I 3´5´ Frame 1 gaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagctt E F E L G T R G S S R V D L Q A C K L 3´5´ Frame 2 gaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagctt N S S S V P G D P L E S T C R H A S 3´5´ Frame 3 gaattcgagctcggtacccggggatcctctagagtcgacctgcaggcatgcaagctt I R A R Y P G I L - S R P A G M Q A

Finding Open Reading Frames All six reading frames, three forward and three in reverse, are presented with oneletter abbreviations for the amino acids. Dashes represent termination codons. Note that four of the six reading frames are open from end to end, allowing readthrough translation at the insertion site in these frames.

7.3 FINDING OPEN READING FRAMES Although the above Translate tool can be used to identify open reading frames, there are other tools that give you graphical output to help you choose reading frames from larger sequences. This can be a frequently encountered problem; does an EST or a section of genomic DNA encode a protein? Perhaps the identification of large reading frames will help. If you started with an unknown DNA sequence and knew that a protein-coding region was present, what properties in the DNA sequence could lead you to identifying this coding region? Similarity to known genes or proteins is one property, and this has been covered extensively in Chapters 3 (BLASTN) and 5 (BLASTX). An additional approach is to look for open reading frames (ORFs). In order to encode a protein, the DNA sequence must be free of translation termination codons for an extended length of one reading frame. This “open” reading frame would then end with a termination codon marking the 3P boundary of the coding region. If you examine the other forward reading frames in the same region, you almost always see frequent termination codons that could remove these frames from consideration as coding sequence. If DNA sequence were completely random, you would encounter termination codons three times out of every 64 codons (see Figure 4.2) or about once every 64 nucleotides. Of course, very small real coding regions could be hidden among random stretches of DNA sequence that happen to have very few stop codons. Although many mammalian proteins are hundreds or thousands of amino acids long, there are many real proteins that are less than 100 amino acids in length. Also, remember that the first nucleotide of all three termination codons is T (TAG, TGA, TAA), so a DNA sequence with few Ts could have many long stretches of sequence without any terminators but not necessarily code for a protein. Other data must be gathered to predict if a short reading frame is real: for example, does the encoded protein sequence show any similarity to anything in the database? In small genomes, space is often limiting so you see very little noncoding sequence. Unlike mammalian genomes where you may have thousands of nucleotides between genes, bacterial or viral genomes can have genes separated by 100 nucleotides or less. Overlapping genes are not uncommon. Evolutionary constraints on these overlapping coding regions must be very strong since a single nucleotide change could alter two protein sequences at once.

The NCBI ORF Finder To demonstrate the capabilities of the NCBI ORF Finder, let’s look at some genes that were studied in earlier chapters. Go to www.ncbi.nlm.nih.gov/gorf/gorf.html, and enter NM_181744, which is the accession number for a human opsin-5 transcript (first encountered in Section 6.2). Notice that you may also enter a sequence in FASTA format. Selecting the standard genetic code (the default in the drop-down menu), click on the “OrfFind” button. When the window refreshes, this tool shows you a graphic of ORFs in all six frames as colored boxes. By using a drop-down menu, you can raise or lower the minimum length of the ORFs shown (you have to hit the “Redraw” button to refresh the view). This filtering can reduce the number of small ORFs from your view since there are many that are not biologically real. But if you don’t find what you are looking for with a large window, you can lower this parameter and examine smaller coding regions. By clicking on the graphic of the open reading frame, the predicted translation will then be displayed below it. In addition, the table listing

163

164

Chapter 7: Bioinformatics Tools for the Laboratory

Figure 7.8 The NCBI ORF Finder tool. A transcript of human opsin-5, NM_181744, was analyzed and shown are ORF predictions, both in graphic and table form. The sequence goes 5P to 3P, left to right, respectively, and each reading frame gets a separate row in the graphic. The largest ORF of this transcript was clicked on, changing its color, marking the table listing with the same color, and showing the translation of that ORF below (only a portion of the translation is shown in this figure).

the ORFs refreshes and indicates which ORF you have selected (Figure  7.8). A nice feature is that you can use a built-in BLASTP function to test your ORF for sequence similarity to anything already in a small handful of databases. Looking at the results carefully, a 1065 nucleotide ORF is the largest and was found at the extreme 5P end. The total length of the transcript is 3496 nucleotides, the ORF representing only 30% of the total length. This is typical for vertebrate genes, with a relatively short 5P UTR (in this case, 28 nucleotides) and an extensive 3P UTR. Now, let’s compare three other transcripts: NM_014244, NM_198904, and NM_001492. As seen in Figure 7.9A and Figure 7.9B, the largest ORF is at the 5P end of the transcript. Checking the annotation of these two genes, ADAMTS2 and GABA A receptor gamma 2, confirms that these large ORFs are the correct choices. However, Figure 7.9C shows a transcript requiring contemplation. This transcript, NM_001492, is a clear deviation from the pattern seen so far and is from a very unusual gene. This gene, GDF1, encodes two different proteins from a single transcript and is referred to as “bicistronic” to reflect this. The largest ORF at the 5P end encodes the LASS1 protein, and the largest ORF at the 3P end encodes the GDF1 protein. There are two RefSeq records: NM_001492 and NM_021267 for GDF1 and LASS1, respectively. In the annotation of each, there is a “misc feature” which points out that the other ORF and coding region exist. Even without the complication of the second large ORF, there are four other ORFs each over 500 nucleotides. Using the convenient BLASTP function within the ORF Finder shows that there is no record of these other ORFs producing a protein. As mentioned before in Chapter 5, because of the tendency for reverse transcriptase to fall off mRNA prematurely, there is a 3P bias in many cDNA libraries. The result is that the 5P UTR of many cDNAs may be very short and not the correct length. Only with careful and often different cDNA synthesis techniques are the full-length 5P UTRs identified. Regardless, the ORF will tend to be at the 5P end of the cDNA, with a 5P UTR of several hundred nucleotides or less. Of course,

PCR and Primer Design Tools (A)

(B)

165

Figure 7.9 ORF Finder graphics of three human transcripts. (A) ADAMTS2, NM_014244, a 6772 nucleotide transcript with a 3636 nucleotide coding region shown as the largest ORF in the first reading frame. (B) GABA A receptor gamma 2, NM_198904, a 3957 nucleotide transcript with a 1428 nucleotide coding region shown as the largest ORF in the second reading frame. (C) A bicistronic transcript, NM_001492, encoding both LASS1 (coordinates 73–1125) and GDF1 (coordinates 1395–2513). The total length of the transcript is 2558 nucleotides.

(C)

there are exceptions. Some 3P UTRs can be incredibly long, easily numbering in the thousands of nucleotides. Historically, some cDNAs were annotated as encoding novel proteins but they were later found to be all 3P UTR sequence! Only later, when full-length cDNAs were identified, was the true identity of these DNAs determined, and the “novel” coding regions turned out to be short open reading frames in the 3P UTR.

7.4 PCR AND PRIMER DESIGN TOOLS The polymerase chain reaction (PCR) is one of the major tools of molecular biology and this genomic era. Double-stranded DNA is denatured to single strands through the elevation of incubation temperature, and then cooled to allow oligonucleotide primers to anneal to the now single-stranded templates. DNA polymerase uses these primers to synthesize the complementary strands for the templates, and then the double-stranded DNA is denatured again to start a new cycle. Only 20 rounds of PCR are needed to increase the amount of starting material by one million-fold, making rare sequences accessible and creating new approaches to cloning and detection. One round of PCR is illustrated in Figure 7.10.

166

Chapter 7: Bioinformatics Tools for the Laboratory

Figure 7.10 The polymerase chain reaction. In this cartoon, one round of amplification is illustrated. (A) One double-stranded DNA template. (B) This is heated to melt it to single strands. (C) At cooler temperatures, oligonucleotide primers then anneal to the singlestranded DNA in a sequence-specific manner. (D) The 3P ends of these oligonucleotides act as primers for DNA synthesis by a thermally stable DNA polymerase, generating double-stranded copies of the original starting material.

(A) 5P------------------------------3P 3P------------------------------5P

↓ (B) 5P------------------------------3P 3P------------------------------5P

↓ (C) 5P------------------------------3P 3P---5P 5P---3P 3P------------------------------5P

↓ (D) 5P------------------------------3P 3P------------------------------5P 5P------------------------------3P 3P------------------------------5P

The oligonucleotide primers for PCR are key to the success of the amplifications. They must possess a number of qualities to work properly and software has been developed to assist in their proper design. First, they must be sequence-specific. PCR primers are typically added to DNA samples containing many millions of sequences, often genomic DNA. Should the primers be nonspecific, able to anneal in multiple places in the genome, subsequent reactions will amplify DNA in an unpredictable and nonproductive manner. In PCR cycles, an excess of oligonucleotide primers is used to insure adequate amounts for million-fold amplification. These primers must not anneal to each other or they will be unavailable for their designed task. The primers must anneal to the template at the proper temperature, and in the presence of the salts necessary for the DNA synthesis to proceed. Their composition and the sequence at their ends are important in optimizing their binding to the template. Since “A” is always paired with “T,” and “G” with “C,” the percentages are usually expressed as %GC, leaving you to do the mental math to calculate the balance, %AT.

Primer3 Primer3 (frodo.wi.mit.edu/primer3) is a well-respected and widely used Web form for primer design. Two pages of parameters are available for primer refinement should the default settings not provide results. After entering your DNA sequence (the target or “source” sequence for priming), you are given the opportunity to mask common repeat units from a drop-down menu (Mispriming Library) by choosing the species. With literally millions of Alu and L1 repeats in many genomes, you MUST avoid these sequences when performing PCR cycles on genomic DNA or you may end up amplifying sequences from all over the target genome. To demonstrate the capabilities of the Primer3 Website, paste the DNA sequence for the human beta globin gene, NG_000007, into the source sequence text field. If you paste in the FASTA format (with the annotation line), the form will include the annotation in the results page. For this example, the goal is to design two primers that flank the second exon, nucleotide coordinates 273–495 in this sequence file. To input this information into Primer3, go to the “Targets” field and enter 273,223 representing the starting nucleotide and the length of the exon, respectively. Another way to designate the exon is to surround it with square brackets within

PCR and Primer Design Tools the field where you pasted in your sequence. Finally, you also have the option to identify specific regions to avoid (“Excluded Regions”). You may want to exclude sequences of adjacent exons, for example. Figure 7.11B shows the results from this example. Note that it is a simple, easy-to-understand text graphic that can easily be pasted into an electronic record of your laboratory work.

i

(A) PRIMER PICKING RESULTS FOR gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@); and hemoglobin, beta (HBB); and hemoglobin, delta (HBD); and hemoglobin, epsilon 1 (HBE1); and hemoglobin, gamma A (HBG1); and hemoglobin, gamma G (HBG2), RefSeqGene on chromosome 11

gc% 50.00 45.00

any 5.00 8.00

STS: What’s that?

Looking at the annotation of the human beta globin gene region (NG_000007) there are many, many locations annotated as “STS.” These are Sequence Tagged Sites, which are short (100–500 nucleotides) sequences that are unique in the genome and were identified and designated for the purposes of generating large genome maps. If you have enough STSs, you can put together a map with accurate distances. Once they are placed on the map, they can be used (in pairs or more) to identify distances, insertions or deletions, or chromosomal rearrangements.

Five primer pairs (this is the default amount) are listed in Figures 7.11A and 7.11D, and one pair is displayed on the sequence in Figure 7.11B. Coordinates, melting temperature, and the primer sequences are displayed along with statistics in Figure 7.11E, including how many possibilities were considered (in this case, in the thousands!). Notice that the “Right” (downstream) primer design encountered many more problem sequences than the “Left” primer; downstream primers were rejected because of “bad GC%” and the melting temperature was a problem for many candidates.

No mispriming library specified Using 1-based sequence positions OLIGO start len tm LEFT PRIMER 225 20 60.19 RIGHT PRIMER 522 20 59.57 SEQUENCE SIZE: 2870 INCLUDED REGION SIZE: 2870

167

3’ seq 3.00 CCTAAGCTGATTCGGCCATA 2.00 CAATGCATGCAGAAAGAAGC

PRODUCT SIZE: 298, PAIR ANY COMPL: 4.00, PAIR 3’ COMPL: 2.00 TARGETS (start, len)*: 273,223

(B) 1 GGATCCTCACATGAGTTCAGTATATAATTGTAACAGAATAAAAAATCAATTATGTATTCA 61 AGTTGCTAGTGTCTTAAGAGGTTCACATTTTTATCTAACTGATTATCACAAAAATACTTC 121 GAGTTACTTTTCATTATAATTCCTGACTACACATGAAGAGACTGACACGTAGGTGCCTTA 181 CTTAGGTAGGTTAAGTAATTTATCCAAAACCACACAATGTAGAACCTAAGCTGATTCGGC >>>>>>>>>>>>>>>> 241 CATAGAAACACAATATGTGGTATAAATGAGACAGAGGGATTTCTCTCCTTCCTATGCTGT >>>> **************************** 301 CAGATGAATACTGAGATAGAATATTTAGTTCATCTATCACACATTAAACGGGACTTTACA ************************************************************ 361 TTTCTGTCTGTTGAAGATTTGGGTGTGGGGATAACTCAAGGTATCATATCCAAGGGATGG ************************************************************ 421 ATGAAGGCAGGTGACTCTAACAGAAAGGGAAAGGATGTTGGCAAGGCTATGTTCATGAAA ************************************************************ 481 GTATATGTAAAATCCACATTAAGCTTCTTTCTGCATGCATTGGCAATGTTTATGAATAAT ***************