Bioinformatics: a primer
 9788122426465, 9788122416107, 8122416101, 8122426468

Table of contents :
Cover......Page 1
Preface......Page 8
Acknowledgement......Page 10
Contents......Page 12
1.1 "Bioinformatics"......Page 14
1.2 The Objectives of Bioinformatics......Page 17
2.1 Atomic Structure......Page 24
2.2 Molecules......Page 26
3.1 Constituents of Nucleic Acids......Page 28
3.2 Polynucleotides......Page 29
3.3 The Genome Projects......Page 34
4.1 Amino Acids......Page 43
4.2 Proteins......Page 47
4.3 Forces Stabilizing the Molecular Structure......Page 55
Chapter 5 Physicochemical Characterization of Biomolecules......Page 64
5.1 Hydrodynamical Methods......Page 65
5.2 Chromatographic Methods......Page 68
5.3 Electrophoretic Methods ......Page 69
5.4 Blotting Techniques......Page 79
6.1 Primary Structure Determination of Nucleic Acids......Page 83
6.2 Primary Structure Determination of Proteins......Page 89
7.1 X-Ray Diffraction Methods......Page 96
7.2 Nuclear Magnetic Resonance (NMR) Spectroscopy......Page 102
7.3 Imaging Methods......Page 108
8.1 Protein-Nucleic Acid Interactions......Page 115
8.4 Protein-Lipid Interactions......Page 123
Chapter 9 The Protein-folding Problem......Page 128
9.1 Genomics Analysis......Page 129
9.2 Proteomic Analysis......Page 138
10.1 Protein Folding Rules......Page 146
10.2 Structure Prediction of Fibrous Proteins......Page 147
10.3 Structure Prediction of Globular Proteins......Page 148
10.4 Application of Structure Prediction Programs......Page 157
11.1 Primary Structure (Sequence)......Page 164
11.2 Databases......Page 165
11.3 Genome Datbase Search......Page 168
11.4 Protein Database Search......Page 173
Chapter 12 Data Mining, Analysis and Modeling......Page 181
12.1 Sequence Alignment Analysis......Page 183
12.2 Pair-Wise Sequence Alignment......Page 186
12.3 Multiple Sequence Alignment (MSA)......Page 191
12.4 Phylogenetic Analysis......Page 197
12.5 Secondary Structure Analysis......Page 203
12.6 Motifs, Domains and Profiles......Page 205
12.8 Protein Classification and Modeling......Page 212
13.1 Disease Gene Identification......Page 225
13.2 Genetic Variations and Genetic Diseases......Page 227
13.3 Genetic Testing and Therapy......Page 229
14.1 Genomics and Proteomics Analysis......Page 236
14.2 Rational Design......Page 240
14.3 Validation......Page 243
Glossary......Page 247
Index......Page 255

Citation preview

This page intentionally left blank

Copyright © 2005, New Age International (P) Ltd., Publishers Published by New Age International (P) Ltd., Publishers All rights reserved. No part of this ebook may be reproduced in any form, by photostat, microfilm, xerography, or any other means, or incorporated into any information retrieval system, electronic or mechanical, without the written permission of the publisher. All inquiries should be emailed to [email protected]

ISBN (13) : 978-81-224-2646-5

PUBLISHING FOR ONE WORLD

NEW AGE INTERNATIONAL (P) LIMITED, PUBLISHERS 4835/24, Ansari Road, Daryaganj, New Delhi - 110002 Visit us at www.newagepublishers.com

Dedicated to my parents, in their memory, and to my wife and son, for their constant encouragement, support and patience.

This page intentionally left blank

Preface

Associations of elements, molecules, their complexes and aggregates have physicochemical information content that would be of chemical and biological importance. Therefore, there is tendency to address wide varieties of physicochemical interactions under “Bioinformatics”. Such an approach would unfortunately dilute the focus of the main objectives of bioinformatics. “Bioinformatics” is the study (by experimental and computational) of biological information, from its storage sites (DNA/RNA) in the genome to the various gene products in the cell. In the case of life processes, to the realm of which bioinformatics logically belongs, the fundamental building blocks that the life systems made up are nucleic acids, proteins, carbohydrates, lipids and their complexes. The central aims of bioinformatics, therefore, is to elucidate (by experimental methods) and understand structural (primary, secondary, tertiary etc.) features of these biological entities and correlate these structural features to address the physicochemical interactions, functions and pathways among these molecular entities in the cell (experimentally as well as computationally). Bioinformatics is a multi-disciplinary subject, with immense scope in molecular biology, biotechnology, pharmaceutical and medical fields– e.g., genome and protein sequencing, structure prediction and molecular modeling and drug design and development of novel molecules and drugs (molecular engineering), medical diagnostics and therapeutics. With the advances in experimental methodologies (X-ray diffraction; NMR spectroscopy) in molecular structure determination, biology has become data-rich with considerable amount of experimental data being made available on complex biomolecular structures. Tremendous progress in computational arenas has also been taking place, in terms of vast data storage capacity, processing and visual display. These developments have made it possible to address complex array of biological systems and interactions in a systematic and quantitative way. Study of bioinformatics as understood and pursued (by computational and biomedical scientists) is treated with computational biology, which relies primarily on computational (theoretical) methods as applied to biological and medical sciences–in areas such as in genomics, proteomics, drug design etc. In this kind of approach there is a tendency to underplay the crucial importance of experimental methods of data acquisition and structure-function interpretation. But, the availability of structural data by experimental methods is the core aspect in rational molecular drug design (molecular engineering) and validation. Lack of such data or

viii Preface

poor understanding would lead to spurious and inconsistent molecular models and thus would undermine the very objective of bioinformatics. The major objectives of bioinformatics still are the same– (i) Development of computer databases and algorithms to analyze biological data. (ii) Processing, and interpretation of these complex physicochemical databases on molecular interactions, structures and functions of biomolecules. (iii) Computational methods for structure prediction and molecular modeling (molecular engineering/drug design), based on available experimental data that include cases where the existing experimental techniques are too time-consuming, or unable to provide structural information due to inherent operational constraints. Till recently, progress in bioinformatics (quantitative biology) was initiated and nurtured by physical scientists (crystallographers; NMR specialists) and biologists. However, this situation is changing with bioinformatics relying more and more upon computation-oriented problemsolving protocols. Biologists may know biological systems and their functions, but the emphasis should be, while taking up computational bioinformatics, on the structural and functional aspects of biological molecules vis a vis their physical and chemical characteristics. On the other hand, the computational personnel may be versatile in the operational aspects of computer programs and algorithms, but the real handicap arises if they lack basic knowledge about the structures and functions of biological systems, whose complexity they are supposed to unravel. This situation is akin to a driver who is adept at driving automobiles but lacks the basic knowledge of automobile engineering. Therefore, it is imperative that both groups have operational knowledge of the essentials of molecular biophysics, molecular biochemistry and structural biology while undertaking the task of molecular modeling and design. With these broad objectives in mind, the material contents of this book, “Bioinformatics: A Primer”, are organized under molecular biophysics, experimental methods of structure elucidation, database search, data mining and analysis, computational methods of structure prediction, and rational molecular/drug design and validation, with easy interface between these areas and various chapters. Ample tables and figures, culled form the Protein Data Bank (PDB) and other sources, are intended to facilitate the reader insights to the structure-function features at the molecular level. Exercise modules and bibliography for each chapter, and glossary are aimed at providing the reader wider perception and insight to the subject matter, and scientific and technical terms. Index is also provided to help in easy access to the words and topics to the subject matter. P. Narayanan

Acknowledgement

Thanks are due to Ms. Swarna Murthy, Bhabha Atomic Research Centre, Mumbai, for rendering bibliographic help. P. Narayanan

This page intentionally left blank

Contents

vii ix

Preface Acknowledgement 1. Bioinformatics: Introduction 1.1 “Bioinformatics” 1.2 The Objectives of Bioinformatics

1 1 4

Section I Biomolecular Structure (Molecular Biophysics) 2. Atoms and Molecules 2.1 Atomic Structure 2.2 Molecules 3. Features of Nucleic Acids 3.1 Constituents of Nucleic Acids 3.2 Polynucleotides 3.3 The Genome Projects 4. Features of Proteins 4.1 Amino Acids 4.2 Proteins 4.3 Forces Stabilizing the Molecular Structure

11 11 13 15 15 16 21 30 30 34 42

Section II Experimental Methods of Structure Elucidation (Bioinformatics-I) 5. Physicochemical Characterization of Biomolecules 5.1 Hydrodynamical Methods 5.2 Chromatographic Methods 5.3 Electrophoretic Methods 5.4 Blotting Techniques 6. Primary Structure (Sequence) Determination of Biomolecules 6.1 Primary Structure Determination of Nucleic Acids 6.2 Primary Structure Determination of Proteins

51 52 55 56 66 70 70 76

xii

Contents

7. Spatial Structure Determination of Biomolecules 7.1 X-Ray Diffraction Methods 7.2 Nuclear Magnetic Resonance (NMR) Spectroscopy 7.3 Imaging Methods 8. Protein-Ligand Interactions 8.1 Protein-Nucleic Acid Interactions 8.2 Protein-Protein Interactions 8.3 Protein-Carbohydrate Interactions 8.4 Protein-Lipid Interactions Section III Towards Structure Prediction (Bioinformatics-II) 9. The Protein-folding Problem 9.1 Genomics Analysis 9.2 Proteomic Analysis 10. Computational Methods in Structure Prediction 10.1 Protein Folding Rules 10.2 Structure Prediction of Fibrous Proteins 10.3 Structure Prediction of Globular Proteins 10.4 Application of Structure Prediction Programs Section IV Database Search, Analysis and Modeling (Bioinformatics-III) 11. Database Search 11.1 Primary Structure (Sequence) 11.2 Databases 11.3 Genome Datbase Search 11.4 Protein Database Search 12. Data Mining, Analysis and Modeling 12.1 Sequence Alignment Analysis 12.2 Pair-wise Sequence Alignment 12.3 Multiple Sequence Alignment (MSA) 12.4 Phylogenetic Analysis 12.5 Secondary Structure Analysis 12.6 Motifs, Domains and Profiles 12.7 Pattern Recognition 12.8 Protein Classification and Modeling 13. Medico- and Pharmacoinformatics 13.1 Disease Gene Identification 13.2 Genetic Variations and Genetic Diseases 13.3 Genetic Testing and Therapy 14. Molecular Engineering 14.1 Genomics and Proteomics Analyses 14.2 Rational Design 14.3 Validation

83 83 89 95 102 102 110 110 110

151 151 152 155 160 168 170 173 178 184 190 192 199 199 212 212 214 216 223 223 227 230

Glossary

234

Index

242

115 116 125 133 133 134 135 144

1 Bioinformatics: Introduction Structural and physicochemical characteristics of atoms, molecules and their complexes in the cells of the living organisms form the essence of information content and its use. Genetic materials–genes and gene products (e.g. proteins) are the basis of the life processes. Understanding of the intricate processes of information storage, retrieval and transmission in genes and gene products in the cell is the first logical step towards our understanding of the complex life processes. A better understanding of these biochemical processes would help in understanding the structure-function relationships, and biochemical pathways in the life processes. An understanding of the behavior of biological systems at each level of their organization can only be achieved by careful study of the complex dynamical interactions between the components of these systems. For this understanding to be quantitative it is necessary to develop structural, biophysical and biochemical mathematical models. Once developed, these models can be simulated, analyzed, and visualized through application of modern engineering and computational approaches. Acquisition of high-throughput biological data (e.g. from genomic projects) at fast rate has ushered in computer-intensive data analysis (in silico analysis). Computational methods are used to obtain meaningful data from gene expression microarrays (cDNA microarray-based RNA quantitation), proteomics, mass spectrometry (MS), 2-DE, protein-ligand interaction studies, and other experiments, in order to establish biological pathways. It is hoped that this would ultimately help, in addition to understanding how various factors are interconnected, in improving genes and gene products, designing better and new molecular species, identifying disease-susceptible genes and define and diagnose disease on a molecular basis, and to identify targets for therapeutic intervention and design of new drugs.

1.1 “BIOINFORMATICS” Progress in structural biology has been closely associated with the emergence of a new area in quantitative biology, currently known under “bioinformatics”. Elucidation of three-dimensional structures, primarily by X-ray crystallography, of biomolecular model complexes– those of nucleotides and peptides complexes– enabled one to address the dynamics of proteinnucleic acid interactions, and the conformational studies of biomolecules (e.g. Ramachandran analysis). Technical advances in structural biology, together with the development of fast computers with vast storage capacity for processing voluminous data ushered in the era of macromolecular crystallography, which has immensely enriched the structural and quantitative biology, and was also the harbinger for the emergence of bioinformatics. Besides, laboratory

2

Bioinformatics: A Primer

automation, integration of improved technologies in biological and allied sciences led to rapid accumulation of vast amount of genome sequence information, functional expression analysis and other types of experimental data. The considerable “algorithmic complexity of biological systems of biological systems requires a vast of amount of detailed information at the cellular and molecular levels for their complete description. Thus, the need for systematic organization, and analysis, towards an integrated view of biology, necessitated the introduction of computers, resulting in the evolution of a new area of quantitative biology, namely “bioinformatics”, dedicated to the computational “mining” or in silico analysis of this experimental data. Till recently, while the (experimental) biologists focused on accumulating more data from elaborate experimental approaches, the quantitative biologists concentrated on developing algorithms for the interpretation of these data, with minimum cross-interaction. The introduction of computers in a comprehensive way in natural sciences has completely changed the “mindset” of both the experimental and quantitative biologists and universalized the outlook towards the approach and the methodology in scientific research. This change in attitude has also, in part, ushered in greater acceptance of scientists from divergent fields–physicists, chemists, molecular biologists, biomedical and pharmacological personnel, and computer scientists. “Bioinformatics” is thus a generic term that encompasses the application of computational tools and approaches to the study of information content, organization, and processing in biological systems (genes and gene products) utilization at the molecular level– e.g. study of diagnostic, therapeutic and prognostic features (biomedinformatics), and structural and chemical features in molecular and drug design (cheminformatics). It is the symbiotic relationship between computational and biological sciences with emphasis on computational aspects. In general, bioinformatics is understood and pursued as the study of information content and information flow in biological systems and processes. It is a bridge between observation (experimental data) in diverse biologically related disciplines and extrapolation of information, by computational analysis, about how the systems and processes function. It also envisages subsequent application of the knowledge and insights thus gained towards rational design and synthesis of new molecules (e.g. drugs, insulin and other biological compounds) tailor-made to desired specifications or conditions. Thus, the aims of bioinformatics are broad and diverse, and require trans-disciplinary collaboration between molecular biophysicists, drug chemists, molecular biologists and computer-aided modeling experts.

1.1.1 Information Content and Transmission Double helical DNA (in general) is the molecule in which the genetic information (genetic code) is stored. The information is stored chemically by sequence-specific base-pairing (A = T; & G ∫ C). This basepair complementarity implies that the strands of nucleic acids act as templates for replication (duplication) of the genetic code. The genetic code is transcribed into a messenger RNA (mRNA) strand, which is complementary to its DNA template. It is the information-carrying link between the DNA and synthesis (translation) of proteins in ribosomes. The tRNA molecules carry respective amino acids to the ribosomal sites and interact with the mRNA (codon-anticodon interactions) in synthesizing proteins.

Bioinformatics: Introduction 3

The genetic code is the relationship between the sequence of bases in DNA (its mRNA template) and the sequence of amino acids in proteins. An amino acid is coded by a set of three contiguous bases, called codons. The genetic code is degenerate, because more than one codon can code for the same amino acid (Fig. 1.1).

A = Adenine; G = Guanine; C = Cytosine; U = Uridine; ## = Stop signal (Source: Narayanan, P. (1998) J Uni Mumbai, 55(82); 41). Fig. 1.1 Dictionary of the Genetic Code

1.1.2 Structural Defects and Genetic Significance Genes are the units of heredity that provide the blueprint for our physical body, determining not only how we may live, but also the quality of that life. The extent of quality of life can be drastically altered by disease, and genetic disease is perhaps the purest illustration of the relationship between our genes and our health. Since the genetic code in the mRNA template is ”read” sequentially in triplets of nucleotides (codons) and translated into synthesis of proteins, any defects manifest in nucleotides or shift (frame-shift) in reading will result in synthesis of an altered protein (basis of mutation). While genetic mutations are fundamental to evolution, they are also the reason for genetic (hereditary) diseases. There is a direct link between improper folding of protein tertiary structure and genetic diseases at the molecular level. Some of the genetic disorders can be debilitating or lethal. In the case of single-point mutation, that is, single nucleotide polymorphism (SNP), predisposition for the disease is directly associated with the presence of a single gene allele. To quote few examples: Sickle cell anemia is due to single-point mutation (single nucleotide polymorphism—SNP) in the 6th position of the b-chain of hemoglobin, with the substitution of hydrophobic valine

4

Bioinformatics: A Primer

residue in place of glutamic acid, which is acidic. This change of a single amino acid alters the structure of hemoglobin in such a way that the deoxygenated protein (dHbS) polymerizes and precipitates within the erythrocyte, leading to characteristic sickle shape. Similarly, singlepoint mutation with substitution of critical threonine by lysine in antithrombin results in thrombosis. In cystic fibrosis transmembrane-regulator (CFTR) protein (1480 residues and 170 kilodaltons), a single-point mutation leading to the deletion of the crucial amino acid phenylalanine at position 508 leads to misfolding of the protein and its inability in protein trafficking function. A single-point mutation in rhodopsin (His Æ Pro at position 23) leads to retinal degeneration and blindness. The characteristic repeating unit of collagen monomers is (Gly-XY)n, and glycine residue is very crucial in the formation of collagen helices. Therefore, any mutation that results in the replacement of glycine by any other amino acid would lead to host of pathological disorders. In many neuro-degenerative diseases, such as Alzheimer’s, Parkinson’s, Huntington’s, agglutination of soluble proteins as fibers is related to tri-nucleotide repeat expansion (TNRE), that is, to critical number of particular amino acid residue (e.g. glutamine residues). Induced conformational change propagated from an abnormal conformer to its normal counterpart results in prion diseases. For common diseases such as cancer, the situation is less clear, and depends on genetic and environmental factors. A number of proteins can contribute to cellular transformation and carcinogenesis when their normal structure is altered by mutations in their genes. These genes are termed proto-oncogenes. For some of these proteins (e.g. protein P21 of C-ras gene), a single-point mutation at position 12 (or 61) replacing glycine makes them oncogenic. It is hoped that bioinformatics operations on the human gnome project (HGP), and other genomic data would shed light on many common diseases, and subsequently control. One such example is development of inhibitors of HIV-1 protease by molecular modeling to optimization of drug candidates.

1.2 THE OBJECTIVES OF BIOINFORMATICS As stated earlier, bioinformatics deals with the (computational) methods of storing, retrieving, and analyzing of biological information (genetic code) as it passes from its storage sites (DNA/RNA) in the genome to the sites of synthesis of various gene products (e.g., synthesis of proteins at the ribosomal sites), and structure-function relationships and their effects in the cells and the organisms. Availability of three-dimensional (tertiary) structure of any biomolecule (say a protein) is vital to understand its structure-function interactions at the molecular level, and to undertake any molecular design tasks. While the primary structure (genomic or proteomic sequence) data are obtainable (experimentally) in a faster and a more ‘routine’ way, the acquisition of secondary and tertiary structural data, by experimental methods, is still a time-consuming and tedious task. Therefore, theoretical “structure prediction” methods, employing computational tools and algorithms, for prediction of secondary and tertiary structures of biomolecules from their primary structure (sequence) data, is an attractive alternative method, and the major objective of bioinformatics (molecular bioinformatics). The rationale of molecular bioinformatics is– conceptualizing biology in terms of molecules and applying “informatics” techniques to understand and organize the information associated with these biomolecules in the cells and organisms.

Bioinformatics: Introduction 5

Many novel genes are being uncovered through the systematic searching of available genomic sequence data and their putative function is being assigned through sequence identity algorithms. A general approach in bioinformatics is– (1) Application of computer software tools for creating computer databases (both genomic and proteomic databases). (2) Development of algorithms to utilize and mange these databases in knowledge-based analysis. (3) Utilization of databases and computational methods in “structure prediction” methods. (4) Use of (primary, secondary and tertiary) databases and “structure prediction” algorithms in the rational molecular design to repair defective biomolecular species in the cell, and/or to synthesize better molecular species/drugs, tailor-made to desired requirements (genetic engineering/molecular engineering). Nothing in the field of quantitative biology is really new. The paradigm shift is in handling a large-scale, automated, and integrated approach to molecular biology and medical sciences. Two developments distinguish bioinformatics from classical biological and allied sciences– (1) Integration of advanced physical techniques (lasers, better sequencers and mass spectrometers etc.). (2) Central role of computer-assisted operations in data acquisition and analysis. Data collection is directly connected to laser-based detectors and automated. At the same time, data storage, retrieval and exchange operations are almost completely computer-based (in silico biological analysis). Automation in data acquisition and processing has resulted in tremendous increase throughput at a fast pace. Automated data acquisition enables scientists to spend more of their time data analysis and interpretation processes. Experimental research in molecular biology has in recent years yielded a wealth of information and a large amount of gene expression data is being added at a fast rate. Without functional assignment, the true goal of any genome project, which is to understand how genomes are organized, and expressed, and other functional features of the genome, cannot be achieved. That is, “structure mining”, not just “sequence mining” should be the objective of bioinformatics. Therefore, structural aspects of molecules (X-ray and NMR structural data) will become more and more important in future genomic research. However, the three-dimensional structure information by experimental methods (X-ray crystallography and NMR spectroscopy) is lagging behind the gene sequence data. Therefore, structure prediction by theoretical methods is a viable option in bioinformatics. The main objective of bioinformatics will be, therefore, to combine experimental structural data (mainly from X-ray crystallography and NMR spectroscopy) and theoretical methods of structure prediction, to understand the structure-function relationships in biomolecular complexes and utilize such ‘knowledge’ in rational design of new molecular species, drugs and therapeutic agents. In addition, a deeper understanding of complex biological systems will need a more quantitative type of biology that is closely integrated with the physical sciences (aim of quantitative biology). With the availability of large-scale genome sequence data, modern biology has become more data-rich and is faced with organizing and analyzing the sequence and other

Bioinformatics: Introduction 7

EXERCISE MODULES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

What are the objectives of Bioinformatics? What is the genetic code and how is it stored and transmitted? Why is the Genetic code is degenerate? Give some examples of amino acids coded by multiple codons. Which are the amino acids coded by maximum number of codons? Which are the amino acids coded by one codon? What is the molecular basis of hereditary diseases? What is mutation, and what are single-point mutations and frame-shift mutations? What are the objectives of bioinformatics? What is the importance of rational design of moleunles?

BIBLIOGRAPHY 1. Baum, J. & Brodsky, B. (1999), Curr Opin Struct Biol., 9(1); 122. “Folding of peptide models of collagen and misfolding in disease”. 2. Bently, D.R. (2000), Med Res Rev., 20; 189. “The Human Genome Project–an overview”. 3. Broder, S. & Venter, J.C. (2000), Curr Opin Biotechnol., 11; 581. “Whole genomes: the foundation of new biology and medicine”. 4. Carell, R.W. & Gooptu, B. (1998), Curr Opin Struct Biol., 8(6); 799. “Conformation changes and disease– serpins, prions and Alzheimer’s”. 5. Davies, K.E. & Reid, A.P. (1988), IRL Press: New York. “Molecular basis of Inherited Diseases”. 6. Dickerson, R.E. & Geis, I. (1983), Benjamin-Cummings: Menlo Park/CA. “Hemoglobin: Structure, Function, Evolution and Pathology”. 7. Dobson, C.M. (1999), Trends Biochem Sci., 24; 329. “Protein misfolding, evolution and disease”. 8. Harrison, P.M., et al. (1997), Curr Opin Struct Biol., 7(1); 53. “The Prion folding problem”. 9. Lee, P.S. & Lee, K.H. (2000), Curr Opin Struct Biol., 11; 171. “Genomic analysis”. 10. Narayanan, P. (1998), J Uni Mumbai, 55(82); 41. “Influence of base-stacking interactions on the variable degeneracy of the genetic code”. 11. Narayanan, P. (2001), Bhalani Pubs: Mumbai. “Clinical Biophysics: Principles and Techniques”. 12. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print). 13. Perutz, M.F. (1991), W.H. Freeman: New York. “Protein Structure and Function”. 14. Perutz, M.F. (1992), W.H. Freeman: New York. “Protein Structure: New Approaches to Disease and Therapy”. 15. Stryer, L. (1995), Freeman Press: New York. “Biochemistry”, 4th Edn. 16. Watson, J.D., et al. (1987), Benjamin-Cummings: Menlo Park/CA. “Molecular Biology of the Gene”, 4th Edn. 17. Weinberg, R.A., et al. (1985), Sci Amer., 253(4); 48. “The molecules of Life”. 18. Wilson, J.M. (1993), Nature, 365; 691. “Cystic fibrosis: vehicles for gene therapy”. 19. Wladawer, A. & Vondrasek, J. (1998), Annu Rev Biophys Biomol Struct., 27; 249. “Inhibitors of HIV1 protease: a major success of structure-assisted drug design”.

This page intentionally left blank

Section I

Biomolecular Structures (Molecular Biophysics)

This page intentionally left blank

2 Atoms and Molecules Matter is composed of individual entities, called elements, which are the basic building blocks of all molecules and chemical compounds. Each element is distinguishable from others by the physical and chemical properties of its basic component, the atom. Molecules are formed from a cluster of atoms bonded by chemical bonds. There are a variety of chemical bonds (covalent, double, and triple) resulting in a variety of molecular species. A basic knowledge of these chemical bonds and their stereochemical features is necessary in understanding the essential features of molecular structures. In bioinformatics, the molecules and their interactions (DNADNA, DNA-protein and protein-protein interactions) should be addressed at the atomic and molecular levels. Therefore, in order the evaluate sequence and structural information, one must have a basic understanding of the representation of molecules in terms of atoms and bonds representation, known as “chemical graph” (atomic structure without coordinate information).

2.1 ATOMIC STRUCTURE Atomic structure consists of a central nucleus with positive charge, where practically all the mass (with protons and neutrons) is concentrated, and negatively charged electrons distributed in different orbits around the nucleus. The concept, from the classical physics viewpoint, is analogous to the planetary motions around a central massive star. But, the concepts of classical physics cannot explain either the stability of the atom or the occurrence of discrete spectral lines. According to the laws of classical physics, an orbiting charged particle would radiate energy, and accordingly electrons orbiting around a nucleus are unstable and they would spiral down and collapse into the nucleus. Niels Bohr (1885-1962) resolved this difficulty, and also ushered in quantum physics with his ad hoc proposition that electrons revolving in their orbits do not radiate energy, and spectral lines are due to electron transitions between the orbits (Fig. 2.1). According to the concepts of quantum physics, the state of an electron can be determined by wave functions (orbitals). Orbitals represent regions in space in which a particle of particular energy is most likely to be found. Orbitals for electrons are represented by four quantum numbers–principal quantum number, n, angular (azimuthal) quantum number, l, magnetic quantum number, m, and spin quantum number, s. The orbitals belonging to the same principal quantum number (n) constitute a “shell”. The capital script letters of the alphabet denotes the “shells”.

12 Bioinformatics: A Primer

Fig. 2.1 Origin of the Atomic Line Spectra

n = 1, 2, 3, 4…. (K, L, M, N…. shells)

(2.1)

Orbitals of the same shell, but different azimuthal values are called “subshells”. These subshells are denoted by lower script letters of the alphabet. l = 0, 1, 2, 3, …. (s, p, d, f,….. subshells)

(2.2)

The magnetic quantum number, m, will have ±l values. Its significance is realized when a particle is under a magnetic field (e.g. Nuclear magnetic resonance, NMR). The spin quantum number, s, can have only two values, either +½ (↑) or –½ (↓). The quantum numbers, and the physical attribute to which they correspond, combined with the Pauli’s exclusion principle, which states that no two electrons have the same four quantum numbers, provides a rational approach towards determining the electronic configuration of atomic structure (Table 2.1), and thereby the physical and chemical periodicity of atoms (the periodic table of the elements). The order of filling of electronic orbitals of an atom of atomic number, Z, is 1s. 2s2p. 3s3p. 4s3d4p. 5s4d5p. 6s4f5d6p. 7s6d5f. Table 2.1 The Electronic Configuration of Atomic Structure lÆ Øn 1 (K) 2 (L) 3 (M) 4 (N) 5 (O)

s 0

p 1

1 1 1 1 1

3 3 3 3

d 2

5 5 5

f 3

7 7

g 4

Total Orbitals

Number of Electrons

9

1 4 9 16 25

2 8 18 32 50

Atoms and Molecules 13

2.2 MOLECULES The smallest entity of a chemical compound is a molecule. Ionic bonds do not lead to the formation of single molecules, but to the formation of conglomerates (e.g. NaCl). Single molecular structures are formed from the association of atoms by chemical (covalent) bonds. A covalent bond is formed between two atoms when they share an electron pair between them. There are several types of chemical bonds–– single bonds, double bonds, conjugate bonds and coordination bonds. Single bond (σ-bond) molecules are formed from the combination of s- orbitals. The σbond is cylindrically symmetrical. Some examples of such molecules are aliphatic organic molecules, ring-structured mono-sugars, and saturated fats. Double bond (and triple bond) molecules are formed from the combination of s- and porbitals. The three p- orbital electrons (π-orbitals) are directed orthogonally along the three Cartesian axes (px, py, and pz). While σ-bond is axial covalent bond, electrons in π-molecular orbitals reside only above and below the bond axis. π-orbital structures are planar; aromatic molecules (benzene, anthracene, phenyl alanine etc.) exhibit π-bond characteristics. Metals form coordinate complexes with molecules, forming tetrahedral, square planar, trigonal pyramid and octahedral coordinated moieties. There are four major classes of biological macromolecules in cells––nucleic acids, proteins, carbohydrates and lipids (Table 2.2). Nucleic acids are involved in storage and transmission of genetic information. Proteins are involved in a wide range of biological and biochemical activities, structural as well as functional. Nucleic acids and proteins are linear polymers of nucleotides and peptides, respectively. Carbohydrates are linear as well as branched-chain polymers and they are involved in structural, energy storage and cell-cell communication. Biological membranes (lipids) are macromolecules, but are not polymers and they are involved in energy storage functions. Table 2.2 Biological Macromolecules and their Functions Macromolecule(s)

Function

Nucleic Acids Proteins

Storage of genetic information and storage Structural and biochemical functions

Carbohydrates Lipids

Structural support; energy storage Cell membranes; energy storage

Examples DNA; mRNA; tRNA Globular proteins (hemoglobin); fibrous proteins (fibrin; silk) Cellulose; starch; glucose Cell membranes; fats; cholesterol

EXERCISE MODULES 1. 2. 3. 4. 5.

Why does the classical physics fail to explain the stability of the atom? What are the essential features quantum physics? What are orbitals and quantum numbers? Explain various kinds bonds Draw formulas of some aliphatic molecules (linear and ring-structured), aromatic molecules and metal coordinate complexes (Help: check up any organic chemistry textbook).

14 Bioinformatics: A Primer

BIBLIOGRAPHY 1. Atkins, P.W. (1998), Oxford University Press: Oxford. “Physical Chemistry”, 6th Edn. 2. Hallet, F.R., et al. (1982), Metheusen Pubs: Toronto. “Physics for the Biological Sciences: A topical Approach to Biophysical Concepts”. 3. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print). 4. Pauling, L. (1960), Cornell University Press: New York. “Nature of the Chemical Bond”. 5. Rae, A.I.M. (1981), McGraw Hill: London. “Quantum Mechanics”. 6. Tinoco, I. Jr., et al. (1985), Prentice-Hall: Englewood Cliffs/NJ. “Physical Chemistry: Principles and Applications in Biological Sciences”, 2nd Edn.

3 Features of Nucleic Acids Nucleic acids, proteins, carbohydrates, and lipids and membranes are the major biological macromolecules that are of importance in the study of cell structure and function. However, from the standpoint of bioinformatics, nucleic acids and proteins and their complexes are the most important biological macromolecules. Therefore, an understanding of their structural features, vis a vis the structural characteristics of their constituents, is of importance to understand their functional characteristics. Nucleic acids– DNAs and RNAs (except tRNAs) are long thread-like macromolecules that play central roles in all the hereditary processes storage of genetic code, replication, transcription and translation into protein synthesis. Coded in the nucleic acid (DNA/RNA) is biological (chemical) information that can be stored, replicated, transcripted and translated into protein synthesis processes.

3.1 CONSTITUENTS OF NUCLEIC ACIDS Generally, the basic unit of a nucleic acid is a nucleotide (it is a dinucleotide in left-handed Zform of nucleic acids). A nucleotide comprises (i) a purine or pyrimidine base, (ii) a ribose or deoxyribose sugar and (iii) a phosphate group (Fig. 3.1).

3.1.1 Nucleic Acid Bases Nucleic acid bases are hetero-atom ring compounds (N and O atoms), and there are two classes of nucleic acid bases. They are (i) purines (R)–adenine (A) and guanine (G), and (ii) pyrimidines (Y)–– thymine (T), uracil (U) and cytosine (C). The bases are planar, exhibiting aromatic nature with polar characteristics (due to nitrogen and oxygen moieties), and they exist in amino (NH2) and keto (C=O) tautomeric forms. Only those bases with “proper” tautomeric forms (amino (NH2) and keto (C=O) forms) can form correct hydrogen bond base-pairing (Watson-Crick base-pairing) patterns, which are essential for storage and transmission of genetic information. In all naturally occurring nucleic acids, a nucleic acid base is covalently linked to a sugar moiety by the b-glycosyl bond (C1’ – N bond), formed between the anomeric carbon atom (C1’) of the ribose sugar and the N9 of a purine or N1 of a pyrimidine base. A nucleic acid base with a sugar unit is referred to as a nucleoside. The conformation the base with respect to the sugar moiety is represented by the torsion angle (c) around the glycosyl bond (C1’ – N).

16 Bioinformatics: A Primer

Base + Pentose sugar = Nucleoside Base + Pentose sugar + Phosphate group = Nucleotide Fig. 3.1 Constituents of a Nucleotide Base

Base O

O O

O O

O

3.1.2 The Sugars The sugars in nucleic acids are five-membered furanose rings–– D-ribose in RNAs and 2-deoxyD-ribose in DNAs (Fig. 3.2). The furanose sugar rings are puckered. Generally, the prominent sugar puckers are C2’-endo (in DNAs) and C3’endo (in RNAs and DNA-RNA hybrids).

Ribose

O Deoxyribose

Fig. 3.2 Chemical Structures of Ribose and 2-Deoxyribose

3.1.3 The Phosphate Group The phosphate group is linked to the sugar at the C5'-position. A nucleoside with the attachment of a phosphate group is called a nucleotide.

3.2 POLYNUCLEOTIDES Nucleic acids are linear polymers of nucleotides formed by the condensation of two or more nucleotides linked via phosphodiester bonds. That is, they are polynucleotides. The formation of phosphodiester bonds in nucleic acids exhibits directionality. The conformation of the ribose-phosphate backbone of a nucleotide unit is represented by six torsion angles. In nucleic acids, all torsion angles are correlated; that is, structural changes follow a concerted motion. In living cells, the genetic information is stored in DNA (RNA in some viruses), transcribed onto messenger RNA (mRNA) and then translated into proteins in ribosomes. The WatsonCrick hypothesis for DNA double helix provides a rational explanation for storage of genetic code, replication and transcription processes. 1. DNA (B-form) is a right-handed double helix of ~ 20 Å diameter, formed by two antiparallel right-handed helical strands wound around each other. 2. The ribose-phosphate backbone, formed via 3’-5’phosphodiester linkage forms the periphery of the double helix structure. 3. The structure is stabilized by non-bonded interactions: H-bonding between bases of adjacent strands (polar) and base stacking (non-polar) (Fig. 3.3)

18 Bioinformatics: A Primer

Fig. 3.4 Double-helical Structure of DNA (Ref: US Dept of Energy: Genomes to Life Project; http://doegenomes.to.org) O N

N

N N

N O

N

N N

N N

O

N

N

N N

O N

Thymine

Adenine

Guanine

Cytosine

Fig. 3.5 Watson-Crick Type Base-pairing in Nucleic Acids

flexibility and can exist in distinct tertiary structure folding (Figs. 3.6 & 3.7). The quaternary structure of nucleic acids generally refers to nucleic acid-ligand association. DNA-histone complexes, nucleic acid-protein complexes are some examples (see protein-nucleic acid interactions in Chapter 8).

3.2.1 The Nucleic Acid Families Nucleic acids exist in several structural forms, depending upon relative humidity, uniqueness of the base sequences and solvent and salt concentrations, They are classified under A-, and Band Z- families, based on their structural and conformational features. Due to subtle variations in their structural parameters, helical arrangements in different families are macroscopically different and distinct. These structural families are interconvertible, depending on humidity and salt concentrations.

3.2.1.1 The A-Family To the A-family belong RNAs and DNA-RNA hybrids. The A-family exists in right-handed double helical form; and exhibits structural conservatism and is uniform in overall shape. The helix is pushed into the major groove (D = 4.5 Å) and the polynucleotides chains wrap around the helix axis like a ribbon. As a consequence, there is a deep and narrow major groove, but shallow and wide minor groove. The A-family has both inter-strand and intra-strand base overlap.

20 Bioinformatics: A Primer

3.2.1.2 The B-Family B-, C-, D- and T-DNA forms belong to the B-family. The B-form of double-helical DNA prevails under normal physiological conditions of low ionic strength and high relative humidity (> 90%). In the B-form the helix axis passes through the center of basepairs (D = 0. Å). Therefore, the major and minor grooves are of equal depth, but unequal widths (Fig. 3.8). The B-form is structurally more flexible and is sensitive to sequence, base composition and environmental conditions (solvent and salt). When the relative humidity is reduced to < 75%, the B-form undergoes a reversible transformation to the A-form. The transformation is dependent on base composition.

Fig. 3.8(a) Double-helical Structure of B-DNA (CKP Model)

Fig. 3.8(b) Tertiary Structure of B-DNA (Dodecanucleotide) Unit (Ref: Drew, H.R., et al. (1981) Proc Natl Acad Sci, USA. 78; 2179. Source: Protein Data Bank; 1BNA.pdb)

3.2.1.3 The Z-Family At higher salt concentrations, poly(dG-dC) regions of right-handed double helical B-DNA structure can transform into the left-handed Z-DNA form (zig-zag form). The structural features of the Z-form greatly differ from the B-form. These are: (i) (ii) (iii) (iv) (v)

The double helix is left-handed. The repeat unit is a dinucleotide (dG + dC). T The glycosyl bond conformation and the sugar pucker are different. There is no major groove. There is no proper base stacking.

Features of Nucleic Acids 21

Fig. 3.8(c) Space-filling Model of B-DNA Repeat Unit

3.3 THE GENOME PROJECTS A gene is a segment of DNA that contributes to phenotype function. Genes that do not appear to encode protein products (pseudogenes) can be characterized by sequence, transcription, or homologuous to another gene. The primary objectives of genome projects, namely gene amplification and gene sequencing are twofold–– (1) for making a series of descriptive diagram maps (gene mapping) of each chromosome (human and other organisms) at increasingly finer resolution, and (2) for high-resolution gene mapping and production of large quantities of proteins and inference of amino acid sequence from the gene sequence data (see chapter 9 for further details).

3.3.1 Gene Expression A gene is a specific sequence of nucleotides that carry information required for protein synthesis. Gene expression, the process of transmitting genetic information for protein synthesis is carried out in three distinct stages– (1) replication, (2) transcription and (3) translation. There exists a host of regulatory features and factors in realizing these processes. 3.3.1.1 Replication DNA replication is based upon the complementarity of the genetic (chemical) information stored in base-pairing. During replication the DNA double helix is unwound by helicases (Fig. 3.9), with each single strand becoming a template for synthesis of a new, complementary strand. The replication is semi-conservative; each daughter molecule consists of one parent strand and one newly synthesized strand (Fig. 3.10). The replication process occurs at the

22 Bioinformatics: A Primer

Parent DNA

Daughter DNA molecules

(Light) = Newly synthesized

Fig 3.9 Structure of Rep Helicase + Singlestranded DNA complex (Ref: Korolev, S., et al. (1997), Cell, 90; 635) (Source: Protein Data Bank: 1UAA.pdb)

Fig. 3.10 Semi-conservative Nature of DNA Replication

replication forks, regions where DNA is unwound exposing single strands that act as templates (at specific DNA sequences called origins of replication) (Fig. 3.11). RNA primes the DNA synthesis, that is, it initiates the DNA synthesis process (Fig. 3.12). RNA polymerase (called primase) synthesizes short stretch of RNA (~ 10 bases) that is complementary to one of the DNA template strands. This RNA chain serves as the primer for synthesis of new DNA molecule, and the chain-elongation is catalyzed by DNA polymerases (Fig. 3.13). DNA polymerases are template-directed. They catalyze the formation of a phosphodiester bond only if the base on the incoming nucleotide is complementary to the base on the template strand. In the case of DNA, DNA polymerase III (DNA pol III) catalyzes the DNA synthesis (in the 5’à 3’ direction) (Fig. 3.14). The polymerization process is bi-directional. The leading strand is polymerized continuously, and the lagging strand discontinuously (Okazaki fragments). As a result, while one RNA primer is required for the leading strand, each Okazaki fragment requires a RNA primer. The RNA portion of the RNA-DNA hybrid is hydrolyzed by DNA pol I (Fig. 3.15). DNA ligase joins the Okazaki fragments. The process occurring at the replication fork and various enzyme complexes involved are schematically represented in figure 3.16.

Features of Nucleic Acids 23

Fig. 3.11 DNA Synthesis at the Replication Fork

Fig. 3.12

Initiation of DNA Synthesis by an RNA Primer

Fig. 3.13 Chain-elongation Reaction catalyzed by DNA Polymerases

3.3.1.2 Transcripion The templates for protein synthesis are RNAs–– DNA

Transcription

¾ ¾¾¾¾ ¾®

RNA

Translation

¾ ¾¾¾¾®

Protein

Features of Nucleic Acids 25

synthesis is catalyzed by RNA ploymerases (RNA pol). Whereas prokaryotes have one RNA polymerase, eukaryotes have three (protein-encoding genes are transcribed by RNA polymerase II in eukaryotes). Transcription is initiated in the promoter region by a complex of different factors. The process is similar to DNA replication. Transcription involves three distinct stages– (i) initiation, (ii) elongation and (iii) termination. (i) The RNA polymerase binds at the DNA promoter site, unwinds the DNA double helix, and initiates the synthesis of a transcript. The promoter sequences are not transcribed. (ii) RNA pol moves along the DNA, maintaining the transcription “bubble” to expose the DNA template strand, and catalyzes the 3’ elongation of the transcript. (iii) Formation of a hairpin loop in nascent RNA transcript results in the RNA strand RNA pol from the DNA template.

3.3.1.3 Translation Translation is the unidirectional process that takes place on the ribosomes whereby the genetic information present is an mRNA is converted into a corresponding sequence of amino acids in a protein. After transcription, the single-stranded mRNA is moved from the nucleus to the cellular cytoplasm, to the ribosome, the protein synthesis apparatus. Activated tRNA molecules (aminoacyl-tRNAs) carrying specific amino acids are also brought to the ribosomes. Sequence of amino acids in a protein is determined by the sequence of codons (contiguous nucleotides triplets) in mRNA as ‘read’ by anticodons of tRNAs. The anticodon of a tRNA binds with a specific codon in mRNA by complementary base-pairing (base-specific codonanticodon hydrogen bonding). Translation also follows initiation, elongation and termination steps. There are two tRNAbinding sites in the ribosome– A-site (for entry of aminoacyl-tRNA) and P-site (for peptidyltRNA, carrying growing polypeptide). Initiation results in the binding of the initiator tRNA to the start signal of mRNA. In bacteria, the first amino acid is always N-formylmethionine (fMet) and initiation codon is preceded by Shine-Delarno sequence. In eukaryotes, the first amino acid is methionine (Met) and the initiation codon has 5’-cap. The initiator tRNA occupies the peptidyl (P) site on the ribosome. Elongation consists of three steps– (i) binding of aminoacyl-tRNA (codon recognition), (ii) peptide formation and (iii) translocation (Fig. 3.17). Elongation starts with the binding of an aminoacyl-tRNA to the A-site (aminoacyl site). A peptide bond is the formed between the fMET-tRNA and aminoacyl-tRNA. The resulting depeptidyl-tRNA is then translocated from the A-site to the P-site, while the other tRNA (uncharged tRNA) molecule leaves the A-site. The mRNA moves a distance of three nucleotides and a new aminoacyl-tRNA binds to the empty A-site to start another round of elongation. Elongation process is assisted by various elongation proteins (elongation factors) (Fig. 3.18). Encountering a “stop” codon, recognized by a protein release factor, leads to termination of the translation process and release of polypeptide from ribosomes. 3.3.2 Gene Amplification Gene amplification is selective increase in the number of copies of a specific gene coding for a specific protein without a proportional increase in other genes. Gene amplification is necessary for obtaining sufficient quantities of desired gene or gene fragment for further analysis and for production of desired protein in large quantities for sequencing (see chapter 6) and other analysis. Cloning and polymerase chain reaction (PCR) are two molecular biological techniques that are in use for gene amplification.

Features of Nucleic Acids 27

3.3.2.1 Gene Cloning Gene cloning involves the use of recombinant DNA technology to propagate DNA fragments, isolated from chromosomes using restriction enzymes, inside a foreign host. Following introduction into suitable host cells, the DNA fragments can then be reproduced along with the host cell DNA. Cloning procedures are routinely employed to produce unlimited material for experimental study.

3.3.2.2 Polymerase Chain Reaction (PCR) Polymerase chain reaction (PCR) is a very versatile in vitro gene amplification method that has brought a tremendous progress in molecular biology and genetics. PCR can amplify a desired DNA sequence of any origin hundreds of million times in hours. In gene amplification by polymerase chain reaction (PCR) methods, a desired cDNA clone is synthesized using mRNA as a template. Suitable primers are used to hybridize to the corresponding sequences, and they are extended in a chain synthesis reaction by DNA polymerases, using the inserted sequence as the template. The PCR mixture contains DNA bases (four types) and two primers (~ 20 bases long). The mixture is (i) heated to denature thermally the doublestranded target molecule and separate the target sequence (ii) cooled (annealing) to allow the primers to bind to their complementary sequence on the separated strands, and (iii) the polymerase to extend the primers into the new complementary strands. Repeated heating and cooling cycles multiply the target DNA exponentially, since each new double strand separates to become two templates for further synthesis. The reaction is efficient, specific, and extremely sensitive. The nucleotide that the polymerase attaches will be complementary to the base in the corresponding position on the template strand (e.g if the adjacent template base is C, the polymerase attaches G). The polymerase chain reaction proceeds with two primers, bound to the opposite strands of the gene target, and their 3’-ends pointing at each other. The reaction is terminated by the incorporation of dideoxynucleotides. The resultant is a series of fragments of different lengths for each primer.

3.3.3 Gene Separation Cutting genomic DNA at specific sites by suitable restriction enzymes generates DNA fragments. The fragments are amplified either by cloning or polymerase chain reaction (PCR) methods. Electrophoresis techniques are used to separate the fragments. Small diameter capillary array gel electrophoresis permits application of high electric fields, thus providing significantly faster separation than traditional slab gels (chapter 5). While conventional electrophoresis is applicable to separate fragments < 40 kilo bases, pulse-field gel electrophoresis (PFGE) techniques has improved the separation of larger fragments (~10M bases). This technique employs multiple electrodes, placed orthogonally with respect to the gel, and short pulses of alternate current are passed through the gel.

3.3.4 Gene Sequencing Genome sequences are assembled from DNA sequence fragments of approximately 500 basepairs length. Conventional (1st generation) gene sequencing methods employed MaxamGilbert and Sanger methods. Maxam-Gilbert method uses chemicals to cleave DNA at specific

28 Bioinformatics: A Primer

bases, resulting in fragments of different length. Sanger sequencing method (dideoxy method) uses enzymatic procedure to synthesize DNA replication at positions occupied by one of the four bases, and then determines the resulting fragment length (see chapter 6). Multiplex sequencing procedure enables to analyze ~ 40 clones on a single DNA-sequencing gel. Developments in gene sequencing techniques (2nd and 3rd generation) are–– ultra-thin electrophoresis, resonance ionization spectroscopy to detect suitable isotope labels, laser-induced fluorescence, gel-less flow cytofluorimetry, scanning-tunneling or atomic force microscopy, and mass spectrometry (see chapter 6).

3.3.5 Genetic Mapping A genome map describes the relative positions of genes and other markers and the spacing between them on each chromosome. Mapping involves (i) dividing the chromosomes into smaller fragments by restriction enzymes, and (ii) mapping the fragments to correspond to their respective locations on the chromosomes. Low-resolution maps are genetic linkage maps, which depict the relative chromosomal location of DNA markers along the chromosome. Physical maps describe the chemical characteristics of the DNA molecule itself. Physical maps can be low-resolution or high-resolution maps. Low-resolution chromosomal maps are based on the banding patterns (light and dark bands reflecting regional variations in the amounts of A-T versus G-C) observed in light microscopy of stained chromosomes. Highresolution physical maps provide complete basepair of each chromosome in the genome. Determination of basepair sequences of genes (high-resolution physical mapping) is necessary for inferring the amino acid sequences (primary structure) of corresponding proteins.

EXERCISE MODULES 1. Build chemical structures of nucleic acid bases. 2. Build Watson-Crick basepairs (A = T; G º C). 3. Build a nucleoside; rotate around the bond (C1'–N) bond and observe the orientation of the base with respect to the sugar moiety. 4. Build a nucleotide and rotate C5' – O5' bond and observe the orientation of phosphate group with respect to the sugar moiety. 5. What are the structural features of nucleic acids? 6. Describe the essential features of Watson-Crick model of DNA double helix. 7. Build a few turns of DNA double helix. 8. What are the primary, secondary and tertiary structural features of nucleic acids? 9. Describe various forms of nucleic acids. 10. What are the objectives of the genome project? 11. What is gene expression? 12. What is replication? 13. Why is replication semi-conservative? 14. What is replication fork? 15. What is the role of a primer? 16. What are the functions of DNA pol I and DNA pol III? 17. What is transcription and what are the salient features?

Features of Nucleic Acids 29 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.

What is translation and where does it occur? What are the main features of translation? What is the role of tRNAs in translation? What is relevance of genome projects? What is gene amplification and why is it necessary? What is gene cloning and how is it done? What is polymerase chain reaction (PCR)? What are the gene separation methods? What are gene-sequencing methods? What is genetic mapping?

BIBLIOGRAPHY 1. Baltimore, D. & Berg, A.A. (1995), Nature, 373; 287. “DNA-binding proteins”. 2. Berman, H.M., et al. (2000), Nucleic Acid Res., 28; 235. “Protein Data Bank”. 3. Blackburn, G.M. & Git, M.J (Eds). (1990), Oxford University Press: Oxford. “Nucleic acids in Chemistry and Biology”. 4. Calladine, C.R. & Drew, H.R. (1992), Academic Press: New York. “Understanding of DNA”. 5. Conn, G.L. & Druper, D.E. (1998), Curr Opin Struct Biol., 8(3); 278. “RNA structure”. 6. Darnel, J.E. Jr. (1985), Sci Amer., 253(4); 68. “RNA”. 7. Dickerson, R.E., et al. (1982), Science, 216; 475. “The anatomy of A-, B- and Z-DNA”. 8. Dickerson, R.E. (1983), Sci Amer., 249(6); 86. “DNA helix and how it is read”. 9. Felsenfeld, G (1985), Sci Amer., 253(4); 58. “DNA”. 10. Innis, M., et al. (1990), Academic Press: San Diego, CA. “PCR Protocols: A Guide to Methods and Applications”. 11. Johnson, P.F. & McKnight, S.L. (1989), Annu Rev Biochem., 58; 799. “Eukaryotic transcriptional regulatory proteins”. 12. Kornberg, A. & Baker, T.A. (1992), W.H. Freeman: New York. “DNA Replication”. 13. Lodish, H., et al. (1995), Sci Amer Books: New York. “Molecular Cell Biology”, 3rd Edn. 14. Narayanan, P. (2001), Bhalani Pubs: Mumbai. “Clinical of Biophysics: Principles and Techniques”. 15. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print). 16. Ptashne, M. (1988), Nature, 335; 683. “How eukaryotic transcriptional activators work”. 17. Saenger, W. (1984), Springerverlag: Berlin. “Principles of Nucleic Acid Structure” 18. Tijan, R. (1995), Sci Amer., 272(3); 7. “Molecular machines that control genes”. 19. Walker, J.M. & Gaastra, W. (Eds) (11983), Croom Helm: London. “Techniques in Molecular Biology”. 20. Watson, J.D. & Crick, F.H.C. (1953), Nature, 171; 737. “The molecular structure of nucleic acids”. 21. Watson, J.D. & Crick, F.H.C. (1953), Nature, 171; 964. “Genetic implications of the structure of deoxyribonucleic acid”. 22. Watson, J.D., et al. (1988) Benjamin-Cummings: New York. “Molecular Biology of the Gene”. 23. ———— (1997), Human Genome Project, US Dept of Energy: Washington, DC. “A Primer on Molecular Genetics”.

30 Bioinformatics: A Primer

4 Features of Proteins Proteins are important constituents of all biological systems. All biologically relevant proteins are linear (except for cystines) polypeptides, constituted from a repertoire of twenty L-amino acids. A polypeptide (protein) consists of repeating peptide (main-chain) units with different side-chain residues (R-groups). Physicochemical characteristics of side-chains and tertiary folding of the polypeptide backbone make each protein unique, structurally and functionally. Protein structures are organized under four structural categories. Non-bonded interactions (ionic, hydrogen bonds and van der Waals) stabilize the secondary, super-secondary, tertiary and quaternary structures of macromolecules.

4.1 AMINO ACIDS There are twenty L-amino acids that are the basic structural units of all naturally occurring proteins and enzymes. They all have (except proline, which is an imino acid) an amino group (NH3+), a carboxylate group (COO–), a hydrogen atom and a sustituent group, R, called a side chain– all covalently bonded to the central tetrahedral a-carbon atom (Fig. 4.1). All amino acids (except glycine) with different substituents at the a-carbon exhibit chirality (L- or D-from). Amino acids differ from each other structurally and functionally owing to the structure and the chemical nature of their side chains.

Fig. 4.1 Chemical Formula (Representation) of L-Amino Acids

4.1.1 Characteristics of Amino Acids Ionization states of amino acids are pH-dependent (Fig. 4.2). They exist in zwitterionic state under physiological pH (~7) conditions. Amino acids can be characterized, based on the shape, size, and chemical nature of their side-chains (Fig. 4.3 and Table 4.1). Whereas the constituents of nucleic acids are four with two subclasses with similar structural and physicochemical characteristics, amino acids differ greatly in size, shape, and chemical characteristics of their side chains. From the chemical nature of the side-chains under physiological conditions, there are: (i) Five amino acids with charged R-groups: two acidic amino acids, aspartic acid and glutamic acid and, three basic amino acids, arginine, lysine and histidine.

32 Bioinformatics: A Primer

Fig. 4.3 Chemical Structures of L-Amino Acids found in Proteins Table 4.1 Amino Acids with their Side-chains (–R) and their Characteristics Amino acid

Abbreviation

Side-chain (–R group)

pKa a-NH3+; a - COO–; R-Group

Characteristics

Alanine

Ala (A)

–CH3

9.8; 2.4;

Small; Hydrophobic/Ambivalent

Arginine

Arg (R)

–CH2–(CH2)2–NH–C–(NH2)2+

9.0; 1.8;

Large; Hydrophilic/ (Basic) 12.5

Asparagine

Asn (N)

–CH2–(C=O)–NH2

8.7 2.1;

Small; Hydrophilic/(Neutral)

Aspartic Acid

Asp (D)

–CH2–COO–

9.9; 2.0;

Small; Hydrophilic/(Acidic) 3.9

Cysteine

Cys (C)

–CH2–SH

10.7; 1.9; 8.4

Medium; Hydrophilic/ Ambivalent

Glutamine

Gln (Q)

–CH2)2–(C=O)–NH2

9.1; 2.2;

Small; Hydrophilic/ (Neutral) (contd.)

Features of Proteins Table 4.1 (contd.) Glutamic Acid

Glu (E)

–CH2–CH2–COO–

9.4; 2.1; 4.1

Small; Hydrophilic/ (Acidic)

Glycine

Gly (G)

–H

9.8; 2.4;

Small; Hydrophobic/ Ambivalent

Histidine

His (H)

–CH2– imidazole ring

9.3; 1.8; 6.0

Large; Hydrophilic/ (Basic)

Isoleucine

Ile (I)

–CH–CH2–CH3 | CH3

9.8; 2.3;

Medium; Hydrophobic/ (branched)

Leucine

Leu (L)

–CH2–CH–(CH3)2

9.7; 2.3;

Medium; Hydrophobic/ (linear)

Lysine

Lys (K)

–CH2–(CH2)3–NH3+

9.1; 2.2; 10.6

Large; Hydrophilic/ (Basic)

Methionine

Met (M)

–CH2–CH2–S–CH3

9.3; 2.2;

Medium; Hydrophobic/ Ambivalent

Phenyl- alanine

Phe (F)

–CH2–phenyl ring

9.3; 2.1;

Large; Hydrophobic/ (aromatic)

Proline

Pro (P)

Imino ring

10.6; 2.0;

Medium; Hydrophobic/ Ambivalent

Serine

Ser (S)

–CH2–OH

9.2; 2.2; ~ 13

Small; Hydrophilic/ Ambivalent

Threonine

Thr (T)

–CH–CH3 | OH

9.1; 2.1; ~ 13

Small; Hydrophilic/ Ambivalent

Tryptophan

Trp (W)

–CH2–indole ring

9.4; 2.4;

Large; Hydrophobic/ Ambivalent

Tyrosine

Tyr (Y)

–CH2–phenolic ring

9.2; 2.2; 10.4

Large; Hydrophobic/ Ambivalent

Valine

Val (V)

–CH–(CH3)2

9.7; 2.3;

Medium; Hydrophobic

33

34 Bioinformatics: A Primer

(ii) Five polar (neutral/ambivalent) amino acids: asparagine, glutamine, cycteine, serine and threonine. (iii) Ten non-polar (hydrophobic/ambivalent) amino acids: alanine, glycine, phenylalanine, isoleucine, leucine, methionine, proline, tryptophan, tyrosine (uncharged) and valine.

4.2 PROTEINS As stated, naturally occurring peptides and proteins are linear polymers, synthesized from a library of twenty L-amino acids, covalently linked through the peptide (C–N) bonds by the hydrolysis between a-amino and a-carboxyl groups of successive amino acids (Fig. 4.4). A polypeptide consists of repeating peptide backbone (main-chain) units with side- chains (R1, R2, R3,…..). The repeating unit along the polypeptide backbone is (HN–CaH–CO). Typically a peptide consists of less than 50 amino acids while a protein has greater than 50 amino acids.

Fig. 4.4

Formation of Polypeptides from Amino Acids via Peptide (C–N) bonds

Operationally, the structural and functional features and protein complexes are addressed at four levels of hierarchical structural organization (1) primary structure, (2) secondary structure, (3) tertiary structure and (4) quaternary structure.

Features of Proteins

35

4.2.1 The Primary Structure (1°-Structure) The linear number and order of the amino acids present in a peptide or protein constitutes the primary structure (1°-structure). The convention for the designation of the order of amino acids is that the N-terminal end (free a-amino group) is to the left (and the number 1 amino acid) and the C-terminal end (end with the residue containing a free a-carboxyl group) is to the right. Single alphabet nomenclature of amino acids is used in the amino acid sequence data (Fig. 4.5). Determination of primary structure of a protein (see chapter 6) is essential to understand the mechanisms of biochemical reactions, to trace evolutionary paths and to carry out computational methods of structure prediction from the sequence homologies with related proteins. Primary structure of a protein is generally a prerequisite for three-dimensional structure determination by physical techniques.

Fig. 4.5 Amino Acid Sequence of a Disulfide-containing Protein

Amino acid sequence data of proteins can be inferred from the base sequence of corresponding nucleic acids. With the development of highly efficient gene sequencing methods, gene sequencing is faster and easier and than protein sequencing, this alternative is followed wherever it is feasible. However, there are certain ambiguities and limitations in inferring amino acid sequence from gene sequencing. These are: (i) Degeneracy of codons (more than one codon coding for the same amino acid) leads to ambiguities. (ii) The genetic code is not universal. (iii) Deletion and insertion of nucleotide(s) can lead to erroneous reading frame for the amino acids. (iv) Post-modified proteins and disulfide-containing proteins can be determined only by direct protein sequencing.

4.2.2 The Secondary Structure (2°-Structure) Ordered local segments (helices and sheets), reverse turns and loops and local hydrogen bonding of the polypeptide backbone constitute the secondary structure of proteins. The secondary structure elements constitute the building blocks of the folding units in the globular proteins. The secondary structure of a polypeptide (protein) is determined largely by local sequence information. According to Pauling hypothesis on the peptide unit–– (i) The peptide bond (C–N bond) is rigid, and the amide group (O=C––NH) is planar as a result (Fig. 4.6). (ii) There are only two degrees rotational freedom per peptide, namely about the single bonds, N–Ca (f) and Ca–C (y). These conformational angles are referred to as Ramachandran angles. (iii) Hydrogen bonds play a crucial role in stabilizing the polypeptide chain conformation.

36 Bioinformatics: A Primer

(iv) Trans configuration is preferred across the peptide bond.

Angles around C and N atoms are ~ 120°; O = C—NH group is planar Color code: C = black; O = Red; N = Blue; H = White Fig. 4.6 Stereo-chemical Features of the Peptide Unit

Pauling postulated two ordered structures that should occur in polypeptides, namely the a-helix and b-sheet.

4.2.2.1 a-Helix The a-helix (3.613-helix) is common secondary structure encountered in proteins of globular class. It is a right-handed rod-like helical segment, stabilized by intra-molecular hydrogen bonds, parallel to the helix axis, occurring between NH and C=O groups peptides spaced four residues apart (Fig. 4.7(a)). There are many proteins with predominance a-helices, such as myoglobin, melettin, and cytochrome C’ (Fig. 4.7(b)). There are other types of helices (310-helix, p-helix) (Table 4.2) that do occur in proteins and also in the L-polyproline type helix, which is the conformation of collagen monomers. Table 4.2 Structural Parameters of ordered Segments in Polypeptides

Type

j (°) (Ca-N)

y (°) (Ca-C)

n

d (Å)

P (Å)

Atoms in the loop

27-ribbon

105 (–75)

240 (60)

2.0

2.8

5.6

7

310-helix

130 (–50)

155 ( –25)

3.0

2.0

6.0

10

3.613-helix (aR-helix)

122 (–58)

133 (–47)

3.6

1.5

5.4

13

4.416-helix (p-helix)

120 (–60)

110 (–70)

4.4

0.8

3.5

16

27-ribbon b-sheet (≠≠)

105 60 (–120)

295 (115)

2.0

3.47

6.95

—-

b-sheet (≠Ø)

40 (–140)

315 (135)

2.0

3.47

6.95

—-

L-helix (Collagen)

105 (–75)

330 (150)

–3.0

3.12

9.36

—-

P = pitch of the helix; d = rise per turn; n = number of residues per turn;

Features of Proteins

Thick lines indicate the Polypeptide Backbone. Ca-Carbon atoms are numbered (Substituents at the Ca-atoms are deleted for clarity) Fig. 4.7(a) Features of Right-handed (3.613) a-Helix with Intra-molecular Hydrogen Bonds along the Helix Axis

Fig. 4.7(b) Example of a Protein (Cytochrome C’) Structure with Predominance of a-Helices (Ref: Dobs, A.J., et al.(1996), Acta Cryst D, Biol Crystallogr; 52; 356) (Source: Protein Data Bank: 1CGO.pdb)

37

38 Bioinformatics: A Primer

4.2.2.2

b-Sheet

The b-sheet is a pleated sheet structure and results from intermolecular hydrogen bonds, perpendicular to the strand axis, between NH and C=O groups of neighbouring strands (Fig. 4.8(a)). The polypeptide chain is extended (Table 4.5). b-Sheets occur either as parallel (≠≠) or as antiparallel (≠Ø) strands. Many proteins have a predominance of b-sheet structure (Fig. 4.8)(b)).

Thick lines indicate the Polypeptide Backbone. (Substituents at the Ca-Carbon atoms are deleted for clarity) Fig. 4.8 (a) Features of b-Sheet Pleated Structure, with extended Conformation (Inter-molecular Hydrogen Bonds are transverse to the Strand Axis)

Features of Proteins

39

Fig. 4.8 (b) Example of a Protein (Retino-binding Protein (RBP) Structure with Predominance of b-Sheets (Ref: Zanotti, G., et al. (2001), Biochim Biophys Acta; 64; 1550) (Source: Protein Data Bank: 1IIU.pdb)

4.2.2.3

Turns and Loops

While helices and sheets are ordered segments, because their residues have repeating backbone torsion angles, j and y and their hydrogen bonding patterns are periodic, turns and loops do not exhibit such regular secondary structural features. Turns are those regions in a protein where the polypeptide backbone folds back and changes the overall direction of the polypeptide chain by nearly 180°. There are cases of protein structures with predominance of turns and loops (Fig. 4.9).

4.2.2.4

Conformation Analysis

Conformational flexibility of a polypeptide is restricted to certain regions because of the steric hindrances of moieties. The sterically allowed and not-allowed conformations of the polypeptide backbone can be determined by Ramachandran conformation plots (Ramachandran j, y maps). Ramachandran analysis demonstrates the conformations of nonglycine polypeptides are severely restricted to certain j-y regions only (Fig. 4.10). Ramachandran maps do not provide information on three-dimensional protein folding, that is, information on the nearest neighbor amino acid residues in a polypeptide chain that are sequentially distant. Folding information can be obtained from “distance” plots or “diagonal” plots. Distance plot provides distances between each amino acid residue to all other amino acids of a polypeptide. Inspection of distance plots helps in discerning tertiary structural features of proteins and can provide correlation of structural units of a protein with its DNA-exonic regions, even in protein structures that do not shoe apparent domain structure.

40 Bioinformatics: A Primer

Fig. 4.9 Example of a Protein (Rubredoxin Mutant G10A) Structure with Predominance of Turns and Loops (Ref: Maher, M.J., et al. (1999) Acta Cryst D, Biol Crystallogr., 55; 962) (Source: Protein Data Bank: 1B13.pdb)

Fig. 4.10 Ramachandran (j, y) Map

Features of Proteins

41

4.2.3 The Tertiary Structure (3°-Structure) The three-dimensional structure (spatial folding) of a protein is referred to as the tertiary structure (Fig. 4.11). The tertiary structure (protein folding) of a protein is unique and specific to each protein. The structural and functional features of proteins—binding sites of ligands (protein-drug interactions), the active sites of enzymes, or the binding sites for other proteins (protein—protein associations)—depend on their tertiary structures, and therefore, knowledge of the spatial folding of any protein is prerequisite to understand its structural and biochemical functions. This knowledge can also aid in our understanding of how particular mutations or variations in the gene that encodes a particular protein lead to changes in the behavior of that protein which can result in disease or in differences in drug interactions among different individuals.

Fig. 4.11 Ribbon Diagram of the Tertiary Structure of Protein Phosphatase (Ref: Barford, D., Flint, A. and Tonks, N) (http://biop.ox.ac.uk/www/mol_of_life/pdb/phos.html)

4.2.4 The Quaternary Structure (4°-Structure) The structural organization of single-polypeptide chain (monomeric) proteins, such as myoglobin, trypsin, insulin, is complete at the tertiary level. However, in the case of proteins containing two or more polypeptide chains (oligomeric proteins), such as hemoglobin (Fig. 4.12), cytochrome oxidase, ATPase, there exists quaternary level of organization. The quaternary structure deals with the specific arrangements of subunits, with respect one another in the protein complex. Same non-bonded interactions that come into play in the

42 Bioinformatics: A Primer

Fig. 4.12

Ribbon Diagram of Hemoglobin, a tetrameric (a2b2-subunit) Protein (Ref: Tame, J. & Vallone, B) (Source: Protein Data Bank: 1A3N.pdb)

stabilization of tertiary structure folding (e.g. hydrogen bonding, and van der Waals interactions) are also responsible in quaternary interactions. Hydrophobic interactions (non-directional and entropy-driven) play a major role in the higher-order structural organization and stability of quaternary structures in macromolecular complexes.

4.3 FORCES STABILIZING THE MOLECULAR STRUCTURE The structure is stabilized by various non-bonded molecular interactions, such as— (i) electrostatic, (ii) van der Waals, (iii) hydrogen bonding, and (iv) hydrophobic interactions (Table 4.3).

44 Bioinformatics: A Primer

(Uij = Madelung energy; N = Number of molecules or 2N ions; a = Modelung constant; q = charge; rij = distance between ions i and j; z = number of nearest neighbors of an ion; l and r = empirical parameters). Electrostatic interactions (salt bridges) occur between oppositely charged R-groups such as lysine, arginine, aspartic acid and glutamic acid. Majority of the amino acids found on the exterior surfaces of globular proteins (at loops and turns) contains charged or polar R-groups. Salt bridges play a structural role in allosteric cooperativity in multi-subunit proteins (e.g. hemoglobin, aspartic carbamylase).

4.3.2 Van der Waals Interactions Van der Waals forces are weak non-bonded interactions, contact distances between atoms (> 3.0 Å), molecules and moieties. These forces, both attractive and repulsive arise due to dipoledipole and dipole-induced dipole interactions. These interactions play a significant role in the stabilization of correct folding of proteins, hydration and solvent structure.

4.3.2.1

Dipole/Dipole Interactions

Atoms and molecules with zero net charge (q = 0), with charge separation (centers of positive and negative charges do not coincide) are permanent dipoles (polar entities), and they can generate electrical fields. The dipole moment, m, is Æ

Æ

m = q.  (4.2) Since the energy of a charged species is affected by an electric field, the energy of an ion or dipole will be affected by the presence of neighboring ions or dipoles. The interaction energy between permanent dipoles (polar molecules, m π 0) depends on the spatial orientation of the dipoles (Fig. 4.14).

Fig. 4.14 Non-bonded Interactions between Permanent Dipoles (m π 0)

Features of Proteins

U= -

2 m1m 2 Dr 3

for prolate spheroids

U= -

m1m 2 Dr 3

for oblate spheroids

45

(4.3)

Average energy of interaction for randomly distributed dipoles is given by Keeson equation. U= -

2 m12m 22 1 3 Dr 6 kT

Keesom equation

(4.4)

(D = dielectric constant; r = distance between two dipoles; k = Boltzmann constant; T = absolute temperature).

4.3.2.2

Dipole/Induced-Dipole Interactions

Ions or dipoles can induce dipole character to a neutral (m = 0), polarizable molecule (Fig. 4.15). mind = aE

(4.5)

The energy of interation is U= -

2a m2 Dr 6

(4.6)

Fig. 4.15 Schematic of Induced-dipole Interactions

4.3.2.3

Induced Dipole/Induced-Dipole Interactions

The cohesive forces between neutral and non-polar species, called dispersive forces or London interactions, are due to asymmetric temporal charge distribution around atoms that creates an

46 Bioinformatics: A Primer

induction in neighboring atoms. Dispersive forces are always attractive. For the indeceddipole/induced dipole interaction, the quantum mechanical treatment gives

3 Ia 2 London equation 4 r6 (a = polarizability tensor; E = electric field; I = 1st ionization potential) U= -

(4.7)

Total fusion of atom is prevented from happening due to repulsive forces on account of overlapping of nuclei and electron clouds of different atoms at very close range. The total energy of interaction (attractive and repulsive) is represented by Lennard-Jones potential U=-

  +   

Lennard-Jones Potential

(4.8)

Attraction Repulsion

(A and B are empirical constants).

4.3.3 Hydrogen Bonding The hydrogen bond is a special case of a permanent dipole attractive (polar) interaction. Hydrogen bond is formed whenever a polar donor group containing a hydrogen atom (e.g. O– H, N–H) interacts (at a distance 2.5 - 3.0 Å) with electronegative acceptor atom(s), such as O, N, Cl and F. Hydrogen atom has only one electron. When the electron is used to form a covalent bond with an electronegative atom (e.g. O, N, Cl), the electron cloud is pulled towards the electronegative atom, and the nucleus is partially unshielded. Consequently, the proton can interact directly with another negative atom nearby. Hydrogen bond is linear and directional. Though the hydrogen bond is a weak non-bonded interaction (~ 20 kJ/mol), it plays a crucial role in determining the physicochemical properties, structural stability and function of many compounds. For example, the unique properties of water are due to hydrogen bond networks. Ordered segments in proteins and proteins and other macromolecular complexes are due to intra- and intermolecular hydrogen bonds. Storage of genetic information in nucleic acids is via base-specific hydrogen bond patterns. In polypeptides hydrogen bonds are formed between NH and C = O groups of the polypeptide backbone. Eleven amino acids (out of 20) can form hydrogen bonds through their side chains. (i) H-bond donors only: Arg (guanidinium group), and Trp (indole group). (ii) H-bond donors and acceptors: Side chains of Asn, Gln, Ser and Thr can serve both as H-bond donors as well as acceptors. (iii) pH-dependent H-bonding: The hydrogen bonding potential of the side chains of Asp, Glu, His, Lys and Tyr is pH-dependent. These groups can serve as donors and acceptors of H-bonds over a certain pH range, and either as acceptors or donors of H-bonds (but not both) at other pH values.

4.3.4 Hydrophobic Interactions Hydrophobic (nonpolar) interactions are weak and non-directional. The formation of nonpolar associations is one of the most important factors in macromolecular folding, stacking, and

Features of Proteins

47

higher-order (tertiary and quaternary) structure assembly. The driving force for the formation of hydrophobic environment is entropy (positive entropy change).

EXERCISE MODULES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

Explain (pictorially) the difference between an L- and a D-amino acid (Hint: Apply Fisher convention). Build models of all 20 L-amino acids. Identify the amino acids according to the size and shape (small, medium and large; linear or branched). Identify the amino acids according to charge (hydrophilic––acidic, basic and polar neutral; hydrophobic––aliphatic and aromatic). What is the primary structure of proteins? There is no free rotation about the peptide bone– why? Build a decapeptide from the amino acid library (exclude Cys and Pro). What is the secondary structure of proteins? What are the structural parameters of a a-helix? What are the features of H–bonding? What are the structural parameters of a b-sheet? What are the features of H–bonding in parallel and antiparallel b-sheets? What is reverse turn and what is its importance in globular proteins? Which are the preferred residues in turns? What is conformation analysis? Comment on Ramachandran conformational plots. What is the tertiary structure of proteins? What is the quaternary structure of proteins? Which are the non-bonded interactions stabilizing macromolecular structures? What are salt bridges? Comment on van der Waals interactions. What is hydrogen bonding? What is the importance of hydrogen bonding in biology? What are hydrophobic interactions and how are they important?

BIBLIOGRAPHY 1. Baum, S.J. & Scaife, C.W. (1987), McMillan: New York. “Chemistry: A Life Science Approach”, 3rd Edn. 2. Berman, H.M., et al. (2000), Nucl Acid Res., 28; 235. “Protein Data Bank”. 3. Branden, C-I. & Tooze, J. (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”, 2nd Edn. 4. Chothia, C. (1984), Annu Rev Biochem., 53; 537. “Principles that determine the structure of proteins”. 5. Creighton, T.E. (1993), Freeman Press: New York. “Protein structures and Molecular Properties”, 2nd Edn. 6. Dickerson, R.E. & Geis, I. (1983), Benjamin-Cummings: Menlo Park/CA. “Hemoglobin: Structure, Function, Evolution and Pathology”. 7. Doolittle, R.F. (1985), Sci Amer., 253(4); 88. “Proteins”. 8. Klotz, I.M., et al. (1970), Annu Rev Biochem., 39; 25. “Quaternary structure of proteins”.

48 Bioinformatics: A Primer 9. Kyte, J. (1994), Garland Pubs: New York. “Structure in Protein Chemistry”. 10. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print). 11. Sackhein, G. (1991), Addison-Wesley: New York. “Introduction to Chemistry for Biology Students”, 4th Edn. 12. Stryer, L. (1995), H.C. Freeman: New York. “Biochemistry”, 4th Edn. 13. Voet, D. & Voet, J.D. (1990), John Wiley: New York. “Biochemistry”. 14. Walker, J.M. & Gaastra, W. (Eds) (1983), Croom Helm: London. “Techniques in Molecular Biology”. 15. Weinberg, R.A., et al. (1985), Sci Amer., 253(4); 48. “The molecules of Life”.

Section II

Experimental Methods of Structure Elucidation (Bioinformatics-I)

This page intentionally left blank

5 Physicochemical Characterization of Biomolecules Analyses of structure-function aspects of biomolecules are carried out at various levels of organization (primary, secondary, tertiary and quaternary structures), and by various methods– biochemical (physicochemical), molecular biology and biophysical. (i) Biochemical (physiochemical) methods comprise isolation, purification, identification and physicochemical characterization of hydrodynamical (molecular mass, shape, and size) parameters, and reaction mechanisms. Sequence (primary structure) analysis that includes molecular biology methods is also addressed under this category. (ii) Biophysical methods encompass physicochemical characterization, structure determination by X-ray crystallography, NMR spectroscopy and augmented by other spectroscopic methods. For undertaking any experimental structural elucidation of any biological macromolecule (say a protein), it should be available in pure (homogeneous) form and in sufficient quantities for biophysical characterization. Therefore, protein purification is the very first step to be undertaken. Protein purification is carried out either by (i) physicochemical (chromatography, electrophoresis etc) methods or/and by (ii) molecular genetics (gene selection and amplification etc) methods. The next step is to determine the primary structure (amino acid sequence) of the protein. This can be achieved either by amino acid sequencing or/and by gene sequencing methods. The final step, and the most challenging experimental task, is to determine the spatial (three-dimensional) structure of the protein. The only physical techniques that are available, to-date, to determine the three-dimensional structures at atomic and molecular levels are single crystal X-ray diffraction and multi-dimensional NMR spectroscopic methods. A general protocol of structure analyses of biological macromolecules is given in the flowchart (Figure 5.1). Biochemical characterization of biomolecules includes purification, identification and determination of molecular mass, shape and size by various physicochemical methods. These include (1) hydrodynamical, (2) chromatographic, and (3) electrophoretic methods. Blotting techniques, which come under molecular biology can also be treated under the purview of biochemical characterization.

Physicochemical Characterization of Biomolecules 53

The diffusion coefficient (D) can be determined from the optical Doppler effect of the laser beam scattered by the moving particles. Schlieren optics can be employed in a centrifuge to ¶C across the boundary as the solute behaves like a measure the concentration gradient ¶x prism. The refractive index gradient (concentration gradient) can also be measured by the interference method.

FG IJ H K

5.1.2 Rotational Diffusion Non-spherical particles undergo tumbling motion (rotational diffusion), and shape of macromolecules can be determined by rotational diffusion method. Ω=

RT N Az

ζ = 8πη3

(5.4) Stokes equation

(5.5)

4pN A hab 2 for ellipsoids (5.6) RT (Ω = rotational diffusion coefficient; R = gas constant; T = absolute temperature; NA = Avogadro number; ζ = torque; τ = relaxation time; η = viscosity coefficient; a and b = major and minor axes of the ellipsoid). The rotational diffusion coefficient (Ω) can be determined by flow birefringence method, by determining the orientation of elongated molecules in a velocity gradient produced by mechanical shearing force. This method is applicable for analysis of asymmetric rod-like molecules (e.g. DNA, collagen). The depolarization fluorescence method can also be used to determine the relaxation time, from which the size and shape of macromolecules can be estimated. τ=

5.1.3 Light Scattering Treating molecules as point dipoles, the dipole moment, m, induced by an electric field, E, on dipoles of polarizability, a, is m = a.E (5.7) Oscillating dipole emits radiation, and the ratio of scattered (Iθ) and the incident (I0) intensities is given by Rayleigh equation

Iq 16p 4 a 2 N .sin 2 q = I0 l2 r 2

Rayleigh equation

(5.8)

The Rayleigh ratio, Rθ, is 8p 4 a 2 N (5.9) l2 (N = number of particles per unit volume; θ = angle of scattering; λ = wavelength of radiation; r = specimen to detector distance). The polarizability (α) is related to the refractive index, n, and concentration, C, of the particles. Rθ =

54 Bioinformatics: A Primer

n2 – 1 = 4παN Therefore,

Where,

Rθ =

(5.10)

FG IJ H K

2 p 2 dn C. M l4 N A dC

(5.11)

1 LC = M Rq

for ideal solutions

(5.12)

LC 1 2BC + = M M2 Rq

for real solutions

(5.13)

FG JI H K

2 p 2 2 dn n0 Λ= 4 dC l NA

2

(M = molecular mass; n0 = refractive index of solvent; B = second virial coefficient). Scattering by solvent and solution (solvent + solute) can be measured experimentally and difference taken. (dn/dC) is determined by differential refractometer. The quantity (ΛC/Rq) is plotted as a function of concentration (C). The molecular mass can be obtained from the intercept of the plot, and the slope at C à 0 gives 2B/M2, from which the 2nd virial coefficient (B) can be determined. Spectral width of scattered radiation (due to Doppler broadening) can also be used to obtain information on translational and rotational diffusion coefficients, from which radius of gyration and molecular shapes can be determined.

5.1.4 Sedimentation The sedimentation rate of a particle (in a centrifuge) through a solution is related to the net force acting on the particle. The centrifugal force, FC, acting on a particle is opposed by the friction force, F’, and at equilibrium Fc = F’. (5.14) (1 - Vρ) m(ω2r) = f s(ω2r) NAf = \

RT D

RT s M = (1 – Vr) D

Einstein equation

(5.1) (5.15)

(V = specific volume of the particle; m = mass of the particle; ω = angular velocity; r = distance of the particle from the revolving axis in a centrifuge; NA = Avogadro number; M = molecular mass; s = sedimentation coefficient; (10–13 s) is called a Svedberg). Specific volume (V) of the particle can be determined from the variation of the density of the solution with solute concentration. The sedimentation coefficient (s) can be measured in a centrifuge by optical Doppler effect of laser light scattered by the moving particle. The molecular mass (M) can then be calculated if the diffusion coefficient (D) of the particle is known. Molecular mass can be determined without the knowledge of the diffusion coefficient by equating diffusion with sedimentation. This is realized by sedimentation equilibrium method. At equilibrium, the diffusion rate is equal to the sedimentation.

Physicochemical Characterization of Biomolecules 55

dr dC = D (5.16) dt dr Considering concentrations, C1 and C2 at positions r1 and r2 from the rotor axis of the centrifuge, C

M=

2 RT (1 - Vr)w 2

FG C IJ HC K dr - r i ln

2 1

2 2

2 1

(5.17)

A linear gradient of sucrose (~ 5 – 20%) or CsCl is employed in analytical centrifugation methods to determine the sedimentation of macromolecules. Macromolecules are layered atop the gradient in a centrifuge tube, and then subjected to centrifugal fields in excess of 105g. The sizes of unknown macromolecules can then be determined by comparing their migration distances in the gradient with those of known substances.

5.2 CHROMATOGRAPHIC METHODS Chromatography is a physical technique used to separate mixtures of substances based on differences in the relative affinities of the substances for mobile and stationary phases. A mobile phase (fluid or gas) passes through a column containing a stationary phase of porous solid or liquid coated on a solid support. Liquid chromatographic (LC) methods, where the mobile phase is a liquid, are extensively used in physicochemical characterization of biomolecules that are thermo-labile (e.g. biomolecules) because of the simplicity of operation and versatility. High Performance Liquid Chromatographic (HPLC) methods are aimed at increasing the efficiency of liquid chromatography to a high level of sensitivity (10–12 g/L). Advent of HPLC has resulted in tremendous progress in the separation of a wide variety of inorganic and organic compounds. Reversedphase HPLC (based on the hydrophobicity of compounds) has become the workhorse for purification and characterization of molecules of biological importance. Denaturing high-performance liquid chromatography (dHPLC) is used in single nucleotide polymorphism (SNP)-detection methods, based on discrimination between perfect and mismatched hybridization, measures melting temperatures. Mismatching results in lowered melting temperature. This method involves passing the hybridization mixture through a chromatographic column several times at different temperatures and deriving the melting temperature from changes in the chromatogram. Chromatographic separation methods rely on differences in physicochemical properties such as adsorption, solubility, size, affinity, and ionic mobility between solutes and solvents.

5.2.1 Thin Layer Chromatography (TLC) In thin layer chromatography (TLC), the stationary phase is a thin coat of cellulose, or alumina or silica on a plate (sorbent plate). Spots of known and unknown samples are applied along a line at the edge of the sorbent plate, and the plate is kept in a closed chamber that contains the mobile phase (solvent) in such a way that the line spots is just above the solvent layer. As the mobile phase moves up the plate, the samples also migrate with it, the migration distances

56 Bioinformatics: A Primer

being proportional to the affinity to the adsorbent and the solvent. The plate is removed and dried, just before the solvent front reaches the top edge of the plate. Spots are identified by photometric or other suitable methods. The relative migration, Rf , is characteristic of the analyte. Rf =

Distance of migration of the analyte from the origin Distance of migration of solvent from the origin

(5.18)

5.2.2 Size-exclusion (Molecular-sieve) Chromatography Size-exclusion chromatography is based on molecular size, particularly applicable to separation of high molecular mass macromolecules. The packing columns are generally glucose polymers in bead form of required pore size. Molecules of size larger than the pore size cannot enter into the pores, and are excluded and eluted first. The smaller size molecules that can enter into the pores of the beads are retarded in their rate of travel and are eluted last. Molecules of sizes in between are eluted, inversely proportional to their molecular sizes. Beads of different pore sizes can be used, depending upon the desired protein size separation profile. 5.2.3 Ion-exchange Chromatography Proteins (and nucleic acids) can be separated based on their net charge by ion-exchange method. The sorbent column consists of a cellulose polymer, either negatively (cation exchanger) or positively (anion exchanger) charged. Separation is achieved in two steps– (i) adsorption (binding) of analytes of opposite charge to the sorbent column as a solution of analytes is passed through the column, (ii) then eluting the analytes, one at a time, separated from each other, by an ion gradient. Separations are generally performed under conditions in which one of the ionic species predominates in both phases. 5.2.4 Affinity Chromatography Proteins have high affinity for their substrates or cofactors or receptors or antibodies raised against them. This specific characteristic is exploited in protein purification by affinity chromatography. The sorbent column consists of beads with specific affinity chemical group (X) bound to them. When a solution consisting a mixture of proteins is passed through the column, protein specifies with affinity to X is bound to the column. The bound proteins are then eluted from the column by passing a buffer solution containing a high concentration of the chemical group (X). 5.2.5 Supercritical Fluid Chromatography (SFC) Supercritical fluid chromatography (SFC) is a separation method that combines the flexibility and versatility of other conventional extraction methods. It uses supercritical fluids (high-purity CO2) under very high pressure as the mobile phase. The SFC technique exploits this liquid-like solvating property, and gas-like mobility of supercritical fluids for better separation. SFC has proven to offer advantages in terms of increased resolution, decreased purification, and sample dry-down time and purification capabilities complementary to HPLC.

5.3 ELECTROPHORETIC METHODS Proteins and nucleic acids can also be characterized according to size and charge by electrophoretic methods. The fundamental physical principle upon which all electrophoretic

Physicochemical Characterization of Biomolecules 57

techniques are based is the migration of charged particle(s) towards the electrode(s) of opposite polarity under the influence of an applied electric field. The migration of a charged particle under the influence of an applied electrical field depends on its electrophoretic mobility, µ, the frictional drag in the medium and other physicochemical parameters. Electrophoresis is a prelude to blotting techniques employed in molecular and immunobiology. The force, F, acting on a particle of charge q, in an applied electric field, E, (F = q.E), which is opposed by the frictional drag, F’, (F’ = fv); and at equilibrium, F = F’. (5.19) F = q.E = F’ = fv = (6πηr)v (Stokes equation) Electrophoretic mobility, µ, is defined as µ = v/E =

q 6phr

(5.20)

(f = frictional coefficient; η = viscosity coefficient; v = terminal velocity and r =radius of the particle). If µ is positive (pH > the isoelectric point), the particle moves towards the cathode. If µ is negative (pH < the isoelectric point), the particle moves towards the anode. The physicochemical parameters that influence the electrophoretic mobility of a particle in an electric field are: (i) (ii) (iii) (iv)

The ionizable groups present on the surface of the particle. Shape and rigidity of the particle. Pore size of the separation matrix. Characteristics of the buffer medium (pH, concentration etc.).

Optimization of these parameters leads to greater flexibility and versatility of the electrophoresis methods. Any electrophoretic setup consists of at least two components— an electrophoresis unit, and a power pack. The choice of the separating medium depends on the purpose. Paper: Cheap and easy to use, but resolution is poor. Cellulose acetate: Minimal adsorption and faster and clear separation. Better than paper and routinely used in clinical analysis of serum and other body fluid samples. Agarose gels: Invariably used in macromolecular separation (nucleic acids and proteins). The gel is transparent, so photoscanning is possible. Polyacrylamide gels: Offer many advantages. Used in protein purification and characterization. Photoscanning is possible. Electrophoresis can be carried out with a buffer solution or with a gel soaked in the buffer solution. Adverse effects of diffusion and convection are minimal in solid support and, in the case of gels, pore size can be varied by varying the gel concentration, so that separation is dependant not only on the charge of the particle, but also on its size and shape (its physical features). Therefore, gel electrophoresis has become the standard method for separation and characterization of macromolecular constituents. Both slab gels and capillary gels are in use to accomplish sizing. Slab gel matrices are generally cross-linked (4–6% polyacrylamide gel), whereas capillary gel matrices are non-crosslinked.

58 Bioinformatics: A Primer

5.3.1 Horizontal Electrophoresis of Nucleic Acids Nucleic acid constituents are large size particles and, therefore, larger pore-size gels are required for their separation. Agarose gels meet this requirement, but they do not have structural strength for vertical setup. Therefore, horizontal setup with agarose gels is invariably used in the separation of nucleic acid constituents (Fig. 5.2). Lid

Gel Wick

+



Electrode Buffer

Cooling plate

Fig. 5.2 Horizontal Electrophoresis Setup

The net charge of a nucleic acid is independent of the pH of the medium. That is, the charge/ mass ratio is almost the same for all nucleic acids. Therefore, the electrophoretic separation of nucleic acids is based solely on the difference in their molecular mass, M. The molecular mass (M) of a nucleic acid can be determined from its electrophoretic mobility by running standard nucleic acid markers of known M on the same gel, with proper pore size. Pore size of an agarose gel depends on the gel concentration. The general recipe is– (Table 5.1). Table 5.1 Molecular Separations as a Function of Agarose Gel Concentration

Agarose gel Concentration 0.3% 0.4 – 0.5% 0.8% 1 – 1.2% ~ 2%

Molecular separation size DNA duplexes of 5 – 60 kb. Viral nucleic acids and plasmids. Larger restriction fragments (0.5 – 10 kb). Smaller restriction fragments (< 5kb). Smaller fragments (0.1 – 3 kb).

Microchip electrophoresis is an electrophoretic method that has considerable impact in DNA separations, in large part because microchip electrophoresis offers some clear advantages over slab gel electrophoresis for automation, speed, and quantitative capability.

5.3.2 Column Electrophoresis Both slab gel and capillary gel (and free-solution) matrices can be employed to separate nucleic acids and proteins. In order to accomplish sizing, a sieving matrix is prepared and loaded either between two glass plates (slab gel), or into glass capillaries.

Physicochemical Characterization of Biomolecules 59

5.3.2.1 Slab Gel Electrophoresis Electrophoretic separation of proteins is a function of the pore size and sieving effects of the gel matrix and physicochemical properties of the buffer medium. A judicial optimization is necessary in all protein purification and characterization protocols. Vertical slab gel setup, with polyacrylamide gel supports, has become standard and universal in protein electrophoresis procedures. A vertical slab gel unit has two reservoirs of buffer (each containing an electrode) separated by the gel. A thin slab of gel is formed between two glass plates that are clamped together but held apart by plastic spacers. The gel slab consists of two phases— the lower part with higher gel concentration (separation gel) and the top part with lower gel concentration (stacking gel). The stacking gel with larger pore size is for concentrating the sample so that it enters the separation gel as concentrated band. A plastic comb, placed in the stacking gel (and removed after polymerization), provides loading wells for samples. The lower part of the separation gel is dipped in the electrophoresis tank that contains the buffer (Fig 5.3). Electrode Glass plate Sandwich gel

Buffer solution

Detector assembly

Electrode

Fig. 5.3 Schematic of Slab Gel Electrophoresis Apparatus

5.3.2.2 Poly Acrylamide Gel Electrophoresis (PAGE) Molecular separation by native PAGE is according to the net charge/mass ratio of the molecule and sieving effects of the gel matrix. The gel is prepared by polymerizing acrylamide (CH2=CH.CO.NH2) with a small quantity of cross-linker, bisacrylamide (CH2(CH2=COCHNH2). Ammonium persulfate and tetramethylethylenediamine (TEMED) are added as initiator and catalyst of polymerization. Varying the concentration of the monomer and the cross-linker in the gel solution controls the pore size of the gel.

5.3.2.3 SDS-PAGE Conventional electrophoresis (PAGE, called the native PAGE) is employed for separation of proteins and not for the characterization of molecular mass, shape and size. The electrophoretic mobility of a protein depends simultaneously on its (i) net charge, (ii) its size (molecular mass) and shape and (iii) its structural rigidity. These factors vary with the experimental conditions. In order to establish quantitative relation between one of the parameters and the electrophoretic

Physicochemical Characterization of Biomolecules 61

5.3.2.4 Pore Gradient Gels In this system, the slab gel is not of uniform pore size, but a linear gradient, established with varying acrylamide gel concentration. A particle loaded on the gel will migrate rapidly through the dilute (large pore) gel region until it reaches a smaller pore-containing region through which it only moves extremely slowly. Advantages are: (i) (ii) (iii) (iv)

A much greater range of proteins of M values can be separated. Proteins of very similar M values (isoenzymes) can be resolved. Choice of buffer and electrical conditions is not critical. Inherent self-limiting of the migration of proteins.

Estimation of M of native PAGE (without SDS) can be achieved in a linear gradient gel (~ 3 – 20%). The distance, D, traveled by a protein in time, t, is t = (aD + b)2

(5.23)

5.3.2.5 Immunoelectrophoresis Immunoelectrophoresis is a combination of gel electrophoresis and immunodiffusion methods. Serum (antigen) samples are placed in wells made in agar plates and electrophoresis is carried out to separate proteins according their charge. After electrophoresis, longitudinal trenches are cut and antiserum (antibody) samples are introduced in the trenches and incubated in a humid chamber. Proteins and antiserum samples diffuse and precipitin lines are formed wherever the proteins (antigens) and antiserum samples (antibodies) meet (Fig. 5.5).

Fig. 5.5 Schematic of Immunoelectrophoretic Diffusion Setup

5.3.2.6 Isoelectric Focusing (IEF) The isoelectric point of a protein is the pH at which the net charge of the protein is equal to zero. The use of polyacrylamide gel electrophoresis can also be used in separation of amphoteric molecules (proteins) according to their differences in the isoelectric points (pI). This method is extremely useful in separation of isoenzymes, for studying micro-heterogeneity in a protein (e.g. a protein may show a single band in an SDS-PAGE, but may show three bands in IEF, if the protein exists in mono-, di- and triphosphate forms). The apparatus usually consists of a narrow tube containing a mixture of polyacrylamide gel and ampholytes (which are small molecules with positive and negative charges). The

62 Bioinformatics: A Primer

ampholytes have wide range of isoelectric points, and when an electric field is applied, those with low PI will migrate towards anode, whereas those with high PI towards cathode. In this process a pH gradient is set up from one end of the gel to the other, as a particular ampholyte comes to rest at a position coincident with its isoelectric point. Proteins, introduced in such a column, migrate through the column in the electric field until each one reaches a point at which its own isoelectric point exactly equals the pH in the column. Isoelectric focusing can also be achieved by immobilized pH gradient (IPG) strips. This is achieved by co-polymerization with acrylamide a set of monomers that carried ampholyte functionally. Therefore, by changing the concentration of the different monomers along the strip, pH gradients are covalently immobilized and stabilized into the gel. In this way IPG strips with various pH gradients (3-12) can be fabricated.

5.3.3 2-D Gel Electrophoresis (2-DE) With the completion of genomic sequence analysis, the attention is shifting towards understanding gene functions vis a vis study of the different biomolecules involved in cellular functions. Proteins are one of the fundamental biomolecules involved various cellular functions. The term proteomics is coined to describe the large-scale study of the proteins related to a genome. The rapidly expanding field of proteomics relies heavily upon two-dimensional (2D) electrophoresis as the best method to separate the large number of proteins present in biological extracts. 2-DE, followed by mass spectrometric analysis of isolated proteins, is the workhorse of tool for proteomic analysis. Proteomic studies can be performed on a complex tissue level. The first and probably the most important experimental setup in a proteomic study is the extraction of proteome from its medium. Once the protein extract has been isolated from the lysate, a separation technique is required to separate the individual proteins. 2-D gel electrophoresis (2-DE) has been the method of choice for the large-scale purification of proteins in proteomic studies. The 2-DE technique combines IEF, which separates proteins according to their charge (pI), with the size-separation method of SDS-PAGE. The first dimension is carried out in PA gels in narrow tubes in the presence of ampholytes, 8M urea and a non-ionic detergent. The denatured proteins separate in the gel according their pIs. The full potential of the 2-DE method is better realized with the IPG strip. The protein extract is used to re-swell IPG strip prior to focusing. Once the strip is swelled, an electric current is applied across the IPG strip. The proteins that are positively charged (that is below the area of the strip with pH below their pI) move toward the cathode and encounter an increasing pH until reaching their pI, at which point they will be neutral. The proteins that are negatively charged (that is, in the area of the strip with pH above their pI) will move toward the anode and encounter a decreasing pH until reaching their pI. The end result is that every protein is concentrated and constantly focused at their pIs. The second dimension for the 2-DE separation formed by polymerizing an SDS-PAGE gel between two glass plates. The PA gel, incubated in a buffer, or the IPG strip is place at the top of the SDS-PAGE gel. An electric field is then applied across the gel, and the proteins migrate into the second dimension where they are separated according their molecular mass (Mr). The proteins are detected by a staining protocol that creates a two-dimensional set of spots (Fig. 5.6). The location of each spot is detected by the protein’s pI, and Mr. The intensity of each spot

64 Bioinformatics: A Primer

separation medium. The narrow diameter of the capillary minimizes the thermal effects associated with electric field. As a result, very high electric potential (~ 30 kV) can be applied across the capillary without degradation of the separation due to heating. Separation time is inversely proportional to the applied potential, and therefore, application of high fields results in rapid and efficient separation. Unlike the slab gels, where separated molecular species are visualized after a fixed run time, capillary electrophoresis is a “finish-line” technique, where they are detected after traversing a fixed distance. Combination of capillary electrophoresis and mass spectrometry allows laser-induced fluorescence (LIF) detection to quantify the amount of protein in the sample, and each LIF peak then analyzed by electrospray-time-of-flight-mass spectrometry (ESI-TOF-MS).

5.3.4.1 Free-solution Electrophoresis Capillary free-solution electrophoresis does not have an analog in slab electrophoresis. A capillary is filled with a buffer and proteins are separated, based on their mobility in the buffer, which is related to size/charge ratio of the protein– where highly charged and small proteins have the highest mobility. 5.3.4.2 2-D Capillary Electrophoresis In two-dimensional capillary electrophoresis method, a mixture of proteins, fluorescent labeled, is introduced in to the first capillary. Electric field is applied for a period of time. The aliquot that migrated from the first capillary during that period is injected into a second capillary, where entrance abutted the first capillary. The electric field is switched off. The fraction that is injected into the second capillary is separated by applying an electric field across the second capillary. The process is repeated until every fraction in the first capillary is passed the second capillary and gets separated. This procedure provides a two-dimensional electropherogram as a raster image from successive separations of the first capillary’s fractions. 2-D capillary electrophoresis, with SDS-PEO gel separation in the first capillary tube, and free-solution separation in the second capillary, would allow orthogonal separation. This method would enable to resolve complex mixtures into their constituents. The separation method will produce separations that are similar to IEF/SDS-PAGE in a fully automated system. When combined with laser-induced fluorescence detection, it provides orders of magnitude higher sensitivity and dynamic range the IEF/SDS-PAGE analysis.

5.3.4.3 Capillary Zone Electrophoresis (CZE) Capillary zone electrophoresis (CZE) is an alternative to liquid chromatography (LC) on account of its ability to separate ionizable compounds. Separation of analyte in CZE is based on differences in their electrophoretic mobilities, which can be modified by adding suitable soluble reagents. CZE separations occur entirely in a liquid phase. As there no mass transport is involved between mobile and stationary phases, no peak broadening results from this source. Therefore, separation power is much greater than for similar LC. Thus, CZE is more efficient, expeditious and selective than LC. In CZE, a capillary tube is filled with separation buffer, and one end of the tube is inserted into the microspray interface of a mass spectrometer. A sample reservoir, containing a protein digest, is introduced at the other end of the capillary tube, and an electric field is applied from the injection end to the microspray interface. The applied electric field generates a bulk flow of liquid toward the microelectrospray interface (electrophoretic force). The analytes present

Physicochemical Characterization of Biomolecules 65

in the samples are driven into the capillary tube by the electric field. The sample reservoir is then replaced by a buffer reservoir, and an electric field is applied across the capillary tube. A bulk flow of liquid moves toward the microelectrospray, due to electrosmotic pumping, where the analytes are separated by electrophoresis. The separated analytes are then successively led into the tandem mass spectrometer (MS-MS), and tandem mass spectra are acquired.

5.3.5 Control Parameters and Detection Selection of buffer, its ionic strength and pH, are of utmost importance in electrophoresis of proteins. The ionic strength of the buffer should be ~ 0.05 – 0.15 M. At low ionic strength there is rapid migration but greater diffusion; and at high ionic strength sharp bands are obtained but there is higher heat production. Conductivity is determined by the nature of the component ions— e.g., Na+, K+ and phosphate buffers have higher conductivity than that of Tris-HCl due to the presence of ions. Tris- buffers are ideal for fractionation of basic proteins, and cacodylate buffers (anions) are for acidic proteins at neutral pH values. Tracking dyes are added to the starting solution to keep track of ions that migrate in the same direction as the proteins to be fractionated. The mobility of buffer ions that migrate in the same direction as the macromolecules is of importance and resolution is enchanted if mobilities of both are comparable. In alkaline and neutral buffers, acidic proteins become negatively charged and migrate towards the anode and, therefore, negatively charged dyes are used (Bromophenol Blue − − as a marker for Cl and Gly ions). For electrophoresis in acidic medium, with proteins migrating towards the cathode, positively charged dyes are used (Methyl Green Pyronin). As the charge/mass ratio proteins can be altered by the pH of the buffer, the optimal pH value of a running buffer is that which ensures the maximum difference in the charge of the component proteins. For acid proteins, optimal pH values fall in the neutral or slightly alkaline regions and such proteins migrate towards the anode. Tris-buffers at pH ~ 8.9 are preferred. For basic proteins, it is preferable to choose slightly acidic pH (4 – 5). K+-acetate and Tris-acetate (pH ~ 4.5) are most suitable. Detection and localization of protein bands are performed by staining gels by Coomassie Blue R-250 (CBB) and fixing them by trichloroacetic acid (TCA). Detection is done either fluorescence or silver staining methods, and by autoradiography if the sample is radiolabeled. The nucleic acids are identified by staining the gels with various dyes—ethidium bromide, acridine orange, Pyronin etc. Methyl Green and DAPI selectively stain DNA but not RNA. For visible region fluorescence (blue or green excitation), fluorophores in the fluorescein and rhodamine families are most commonly used. For near IR fluorescence, the most common dyes are from the polymethine carbocyamine family. Mass spectrometry (MS) is the sought after method in the genomic and proteomic analysis (identification and sequence determination; see Chapter 6). All the fluorescence dye approaches require the utilization of a fluorescent detector system to determine the position of protein/DNA spots on the gel. In principle three basic components of a fluorescence detector system– (i) exciting energy source, (ii) fluorescent sample and (iii) fluorescence detector (Fig. 5.7). The laser/detector assembly consists of a laser source (laser diode) as the excitation source, and a detector assembly (photomultiplier tube (PMT), or avalanche diode). The laser source is placed at an angle such that the focused polarized

Physicochemical Characterization of Biomolecules 67

Weight

Glass plate

Paper towels

Filter paper Membrane Wick

Gel Support

Transfer buffer

Fig. 5.8 Schematic of a Blotting Apparatus

Fig. 5.9 Immunodetection of Protein Blots by Enzyme-linked Antibodies

probes under the appropriate hybridization conditions. Radioactive DNA molecules (cDNA) can be employed as probes in Southern and Northern blotting. Specific antibodies can be used to examine individual protein bands in Western blotting (Fig. 5.9) or by other methods, such as by fluorescence (Fig.5.10).

68 Bioinformatics: A Primer

2H2O2

2H2O

Enzyme (Peroxidase)

N2

2O



Fluorescent Molecule (Luminol)

Fluorescence

Nofluorescent species (Aminopthalic acid)

Fig. 5.10 Schematic of Detection of Proteins in Western Blots by Fluorescence

EXERCISE MODULE 1. Which are the two branches of structure determination of biomolecules and what are their objectives? 2. Which are the physiochemical methods of structure elucidation clubbed under biochemical characterization? 3. Which are the hydrodynamical methods of structure elucidation and what information do they provide? 4. Which are the chromatographic methods that employed in the physiochemical characterization of biomolecules? 5. What is the reason for the extensive use of electrophoretic methods in separation and characterization of biomolecules? 6. What are procedures followed in the electrophoretic separation of (i) nucleic acids, and (ii) proteins? 7. Which are important physicochemical methods employed in protein separation and characterization? 8. What kinds of complementary structural data do PAGE and SDS-PAGE provide? 9. Explain the use of immunoelectrophoretic methods in biomedical sciences. 10. What are the applications of 2-D gel electrophoresis (2-DE)? 11. What are the advantages of capillary electrophoretic methods? 12. What are the detection methods in gel electrophoresis? 13. What is the relevance of blotting techniques in genome analysis?

BIBLIOGRAPHY 1. Anazawa, T., Takahashi, S. & Kambara, H. (1996), Anal Chem., 68; 2699. “A capillary array gel electrophoresis system using multiple laser focusing for DNA sequencing”. 2. Andrew, A.T. (1986), Oxford University Press: Oxford. “Electrophoresis: Theory, Techniques and Biochemical and Clinical Applications”. 3. Atkins, P.W. (1998), Oxford University Press: Oxford. “Physical Chemistry”, 6th Edn. 4. Braithwaite, A. & Smith, F.J. (1996), Chapman & Hall: London. “Chromatographic Methods”. 5. Dean, P.D.G., Johnson, W.S. & Middle, F.A. (1985), IRL Press: Oxford. “Affinity Chromatography: A Practical Approach”. 6. Deutscher, M.P. (Ed) (1990), Academic Press: New York. “Guide to Protein Purification”.

Physicochemical Characterization of Biomolecules 69 7. Dolnik, V. (1999), J Biochem Biophys Methods, 41; 103. “DNA sequencing by capillary electrophoresis”. 8. Fried, B. & Sherma, J. (1999), Marcel Dekker: New York. “Thin Layer Chromatography”, 4th Edn. 9. Gevaert, K. & Vanderckhove, J. (2000), Electrophoresis, 21; 1145. “Protein identification methods in proteomics”. 10. Görg, A. (1993), Biochem Soc Trans., 21; 130. “Two-dimensional electrophoresis with immobilized pH gradients: current state”. 11. Hacock, W.S. (Ed) (1990), Wiley Interscience: New York. “HPLC in Biotechnology”. 12. Hames, B.D. & Rickwood, D. (Eds) (1990) IRL Press: New York. “Gel Electrophoresis of Proteins: A Practical Approach”, 2nd Edn. 13. Harris, E.L.V. & Angal, S. (Eds) (1989), IRL Press: Oxford. “Protein Purification Methods: A Practical Approach”. 14. Hearn, M.T.W. (Ed) (1991), VCH Pubs: New York. “HPLC of Proteins, Peptides and Polynucleotides”. 15. Huang, X.C., Quesada, M.A. & Mathies, R.A. (1992), Anal Chem., 64; 2149. “DNA sequencing using capillary array electrophoresis”. 16. Katz, E., et al. (1999), Marcel Dekker: New York. “Handbook of HPLC”. 17. Melvin, M. (1987), John Wiley: New York. “Electrophoresis”. 18. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”, (2nd Print). 19. Neuberger, A.A. & Van Deenan, L.L.M. (Eds) (1988) Elsevier: New York. “Modern Physical Methods in Biochemistry”. 20. Nishikawa, T. & Kambara, (1996), Electrophoresis, 17; 1476. “Characteristics of single-stranded DNA separation by capillary gel electrophoresis”. 21. Oestermann, L.A. (1984), Springerverlag: Berlin. “Methods of Protein and Nucleic Acids Research”, Vol I. 22. Rabilloud, T. (2000), Anal Chem., 72; 48A. “Detecting proteins, separated by 2-D gel electrophoresis”. 23. Rickwood, D. (Ed) (1984), IRL Press: Oxford. “Centrifugation” A Practical Approach”. 24. Scott, R.P.W. (1995), Marcel Dekker: New York. “Techniques and Practice of Chromatography”. 25. Tinoco, I, (Jr). (1985) Prentice-Hall, Engelwood: NJ. “Physical Chemistry: Principles and Applications in Biological Sciences”, 2nd Edn. 26. Walker, J.M. and Gaastra, W. (Eds) (1983), Croom Helm: London. “Techniques in Molecular Biology”. 27. Wilson, K. and Walker, J.M. (1994), Foundation Pubs: New Delhi. “Principles and Techniques of Practical Biochemistry”.

6 Primary Structure (Sequence) Determination of Biomolecules Once a biological macromolecule is obtained in a purified form (protein or nucleic acid), the next step is to determine its primary structure (sequence). Determination of primary structure (sequence) of biological molecules is the starting point of generating computer databases and computational analyses in bioinformatics studies. Availability of the primary structure of a macromolecule is a prerequisite in the determination of its spatial structure by physical techniques, such as X-ray diffraction and NMR spectroscopic methods. Molecular separation and amplification, sequence determination of component fragments and analysis of consensus sequences is the main steps in the primary structure determination of nucleic acids and proteins.

6.1 PRIMARY STRUCTURE DETERMINATION OF NUCLEIC ACIDS DNA sequencing technology is a major component of all the genomic studies. Gene amplification, separation and sequencing are hierarchical steps in primary structure determination of nucleic acids. High-throughput sequencing (required for large-scale genome projects) methods use robotics, automated DNA-sequencing machines and computers, to handle to achieve fastness and large-scale data management. The ‘assembly line” purification platforms, replicate the steps used in manual template purification protocol.

6.1.1 Gene Amplification Cutting genomic DNA at specific sites by suitable restriction enzymes generates DNA fragments. Gene amplification is necessary for obtaining sufficient quantities of desired gene or gene fragment for further analysis and for production of desired protein in large quantities for analysis. The fragments are the amplified either by cloning or polymerase chain reaction (PCR) methods.

6.1.1.1

Gene Cloning

Gene cloning involves the use of recombinant DNA technology to propagate DNA fragments, isolated from chromosomes using restriction enzymes, inside a foreign host. Following introduction into suitable host cells, the DNA fragments can then be reproduced along with the host cell DNA. Cloning procedures are routinely employed to produce unlimited material for experimental study.

Primary Structure (Sequence) Determination of Biomolecules

6.1.1.2

71

Polymerase Chain Reaction (PCR)

Polymerase chain reaction (PCR) is a very versatile gene amplification method that has brought a tremendous progress in molecular biology and genetics. It is an in vitro method of amplifying a desired DNA sequence of any origin hundreds of million times in hours. The procedure involves cycle of steps in which the double-stranded target sequence is denatured, oligonucleotide primers bordering the region to be amplified are annealed and the primers are extended by thermo-stable polymerases and dNTPs. In DNA amplification by PCR, a desired cDNA clone is synthesized using mRNA as a template. Suitable primers are used to hybridize to the corresponding sequences, and they are extended in a chain synthesis reaction by thermo-stable DNA polymerases, using the inserted sequence as the template. The PCR mixture contains DNA bases (four types) and two primers (~ 20 bases long). The mixture is heated to separate the target sequence and then cooled (annealing) to allow the (i) primers to bind to their complementary sequence on the separated strands, and (ii) the polymerase to extend the primers into the new complementary strands. Repeated heating and cooling cycles multiply the target DNA exponentially, since each new double strand separates to become two templates for further synthesis. The nucleotide that the polymerase attaches will be complementary to the base in the corresponding position on the template strand (e.g. if the adjacent template base is C, the polymerase attaches G). The polymerase chain reaction proceeds with two primers, bound to the opposite strands of the gene target, and their 3’-ends pointing at each other. The reaction is terminated by the incorporation of dideoxynucleotides. The resultant is a series of fragments of different lengths for each primer.

6.1.2 Gene Separation Electrophoresis techniques are used to separate the fragments. Small diameter capillary array gel electrophoresis permits application of high electric fields, thus providing significantly faster separation than traditional slab gels. While conventional electrophoresis is applicable to separate fragments < 40 kilo bases, pulse-field gel electrophoresis (PFGE) technique has improved the separation of larger fragments (~10M bases). This technique employs multiple electrodes, placed orthogonally with respect to the gel, and short pulses of alternate current are passed through the gel back and between two directions. Automated electrophoretic methods, with laser-induced detection have revolutionized the gene sequencing methodologies.

6.1.3 Gene Sequencing Genome sequences are assembled from DNA sequence fragments of approximately 500 basepairs length. Knowing the sequence of a DNA molecule is vital for making prediction about its function and facilitating manipulation of that molecule. Conventional (1st generation) gene sequencing methods employed Maxam-Gilbert and Sanger methods. Maxam-Gilbert method (also called the chemical degradation method) uses chemicals to cleave DNA at specific bases, resulting in fragments of different length. Sanger sequencing method (dideoxy method) uses enzymatic procedure to synthesize chains of varying length in four different reactions, stopping the DNA replication at positions occupied by one of the four bases, and then determines the resulting fragment length. This method is automated, and is now by for the most widely used technique for sequencing DNA. Multiplex sequencing procedure enables to analyze ~ 40 clones on a single DNA-sequencing gel.

72 Bioinformatics: A Primer

The principles of Sanger sequencing method are: (i) Enzymatic procedure to synthesize DNA replication (of varying lengths) at positions occupied by one of the four bases. (ii) Then separation of the fragments as gels by electrophoresis and determination of the identify, and order of nucleotides, based on the size of the fragments. A DNA polymerase extends an oligonucleotide primer annealed to a unique location on a DNA template by incorporating deoxynucleotides (dNTPs) complementary to the template. The 5’-end of every DNA fragment within a sample begins with the same priming sequence. Synthesis of the new DNA strand continues until the reaction is randomly terminated by the inclusion of a dideoxynucleotide (ddNTP). These nucleotide analogs are incapable of chain elongation since the ribose moiety of the ddNTPs lack the 3’-OH necessary for forming a phosphodiester bond with the next incoming dNTP. This results in a population of truncated sequencing fragments of varying length. The identity of the chain-terminating nucleotide at each position is specified by running four separate base-specific reactions, each of which contains a different dideoxynucleotide– ddATP, ddCTP, ddGTP, and ddTTP. The four such fragment sets are loaded in adjacent lanes of polyacrylamide gel and separated by electrophoresis, according to the fragment size. By this method, DNA fragments differing in length by one nucleotide can be resolved (Fig. 6.1). Detection can be achieved by (i) autoradiography if radioactive label is introduced into the sequencing reaction products, or by (ii``) fluorescence, if the reaction products are labeled with an appropriate fluorescent dye. Fluorescence detection employs direct methodology, which is simple, sensitive and amenable for automation. The use of laser-induced fluorescence (LIF) technology, which can be coupled to computerized detection systems, has replaced most of the radioactive techniques in genomic studies, allowing automation of sequencing methods. Automated DNA sequencing by fluorescence techniques accomplish real-time detection of DNA fragments as they move through a portion of the electrophoresis gel that is irradiated by a laser beam– DNA is separated on an automated DNA sequencer and the fluorescent image of DNA migrating through the acrylamide gel is captured by CCD (charge-coupled diode) cameras similar to that found in common home video recorders.

Fig. 6.1 Electropherogram of a DNA Sequence

Primary Structure (Sequence) Determination of Biomolecules

6.1.3.1

73

Latest Gene Sequenci Methods

Developments in gene sequencing techniques (2nd and 3rd generation) are–– ultra-thin electrophoresis, resonance ionization spectroscopy to detect suitable isotope labels, gel-less flow cytofluorimetry (single molecule detection in flowing solution), laser-induced fluorescence, confocal (LSCM), near-field scanning optical (NSOM), scanning-tunneling (STM) or atomic force microscopy (AFM), and mass spectrometry (MS). Single-molecule detection can be achieved with flow-cytofluorimetry, to monitor fluorescently tagged macromolecules. A small sample volume in a flowing solution can be obtained by introducing the sample from a capillary tube inserted into a flow cell (Fig. 6.2). While a fluorescent molecule transits in a focused laser beam, it undergoes cycles of photon absorption and emission so that its presence is signaled by a burst of emitted photons, which enables to distinguish the signal from the background. The background from Rayleigh and Raman scattering can be drastically reduced using a pulsed laser beam and single-photon timing technique.

Fig. 6.2 Schematic of Single-molecule Detection by Flow-cytofluorimetry

Confocal microscopy is well suited for fluorescently tagged single-molecule detection. Either fluorescence spectroscopic or microscopic approach can be used for sequencing of nucleic acids (DNA or RNA). In this approach, a nucleic acid strand (DNA/RNA) strand is replicated by a polymerase using nucleotides linked to fluorophores via a linker arm. The fluorescently tagged nucleic acid strand is attached to a solid support (e.g. latex bead), and suspended in a flowing stream. The nucleic acid bases are then sequentially cleaved by an exonuclease. The released labeled nucleitodes are detected and identified by their fluorescence signature– either by their spectral characteristics if the tagged fluorophores are different, or by their different lifetimes, if the same fluorophore is tagged (Fig. 6.3). Thus, throughput sequencing (several hundred bases per second) of nucleic acids can be determined from the order in which the labeled nucleic acids pass through the laser beam.

74 Bioinformatics: A Primer

(I) = Synthesis of complementary strand with fluorescently labeled nucleotides; (II) = Attachment of strand to a support and suspension in a flow sample stream; (III) = Sequential cleavage by an exonuclease and detection. Fig. 6.3 Sequencing of Nucleic Acids by Fluorescently-labeled Single-molecule Detection Method

6.1.4

Genome Sequencing

Genome sequencing projects do face several technical problems. The selection of sequencing strategy usually depends on the size of the target DNA molecule. Since present experimental methods can provide data on ~ 500 base pairs size genes, determination of larger genomic sequences requires a strategy to assemble overlapping sequence fragments. “Shotgun” sequencing strategy (sequencing method, which involves randomly sequencing tiny cloned pieces of the genome, with no foreknowledge of where on a chromosome the piece originally came from) is employed in most of the current large-scale genome sequencing projects. The strategy employed in the “whole genome shotgun” DNA sequencing method is (Fig. 6.4)– (i) Chromosomes are first separated by pulse-field gel electrophoresis, from which each DNA molecule is broken into random DNA fragments.

Primary Structure (Sequence) Determination of Biomolecules

75

(ii) Purified DNA fragments are to construct small-insert shotgun libraries, which are cloned and sequenced from each end. Sufficient DNA sequencing is performed so that each nucleotide of DNA in the genome is covered numerous times in fragments of ~ 500 bp. (iii) After sequencing, assembly of scaffolds of DNA sequences (series of contigs that are in the right order but are not necessarily connected in one continuous stretch of sequence) by identifying overlapping stretches of DNA sequences, to reconstruct the complete genome.

Fig. 6.4 Schematic of the “Whole Genome Shotgun (WGS)” DNA Sequencing Protocol

A large segment of target DNA is randomly fragmented by physical shearing or enzymatic digestion to fragment sizes in the range of 1 - 5 kb. Chromosomes of a target organism are purified, fragmented, and sub-cloned in fragments of ~ kilo base pairs. They are further subcloned as smaller fragments of plasmid vectors for DNA sequencing. First, the gene fragments are sequenced to determine the order of the bases in each sequence. Next, overlapping fragments are built up in a multiple alignment, a process known as sequence assembly, from which a consensus sequence for the clone is obtained, called “contig”. Contig assembly is one of the most difficult and critical functions in DNA sequence analysis. Full chromosomal sequences are then assembled from the overlap sequences in a highly redundant set of fragments. Full chromosomal sequences are then assembled from the overlap sequences in a highly redundant set of fragments. In genomic studies, the earliest stages of genome analysis are performed automatic methods. The genome sequences are then annotated. More detailed information is collected by laboratory experiments and a closer examination of the sequence data.

6.1.4.1 cDNA SEQUENCING (1) Isolating of mRNA from human tissues, and exon coding sequences are cDNA. This allows rapid isolation of annotation of putative gene sequences. (2) Grouping of ESTs into consensus groups, based on sequence overlaps. 6.1.4.2 NANOPORE SEQUENCING Nanopore (solid-state and membrane) probe technology enables sequencing of single DNA molecule– probing of single DNA molecule and direct conversion of sequence information

Primary Structure (Sequence) Determination of Biomolecules

77

2-mercaptoethanol or dithiothreitol (Cleland reagent) to reduce the S–S bonds to sulfhydryl (SH) groups. The resultant free-SH groups are alkylated by iodoacetic acid to prevent reoxidation. Table 6.1 Chemical Reagents for Cleavage of Polypeptides Reagent

Specificity

Cyanogens Bromide (CNBr)

Highly specific; Carboxyl side of Met or Trp

Hydroxyl amine

Specific; Arg–Gly bonds.

2-Nitro-5-thiocarbanobenzoate

Specific; Cysteine residues.

Table 6.2 Enzyme

Endopeptidases for Cleavage of Polypeptides Specificity

Pepsin

Non-specific; cleaves N-terminal to F, L, W, Y only when next to P.

Thermolysin

Small neutral residues; cleaves N-terminal to F, I, M, V, W, Y, but not if next to P.

Elastase

Cleaves C-terminal to A, G, S, V, but not if next to P.

Trypsin

Highly specific for positively charged residues; cleaves C-terminal to R, K, but not if next to P.

Chymotrypsin

Prefers bulky hydrophobic residues; cleaves C-terminal to F, Y, W, but not if next to P.

Table 6.3 Exopeptidases in C-terminal Cleavage of Polypeptides Enzyme

Specificity

Carboxypeptidase-A

Cleaves all C-terminal terminal residues, except R, K or P, or if P residue is next to terminal residue.

Carboxypeptidase-B

Cleaves when C-terminal residue is K or R, and not when P residue is next to terminal residue.

Carboxypeptidase-C

Cleaves all free C-terminal residues.

Carboxypeptidase-Y

Cleaves all free C-terminal residues.

6.2.2 Sequence Analysis For sequence analysis, the cleaved fragments are then separated and purified by various fractionation methods (reversed-phase HPLC, TLC electrophoresis—chapter 5). Amino acid sequencing is carried out employing Edman degradation method or more efficiently by tandem mass spectrometric (MS-MS) method. In Edman method, a polypeptide is sequentially cleaved, one residue at a time from the amino-end. Edman reagent, phenylisothiocyanate (PITC), reacts with terminal amino group of the peptide to form phenylthiocarbamayl (PTC) derivative. Under mildly acidic conditions the derivative forms PTH-amino acid, which leaves

78 Bioinformatics: A Primer

an intact peptide. The PTH-amino acid is the end product of one cycle of Edman degradation. The remaining polypeptide chain may now be subjected to further cycles of the Edman degradation. Amino acids are purified and identified by chromatographic methods (TLC, HPLC). Unlike the Edman method, the MS-MS approach can be carried out with mixture of cleavage fragments (no chemical separation and identification). Information about the order of the cleaved and sequenced fragments is obtained by comparison of fragment overlaps. Chymotrypsin cleaves preferentially on the carboxyl side of aromatic and other bulky non-polar residues. Because these chymotryptic peptides overlap two or more tryptic peptides, they can be used to establish their sequential order. Amino acid sequence data of proteins can be inferred from the base sequence of corresponding nucleic acids. With the development of highly efficient gene sequencing methods, gene sequencing is faster and easier and than protein sequencing, this alternative is followed wherever it is feasible. However, there are certain ambiguities and limitations in inferring amino acid sequence from gene sequencing. These are: (i) Degeneracy of codons (more than one codon coding for the same amino acid) leads to ambiguities. (ii) The genetic code is not universal. (iii) Deletion and insertion of nucleotide (s) can lead to erroneous reading frame for the amino acids. (iv) Post-modified proteins and disulfide-containing proteins can be determined only by direct protein sequencing. Also, comparison of amino acid sequence data with DNA sequence data is necessary to identify initiation, termination and intron-coding regions. In addition, DNA sequence data does not provide information on post-translational modifications (methylation, glycosylation, phosphorylation) of proteins.

6.2.3 Mass Spectrometry (MS) in Sequence Analysis Until recently, protein sequencing by Edman degradation method was the method of choice for the identification of proteins. The situation has drastically changed with the integration of advanced physical technologies, such as mass spectrometric methods for protein identification and sequence analysis. Of particular relevance to the “Genomes-to-Life” program is the unique ability of MS to identify a protein unambiguously, establishing the amino acid sequence (the order in which these building blocks of proteins are arranged) and determining the presence of post-translational modifications that can impact the protein’s function. Mass spectrometers generally couple three devices—(i) an ionization device, (ii) a mass analyzer (a device that separates a mixture of ions by their mass-to-charge ratios), and (iii) a detector. Mass spectrometric methods require the transfer analyte into the gas phase. Incorporation of mass spectrometry in biological studies is made possible with the development of gentle ionization techniques– matrix-assisted desorption ionization (MALDI), and electrospray ionization (ESI). MALDI and ESI methods permit the efficient transfer of proteins and peptides into gas phase and into the mass spectrometer. Generally, the MS analysis consists of individually excising the spot of interest from a 2-DE gel. The spot is rinsed and subjected to enzymatic

Primary Structure (Sequence) Determination of Biomolecules

79

digestion (trypsin). The most commonly used mass analyzers for protein biochemistry applications are time-of-flight (TOF), triple-quadrupole, quadrupole-TOF, and ion-trap instruments.

6.2.3.1 MALDI-TOF-MS The purified proteins, isolated spots from a 2-D gel electrophoresis, are individually excised, digested wit a protease (e.g. trypsin) that cuts the protein at predictable positions, and spotted on a MALDI plate for co-crystallization with a standard matrix solution. The plate is inserted into the vacuum chamber of the MS apparatus. A selected spot is then illuminated using a focused-pulse laser beam of wavelength tuned to the absorbance wavelength of the matrix. Positively charged peptides in the gas phase are attracted toward the orifice flight tube kept at a negative bias (Fig. 6.6). All the peptides are subjected to the same electric field for the same plate-to-orifice distance (L), and reach the orifice with a velocity proportional to

  . Time-of-flight mass spectrom-

eter is an arrangement based on the fact that ions of different mass-charge need different times to travel through a certain distance in a field-free region after they have all been initially given the same translational energy. Time of flight through the tube correlates directly to mass, with lighter molecules having a shorter time of flight than heavier ones. Once the particles are out of the time-of-flight (TOF) tube, they are separated according to their m/z ratio, and reach the detector at different time intervals. The mass analyzer records the intensity observed at the detector as function of time of flight, which correlated to m/z ratio. Peaks in an MS spectrum correspond to masses of each of the peptides analyzed (“mass fingerprinting”).

Fig. 6.6 Schematic of MALDI-TOF-MS Analysis of Peptide Mass Fingerprinting

MALDI-TOF-MS, in combination with protein database searches by peptide mass fingerprinting, is a valuable and sensitive, and high-throughput screening approach in proteomics. By comparing the observed mass of a protein mixture in the MS spectrum with predicted peptide masses, derived from all known protein sequence databases, it is possible in principle to identify the protein of interest. Surface-enhanced laser desorption ionization (SELDI)-TOF-MS is a type of affinity MS high-throughput proteomic fingerprinting tool that shares a basic identity with the MALDITOF, which can be used for protein purification, identification and target discovery and validation. Differential protein expression via SELDI brings the research team one step closer to the ultimate drug target than gene expression.

80 Bioinformatics: A Primer

Biomolecular interaction analysis mass spectrometry is a two-dimensional, chip-based, analytical technique for rapid and sensitive analysis of biomolecules. It represents a synergy of two individual technologies–surface plasmon resonance (SPR) sensing and MALDI-TOF mass spectrometry.

6.2.3.2 ESI-MS/MS As peptide mass fingerprinting by MALDI-TOF-MS does not always work for unambiguous protein identification, there has been an increase emphasis on using tandem mass spectrometry (MS/MS), equipped with collision-induced dissociation. Tandem mass spectrometry (MS/MS) is an arrangement in which ions are subjected to two or more sequential stages of analysis (which may be separated spatially or temporally) according to the quotient masscharge. One of the most popular types of tandem MS instrument is the triple quadrupole mass spectrometer. Electrospray ionization (ESI) allows the transfer of non-volatile analytes (such as proteins and nucleic acids) at atmospheric pressure from a liquid to phase to gas phase. The basis of MS/MS is collision induced dissociation, and ESI-based MS has several advantages over MALDI-TOF– (i) they can be easily coupled to different sample separations and sample introduction techniques, and (ii) increase in the quality of MS/MS spectra generated from multiple-charged analytes. Nanoelectrospray-MS/MS is a newer adaptation of ESI methodology in conjunction with MS/MS. Only a very small amount of the unseparated peptide mixture is sprayed directly into the MS machine. Nanospray/HPLC has become the method of choice for protein digests to ESI-MS/MS. Full automation is possible– protein digests are on a reversed phase HPLC column, and analytes separated according their hydrophobicity, are analyzed by MS. The method can be used for de novo protein sequencing and study of post-translational modifications. Protein database search algorithms are pursued to reveal the identity of the protein. Peptide amino acid sequence information contained in the MS/MS spectra are compared with known sequence in protein/genome databases. A single confident match between a peptide MS/MS spectrum and a protein sequence entry from a database can be enough to identify a protein or a family of proteins. de novo sequencing (determination of sequences of genes or amino acids whose sequence is not yet known) can be pursued with LC/MS/MS or nanoelectrospray MS/MS methods. These physical techniques (e.g. SELDI-TOF) can be used to find specific proteomic patterns that can distinguish healthy from diseased patients. The ability of the pattern itself to become the diagnostic represents a new paradigm for the application of proteomics to clinical specimen analysis and disease diagnosis. When more than two stages are involved, the technique is called multi-dimensional MS. Multiple mass spectrometry (MS/MS/MS) provides even greater certainty of identification and additional characterization information than electrospray ionization/ tandem mass spectrometry. Photoionization Mass Spectrometry (PI-MS) is another emerging technology that would be an important tool for high-throughput pharmaceutical analysis. PI-MS meets the requirements for many applications where ESI and atmospheric pressure chemical ionization (APCI) underperform. Multi-isotope Imaging Mass Spectrometry (MI-MS) is a cutting edge technol-

Primary Structure (Sequence) Determination of Biomolecules

81

ogy, which enables visual and quantitative assessment of intra- and trans-cellular metabolic pathways, signal transduction, virus penetration, and localization of drugs. Remarkable advances are taking place in protein expression via sequence analysis and identification, but still face major hurdles. 2-DE methods are cumbersome, and all the steps in protein expression study must become much more easily reproducible and more affordable before they will enable researchers to significantly further our knowledge of protein expression. Another major challenge is to improve quantification of proteins. It is not sufficient to find a protein is expressed; one must also know how much is expressed to be able to identify important patterns.

EXERCISE MODULES 1. What are the main steps in the primary structure determination of nucleic acids and proteins? 2. Explain the procedure followed in the nucleic acid sequence determination. 3. What the objective of gene amplification? 4. What is gene cloning and what its role in sequence determination of nucleic acids? 5. What is PCR and what is its important role in molecular biology? 6. Which are the physicochemical methods used in gene separation? 7. Which electrophoretic methods are used in gene separation? 8. Which are the standard methods used in gene sequencing? 9. What are the physicochemical methods of analysis applied in genomic projects? 10. What is the importance of primary structure determination of proteins? 11. What are the step-by-step methods used in the primary structure determination of proteins? 12. Write on the Edman method in protein sequencing. 13. What are the practical strategies employed in the separation of small polypeptides, large proteins and disulfide-containing proteins? 14. Write on the use of mass spectrometry in protein sequencing? 15. What are the applications and advantages of MALDI-TOF, and ESI-MS/MS?

BIBLIOGRAPHY 1. Abersold, R. (1993), Curr Opin Biotechnol., 4; 412. “Mass spectrometry of proteins and peptides in biotechnology”. 2. Adams, M.D., et al. (1991), Science, 252; 1651. “Complementary DNA sequencing: expressed sequence tags and the human genome project”. 3. Alphey, L. (1997), Springerverlag: New York. “DNA Sequencing”. 4. Ansorge, W., Voss, H. & Zimmermann, J. (1997), Wiley & Sons: New York. “DNA Sequencing Strategies”. 5. Beavis, R.C. & Chait, B.T. (1996), Methods Enzymol., 270; 519. “Matrix-assisted laser desorption ionization mass-spectrometry of proteins”. 6. Branden, C-I. & Tooze, J. (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”, 2nd Edn.

82 Bioinformatics: A Primer 7. Brumley, Jr., R.L. & Smith, L.M. (1991), Nucleic Acids Res., 19; 4121. “Rapid DNA sequencing by horizontal ultrathin gel electrophoresis”. 8. Creighton, T.E. (1993), Freeman Press: New York. “Protein structures and Molecular Properties”, 2nd Edn. 9. Dolnik, V. (1999), J Biochem Biophys Methods, 41; 103. “DNA sequencing by capillary electrophoresis”. 10. Durbin, R., et al. (1998) Cambridge University Press: Cambridge. “Biological Sequence Analysis”. 11. Edman, P. (1970), Mol Biol Biochem Biophys., 8; 211–55. “Sequence determination”. 12. Fenn, J.B., Mann, M. & Meng, C.K. (1990), Mass Spectrum Rev., 9; 37. “Electrospray ionization– principles and practice”. 13. Gevaert, K. & Vanderckhove, J. (2000), Electrophoresis, 21; 1145–54. “Protein identification methods in proteomics”. 14. Griffin, T.J. & Smith, L.M. (2000), Trends Biotechnol., 18; 77. “Single-nucleotide polymorphism analysis by MALDI-TOF mass spectrometry”. 15. Harding, J.D. & Keller, R.A. (1992), Trends Biotechnol., 10; 55. “Single-nucleotide detection as an approach to rapid DNA sequencing”. 16. Hunkapiller, T., et al. (1991), Science, 254; 59–67. “Large scale and automated DNA sequence determination”. 17. Kaufman, R. (1995), J Biotechnol., 41; 155–75. “Matrix-assisted desorption ionization (MALDI) mass spectrometry: a novel analytical tool in molecular biology and biotechnology”. 18. Klotz, I.M., et al. (1970), Annu Rev Biochem., 39; 25. “Quaternary structure of proteins”. 19. Kyte, J. (1994), Garland Pubs: New York. “Structure in Protein Chemistry”. 20. Link, A.J., et al. (1999), Nature Biotechnol., 17; 676. “Direct analysis of protein complexes using mass spectrometry”. 21. Mathies, R.A. & Huang, X.C. (1992), Nature, 359; 167. “Capillary array electrophoresis: an approach to high-speed, high-throughput DNA sequencing”. 22. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print). 23. Pappin, D.J. (1997), Methods Mol Biol., 64; 165. “Peptide mass fingerprinting using MALDI-TOF mass spectrometry”. 24. Patterson, S.D., Thomas, D. & Bradshaw, R.A. (1996), Electrophoresis, 17; 813. “Application of combined mass spectrometry and partial amino acid sequence to the identification of gel-separated proteins”. 25. Sanger, F., Nicklen, S. & Coulson, A.R. (1977), Proc Natl Acad Sci, USA, 74; 5463–67. “DNA sequencing with chain-terminating inhibitors”. 26. Smith, L.M., et al. (1986), Nature, 321; 674”. Florescence detection in automated DNA sequence analysis”. 27. Southern, E.M. (1975), J Mol Biol., 98; 503. “Detection of specific sequences among DNA fragments separated by gel electrophoresis”. 28. Stryer, L. (1995), H.C. Freeman: New York. “Biochemistry”, 4th Edn. 29. Tsugita, A. (1987), Adv Biophys., 23; 81. “Developments in protein microsequencing”. 30. Venter, J.C., et al. (1998), Science, 280; 1540. “Shotgun sequencing of the human genome”. 31. Voet, D. & Voet, J.D. (1990), John Wiley: New York. “Biochemistry”. 32. Walker, J.M. & Gaastra, W. (Eds) (1983), Croom Helm: London. “Techniques in Molecular Biology”. 33. Weber, J.L. & Myers, E.W. (1997) Genome Res., 7; 401. “Human whole-genome shotgun sequencing”. 34. Wilson, K. & Walker, J.M. (1994), Foundation Books: New Delhi. “Principles and Techniques of Practical Biochemistry”. 35. Yates, J.R. (2000), Trends Genet., 16; 5. “Mass spectrometry: from genomics to proteomics”.

7 Spatial Structure Determination of Biomolecules The most challenging experimental task in structure elucidation of macromolecules, is the determination of the three-dimensional (spatial) structure. To-date, there are only two physical techniques that are capable of providing spatial (secondary and tertiary) structural information of macromolecules (e.g. nucleic acids, proteins, carbohydrates, lipids and their complexes). These are X-ray diffraction and Nuclear Magnetic Resonance (NMR) spectroscopic methods. In addition, scanning and imaging techniques, such as fluorescence imaging, microscopies, tomographies provide gel patterns, morphological and functional features of cells and organelles.

7.1 X-RAY DIFFRACTION METHODS X-ray crystallography is the most important quantitative method, to-date, in elucidation of three-dimensional architecture of matter in crystalline state at molecular and atomic resolutions. Therefore, this method has become an indispensable analytical tool in biological sciences. There seems to be no limit to the size of the molecule not to its structural complexity to attempt its structure determination. Indeed, a very substantial part of what is known of bimolecular structures has been due to X-ray crystallography. The primary condition is– the matter should be in single-crystalline state. Further, the power and scope of X-ray diffraction is enhanced manifold by advances in the technology of production and detection of X-rays. The use of high-intensity, high degree of collimation, and continuous wavelength sources (synchrotron radiation) has helped in (i) reducing the data collection time and (ii) resolving the “phase problem”. X-ray detectors with increased sensitivity, increased dynamic range, and increased counting rate have enabled fast data and accurate collection. These developments have enhanced the power of X-ray crystallography in solving macromolecular structures of immense complexity and have ushered in study of real-time structural dynamics of biological molecules (time-resolved X-ray crystallography) at molecular level. X-rays are produced when high-energy electrons are decelerated by a target, giving rise to X-ray continuum (Bremsstrahlung), and characteristic line spectra, depending on the applied voltage. Wavelengths of characteristic X-ray wavelengths, employed in X-ray diffraction studies (e.g. CuKα = 1.54 Å) are comparable to interatomic distances. Therefore only these X-ray lines are used in structure elucidation by X-ray diffraction.

84 Bioinformatics: A Primer

7.1.1 Principles of X-ray Diffraction The basic principles of X-ray diffraction are similar to those of image formation by light and electron microscopies. The scattered rays in light- and electron microscopes are focused by lens systems. There is no proper lens system to focus X-rays (refractive index of X-rays is 1 in all media). What is recorded in X-ray diffraction methods are scattered X-rays (diffraction patterns by films or electronically). The operational procedures involved in the transformation of the diffraction patterns for “image reconstruction” are quite intricate and involved. Matter in solid state can either be amorphous, semi-crystalline or single-crystalline. X-ray diffraction is a scattering phenomenon. Though the amount of diffracted energy is same for a substance in amorphous or crystalline state, the information content obtainable is not the same in all states. The information content is least in the amorphous state and maximum in the single crystalline state. A single crystal is a manifestation of regular and periodic arrangement of atoms/ molecules or clusters of molecules in all three dimensions (crystal lattice). Due to periodic arrangement of diffraction elements, scattering is coherent to produce observable diffraction patterns. In addition, the periodic arrangement channels the diffracted rays at certain directions only, greatly enhancing the signal-to-noise ratios (information content). Scattering is random in the amorphous state and gives rise to diffuse scattering. Scattering by semi-crystalline state gives rise to diffraction lines and arcs. Scattering by single-crystalline state gives rise to discrete diffraction spots.

7.1.2 Determination of Unit Cell Morphology As scattering of X-rays by a single molecule is too weak because of its high penetration power, single crystalline state of matter is a pre-requisite for three-dimensional structure determination by X-ray diffraction methods.

7.1.2.1

Unit Cell and Space Group

In crystalline substances, the crystal lattices act as diffraction ratings to X-rays. A unit cell is parallelepiped (an imaginary box) that contains one basic unit of the structure and translational repeat of the unit cell in all three dimensions represent the crystal. Each reflection (spot) of a diffraction pattern can be identified by triple indices, hkl, called Miller indices. Indices h,k,l describe number of times a, b and c axes (of the unit cell) are intersected, respectively. The underlying principle of X-ray diffraction can be understood from the intuitive interpretation offered by Lawrence Bragg. It states that a beam of X-rays incident upon a stack of parallel, equally spaced lattice planes appear to be reflected by these planes (Fig. 7.1). According to Bragg’s law, the condition for constructive interference is 2d sinθ = nλ

Bragg’s equation

(7.1)

(Where, d = interplanar spacing; θ and λ = angle of incidence and wavelengths of X-rays; n = integer 1, 2, 3,…). Bragg’s equation is used to determine the unit cell parameters (cell lengths and angles) from the spatial separation of the diffraction spots.

Spatial Structure Determination of Biomolecules 85

q = angle of incidence; d = interplanr spacing Fig. 7.1 Bragg’s Interpretation of X-ray Scattering

7.1.2.2

Unit Cell Content (Z)

Unit cell content, Z, which is the number of molecules in the cell can be determined from the unit cell parameters Z=

 ¥   ¥ 

(7.2)

(V = volume of the unit cell; do = measured density of the crystal; F = formula mass).

7.1.3 Structure Determination Bragg’s equation is useful in obtaining unit cell parameters and geometry (external morphology) only, from the spatial separations of the diffraction spots (recorded photographically, counter or other methods). But, it does not give information pertaining to the structural architecture, namely, the positional coordinates of atoms and molecule(s) in the unit cell. Structure determination requires a deep understanding of wave properties, as X-ray diffraction is a wave phenomenon. A wave is characterized by two parameters– (i) amplitude and (ii) phase, and both these quantities should be available for each reflection, hkl (diffraction spot), for the image reconstruction processes.

7.1.3.1

The ‘Phase Problem”

All X-ray diffraction techniques can record only the intensity data, that is, the square of the amplitudes of the reflections (hkl), but not the phase information. That is, amplitude data of the diffraction spots, hkl, can be obtained from the intensity data of the reflections. But, X-ray diffraction data can not provide phase information, and resolving the “phase problem” is central and crucial in X-ray crystallography. As the phase information of diffracted spots cannot be obtained experimentally, mathematical procedures are invoked to resolve the “phase problem” in the image reconstruction methods– transforming a diffraction pattern (DP) into an image of the object is carried out by Fourier transformation (FT) methods. FT of the DP ¤ Structure

(7.3)

86 Bioinformatics: A Primer

The phase difference, α, between a ray reflected by an atom at the origin of the unit cell and by jth atom at position (xj, yj, zj) is given by α = 2π(hxj + kyj + lzj) (7.4) If the scattering power of the jth atom is fj, then the ums of the contributions from all the atoms in the unit cell to the Bragg reflection, hkl, is given by the structure factor Fhkl 

Fhkl = S     p  +  +  

(7.5)

 =

Fhkl is a complex quantity representing both the amplitude and phase of individual reflections, hkl (Fig. 7.2). Æ

  = |Fhkl|exp(iahkl)

(7.6)

(|Fhkl| = amplitude; αhkl = phase of the reflection hkl).

Æ

Fig. 7.2 Representation of the Structure Factor, Fhkl

X-rays are scattered by electrons and the representation of the structure in essence the electron density distribution of the molecule(s) in the unit cell. Since the electron density in a crystal varies continuously and periodically, the electron density, ρxyz, for a unit cell of volume, V, is ρxyz =

 S S S    a  ¥  - p + +   

(7.7)

Accordingly, the electron density at any point in the unit cell can be computed from a Fourier series and the coefficients of the terms are the structure factors of the diffraction. Both the amplitude |Fhkl| and the phase αhkl for each reflection (hkl) are needed to compute the equation (7.7). The major effort in X-ray crystallography is to obtain ‘phase’ information and carry out structure determination by iterative methods. Patterson methods, direct methods, molecular replacement, and anomalous dispersion are some of the standard methods in resolving the phase problem. 7.1.3.1.1 Patterson Method: In Patterson method (known as heavy-atom method), the intensities, Ihkl, of reflections (hkl) are used to compute Patterson syntheses, setting the phase value of each

Spatial Structure Determination of Biomolecules 87

reflection either 0 or 180°. This method is useful in structures that contain few heavy atoms. This method is dealt under isomorphous replacement in macromolecular crystallography. 7.1.3.1.2 Direct Methods: In these methods, starting phase sets are computed mathematically (statistically) from the intensity data, establishing internal phase relationships between reflections and diffraction patterns. These methods have become standard in structure elucidation of small molecules (structures with ~ 50 non-hydrogen atoms). 7.1.3.1.3 Molecular Replacement Method: This method is often employed in macromolecular crystallography. Structural moieties of a molecule (or part of it), whose structure is known are translated and rotated in the unit cell to obtain the spatial position of the molecule (initial coordinates) whose structure is to be determined. 7.1.3.1.4 Anomalous Dispersion Method: Though the feasibility of anomalous dispersion method for structure elucidation of macromolecules has been proposed by Ramachandran & group (Madras group), i1ts full potential has been realized in resolving the phase problem due to latest developments in macromolecular crystallography. From the wide-band wavelength spectrum, obtainable from a synchrotron source, any desired wavelength(s) can be selected to suit desired absorption edge(s) of absorber(s) to enhance the effectiveness of the anomalous dispersion method. This is the basis of multiple anomalous dispersion (MAD) procedure in resolving the phase problem in macromolecular crystallography.

7.1.4 Structure Refinement With a set of calculated phases, αc, electron density maps are computed with observed amplitudes, F0, and refined (by least-squares or other procedures) to obtain better fit between observed (F0) and calculated (Fc) structure factors. The procedure is iterative. The goodness fit of the refinements is monitored by several standard statistical methods; one of the parameters being reliability index, R. Low R value is an indication of well-refined structure. R=

S  -  S 

(7.8)

The flowchart (Fig. 7.3) highlights the stepwise procedures followed in single-crystal X-ray crystallography.

7.1.5 Fibrous Macromolecules A major class of biomolecules (e.g. DNA, RNA, collagen and keratin group proteins are in fibrous (not single-crystalline) state. As fibers exhibit periodicity only along the fiber axis (zaxis), the information content from the fiber diffraction data is meager. The entire intensity data (comprising of layer lines of diffuse spots and streaks) can be recorded on a single photograph, and the data is insufficient to solve the structure ab initio. Many fibrous molecules are helical, with general structural characteristics– radius, rise per residue and pitch. The repeating units (averaged out) along the fiber axis are nucleic acid bases in nucleic acids, amino acid side-chains in proteins and protein subunits in viruses. The theory of helical fiber diffraction, represented in cylindrical coordinates, leads to Fourier-Bessel transformation of the diffraction pattern representing the structure.

Spatial Structure Determination of Biomolecules 89

(iii) Intensity maxima are confined to layer lines only (no discrete reflections). (iv) Due to mirror and cylindrical symmetries, the diffraction photograph shows characteristic X pattern (e.g. DNA diffraction photograph). The stepwise procedures followed in X-ray structure determination of fibrous molecules are as in the flowchart (Fig. 7.4).

Fig. 7.4 Flowchart for X-ray Structure Determination of Fibrous Molecules

7.2 NUCLEAR MAGNETIC RESONANCE (NMR) SPECTROSCOPY Atomic nuclei with odd mass number, with half-integral spin (I = ½, 3/2, 5/2….) give rise to nuclear magnetic resonance (NMR). When subjected to an external magnetic field, NMR-sensitive atomic nuclei, such as 1H, 13C, 15N, and 31P, behave like tiny bar magnets, precessing around the axis (z-axis) of the magnetic field with frequency, ν. ν=

g  p

Larmor frequency

(7.11)

NMR transitions between adjacent nuclear spin states are induced by the application of an oscillating magnetic field in the radio frequency region (r.f. field), perpendicular to the z-axis (in the azimuthal x-y plane), and the resonance absorption occurs when ∆E = g

 p

(7.12)

(B0 = applied magnetic field; γ = nuclear gyromagnetic ratio; ∆E = energy difference; h = Planck constant). The strength of NMR signals not only depends on the isotope spin, but also on its abundance. 1H isotope with 99.98% abundance is the most sensitive NMR technique (1H-NMR spectroscopy).

90 Bioinformatics: A Primer

7.2.1 The “Chemical Shift” Atomic nuclei are shielded from the influence of the external magnetic field by electrons, and the phenomenon is called the “chemical shift”. In molecules the nuclei of an element exist in different environments, and thus give rise to different spectral chemical shifts). For example, the spatial distribution of particular nuclei (say 1H) in different chemical environments (in CH, CH2, CH3 groups etc) in a molecule is determined by the “chemical shifts”. Chemical shift is the basis of NMR spectroscopy and is the most important parameter in NMR structural studies. Chemical shift, δ, is quoted with respect to a standard reference frequency. δ=

n - n  ¥   = (ss – sr) × 106 n

(7.13)

(νs and νr σs and σr are frequencies and shielding constants of sample and reference compounds, respectively).

7.2.2 NMR Spectra Multidimensional NMR spectroscopy has emerged as a complementary technique to X-ray crystallography in the three-dimensional structure determination of biomolecules. NMR analysis of a stable pure protein (< 20,000 Daltons) in solution is possible, but the results are more ambiguous than those from X-ray crystallography. While X-ray based studies provide a direct map of the electron densities in different parts of a molecule in a crystal, in NMR, this information is obtained in an indirect manner, through measurement of “chemical shifts” and coupling constants. An NMR spectrum is a plot of resonance frequencies and their intensities. Many of the effects that occur in NMR experiments are determined by the behavior of the magnetization vector, M, which represents the net Æ magnetization. The magnetization vector,  has two components––the longitudinal component, Mz, and the azimuthal component, Mxy. At thermal equilibrium, the longitudinal component of it, M z = M 0 and the transverse component, Mxy = 0 (Fig. 7.5). Applying an Fig. 7.5 Magnetization Vector, M, at rf pulse (in the x-y plane) perturbs the equilibrium state Thermal Equilibrium of magnetization, and Mz = 0, and Mxy ≠ 0. Once the rf field is switched off, the M tries to come back to its thermal equilibrium state, and the process is called relaxation (spin-lattice, and spin-spin). By choosing the time duration, t, of the pulse, it is possible to turn the magnetization vector M in any desired direction. This is the underlying principle of all NMR spectroscopic experiments.

7.2.2.1

Free Induction Decay (FID)

The sensitivity of NMR spectra can be increased dramatically if short, intense radio frequency pulses replace the slow frequency sweep. The pulses cause a signal to be emitted by the nuclei. This signal is measured as a function of time after the pulse. The decay profile of pulsed NMR signal, which is time-dependant (signal amplitudes as a function of time), is called the free

Spatial Structure Determination of Biomolecules 91

induction decay (FID). FIDs can be converted by Fourier transformation to frequency-domain spectrum, to give amplitudes as a function of frequency. This is the basis of modern NMR spectroscopy, called FT-NMR.

7.2.3 NMR Experiments The spectra of complex biomolecules contain a large number of peaks, many of which are close together or overlap. Higher field magnets, or higher frequency (e.g. 600 MHz) instruments offer better peak resolution, enabling analysis of larger and larger molecules. Also, in NMR, sensitivity increases almost with the square of the magnetic field, so when magnetic field strength is doubled, sensitivity increases about fourfold. Data can thus be acquired faster, or alternatively, samples can be run at lower concentrations in the same experimental time, which is particularly important to the study of large biomolecules. NMR spectra contain information about the structure of molecules through chemical shift (spin-spin coupling), which is sensitive to local electronic environment, and through nuclear Overhauser effect (NOE), which gives interaction between the dipole moments of two nuclei in spatial proximity, and thus provides information about the distance between nuclei and is sensitive to the positions of the neighborhood spins. There are varieties of NMR experiments (1D-, 2D- 3D-NMR) to elucidate three-dimensional structures macromolecules (< 20,000 Daltons) in solution (Fig. 7.6).

Fig. 7.6 Schematic of Processes in NMR Experiments

In 1D-NMR, the only variable is the acquisition time, and the spectrum is a plot of amplitude as a function of frequency, ν. Multi-dimensional NMR experiments substantially reduce the overlap problems encountered in 1D-NMR spectra by effectively spraining the signals onto

92 Bioinformatics: A Primer

multi-dimensions. To extend the analysis to multi-dimensional NMR, multiple time-periods are added. Thus, multi-dimensional NMR experiments are composed of sections: preparation, evolution, mixing, and detection (Fig. 7.6). The time-domain pulses (FIDs) are Fourier transformed into frequency-domain spectra S(ω1, ω2) and are plotted as contour maps. In 2D-NMR, a second time-period is added. The evolution time, t1, is increased in equivalent increments, and for each t1, a complete signal is collected as a function of the detection time, t2 (t1 < t2). That is, the process is repeated for several values of t1, and a set of FIDs are collected instead of a single FID as is the case with 1D-NMR. The 2D-NMR spectrum is thus a plot of amplitude as a function of frequencies, ν1, and ν2. in 2D-NMR spectrum, the diagonal peaks correspond to a normal 1D-NMR spectrum. The off-diagonal peaks result from interactions between hydrogen atoms (protons) that are close to each other in space. In 3D-NMR spectrum there are correlations of three different frequencies generated through the two different mixing times of the experiment.

7.2.3.1

Mixing Mechanisms

Variety of 2D- and 3D-NMR experiments arises from the different mixing mechanisms. There are two fundamentally different mixing mechanisms employed in NMR experiments– COSY and NOESY (Fig. 7.7).

Fig. 7.7 2D-COSY and NOESY Pulse Sequences

Correlated spectroscopy (COSY) is coherent transfer of information (spin-spin coupling) through chemical bonds. A COSY spectrum gives peaks between hydrogen atoms (protons) that are covalently connected. A very useful modification of COSY is to select only coherence that was transferred at some point in the experiment. This is achieved by applying a ‘double quantum filter’ (double quantum filtered COSY (DQF-COSY)).

Spatial Structure Determination of Biomolecules 93

In the solution phase, because of rapid Brownian motion, the dipolar interaction averages to zero on the NMR time scale. Thus there will be no splitting of lines due to this interaction. However, the dipole-dipole interaction does contribute to the relaxation properties of the spin systems and affects the line widths and polarization transfers of one spin system to another when the spin system is perturbed. This is genesis of Nuclear Overhauser effect (NOE), which describes changes in intensity of one resonance line when another line in the spectrum is perturbed by RF radiation. A NOESY spectrum gives peaks between hydrogen atoms (protons) that are close together in space. The magnitude of NOE is proportional to d–6, and for NOE to occur the nuclei should be close to one another in space (< 5.0 Å apart). That is, Nuclear Overhauser effect spectroscopy (NOESY) is a space-relaxation-mediated information transformation process, which provides internuclear distance information (< 5.0 Å).

7.2.4 Structure Determination While X-ray based structure elucidation procedures provide a direct map of electron densities in different parts of molecule in crystalline state, this information is obtained in an indirect manner, in structure elucidation by NMR, through measurements of chemical shifts and coupling constants. The electron densities depend on the nature of the chemical bonds (single, double or triple, or polar or non-polar bonds). In diamagnetic systems (isolated molecules with no unpaired electrons), the chemical shift of a nucleus reflects the electron density distribution around the nucleus. Nuclei separated by a double or triple bond have greater s-electron densities and consequently have larger coupling constants than in the case of a single bond. Similarly, a proton in the vicinity of an electron-withdrawing group has lower electron density around it and is consequently less shielded. The structure determination rests on the accurate assignment of resonances, by fitting the model structure to the experimental data from these contour maps by various algorithms. NMR spectroscopy can generate a variety of distance constraints which can be used to compute the three-dimensional structures of biomolecules. The general strategy for three-dimensional structure determination by multi-dimensional NMR consists of basically two steps: (i) sequence-specific resonance assignments, and (ii) quantification of the interactions between the assigned nuclei. Multi-dimensional NMR spectra contain information about the interaction of protons (H) that are covalently linked (COSY spectra)(off-diagonal peaks in a 2D spectrum provides information on the correlations between protons in different parts of the molecules), or not covalently linked, but are close in space (NOSY spectra) that provides internuclear distance information. The process of associating specific spins in the molecules with specific resonance is called the sequence-specific assignment of resonances. This is carried out from combination of COSY, DQF-COSY, and NOESY spectral data. The first stage of analysis is the identification of spin systems by COSY. NOESY is used to identify intra-residue, neighboring residue and distant (in sequence) contacts. COSY and NOESY data are used to construct and refine model structures by different algorithms (distance geometry, energy minimization etc.). For biological macromolecules, 1H-NMR is the method of choice, because 1. Protons (H) are present at many sites in biological molecules. 2. High abundance of 1H for each site. 3. 1H nucleus is the most sensitive to detect.

94 Bioinformatics: A Primer

In nucleic acids, identification of protons that belong to an individual base or sugar is done with COSY. The association of specific bases with sugars is done using NOESY spectrum. In proteins, the primary information for recognizing the type of amino acid associated with each resonance comes from couplings established in DFQ-COSY spectra, and identification of sequentially neighboring amino acid 1H spin systems from the sequentially NOE connectivities. The chemical shifts of NH and CH protons can be indicative of regular secondary structure. In general, the proton resonances from backbone (main-chain directed assignments) and different types of protons on side-chains are grouped, based on chemical shifts and analyzed. NOESY is used for sequence-specific assignments in helix, sheet and turns regions. Much of protein NMR spectroscopy relies on spectral correlation techniques (Heteronuclear Multiple Quantum Correlation, HMQC) using 13C or 15N nuclei (heteronuclei). Spectral editing allows a subset of an entire spectrum to be observed. Normally, one observes a subset of 1 H spectra that has been selected based upon which nucleus the protons are attached to. The same techniques involved in spectral editing allow the measurement of heteronuclear correlations (which, for example, allows one to know which 1H are attached to which 13C or 15N). Isotope substitution, and hetero-nuclear (13C-, 15N- and 31P-NMR) methods can be used to simplify 1H-NMR spectra and also to obtain more information for sequence assignments. 13C or 15N can be used to select only the protons attached to them. 31P-NMR is also used in nucleic acid studies. The flowchart (Fig. 7.8) gives general steps in structure determination by NMR spectroscopy. NMR spectroscopy can provide structural details (< 20,000 Daltons) of averaged-out conformations of the molecules in solution, (Fig. 7.9a). The results are more ambiguous, in contrast with the static and clearer picture of structures that would be available from X-ray crystallography (Fig. 7.9b).

Fig. 7.8 Flowchart of Steps in Structure Determination by NMR Spectroscopy

Spatial Structure Determination of Biomolecules 95

Fig. 7.9(a) Averaged Structure of Fibronectin by NMR Spectroscopy

Fig. 7.9(b) Structure of Fibronectin by X-ray Crystallography

(Ref: Bocquier, A.A. (1999), Structure, 7; 1451) (Source: Protein Data Bank: 1QO6.pdb)

(Ref: Learhy, D.J., Aukhil, I & Erickson, H.P. (1998), Cell, 84; 155) (Source: Protein Data Bank: 1FNF.pdb)

Though NMR generally gives a lower-resolution structure than X-ray crystallography does, it does not require crystallization. In addition, NMR spectra can used to establish quantitative spectral data activity relationship (QSDAR) of molecules and their biological binding activity. All the quantitative spectral data-activity relationship (QSDAR) models yield a relationship that may be used to predict the binding activity of a molecule from its experimental or simulated spectral data alone. Binding characteristics help determine how well the drug works– how effective and selective it is, and whether it can be administered in reasonable quantities. NMR studies can also be used for structure-based, site-specific screening. In this approach, two amino acid types are labeled with 13C and 15N. If this pair of amino acids occurs only once in the sequence, there will be only one peak in a one-dimensional/ two-dimensional HNCOtype NMR spectrum. This technique allows screening for only those binders that interact with a specific site of the receptor.

7.3 IMAGING METHODS Understanding a complex living system will require a thorough comprehension of the interactions of cells and tissues in the organism. The molecular machinery of life must be studied at

96 Bioinformatics: A Primer

all size scales from atoms to complete organisms. Extensive information about the proteins that make up the cells’ functional units can be obtained through the use of molecular biology, crystallography, and computational biology. But understanding their function within their natural environment will require examining these proteins within the cell, through all phases of cell behavior. Imaging is a very powerful unifying tool for such studies. Scanning technology, such as that is used for scanning a fluorescence-labeled gel is conceptually quite simple. A light source excites the labeled samples and a detector system measures and records the emitted fluorescence. Cellular and molecular imaging will be a key tool in translating the structural knowledge into better ways of diagnosing, treating, and preventing the disease. Imaging can identify the kinds of molecular structures/receptors that cover the surface of a tumor, information that potentially can predict how it may behave and respond to certain treatments. Monitoring the processes and pathways inside a cell as the cell transforms from normal to cancerous will allow us to detect this change in people earlier in the cancer process, perhaps before a tumor has even had the chance to become fully malignant. Smart contrast agents can be used as tumor markers in imaging techniques. When smart contrast agents are injected into the body, they are undetectable. However, when they come into contact with tumor-associated enzymes called proteases, the smart agents change shape and become fluorescent. The fluorescent signal can then be detected using sophisticated imaging devices.

7.3.1 Structural Imaging Optical microscopy (e.g. confocal microscopy) and electron microscopy (e.g. scanning (SEM), and transmission (TEM)) are used in morphological imaging and cellular studies. Tomographies—X-ray computed tomography (CT), magnetic resonance imaging (MRI), magnetic resonance microscopy (MRM), and single photon emission computed tomography (SPECT) are new additions to the growing number of imaging technologies at cellular and molecular levels. Scanning-probe microscopies, such as scanning-tunneling, and atomic-force microscopies do not come under “microscopy” in the conventional term. They are surface scanning probing techniques that prove surface topology at atomic resolution.

7.3.1.1 Microscopies Optical, scanning-probe and other novel imaging techniques, and electron microscopic methods come under microscopies. 7.3.1.1.1 Optical Microscopies: Among optical microscopies, Laser scanning confocal microscopy (LSCM) is the one that is widely used in biological studies. It is a light microscopic technique in which only a small spot is illuminated and observed at a time. An image is constructed through point-by-point scanning (raster scanning) of the field (a section of the specimen), and section by section by stepwise increments along an axis (z-axis), which would enable the 3-D image reconstruction. Thus, confocal microscopy permits analysis of the 3-D architecture of a specimen, which cannot be achieved by conventional light microscopy. Confocal microscopy can produce three-dimensional (3-D) images of fluorescently-tagged gene products to determine their distribution in the cell during different stages of the cell cycle or under various environmental conditions.

Spatial Structure Determination of Biomolecules 97

Bio-photonic imaging (employing confocal microscopy) is a novel approach to functional genomics, target validation, and drug screening and preclinical testing. It uses a bioluminescent reporter gene to tag a target of interest, which can be a gene, a cell, or a microorganism, in a whole mouse. Because light passes through tissue, the labeled mouse can be anesthetized and photographed with a camera capable of detecting the bioluminescence. This method can be used to label bacteria, infect an organism, and study the effect of antibiotics on the infection, or the effects of various drugs that can modify response to infection. In oncology, such molecular and cell-based imaging can impact directly on cancer treatment and diagnosis, and that the development and testing of new molecular-based therapies would benefit substantially from advances in our ability to image specific molecular and cell processes. Two-photon or multiphoton excitation mode fluorescence scanning microscopy is based on excitation resulting from successive or simultaneous (coherent) absorption of two or more photons (the energy of excitation being the sum of the energies of the photons by an atom or molecular entity. The capability of using near-IR excitation wavelengths provides multiphoton excitation scanning microscopy many advantages providing over single-photon confocal microscopy– e.g. clear three-dimensional images of biological tissues in real-time with muchreduced photobleaching and photodamage to cells, since there are few intrinsic near-IR absorbing chromophores. In fluorescence-based microscopy, specimens are stained with fluorescent materials, which emit light when exposed to light. Immunofluorescence microscopy utilizes antibodies that are labeled with fluorescent dye. Fluorescence lifetime imaging microscopy (FLIM) is an imaging technique in which the mean fluorescence lifetime of a chromophore is measured at each spatially resolvable element of a microscope image. Imaging of fluorescence lifetimes enables biochemical reactions to be followed at each microscopically resolvable location within the cell. 7.3.1.1.2 Scanning-probe Microscopies (SPM): Scanning-probe microscopy (SPM) is essentially surface-probe technique. There are no lenses in scanning-probe microscopic (SPM) methods. Instead, a “probe” tip is brought very close to the specimen surface, and the interaction of the tip with the region of the specimen immediately below it is measured. The type of interaction measured defines the type of SPM. The technique is called atomic-force microscopy (AFM), if the interaction measured is the force between the atoms at the end of the “probe” tip and the atoms in the specimen surface; called scanning-tunneling microscopy (STM), if the quantum mechanical current (tunneling current) is measured. These techniques provide topographic maps of the sample at atomic resolution. They are employed for characterizing surface features biomolecular complexes, and molecular interactions. In atomic force microscopy (AFM), a probe (force transducer) systematically rides across the surface of a sample being scanned in a raster pattern. The vertical position is recorded optically as a spring attached to the probe rises and falls in response to peaks and valleys on the surface. AFM can be used to scan conducting as well as non-conducting surfaces. Scanning-tunneling microscopy is based on the principle of quantum mechanical tunneling of electrons– tunneling current is measured as function of the distance between the tip of the probe and the specimen surface. One of the disadvantages of STM is that the specimen should have a conducting surface.

98 Bioinformatics: A Primer

Near-field scanning optical microscopy (NSOM), also known as Scanning near-field optical microscopy (SNOM) is an aperture-less imaging method. It employs near-field probing rather than beam focusing, and thus circumvents the resolution limit imposed by diffraction effects, which are common to all aperture-based imaging techniques. 7.3.1.1.3 Novel Imaging Techniques: X-ray microscopy is an imaging technique that employs Fresnel zone plates, for focusing scattered X-rays. The resolution (~ 100 Å) is about ten times better than in optical microscopy, and is ideal for imaging cells and sub-cellular particles. Coherent anti-Stokes Raman scattering (CARS) microscopy, based on Raman spectroscopy, allows one to localize specific types of molecules inside living cells without tagging them with fluorescent dyes or genetic modifications. In this technique, two laser beams are sent into the cell, at frequencies that differ exactly the frequency at which a particular type of chemical bond (e.g. C–H bond) in the cell is excited. The excited bond vibrates, emitting a frequency characteristic of its vibrational mode, thus enabling the imaging of a point-by-point chemical map of the cell (chemical microscopy). Magnetic resonance microscopy (MRM) is noninvasive and nondestructive imaging toll. MRM allows live cells to be studied simultaneously, providing a necessary link between cellular response and molecular information on proteins and other molecules involved in a certain cellular event. Surface plasmon resonance (SPR) microscopy is potentially useful for the study of molecular events in cell membranes–transport and trafficking processes involving the membrane. 7.3.1.1.4 Electron Microscopies: Electron microscopic methods provide ultrastructure details of biological specimen. Whereas the wavelength of visible light limits the resolution of light microscopy to hundreds of nanometers, the wavelength of intermediate-voltage electrons is only a fraction of an angstrom. It is, therefore possible to determine the 3-D structure of proteins in a biological specimen, such as supramolecular assemblies, organelles, cells, and tissues, using low-dose electron beam intensity and recordings from a large number of view angles in a transmission electron microscope (TEM). This technology is termed electron tomography. Electron tomography works essentially like a CT scan– a computer constructs a 3-D image of a flash-frozen cell from a series of image “slices” created by penetrating electron beams. Scanning electron microscopy (SEM) is essentially a scanning-probe microscopy (with electron optics). The specimen is scanned in raster fashion by a fine-focus electron and beam and the back-scattered secondary electrons are collected and processed electronically in image reconstruction methodologies. Unlike in TEM, the specimen preparation in SEM is not that stringent and can be used to analyze thicker specimens.

7.3.1.2

Tomographies

Molecular Imaging approach fuses the disciplines of molecular biology, genetic engineering, immunology, cytology, and biochemistry with imaging. Integration of imaging techniques with computers has paved the way for here-dimensional description of the physiological and biochemical processes in healthy and defective organs. Advances in MRI/MR spectroscopy, MR microscopy (MRM), Positron emission tomography (PET) and single photon emission computed tomography (SPECT) are used to evaluate normal and abnormal tissue metabolism and perfusion in response to genetic, physiological, or therapeutic challenges.

Spatial Structure Determination of Biomolecules 99

X-ray computed tomography (CT) images regional structure or anatomy, by combining X-ray imaging with computer processing and presentation, to provide 3-D structural images of internal organs with greater clarity. Magnetic Resonance Imaging (MRI) is a non-invasive method of imaging internal anatomy. The method rests on localizing the nuclear spins locally by applying a linear magnetic field gradient axially. The imaging offers both near-cellular (i.e. 50 micron) resolution and wholebody imaging capability. Single-photon emission computed tomography (SPECT) is a tomography method that uses radionuclides, which emit a single photon of a given energy. The camera is rotated 180 or 360 degrees around the patient to capture images at multiple positions along the arc. The computer is then used to reconstruct the transaxial, sagittal, and coronal images from the 3dimensional distribution of radionuclides in the organ. SPECT can be used to observe biochemical and physiological processes as well as size and volume of the organ.

7.3.2 Functional/Metabolic Imaging Few techniques are available for investigating the molecular bases of human brain pathophysiology in vivo. The ability to observe both the structures and also which structures participate in specific functions is due to two new techniques called functional magnetic resonance imaging (fMRI), and positron emission tomography (PET). fMRI provides high resolution, noninvasive reports of neural activity detected by a blood oxygen level-dependent signal. fMRI techniques, with or without contrast agents, provide imaging researchers with the ability to unravel the mysteries of organ and cellular perfusion and for mapping cerebral function. The ability of functional/metabolic imaging studies to monitor pharmacological alterations may provide the basis for future testing of new drugs for the treatment of many diseases/malignancies– malignancy, heart disease, Alzheimer disease, cerebrovascular diseases, multiple sclerosis, AIDS and others. Positron emission tomography (PET) technique is based on the principle that the position of positron-emitting radionuclide can be precisely determined, because of annihilation of positron-electron pair, with emission of two gamma photons in diametrically opposite directions. PET builds images by detecting energy given off by decaying radioactive isotopes. The technique is complementary to the anatomic imaging modalities such as computed tomography (CT) and magnetic resonance imaging (MRI). PET is used to study the brain activity with suitable labeled radionucleide. A rapid sequence of PET images of the brain would reveal the response of the brain to various chemical stimuli, and pinpoint areas of abnormal activities.

EXERCISE MODULES 1. Which are the physical techniques for spatial structural data information? 2. What are the essentials principles of three-dimensional structure determination by X-ray diffraction, and what are the advantages and limitations and why? 3. Why is the single-crystalline state is imperative in X-ray diffraction studies? 4. What are the steps in the determination of unit cell parameters by X-ray crystallography? 5. What is the “phase problem” in X-ray crystallography?

100 Bioinformatics: A Primer 6. Name some methodologies employed in resolving the “phase problem” in X-ray structure determination. 7. Explain the procedural steps followed in the X-ray structure determination of fibrous molecules. 8. What is the nuclear magnetic resonance (NMR) 9. What is “chemical shift” and what is its importance in NMR spectroscopy? 10. What kinds of data are obtained from 1D- and multi-dimensional NMR experiments? 11. What are the essentials of structure analysis by NMR spectroscopy, and what are the advantages and limitations in comparison with the X-ray diffraction methods? 12. What are the optical microscopic methods in imaging? 13. What are the advantages of multi-photon excitation microscopy over confocal microscopy? 14. What is the principle on which scanning-microscopies are based? 15. What are the applications of atomic force and scanning tunneling microscopies in biology? 16. What are the uses and advantages of NSOM and CARS microscopies? 17. Which are the tomograhic imaging methods used in biology and medicine? 18. Which are the two important techniques in functional imaging studies in medicine?

BIBLIOGRAPHY 1. Blundell, T. L. & Johnson, L.N. (Eds) (1976), Academic Press: New York. “Protein Crystallography”, 2nd Edn. 2. Cavanagh, J., et al. (1996), Academic Press: New York. “Protein NMR Spectroscopy: Principles and Practice”. 3. Close, G.M. & Gronenborn, A.M. (Eds) (1993), MacMillan: London. “NMR of Proteins”. 4. Creighton, T.E. (1993), Freeman Press: “New York. “Proteins–– Structures and Molecular Properties”, 2nd Edn. 5. Drenth, J. (1994), Springerverlag: New York. “Principles of Protein Crystallography”. 6. Glusker, J.P. & Trueblood, K.N. (1985), Oxford University Press: New York. “Crystal Structure Analysis: A Primer”, 2nd Edn. 7. Engel, A., Schoenberger, C.A. & Miller, D.J. (1997), Curr Opin Struct Biol., 7(2); 279. “High-resolution imaging of native biological samples using scanning-probe microscopy”. 8. Herman, B. & Lemaster, J.J., (Eds). (11993), Academic Press: New York. “Light Microscopy: Emerging Methods and Applications”. 9. Kirz, J. Jacobson, C. & Howells, M. (1995), Q Rev Biophys., 28; 33. “Soft X-ray microscopes and their applications”. 10. Ledley, R.S. (1974), Science, 186; 197. “Computerized Translational X-ray tomography of the human body”. 11. Lee, J.K.T. et al., (Eds). (1998), Lippincott-Raven: New York. “ Computed Body Tomography with MRI Correlation”. 12. Lichtamn, J.W. (1994), Sci Amer., 271(2); 40. “Confocal Microscopy”. 13. Mettler, F.A. & Guiberteau, M.J. (1991), Saunders: Philadelphia. “Essentials of Nuclear Medicine Imaging”. 14. Morris, P.G. (1986), Oxford University Press; Oxford. “NMR Imaging in Medicine and Biology”. 15. Narayanan, P. (1989), Phys Edun., 6; 217. “X-ray structure determination of biomolecules”. 16. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print).

Spatial Structure Determination of Biomolecules 101 17. Narayanan, P. (2001), Bhalani Publishers: Mumbai. Clinical Biophysics: Principles and Techniques. 18. Parthasarathy, S. & Glusker, J.P. (Eds) (1997), New Age Intl Pubs: New Delhi. “Aspects of Crystallography in Molecular Biology”. 19. Sander, J.K.M. & Hunter, B.K. (1993), Oxford University Press: Oxford. “Modern NMR Spectroscopy”. 20. Slayter, E.M. & Slayter, H.S. (1993), Cambridge University Press: Cambridge. “Light and Electron Microscopy (2nd Edn)”. 21. Wüthrich, K. (1986), Wiley & Sons: New York. “NMR of Proteins and Nucleic Acids”.

8 Protein-Ligand Interactions The molecular associations/interactions are the basis of transformation and regulation of genetic information and all cellular actions and biochemical reactions, such as cell-cell recognition, neuronal signaling, hormonal action, and protein and enzyme functions. Central to such molecular-molecular associations/interactions are protein-nucleic, protein-carbohydrate, and protein-lipid interactions. Knowledge of protein structure often plays a crucial role in functional identification and characterization.

8.1 PROTEIN-NUCLEIC ACID INTERACTIONS Molecular interactions/associations are governed by– (a) The size and shape of the ligand that imposes steric hindrance. (b) Electronic potential at the site of interaction. Molecular interactions that occur in protein-nucleic acid complexes are between (i) Protein side chains and DNA backbone (50%). (ii) Protein side chains and DNA side chains (30%). (iii) Protein backbone and DNA backbone. (iv) Protein backbone and DNA side chain (1%). These types of interactions can also serve as general types of interactions in all macromolecular associations. The interactions can be non-specific, such as those found in histone-DNA association in chromatin, as well as specific as found in restriction endonuclease-DNA complexes. While non-specific interactions enable docking of the interacting molecular moieties, the specific interactions enable sequence-specific associations. Molecular interactions between functional groups can be classified under (1) electrostatic, (2) hydrogen bonding and (3) intercalation interactions.

8.1.1 Electrostatic Interactions Electrostatic interactions in protein-nucleic acid complexes occur between positively charged side chains of proteins (e.g. lysyl, arginyl) and negatively charged phosphate groups of the nucleic acid backbone. Electrostatic interactions are also mediated by metal ions. Electrostatic potential of the basepair moiety plays an important role in protein-nucleic acid interactions in the major and minor grooves (Fig. 8.1).

104

Bioinformatics: A Primer

In protein-nucleic acid complexes, the aromatic side chains, such as Phe, Trp and Tyr can interact in the major and minor grooves of nucleic acids by intercalation and further stabilized by hydrogen bonding between the peptide and nucleic acid moieties. Such a combination of intercalation and hydrogen bonding imparts specificity in protein-nucleic acid interactions.

8.1.4 DNA-regulatory Proteins DNA-regulatory proteins are associated with transcriptional control. These proteins bind to specific DNA sequences and thus help in switching on or off genetic coding as required. Most of these proteins bind in the major groove of the DNA. Many of them have an ordered organization of secondary structures (super-secondary structures) that form distinct structural motifs (e.g. helix-turn-helix, zinc-finger and leucine-zipper motifs).

8.1.4.1 “Helix-turn-Helix” Motif” Many of the prokaryotic transcriptional regulatory proteins have the helix-turn-helix (HTH) structural motif (Figs. 8.2, 8.3, 8.4 & 8.5). The HTH motif is approximately 20 residues long, with a 7-residue helix, a short turn, and nine-residue helix (recognition helix). The ‘recognition helix’ fits into the major groove of B-DNA. The specificities of the various helix-turn-helix motifs for binding to different DNA sequences arise primarily from the different amino acid side chains that emanate for the “recognition helix”. The other helix lies across the major groove and makes non-specific contacts with DNA.

Fig. 8.2 Schematic of “Helix-turn-Helix” Structural Motif

Protein-Ligand Interactions 105

Fig. 8.3 Example of “Helix-turn-Helix” Motif found in DNA-regulatory Proteins

Fig. 8.4 Structure of Lambda Repressor-Operator Complex (Ref: Beamer, L.J. and Pabo, C.O. (1992) J Mol Biol., 227; 177) (Source: Protein Data Bank: 1LMB.pdb)

106

Bioinformatics: A Primer

Fig. 8.5 Structure of Phage 434 Cro Protein with “Helix-loop-Helix” Structural Motif (Ref: Mondragon, A., Wolberger, C. & Harrison, S.C. (1989), J Mol Biol., 205; 179) (Source: Protein Data Bank; 2CRO.pdb)

8.1.4.2 “Zinc-finger” Motifs Several types of zinc-finger motifs have been identified. In these the Zn2+ ion forms a coordination moiety (moieties) with Cys/His residues of the protein (Fig. 8.6). Zinc-finger motifs are found not only in DNA-binding proteins but in proteins in general (Fig. 8.7), involving protein-protein interactions.

Fig. 8.6 Structure of Zif268 Protein-DNA Complex with “Zinc-finger” Structural Motif (Ref: Erickson, M.E., et al. (1996) Structure, 4; 1171) (Source: Protein Data Bank: 1AAY.pdb)

Protein-Ligand Interactions 107

Fig. 8.7 Example of “Zinc-finger” Structural Motif found in Proteins (Ref: Kowalski et al. (1999) J Biomol NMR, 13; 249)

8.1.4.3 “Leucine-zipper” Motif The leucine-zipper motif has been found in several eukaryotic transcriptional regulatory proteins (Fig. 8.8). The motif (~ 30 amino acid residues) consists of leu or ile at seven residue intervals (heptad spacing). The basic motif is (–L–X6–L–X6–L–X6–X6–L–X6–)

Fig. 8.8 (i) Examples of “Leucine-zipper” Structural Motif found in DNA-regulatory Proteins (Ref: Vinson, C.R., et al. (1989) Science, 246; 211)

8.1.5 Other DNA-binding Proteins Histones, DNA-polymerases and restriction endonucleases are some of the proteins associated with nucleic acids and their functions. Histones are basic proteins that interact with DNA

Protein-Ligand Interactions 109

Fig. 8.10 Structure of Human TBP Core Domain-DNA Complex (Ref: Nikolov, D.B., et al. (1996) Proc Natl Acad Sci, USA. 93; 4862) (Source: Protein Data Bank: 1CDW.pdb)

Fig. 8.11 Structure of EcoRV Endonuclease-DNA Complex (Ref: Kostrewa, D. & Winkler, F.K. (1995) Biochemistry 34; 683) (Source: Protein Data Bank: 1RVA.pdb)

Protein-DNA interactions can be detected by DNA footprinting, gel shift analysis, yeast one hybrid assays or Southwestern blots. Can also be analyzed by genetic analysis and X-ray crystallography. Theoretical methods of analyzing molecular dynamics (MD) trajectories may potentially reveal factors governing DNA recognition by proteins. This approach presupposes that MD trajectories accurately reflect behavior of DNA molecules in solution. Therefore, a prerequisite for accurate prediction of protein binding requires that MD simulations be validated by comparison with experimental dynamics data. Nuclear magnetic resonance (NMR) experiments provide a powerful way to obtain insight into the dynamics of molecules in solution.

110 Bioinformatics: A Primer

Thus, DNA structural and dynamic features derived from NMR data are compared with MD simulations to extract motional modes that may have functional significance.

8.2 PROTEIN-PROTEIN INTERACTIONS Protein-protein interactions include biological pathways, regulatory systems and signaling cascades. They play a major role in almost all relevant physiological processes occurring in living organisms, including DNA replication and transcription, RNA splicing, protein biosynthesis, and signal transduction. Molecular interactions that occur in protein-nucleic acid complexes are the same that occur between protein-protein interactions/associations, namely non-bonded interactions– ionic, hydrogen bonding, van der Waals, and hydrophobic interactions. Structure-function aspects can be determined by X-ray crystallography and NME spectroscopy (Chapter 7). Physicochemical and biomolecular methods are– phage display, protein arrays, immunoprecipitation assays, and yeast two-hybrids (screening technique to identify genes encoding interacting proteins). Yeast two hybrid is an approach to studying proteinprotein interactions. The basic format involves the creation of two hybrid molecules, one in which a “bait” protein is fused with a transcription factor, and one in which a “prey” protein is fused with a related transcription factor. If the bait and prey proteins indeed interact, then the two factors fused to these two proteins are also brought into proximity with each other. As a result, a specific signal is produced, indicating an interaction has taken place. Yeast three hybrid: Modification of yeast two hybrid system. The third hybrid may be a first one with a RNA or with a small molecule that is a cell permeable chemical inducer of dimerization. The three-hybrid system enables the detection of RNA-protein interactions in yeast using simple phenotypic assays.

8.3 PROTEIN-CARBOHYDRATE INTERACTIONS Many proteins covalently conjugated with carbohydrates by post-translational modification. These proteins, called glycoproteins, are classified as O-linked if the sugars are attached to the –OH groups of serine or threonine, and as N-linked if the sugars are attached to the amide nitrogen of the asparagine side chain. Glycoproteins are involved wide variety of biological functions. For example, the variability in the composition of the carbohydrate moieties of glycoproteins of erythrocytes that determine the blood groups specificity. Carbohydrates of glycoproteins appear to act as recognition markers in various cellular processes.

8.4 PROTEIN-LIPID INTERACTIONS Protein-lipid interactions are predominantly hydrophobic in character. The major function of lipoproteins is to aid in the storage transport of lipid and cholesterol.

EXERCISE MODULES 1. 2. 3. 4.

What are the various types of protein-nucleic acid interactions? What are the physicochemical parameters that govern these interactions? Which are the amino acid residues involved in electrostatic interactions? How do metals mediate in electrostatic interactions?

Protein-Ligand Interactions 111 5. What are the features of protein-nucleic acid interactions in the major groove and which amino acids can take part in these interactions? 6. What are the features of protein-nucleic acid interactions in the minor groove and which amino acids can take part in these interactions? 7. Which is the major component in the specificity of protein-nucleic acid interactions? 8. What is the importance of oligodentate hydrogen bonding? 9. What is intercalation and what is its role in protein-nucleic acid interactions? 10. Which are the structural motifs in DNA-regulatory proteins? 11. What are the essential features of helix-turn-helix structural motif? 12. What are the essential features of zinc-finger structural motif? 13. What are the essential features of leucine-zipper structural motif? 14. What is the role of glycoproteins? 15. What is the role of lipoproteins?

BIBLIOGRAPHY 1. Baltimore, D. & Berg, A.A. (1995), Nature, 373; 287. “DNA-binding proteins”. 1. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “Protein Data Bank”. 2. Branden, C.I. & Tooze, J (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”, 2nd Edn. 3. Brennan, R.G. & Matthews, B.W. (1989), TiBS., 14; 286. “Structural basis of DNA-protein recognition”. 4. Choo, Y. & Klug, A. (1997), Curr Opin Struct Biol., 7(1); 117. “Physical basis of protein-DNA recognition”. 5. Creighton, T.E. (1993), Freeman Press: New York. “Proteins– Structures and Molecular Properties”, 2nd Edn. 6. Dickerson, R.E. (1983), Sci Amer., 249(6); 86. “The DNA helix and how it is read”. 7. Johnson, P.F. and McKnight, S.L. (1989), Annu Rev Biochem., 58; 799. “Eukaryotic transcriptional regulatory proteins”. 8. Kjellen, L. & Lindhal, U. (1991), Annu Rev Biochem., 60; 443. “Proteoglycans: Structure and interactions”. 9. Konforti, B. (1999), Nature Struct Biol., 6; 505. “Rules for protein-DNA interactions”. 10. Kuhlbrandt, W. & Gouax, E. (1999) Curr Opin Struct Biol., 9(4); 445. “Membrane proteins”. 11. Lilley, D.M.J. (Ed) (1995), IRL Press: Oxford. “DNA-Protein: Structural Interactions. 12. Lodish, H., et al. (1995), Sci Amer Books: New York. “Molecular Cell Biology”, 3rd Edn. 13. Mc Cammon, J.A. (1998), Curr Opin Struct Biol., 8(2); 245. “Theory of biomolecular recognition”. 14. Nagai, K. (1996), Curr Opin Struct Biol., 6(1); 53. “The RNA-protein complexes”. 15. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics”(2nd Print). 16. Pabo, C.O. and Sauer, R.T. (1992), Annu Rev Biochem., 61; 1053. “Transcription factors: structural families and principles of DNA recognition. 17. Ptashne, M. (1988), Nature, 335; 683. “How eukaryotic transcriptional activators work”. 18. Richard, T.J. & Steitz, T.A. (1998), Curr Opin Struct Biol., 8(1); 11. “Protein-nucleic acid interactions”. 19. Saenger, W. & Heinamann, U. (Eds) (1989), Macmillan: Houndmills. “Protein-Nucleic Acid Interactions”. 20. Schmiedeskamp, M. & Klevit, R.E. (1994), Curr Opin Struct Biol., 4; 28. “Zinc-finger diversity”. 21. Sharon, N. & Lis, H. (1993), Sci Amer., 268(1); 82. “Carbohydrates in cell recognition”. 22. Steitz, T.A. (1990), Quart Rev Biophys., 23; 305. “Structural studies of protein-nucleic acid interactions: the sources of sequence-specific binding”. 23. Storch, J. & Kleinfeld, A.M. (1985), TiBS., 10; 418. “Membranes”. 24. Struhl, K. (1989), TiBS., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eucaryotic transcriptional regulatory proteins”. 25. Tijan, R. (1995), Sci Amer., 272(3); 7. “Molecular machines that control genes”. 26. Travers, A.A. (1990), Chapman-Hall: London. “DNA-Protein Interactions”.

This page intentionally left blank

Section III

Towards Structure Prediction (Bioinformatics-II)

This page intentionally left blank

9 The Protein-folding Problem The first half of the genetic code, namely how a gene sequence is translated into a polypeptide is fairly straightforward and has been resolved (vide: works of Khorana, Nierenberg and Mathaei). Gene sequence is read in triplets of nucleotides (codons) and translated into protein synthesis. However, the second half of the genetic code, namely, inferring the three-dimensional folding of a protein (tertiary structure) from its amino acid sequence (primary structure) is still an unresolved problem. This problem, the problem of predicting the tertiary structure of a protein from its amino acid sequence data is called the protein-folding problem, and is also known under various names, bioinformatics being the latest and much sought after word. The protein-folding problem is a central theme in molecular biology and bioinformatics. There is a direct correlation between the protein folding and genetic diseases. Therefore, proper knowledge of protein-folding mechanisms is fundamental to our understanding of protein function vis a vis its structure, and genetic diseases, life processes and evolution at molecular level. A rational approach towards molecular modeling (molecular engineering/drug design), namely, the design of novel molecules to suit desired requirements, rests on the availability of the three-dimensional structures of proteins that are to be engineered. The only physical techniques that are available, to-date, to determine the three-dimensional structures at atomic and molecular levels are single crystal X-ray diffraction and multi-dimensional NMR spectroscopic methods. The ‘bottleneck’ encountered towards bioinformatics (by experimental methods) arises out of the difficulties and delays in obtaining the tertiary structure data of macromolecules by these experimental methods, due to inherent constraints and operational restraints. Single-crystalline state of matter is a prerequisite for initiating X-ray diffraction studies, while the size (mass) of the molecule is the limiting factor for NMR spectroscopic methods. Many proteins fail to crystallize and/or cannot be obtained or dissolved in large enough quantities for NMR spectroscopic studies. In addition, these involve elaborate technical procedures, and the operational constraints make the existing experimental techniques of macromolecular structure determination a challenging and time-consuming task. However, these experimental techniques are crucial and are pursued, and constant efforts are being made to minimize the operational constraints and enhance their importance. The primary structure (amino acid sequence) data are obtainable faster and in a more “routine” way than their tertiary structure data. With advent of gene cloning and fast gene sequencing techniques (e.g. laser-activated fluorescence scanning), the availability of sequence data is growing at very fast rate as compared to the structural data (sequence/structure deficit

116 Bioinformatics: A Primer

ratio ~ 1,500:1). Therefore, theoretical prediction of tertiary structures of proteins from their primary structure data is an alternative approach towards molecular engineering (molecular bioinformatics). This approach, though highly complex and challenging, is an attractive and desirable one for multiple reasons–– 1. 2. 3. 4.

To realize the full potential of rapidly growing gene sequence data. For rational molecular design (drugs). For molecular (protein) engineering (rationally altered proteins). De novo synthesis (design) of proteins.

All the structural information necessary for proper tertiary folding is in the primary structure data. The molecular forces involved in the tertiary structure stability are the same nonbonded interactions––van der Waals interactions, hydrogen bonding and hydrophobic interactions. However, the task of predicting the tertiary structure of a protein a priori from its amino acid sequence data is not at all simple and straightforward. This is because protein folding is a highly cooperative process, involving various molecular interactions. That is, the second half of the genetic code, namely, inferring the unique spatial folding of proteins from their amino acid sequence data, still remains an unsolved problem. This complexity is referred to as the protein-folding problem. The protein-folding problem can be addressed either by genomic or proteomic-based analyses, or by methods.

9.1 GENOMICS ANALYSIS The genome is the sum of genes and intergenic sequences of the haploid cell. The complex and richly structured data from genomics pose the greatest encoding problem. From this perspective, the sequence data from human and other genomes provide great opportunities for theorists interested in the establishment of the complete genetic information and their structures in diverse organisms. Availability of genome sequences also provides an opportunity to explore genetic variability between organisms as well as within individual organisms. Genome sequences all genes of an organism enable one to identify the genes that influence metabolism, cell division and disease processes and evolutionary modeling etc., (Fig. 9.1), as well as to manipulate gene expression. Genomic analysis comprises both structural genomics and functional genomics. Availability of genome sequences provides an opportunity to explore genetic variability between organisms as well as within individual organisms. A newly identified gene in an organism can be compared to the existing database of information to find another that has similar function. Tracing the phylogenetic history of genes, gene domains and gene linkages in diverse organisms is one of the most challenging aspects of genome analysis (bioinformatics). 9.1.1 Structural Genomics As traditionally defined, the term structural genomics refer to the use of sequencing and mapping technologies, with bioinformatic support, to develop complete genome maps (genetic, physical, and transcript maps) and to elucidate genomic sequences for different organisms, particularly humans. It is concerned with the gene structure and relative positions of the genes in a chromosome (gene mapping). Polymorphisms (sequence variations) that are close to a trait are seldom separated from one generation to the other. Therefore, these polymorphisms may be used for mapping and identifying important genes.

118 Bioinformatics: A Primer

inheritance. Distance is measured in base-pairs. Physical maps describe the chemical characteristics of the DNA molecule itself. Physical maps are particularly important when searching for disease genes by positional cloning strategies and for DNA sequencing. Physical maps can be low-resolution or high-resolution maps. Low-resolution chromosomal maps are based on the banding patterns (light and dark bands reflecting regional variations in the amounts of A-T versus G-C) observed in light microscopy of stained chromosomes. A cDNA map shows the position of expressed DNA regions (exons), the regions transcribed into mRNA, relative to particular chromosomal regions. A cDNA map can provide the chromosomal location for genes whose functions are currently unknown. For disease-gene hunters, the map can also suggest a set of candidate genes to test when the approximate location of a disease gene has been mapped by genetic linkage techniques. High-resolution physical maps (e.g. by shotgun sequencing and sequence assembly) provide complete base-pair sequences of each chromosome in the genome. Determination of base-pair sequences of genes (high-resolution physical mapping) is necessary for inferring the amino acid sequences (primary structure) of corresponding proteins. The two current approaches in high-resolution physical mapping are termed-top-down (producing a macrorestriction map) and bottom-up (resulting in a contig map). With either strategy the maps represent ordered sets of DNA fragments that are generated by cutting genomic DNA with restriction enzymes. The fragments are then amplified by cloning or by polymerase chain reaction (PCR) methods. Electrophoretic techniques are used to separate the fragments according to size into different bands, which can be visualized by direct DNA staining or by hybridization with DNA probes of interest. DNA fingerprinting is a method of assembling overlapping cloned DNA molecules (contig maps), based on restriction fragment analysis or Southern blot hybridization patterns-digestion of the DNA with restriction enzymes followed by electrophoresis and visualization by hybridization with probes specific for repetitive sequences. The introduction of sequence tagged sites, which are short DNA segments defined by their unique sequences, allowed the use of PCR in contig assembly. Another large-scale physical mapping is the use of radiation hybrid mapping, or site-specific chromosome fragmentation. The introduction of pulsed field gel electrophoresis (PFGE), and fluorescence in situ hybridization (FISH) are major technology advances in sequence analysis. Contig maps are important because they provide the ability to study a complete, and often, large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region. Current mapping methods leave large number of gaps, to filled by other experimental methods. Chromosome walking is one strategy of filling in gaps. It involves hybridizing a primer of known sequence to a clone from an unordered genomic library and synthesizing a short complementary strand. The complementary strand is then sequenced and its end is used as the next primer for further walking. Genome mapping is a reconstruction of the entire set of chromosomes for a given organism, showing the relative position of every gene. Genome control maps would identify all the components of the transcriptional machinery that have roles at any particular promoter and the contribution those specific components make to coordinate regulation of genes. The map will facilitate modeling of the molecular mechanisms that regulate gene expression and impli-

The Protein-folding Problem 119

cate components of the transcription apparatus in functional interactions with gene-specific regulators. Genome sequencing projects do face several technical problems. Since present experimental methods can provide data on ~ 500 basepairs size genes, determination of larger genomic sequences requires a strategy to assemble overlapping sequence fragments. “Shotgun” strategy is employed in most of the current large-scale genome sequencing projects. Chromosomes of a target organism are purified, fragmented, and sub-cloned in fragments of ~ kilo basepairs. They are further sub-cloned as smaller fragments of plasmid vectors for DNA sequencing. First, the gene fragments are sequenced to determine the order of the bases in each sequence. Next, overlapping fragments are built up in a multiple alignment, a process known as sequence assembly, from which a consensus sequence for the clone is obtained. Full chromosomal sequences are then assembled from the overlap sequences in a highly redundant set of fragments. In genomic studies, the earliest stages of genome analysis are performed automatic methods. The genome sequences are then annotated. More detailed information is collected by laboratory experiments and a closer examination of the sequence data.

9.1.2 Genome Annotation and Comparative Genomics Each fragment of DNA contains unique features. A DNA fragment may encode a portion of a gene or a gene control sequence, or the fragment may be a portion of a genome that has no apparent function. The raw genomic sequence (basepairs) data is meaningless, without analyzing various factors/regions that constitute the sequence. The elucidation and description of biologically relevant features in the sequence is essential in order for genome data to be useful. The goal of genomic annotation is to extract biologically relevant information from raw genomic sequence data. The quality with which annotation is done will have direct impact on the value of the sequence. The process is iterative that requires finding putative coding regions, identifying what each region codes for, and using the available evidence to refine the coding, and control regions. Genome annotation requires a spectrum of bioinformatics tools, each tuned to the genome of analysis. Annotation is done at two levels at the DNA sequence level and at the predicted protein sequence level. Genome sequence databases contain an assortment of data types that cannot be treated alike. These include untranslated regions (UTRs), coding region sequences (CDRs), introns (intervening sections of DNA which occur almost exclusively within a eukaryotic gene, but which are not translated to amino acid sequences in the gene product) and exons and ribosome-binding sites, and translational termination sites (Fig. 9.2). DNA sequence analysis estimates boundaries between coding and non-coding sequences, gene structures, translates protein-coding genes into protein sequence, and characterize conditions under which different forms of gene may be expressed, and host other structural and functional information. Once the individual coding sequences are discerned, genome annotation constructs systems of gene and gene products by combining knowledge gained from sequence analysis with knowledge from other data sources. Once properly annotated DNA sequence available, it is possible to infer the roles of all of the gene products, how they are controlled and interact, and their possible involvement in disease.

The Protein-folding Problem 121

Shine-Delgarno sequences. Gene identification in prokaryotes is simplified by their lacking introns. Eukaryotic genes, on the other hand, are commonly organized as coding regions (exons) and non-coding regions (introns, transposable elements, pseudogenes, repeat elements, and UTRs) (Fig. 9.2), and hence may comprise several disjoining ORFs and the gene products may be of different lengths. Non-coding regions contain regulatory residues with promoters and transcription factor-binding sites. The introns are removed from the pre-mature mRNA through a process called splicing, which leaves the exons untouched, to form an active mRNA. The main task of gene identification (in eukaryotic DNA) involves coding region recognition (intron/exon discrimination) and detection of splice sites (boundaries between exons and introns). A central problem in bioinformatics is the assignment of function to sequenced open reading frames (ORFs). The most common approach is based on inferred homology using a statistically based sequence similarity methods (SIM) (e.g. BLAST, PSI-BLAST). Open reading frame expressed sequence tags (ORESTES) approach provides sequence information along the whole length of each transcript, rather than just the ends. The method involves low-stringency PCR to produce cDNA libraries, samples of which are then sequenced. Lately, alternative nonSIM based bioinformatic methods are becoming popular. One such method is Data Mining Prediction (DMP) that is based on combining evidence from amino-acid attributes, predicted structure and phylogenic patterns; and uses a combination of Inductive Logic Programming data mining, and decision trees to produce prediction rules for functional class. DMP predictions are more general than is possible using homology. For many eukaryotic organisms, the complete genome sequence is not available. What is available for some of the organisms is a large collection of partial gene sequence data from randomly chosen cDNA clones, called expressed sequence tags (ESTs). This approach would provide a rapid way to identify genes in any organism. The general strategy for an EST project involves construction of cDNA libraries form a variety of tissues at different stages of development, and the subsequent large-scale single-pass sequencing of clones from these libraries. Due to single-pass sequencing, the error rate in the EST sequences can be high. All the same, EST libraries are useful for preliminary identification of genes by database similarity searches. Using ESTs and cDNAs help to refine gene boundaries (Fig. 9.3).

Fig. 9.3 EST and cDNA Alignment to help resolve Exon Boundaries

EST searches can be done at the DNA level with BLASTn or FASTA, or at the protein level with tBLASTn, tBLASTx or FASTAx. Motif identification can be done with searches against

122

Bioinformatics: A Primer

BLOCKS, PFAM, and PROSITE (Chapter 12). If DNA or protein level sequence similarities are significant and informative, then their functional information can be transferred to genomic query sequence as a putative function annotation. In eukaryotes, after transcription, the mRNA for a gene may be alternatively spliced (alternative transcripts). This phenomenon adds to the complexity of intron-exon and exon-intron splice junctions identification. However, alternative splicing can be deduced if sufficient EST information is available for a predicted gene. Therefore, EST information generated by cDNA sequencing projects is critical to annotate and interpret a eukaryotic genome completely. An EST sequencing projects aim to create ass much data as possible for the genome. They sample libraries of cDNA molecules prepared from a variety of tissues. Intron/exon prediction includes consensus sequences at the intron-exon and exon-intron splice junctions, base composition and condon usage. UTRs occur both in DNA and RNA. They are portions of the sequences flanking coding reading sequences (CDRs) but are not themselves translated. Availability of protein sequences would help refine intron/exon boundaries (Fig. 9.4).

Fig. 9.4 Protein Sequence Comparison to refine Intron/Exon Boundaries

Complete CDRs are rarely sequenced in one reaction. So variable-length, overlapping fragments are aligned, in a multiple sequence alignment (sequence assembly), to obtain a consensus sequence. This is also to minimize cloning errors. Once coding sequences are identified and extracted from the genome, evidence for each coding region sequence (CDR) is collected from a variety of sequence similarity and motif search tools (BLAST and FASTA, PFAM). When a search of the databases reveals several ESTs, computational methods can be used in clustering ESTs by sequence similarity to known genes (Chapter 12). This allows each predicted gene to be compared against an array of EST sequences, enabling more effective and informative annotations. ESTs are not only incomplete, but also to a certain degree inaccurate. Since gene prediction methods are only partially accurate, partial cDNA copies of expressed genes (ESTs) confirm that a predicted gene is transcribed. The use of experimentally determined EST data in gene structure prediction and refinement underscores the importance of the integrated approach in bioinformatics–combining experimental data with computational tools of analysis. Comparative genomics is comparison of all the proteins in two or more proteomes, the relative locations of related genes in separate genomes. This includes a comparison of gene number, gene contact, and gene location on chromosomes. Availability of complete genome sequences makes possible a comparison of all of the proteins encoded by a genome, the proteome of the organism with those of another. Comparison of DNA sequences of two organisms provides information on gene relationships, conserved sequences to identify gene coding regions and evolutionary history. A comparison of each protein in the proteome with all other proteins distinguishes unique proteins from proteins that have arisen from gene

124

Bioinformatics: A Primer

for the genome by revealing which genes are expressed at a particular stage of the cell cycle, and genetic variation. With increasing number of completely sequenced genomes, it is now possible to make DNA microarrays in which all the genes of an organism are represented, and to simultaneously assess the expression of all these genes. The main applications of DNA microaary chips are in gene expression profiling, mutational analysis, detecting single nucleotide polymorphisms (SNPs), and pharmacogenomics. In this analysis, all the genes of an organism are represented by oligonucleotide sequences (non-redualant ESTs), spread in an array on a microscope slide. Fluorescent labels are then attached to the oligonucleotides. The oligonucleotides are collectively hybridized to a labeled cDNA library prepared by reversetranscribing mRNA from cells. The amount of label binding to each oligonucleotide spot is a reflection of the amount of mRNA in the cell. Labeled (fluorescent) probes are scanned with microscope or imaging equipment. From this analysis, a set of genes that responds in an identical manner may be identified. Messenger RNA expression arrays immobilize stretches of mRNA and are used to measure the concentration of mRNA species in a sample as a function of tissue type, cell cycle and other environmental conditions. Gene-transcript profile technique (referred to as transcript imaging) is particularly appealing because RNA transcripts represent the primary output of the genome, and RNA-based and protein-based measurements are complementary. For disease classification, a ‘molecular signature’ of a tissue may be obtained by allowing tissue RNA to react with a DNA microarray. This information may help to refine disease classifications, to guide the choice of therapy and to find new therapeutic targets. Serial analysis of gene expression (SAGE) is an alternate approach to microarrays for gene identification and quantification and monitoring global gene expression. It is a sequencebased method, which involves the production of short nucleotide targets from expressed genes (10-16 bp) that are then concatenated and sequenced sequentially, revealing the identity of multiple tags simultaneously. The study of global gene expression, using DNA microaarys or SAGE, relies on the observed phenotype changes due to dynamic changes in particular mRNA population. An automated hybrid of SAGE and differential display, enables complete elucidation gene expression patterns for a given tissue or cell. It requires a complex series of steps involving multiplex PCR, cDNA cloning, in vitro transcription, cDNA construction, sequencing gel analysis, and quantification. These tools provide researchers with a powerful platform for exploring novel gene-disease relationships. Genome annotation is tripartite, iterative procedure. (1) First, the coding regions of a genomic sequence must be discerned from the non-coding regions. The genomic sequence is split into overlapping contiguous sequences (contigs– DNA sequences, which have been assembled solely on the basis of direct sequencing information). Repeat regions are identified. (2) Once a gene is identified, or predicted, the next step is to identify putative functions, possible homologues in other organisms and within the genome. These tasks require the use of multiple bioinformatics tools. Each coding sequence is compared with sequences in the databases at both the DNA and protein levels. Gene finding tools assist in identifying genes. ESTs, cDNA sequences, and known protein sequences are compared to predict DNA sequences. Analysis of gene products (protein sequences translated from each coding sequence), by sequence similarity, profile and other search methods allow identification of functions. Once a set of nucleotide sequences is available it is possible to ascribe

126

Bioinformatics: A Primer

known and predicted genes of two or more genomes. Each protein of one proteome is then selected in turn as a query sequence of the proteome of another organism or the combined proteome of a group of organisms.

9.2.1 Amino Acid Folding Propensity Parameters The physicochemical characteristics of amino acids––their shape, and size and charge (Table 9.1) influence tertiary folding of a protein in the sequence. Other parameters that influence the folding are pH, side chains and solvent interactions. Knowledge of physiochemical and positional characteristics in protein complexes is essential in understanding functional features of proteins vis a vis its structure, ab initio methods of structure prediction incorporating the physiochemical properties of the amino acids in a protein, to literally calculate the 3-D structure (lowest energy model) based on protein folding, and applications of computation-based proteomics studies.

Fig. 9.7 A Flowchart of Proteomic Analysis

Table 9.1 Structural and Physicochemical Properties of Amino acids

Amino acid Alanine (A) Arginine (R) Asparagine (N) Aspartic acid (D) Cysteine (C) Glutamine (Q) Glutamic acid (E) Glycine (G) Histidine (H) Isoleucine (I) Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P) Serine (S) Threonine (T) Tryptophan (W) Tyrosine (Y) Valine (V)

Mass (dalton)

Volume (Å3)

HP Scale (K & D)#

Surface area Buried (%)##

71.1 186.2 114.1 115.1 103.2 128.1 129.1 52.0 137.1 113.3 113.2 128.2 131.2 147.2 97.1 87.1 101.1 186.2 163.2 99.1

67 148 96 91 86 114 109 48 118 124 124 135 124 135 90 73 93 163 141 105

1.8 –4.5 –3.5 –3.5 2.5 –3.5 –3.5 –0.4 –3.2 4.5 3.8 –3.9 1.9 2.8 –1.6 –0.8 –0.7 –0.9 –1.3 4.2

74 64 63 62 91 62 62 72 78 88 85 52 85 88 64 66 70 85 76 86

(#) = Hydrophobicity of amino acid side chains (Source: Kyte & Doolittle (1982). (##) = Mean surface area (%) buried (Source: Rose et al (1985).

The Protein-folding Problem 127

Most amino acids play more than one structural and functional role. On the basis of size and shape, glycine is the smallest amino acid with only H atom for a side chain. This allows Gly to have a greater conformational flexibility and fit in where other residues will be too bulky. Proline lacks a free amide hydrogen group that prevents main-chain hydrogen bonding. It has additional structural constraint on the backbone relative to any other amino acid. Both Gly and Pro are helix disrupters, which is an important factor in globular proteins. SH group of cysteine is the most reactive group of any amino acid side chain. The formation of disulfide (S– S) bonds between cysteines present within proteins is important to the formation of active structural domains in a large number of proteins. On the basis of hydrophobicity/hydrophilicity, a protein assumes a structure in which polar, hydrogen bonding and non-polar interactions maximize simultaneously. Thus, buried surface area of amino acid side chain is often used as a measure of the contribution to protein folding from the hydrophobic effect. Non-polar amino acids (Ile, Leu, Phe and Met) are generally found in the interiors, whereas charged residues (Arg, His, Lys, Asp and Glu) tend to be exposed on the surface of proteins. Ala, Ile, Leu, Met and Val are the aliphatic residues that fit together in the hydrophobic interiors of proteins. Neutral polar residues (Asn, Gln, Cys, Ser and Thr) are found on the surface as well as inside the protein. Ser and Thr have OH residues and they have important roles as active site residues. Reverse turns exhibit structural and physicochemical characteristics. 1. Reverse turns are polar due to the presence of NH and C=O groups; so usually they are situated at molecular surface (boundary separating protein and solvent), in contact with the solvent, water. 2. Backbone and side chain hydrogen bonds are disrupted in turn regions. 3. Solvent participation is an important influence in turn stabilization. 4. Regions adjacent to the reverse turn have many hydrophobic residues. 5. Turns in a hydrophobic environment (e.g. interior of proteins) are seldom. In such cases, bound water molecules and hydrogen bonds between side chains and the polypeptide backbone are involved in neutralizing the hydrophilicity.

9.2.2 Protein Folding Kinentics Methods No one has the technology in place to solve hundreds of crystal structures for hundreds of new proteins. Therefore, the current experimental recourse is to screen protein-folding conditions (intermediate structures). The protein-folding problem can be analyzed experimentally by kinetic methods, by analyzing the folding pathways. Partially folded protein intermediates are chemically trapped, and physically separated, and their structures are analyzed by biophysical and biochemical methods. Introducing mutants and analyzing the effects of mutation on the kinetics of folding can also be used to study folding pathways. The results of these studies are: (i) The structural elements collapse as a compact unit (molten globule) and later reorganize towards the native structure. (ii) All but rate-limiting steps are readily reversible, so that the initial folding process is rapid. (iii) Inter-conversions of the molten globule state with the fully unfolded state are rapid and non-cooperative, whereas those with the full folded state are slow and cooperative.

128

Bioinformatics: A Primer

(iv) Partially folded intermediates are inherently unstable. (v) The transition state has ordered segments largely intact. (vi) Local sequence alterations and environment can influence the overall structure of proteins. (vii) Folding does not proceed by stepwise acquisition of the native structure. (viii) There is a preferential accumulation of the most stable intermediate in folding pathway, because the productive pathway for refolding leads from this intermediate. Molecular dynamics (MD) simulations can be used to study molecular dynamics in protein folding, and to predict protein-folding rates. Prediction of such stable intermediate folds is one the important procedures in structure prediction methods (see Chapter 12).

9.2.3 Phylogenetic (Evolution) Methods The study of relationships between two groups of organisms is called taxonomy. The branch of taxonomy that deals with genetic sequences is called phylogenetic or molecular evolution. Molecular evolution is self-assembly into higher structures and the subsequent evolution of proteins to the living organisms. Concept of molecular self-replication is based on the application of chemical kinetics to template-induced polynucleotide replication and translation. Protein-folding problem can be addressed from phylogenetic methods either via phenotypic approach or “cladistic” approach. Phenotypic approach relies heavily on sequence data, while the cladistic approach relies on knowledge of ancestral relationships as well as sequence data (see Chapter 8). Rational design of new proteins is based on insights obtained from the basic mechanisms for evolution––mutation and recombination.

9.2.3.1 Evolution and Function The basic mechanisms for evolution are mutation, recombination and natural selection. These processes are closely related to genes and dynamics of their replication and translation. The study of evolutionary aspects of proteins provides insights on important and conserved regions in protein folding. The dynamic processes in the evolution of new proteins are modification of side chains, insertion and deletion of amino acid residues, and all of these affect the folding. Generally, changes in the interior of proteins are conservative so that the main-chain folding is not drastically affected. Homology generally means relationship of nucleic acid or protein sequences that are descent from a common ancestral sequence. Homology (evolutionary relationship) can be inferred from sequence similarity results. From the observation of similarity, one might be able to infer whether the sequence similarity would relate to functional similarity. Sequences that are highly divergent during evolution cannot be detected by simple sequence similarity search methods. In such cases, computational methods, based on profile-search, that go beyond the simple pair-wise sequence similarity methods, should also be tried. Sequence alignments are intended to unravel evolutionary pathways and/ or structural homology between two proteins. However, sequence homology does not necessarily indicate functional homology. Phylogenetic homology does not necessarily imply structural homology or neither of them necessarily implies functional homology. Comparisons of tertiary structures of homologous proteins (proteins related by divergence from a common ancestor) have shown that three-dimensional structures have been better conserved during evolution

The Protein-folding Problem 129

than the primary structures. The conservation of main-chain folding underscores the fact that (i) conservation of structural and physicochemical features imposes stringent conditions on folding pathways, and (ii) proteins that serve functions similar in all organisms retain strong structural homology. Mutations and gene duplications affect expression of genetic code in proteins. Accumulation of point mutations by natural drift (phenotype mutations) results in structural changes in homologous proteins, although their functional properties are similar. These evolutionary changes are continuous and small. Gene fusion and gene duplication, on the other hand, introduce drastic and discontinuous structural changes. Evolution of completely new proteins/enzymes by point mutations is comparatively slow, and their emergence can be explained by gene duplication. Proteins with new functions are produced by gene duplication. Trough mutation and natural selection, one of the copies can develop a new function. Following speciation, a newly derived genome will inherit the families of ancestor organisms, but will also develop new ones to meet evolutionary challenges. One example of gene duplication and divergent evolution is a/b-barrel domains in several classes of proteins. Gene duplication led to several copies of the gene, and gene fusion led to proteins with different functions that share a common architecture. Combination of two dissimilar polypeptides by gene fusion generally results in the formation of a new polypeptide with altered functional features (e.g. a-lactalbumin). Different proteins can be generated by juxtaposition of different combinations of exons. For example, a combination of a-lactalbumin and an enzyme, transferase, has resulted in the formation of new enzyme lactose synthetase.

9.2.3.2 Evolutionary Trends in Protein Structures A common mechanism for the evolution of new proteins is gene duplication to form more copies and the new copies evolve independently by point mutation and amino acid insertions and deletions to yield proteins with new functions. In related organisms, the gene content of the genome and gene order on the chromosome are likely to be conserved. As the relationship between the organisms decreases, local group of genes remain clustered together, but chromosomal rearrangements move the clusters to other locations. Evolution is achieved either via divergent evolution or convergent evolution. Production of different protein species by mutation of genes descended from a common ancestral gene is called divergent evolution (e.g. myoglobins, hemoglobins, cytochrome c and insulins). Convergent evolution is acquisition of similar functional features by a class of proteins with dissimilar primary structures and structural conformations (e.g. subtilisins and pancreatic proteinases). These proteins have different tertiary structures, but with similar functions. Analysis of three-dimensional structures of proteins shows that proteins with high primary structure homology have closely similar tertiary structures, and hence similar functions. The converse is not always true. Closely related sequences, which may be assumed to share a common structure, may not share the same function (e.g. lysozymes and lactalbumins; insulin and relaxin; plasma albumin and fetoprotein; ovalbumin and antithrombin). For example, hen egg-white lysozyme shares 50 percent homology with a-lactalbumin but their functions are different. The b-barrel structural motif found many proteins classes, may have different functions. Also, proteins with different primary structures (analogous proteins from convergent evolution) can have similar tertiary structures and functions. This reinforces fact that during evolutionary changes (divergent or convergent evolution) the tertiary structures are conserved more strongly than the primary structures.

130

Bioinformatics: A Primer

9.2.3.3 Molecular Structure and Evolution Evolution can be studied from the molecular perspective, from the comparative analysis of tertiary structures of proteins from various organisms and environments. In divergence evolution the functions appear to be the same, but there are changes in the tertiary structures to adapt to different environments. Therefore, sequence comparisons of orthologous proteins (proteins that perform same functions in different species (e.g. insulin, hemoglobins, lysozyme etc.) open the way to the construction of phylogenetic trees. Such phylogenetic studies (e.g. cytochrome C, globins) provide valuable structural information on protein folding dynamics, conserved regions, invariant residues crucial for proper folding/function and regions prone for additions and deletions. Similarly, sequence comparison studies of paralogous proteins, which are proteins with different but related functions in an organism (e.g. G-proteins in signal-transduction process) provide valuable insights to the structure-function relationships in proteins. Interacting pairs of proteins co-evolve to maintain functional and structural complementarity. Consequently, such a pair of protein families shows similarity between their phylogenetic trees. Evaluation of the degree of co-evolution of family pairs by global protein structural interactome map (PSIMAP—a map of all the structural domain–domain interactions in the PDB) would improve the accuracy of prediction based on ‘homologous interaction’.

EXERCISE MODULES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

What is the genesis of the “protein-folding” problem? Why is structure prediction is so important in bioinformatics? What are the parameters influencing the tertiary folding? What is a genome? What is structural genomics? What is genetic linkage mapping? What is physical mapping? What are EST databanks and how are they prepared? What is the importance of annotation and comparative genomics? What are gene chips and what are their applications in genomic studies? What is functional genomics? Why is the protein-folding problem so complex? What is hydrophobicity scale? What is a proteome? What is intra-proteomic analysis? What is inter-proteomic analysis? What are the kinetic methods of protein folding analyses? What is taxonomy? What is molecular evolution? What is phenotypic approach of evolutionary study? What is cladistic approach of evolutionary study? How do you correlate evolution of proteins to their functions? What are the evolutionary trends in protein structure? How do you study evolution of proteins from their 3-D structure?

The Protein-folding Problem 131

BIBLIOGRAPHY 1. Baldwin, R.L. (1989), TiBS, 14; 291. “How does protein folding get started?” 2. Baxevanis, A.D. & Ouellette, B.F., (Eds). (1998), Wiley & Sons: New York. “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins”. 3. Birren, B., et al (Eds). (1997), Cold Spring Harbor Laboratory Press: New York. “Genome Analysis: A Laboratory Manual”. 4. Blake, C.C.F. (1985), Int Rev Cytology, 93; 149. “Exons and the evolution of proteins”. 5. Bork, P. & Koonin, E.V. (1996), Curr Opin Struct Biol., 6(3); 366. “Protein sequence motifs”. 6. Branden, C & Tooze, J. (1999), Garland Pubs: Philadelphia. “Introduction to Protein Structure”, 2nd Edn. 7. Brennan, R.G. & Matthews, B.W. (1989), Trends Biochem Sci., 14; 286. “Structural basis of DNA-protein recognition”. 8. Brown, T.A. (1999), Wiley-Liss: New York. “Genomes”. 9. Chee, M., et al. (1996), Science, 274; 610–14. “Accessing genetic information with high-density DNA arrays”. 10. Chothia, C. (1984), Annu Rev Biochem. 53; 537. “Principles that determine the structure of proteins”. 11. Chothia, C. & Lesk, A.M. (1986), EMBO J., 5; 823. “The relation between the divergence of sequence and structure in proteins”. 12. Cohen, C. & Perry, D.A.D. (1993), TiBS., 11; 245. “a-helical coiled coils– a widespread motif in proteins”. 13. Creighton, T.E. (1978), Prog Biophys Mol Biol., 33; 231. “Experimental studies of protein folding and unfolding”. 14. Creighton, T.E. (1993), Freeman Press: New York. “Proteins–– Structures and Molecular Properties”, 2nd Edn. 15. Danchin, A. (1999), Curr Opin Struct Biol., 9(3); 363. “From Protein sequence to function”. 16. Dickerson, R.E. (1980), Sci Amer., 243(3); 136. “Cytochrome C and the evolution of energy metabolism”. 17. Dickerson, R.E. & Geis, I. (1983) Benjamin-Cummings: Menlo Park, CA. “Hemoglobin: Structure, Function, Evolution and Pathology”. 18. Dobson, C.M. & Karplus, M. (1999), Curr Opin Struct Biol., 9(1); 92. “The fundamentals of protein folding: bringing together theory and experiment”. 19. Dunham, I.N., et al. (2000), Nature, 404; 904. “The DNA sequence of human chromosome”. 20. Dutt, M.J. & Lee, K.H. (2000), Curr Opin Biotechnol., 11; 176”. “Proteomic analysis”. 21. Eisen, M.B. & Brown, P.O. (1999) Methods Enzymol., 303; 179. “DNA arrays for analysis of gene expression”. 22. Eisenhaber, F., Persson, B. & Argos, P. (1995), Crit Rev Biochem Mol Biol., 30; 1. “Protein structure Prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence”. 23. Englander, S.W. (1993), Science, 262; 848. “In pursuit of protein folding”. 24. Farber, G. & Petsko, G.A. (1990), TiBS., 15; 228. “The evolution of a/b-barrel enzymes”. 25. Fickett, J.W. (1996), Trends Genet., 12; 316. “Finding genes by computer: the state of the art”. 26. Gilbert, W., de Souza, S.J. & Long, M. (1997), Proc Natl Acad Sci, USA, 94; 7698. “Origin of genes”. 27. Hedge, P., et al. (2000), Biotechniques, 29; 548. “A concise guide to cDNA microarray analysis”. 28. Johnson, M.S. & Overignton, J.P. (1993), J Mol Biol., 233; 716. “A structural basis for sequence comparison”. 29. Kim, W. K., Bolser, D.M. & Park, J.H. (2004), Bioinformatics, 20(7); ([email protected]). “Largescale co-evolution analysis of protein structural interlogues using the global protein structural interactome map (PSIMAP)”. 30. Kleywegt, G.J. (1999), J Mol Biol., 285; 1887. “Recognition of spatial motifs in protein structures”. 31. Kyte, J. (1994), Garland Pubs: New York. “Structure in Protein Chemistry”.

132

Bioinformatics: A Primer

32. Kyte, J. & Doolittle, R.F. (1982), J Mol Biol., 157; 105. “A simple method for displaying the hydropathic character of a protein”. 33. Lee, P.S. & Lee, K.H. (2000), Curr Opin Struct Biol., 11; 171. “Genomic analysis”. 34. Lesk, A.M. (1991) IRL Press: London. “Protein Architecture: A Practical Approach”. 35. Lesk, A.M. & Chothia, C. (1980), J Mol Biol., 136; 225. “How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of globins”. 36. Levitt, M. & Chothia, C. (1976), Nature, 261; 552. “Structural patterns in globular proteins”. 37. Li, W.H. (1997), Sinuaer Associates, Sunderaland: MA. “Molecular Evolution”. 38. Lilley, D.M.J. (Ed) (1995), IRL Press: Oxford. “DNA-Protein: Structural Interactions. 39. Luthy, R., Bowie, J.U. & Eisenberg, D. (1992), Nature, 356; 83. “Assessment of protein models with threedimensional profiles”. 40. Marcotte, E.M., et al. (1999), Science, 285; 751. “Detecting protein function and protein-protein interactions from genome sequences”. 41. Marshall, A. & Hodgson, J. (1998), Nature Biotechnol., 16; 27. “DNA chips: An array of possibilities”. 42. Martin, A., et al. (1998), Structure, 6; 875. “Protein folds and functions”. 43. Mount, D.W. (2001), Cold Spring Harbor Lab Press: New York. “Bioinformatics: Sequence and Genome Analysis”. 44. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print). 45. Orengo, C.A., et al. (1994), Nature, 372; 631. “Protein superfamilies and domain superfolds”. 46. Orengo, C.A., et al. (1999), Curr Opin Struct Biol., 9(3); 374. “From protein structure to function”. 47. Overington, J.P. (1992), Curr Opin Struct Biol., 2; 394. “Comparison of three-dimensional structures of homologous proteins”. 48. Pabo, C.O. & Sauer, R.T. (1992), Annu Rev Biochem., 61; 1053. “Transcription factors: structural families and principles of DNA recognition. 49. Pandey, A. & Mann, M. (2000), Nature, 405; 837. “Proteomics to study genes and genomes”. 50. Perutz, M.F. (1991), W.H. Freeman: New York. “Protein Structure and Function”. 51. Qian, N. & Sejnowski, T.J. (1988), J Mol Biol., 202; 865. “Predicting the secondary structure of globular proteins using neural network models”. 52. Richardson, J.S. (1981), Adv Prot Chem., 34; 167. “The anatomy and taxonomy of protein structure”. 53. Richardson, J.S. (1985), Methods Enzymol, 115; 349. “Describing patterns of protein tertiary structure”. 54. Rose, G.D., et al. (1985), Science, 229; 834. “Hydrophobicity of amino acid residues in globular proteins”. 55. Sackhein, G. (1991), Addison-Wesley: New York. “Introduction to Chemistry for Biology Students”, 4th Edn. 56. Sanchez, R. & Sali, A. (1997), Curr Opin Struct Biol., 7; 206. “Advances in comparative protein structure modeling”. 57. Sensen, C.W., (Ed). (2001), Wiley-VCH: Weinheim. “Biotechnology(5b): Genomics and Bioinformatics” (2nd Edn). 58. Struhl, K. (1989), Trends Biochem Sci., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eukaryotic transcriptional regulatory proteins”. 59. Sun, Z. & Jiang, B. (1996), J Prot Chem., 15; 675. “Conformation of commonly occurring super-secondary structures (basic motifs) in protein databank”. 60. Todd, A.E., Orengo, C.A. & Thornton. J.M. (1999), Curr Opin Chem Biol., 3(5); 548. “Evolution of protein function, from a structural perspective”. 61. Vriend, G. & Sander, C. (1991), Proteins, 11; 52. “Detection of common three-dimensional structures in proteins”. 62. Zaidi, F.N., Nath, U. & Udagaoankar, J.B. (1997), Nature: Struct Biol., 4; 1016. “Multiple intermediates and transition states during protein unfolding”.

10 Computational Methods in Structure Prediction Theoretical (computational) methods of tertiary structure prediction and model building of proteins (and other biological molecules), on the basis of the known three-dimensional structure of their homologous, are at present the alternate ways to obtain structural information (molecular bioinformatics). By virtue of genome projects, the sequence databases are growing faster than the structural databases, and there is a sequence/structure deficit (1,500:1). This is due to the reason that though the primary structure databases of proteins are increasing at a faster rate, there is a paucity of three-dimensional structural data, due to inherent limitations and procedural constraints on the existing experimental methods (X-ray crystallography and NMR spectroscopy). All the same, all structure prediction methods (statistical, physical and simulation methods) are empirical with inherent limitations and they all relay one way or another on experimental data for validation.

10.1 PROTEIN FOLDING RULES Simply stated, the ‘protein-folding’ problem is to predict the 3-D folding of a protein from its primary structure (amino acid sequence) data. Empirical rules for protein-folding problem have been formulated from the “knowledge” culled from the experimental data from structure determination by physical techniques, molecular biology and quantitative biochemistry. From the experimental data, it is proposed that steps toward the native folding of a protein are through stages—(i) rapid collapse of random-coil state to “molten globule” state, (ii) slow process of semi-compact states to a transition state, (iii) fast folding from the transition to the final native state, and (iv) there is a preferential accumulation of the most stable intermediate in folding pathway. From the “knowledge-based” studies, the empirical rules that govern the packing that occurs between and among secondary structural elements to form structural motifs, modules, domains, and tertiary structures are: 1. Residues that become buried in the interior of a protein close-pack. Close packing and exclusion of water and burial of hydrophobic groups in the interior are the major determinants of tertiary folding. 2. Polar and charged residues are predominantly found at the reverse turns and surfaces of protein molecules. 3. The Ramachandran conformation angles of the main-chain and side-chains of the polypeptides lie in the narrowly allowed regions.

134 Bioinformatics: A Primer

4. Secondary structural elements in proteins retain conformations close to the minimum free energy conformations of the isolated secondary structures. They interact in a manner that induces no appreciable steric strain. 5. Packing interactions (non-bonded interactions) contribute to folding stability and, therefore, tend to be conserved in protein folding processes. 6. Structural motifs, modules, domains and topologies are more important than the amino acid sequence homology in the evolution of protein structures. That is, the 3-D structure of proteins is more faithfully conserved than is the underlying sequence. 7. In proteins with natural physical constraints, such as disulfide-containing proteins, the S– S bridges are found as integral parts of structural motifs, influencing the tertiary folding. 8. In disulfide-containing proteins, the S–S moieties render not only structural stability but also functional features. There exists a hierarchy of S–S bridges in stabilizing structural motifs/moieties.

10.2 STRUCTURE PREDICTION OF FIBROUS PROTEINS Proteins are classified under two broad categories—(1) fibrous proteins and (2) globular proteins. Fibrous proteins are long and thread-like (e.g. collagen), while globular proteins (e.g. myoglobin, insulin) are compact globules, due to reverse turns of the polypeptide backbone. Protein-folding rules are relatively simple in the case of fibrous proteins. Many of them are helical (or sheet) and in many cases several helical polypeptide monomers entwine to form coiled-coil super helical structures, and existing in relatively simple structural motifs.

10.2.1 The Keratin Group Proteins The primary structures of a-keratin group (k-m-e-f) proteins (e.g. hair, wool, myosin, and fibrinogen) show a general trend towards a heptad-residue repeat with a preponderance of non-polar residues at the 1st and 4th positions and charged residues at the 5th and 7th positions. The residues at the 2nd, 3rd and 6th positions are generally polar. The heptad-residue repeat of the helix leads to the alignment of the 1st and 4th non-polar residues forming a nonpolar strip on one side of the helix. The non-polar faces of several helices associate to form coiled-coil structure. Silk (b-keratin) has antiparallel b-pleated sheet structure with a six-residue repeat unit– (Gly-Ser-Gly-Ala-Gly-Ala)n. Individual sheets are packed together so that Gly faces Gly and Ala or Ser.

10.2.2 The Collagen Group Proteins The structural unit of collagen, tropocollagen, comprises three left-handed polyproline type helical monomers. The amino acid sequence of each polypeptide monomer is (Gly-X-Y)333. Generally X is proline and Y is 4-hydroxyproline. With every third residue being glycine and stereochemical constraints make it straightforward to predict tertiary structures of collagen group proteins a priori from the amino acid sequence data.

Computational Methods in Structure Prediction 135

10.3 STRUCTURE PREDICTION OF GLOBULAR PROTEINS In the case of globular proteins, the structure prediction rules are inadequate. Many different amino acid sequences give similar three-dimensional structures. We do not yet fully understand the rules of protein folding. The protein-folding problem is highly complex in globular proteins because, (i) All twenty amino acids can be involved in the construction of the polypeptide. (ii) No repeating units that would have simplified the problem. (iii) Globular proteins are not linear; the direction of the backbone folding changes many times to render compact globular shape. Therefore, the protein-folding problem in globular proteins is attempted at several levels of complexity, incorporating empirical protein folding rules culled from known 3-D structures, physicochemical data and “knowledge” accumulated from other sources. Success rate depends on the structural features of the class of proteins addressed. Structural topologies (motifs, and domains) and not amino acid sequence homologies are better conserved in evolution. Different amino acid sequences can give rise to similar protein structures. That is, three-dimensional spatial architecture is more important in protein folding than amino acid sequence homologies. Therefore, protein-folding problem is better addressed by classifying protein structures by shapes, motifs and modules and then carrying out molecular modeling procedures. Structure prediction methods will have better success if pattern matching is given more importance. That is, instead of matching amino acid sequence data of protein with sequence data of its homologues, the objective should be to match the amino acid sequence of a test protein with a given topology/shape/profile. This ‘inverse folding’ method, where proteins are identified by structural motifs and shapes and amino acid sequences are aligned to fit the structural motifs is a recent development in structure prediction by statistical methods. The complexity of the protein-folding problem can be minimized, in the case of certain classes of proteins, such as immunoglobulins and disulfide-containing proteins, where the inherent natural constraints influence the tertiary folding interactions

10.3.1 Secondary Structure Prediction The secondary structure elements (helix, sheet and turn) constitute the building blocks of the folding units in globular proteins. The aim of secondary structure prediction in essence is to provide information and location of helices, strands and random coil segments within a protein from its amino acid sequence data. Therefore, prediction of the secondary structure of a protein is often used as the first step in an attempt toward predicting its tertiary structure. There is a variety of empirical and statistical methods––(i) Chou-Fasman and GOR, (ii) neural network and (iii) nearest neighbor—available to predict secondary structures of proteins from the amino acid sequence data. These methods try to predict the propensity of each amino acid to be a part of a helix, a sheet or a coil region in a protein. Protein sequences are proposed as sliding windows of fixed-length segments, usually ranging from 7–17 amino acids. The central residues are then assigned one of the states, namely helix, sheet or coil, depending on their propensity (Table 10.1).

136 Bioinformatics: A Primer Table 10.1 2-D Structure Propensity Chart for Amino acid Residues Amino acid Alanine (A) Arginine (R) Asparagine (N) Aspartic acid (D) Cysteine (C) Glutamine (Q) Glutamic acid (E) Glycine (G) Histidine (H) Isoleucine (I) Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P) Serine (S) Threonine (T) Tryptophan (W) Tyrosine (Y) Valine (V)

a-helix

b-sheet

Reverse turn

1.40 1.20 0.78 1.00 0.95 1.17 1.45 0.63 1.12 1.00 1.30 1.22 1.30 1.15 0.50 0.72 0.78 1.03 0.74 0.96

0.75 0.90 0.66 0.66 1.10 1.00 0.51 0.85 0.83 1.57 1.20 0.70 1.12 1.23 0.60 0.95 1.43 1.26 1.40 1.66

0.80 0.90 1.54 1.40 1.00 0.95 0.77 1.60 0.90 0.47 0.60 1.02 0.60 0.60 1.50 1.40 0.50 0.90 1.00 0.50

Source: Chou, P.Y. & Fasman, G.D. (1978); Fasman, G.D (Ed). (1990).

The Chou-Fasman method is based on analyzing the frequency of each of twenty amino acids in a-helices, b-sheets and turns. The frequency of an amino acid fi is divided by the frequency of all residues in the sequence. The sequence is first scanned to find a short sequence of amino acids that has a high probability for starting a nucleation event that could form one type of secondary structure. For a-helix, a prediction is made when four of six amino acids have probability > 1.0 of being in a-helix. For b-strand, the presence of three amino acids, out of five residues, with a probability of > 1.0 being in b-strand is considered as nucleation region. These nucleation regions are extended along the sequence in each direction until the predicted values for four amino acids drops < 1.0. Turns are modeled as tetrapeptides. Prediction of secondary structure can be aided by examining the periodicity of amino acids with hydrophobic side chains in the protein structure. Hydrophobicity tables that give hydrophobicity values for each amino acid are used to locate the most hydrophobic regions of the protein. A sliding window is moved across the sequence and the average hydrophobicity value of amino acids within the window is plotted. Hydrophobic moment display, where hydrophobic amino acids tend to segregate to opposite sides of structure plotted against Ramachandran angles from on residue to the next along the protein chain. Whichever procedure is followed, the general features of the secondary structure prediction programs are to predict regions of ordered regions, reverse turns and loops, and formulate

Computational Methods in Structure Prediction 137

further empirical rules to predict how these secondary structure elements further fold into compact super-secondary and tertiary structures. For example, the hydrogen bond network in a-helix is intra-molecular, the H-bonds of helices can be localized to intra-segment partners. Thus, it is presumed that the helical secondary structure is dependent mainly on local sequence information (short-range interaction), and not on final folding, and thus can be predicted from neighborhood amino acid sequence data. On the other hand, the formation of the H-bond network is intermolecular in b-sheet formation, and thus requires long-range interactions. Prediction methods for secondary structure have low accuracy. Such empirical assumptions have no physical meaning and some times may lead to erroneous results. This is because, tertiary folding is highly cooperative and limited knowledge of a few aspects of a structure does not necessarily give insight to predict the whole structure. Interactions between sequentially distant residues override the intrinsic conformational propensities of individual residues to achieve a proper tertiary folding. In essence, bits and parts information does not provide total information on protein folding. The direct approach toward the protein-folding problem is to predict the three-dimensional structure of a protein from its amino acid sequence data. This is carried out by direct search methods (e.g. pair-wise sequence search methods). Converse approach to the protein-folding problem is– for a particular type of folding what are the compatible amino acid sequences?

10.3.2 Super-secondary Structure Classification Proteins have an ordered organization of secondary structures that form distinct structural motifs and domains. These motifs and domains (e.g. Greek-key, helix-turn-helix, zinc-finger, and leucine-zipper) are termed as super-secondary structures. One way of approaching the protein-folding problem is to classify protein structures by structural motifs and domains and incorporate this information in structure prediction procedures. This approach is one or two steps below the hierarchy of predicting the final tertiary structure. Family search methods (e.g. template method) are based on comparison of protein motifs, domains or families. The underlying rationale of these programs is based on the assumption that proteins with similar amino acid sequences will have similar structures (not necessarily true!); and domains, and tertiary folding rather than amino acid sequences are conserved during evolution. Well-defined types of folding units (super-secondary structures) can be classified under (i) all a-, (ii) all b-, (iii) a/b and a+b, and (iv) other types of structure classes.

10.3.2.1 All a-motif Structures In globular proteins, a-helices are packed to form various structural motifs, from simple ‘updown’ helices (e.g. melettin, myohemerythrin, cytochrome C’ and Cyt562) (Fig. 10.1), to more complex helix bundles, (e.g. ribonuclease inhibitor protein (Fig. 10.2)) and transmembrane helices motifs and domains in membrane proteins (e.g. rhodopsin) and to more diverse helical folds, like in lysozyme (Fig. 10.3) and ‘globin fold’, where a-helices of the bundle are wrapped around the core in different directions so that sequentially adjacent a-helices are usually not adjacent to each other (e.g. myoglobin) (Fig. 10.4). These structural units are maintained by preserving the hydrophobic interior core during evolution.

138 Bioinformatics: A Primer

Fig. 10.1(i) Structure of Melettin (Example of “all a-Helix” Structural Motif) (Ref: Eisenberg, D., Gribskov, M. & Terwilliger, T.C.); (Source: Protein Data Bank; 2MLT.pdb)

Fig. 10.2 Structure of Ribonuclease Inhibitor Protein (Example of “Multiple Helices” Structural Motif) (Ref: Kobe, B. & Deisenhoffer, J. (1996) J Mol Biol., 264; 1028) (Source: Protein Data Bank: 1BNH.pdb)

Fig. 10.1(ii) Structure of Myohemerythrin (Example of “Up-down a-Helices Bundle” Structural Motif) (Ref: Sheriff, S., Hendrickson, W.H. & Smith, J.L. (1987)); (Source: Protein Data Bank; 2MHR.pdb)

Fig. 10.3 Structure of Lysozyme (Example of a with “Diverse Helices” Structural Motif) (Ref: Weaver, L.H., Grutter, M.G. & Matthews, B.W. (1995) J Mol Biol., 245; 54) (Source: Protein Data Bank: 153L.pdb)

Computational Methods in Structure Prediction 139

Coiled-coil structures typically comprise two or three a-helices coiled around each other forming a super-coiled structure, with seven residues every second turn. Coiled-coil regions in proteins may be identified by searching for the 7-residue (heptad) periodicity–– (a-b-c-d-e-f-g)-. The a and d residues are usually hydrophobic amino acids. The leucine-zipper motif is typically made of two antiparallel a-helices held together by interactions between hydrophobic leucine residues located at every 7th position in each helix. The zipper holds protein subunits together. The leucine residues are located at approximately every two turns of the a-helix. The binding of the subunits from a “scissor-grip” like structure with ends that lie on the major groove of DNA double helix. The predicted motif is a coiled-coil structure. Membrane a-helices are like a-helices that are buried in the structural core of a protein.

Fig. 10.4 Structure of Myoglobin (Example of Helical-wheel “Globin-fold” Structural Motif) (Ref: Watson, S.C. & Kendrew, J.C) (Source: Protein Data Bank: 1MBN.pdb)

10.3.2.2 All b-motif Structures The “b-turn-b” motif structures have antiparallel b-strands joined by a turn/loop (Fig. 10.5). The b strands have right-handed twist, and packing of b strands gives a barrel-like structure. b-Motif structures exhibit simple “strand-turn-strand” motifs (e.g. T cell coreceptor protein CD8 and superoxide dismutase (Fig. 10.6) to more complex Greek-key, b-barrel, and “swiss roll” structural motifs (Fig. 10.7). Structural motif in immunoglobulins, “immunoglobulin-fold”, and in human cell adhesion protein CD2 (Fig. 10.8) is a version of the Greek-key structural motif.

Fig. 10.5 Schematic of “b-turn-b” Structural Motif

140 Bioinformatics: A Primer

Fig. 10.6(i) Structure of Human T Cell Coreceptor Protein CD8 (Example of “Strandturn-Strand” Structural Motif)

Fig. 10.6(ii) Structure of Superoxide Dismutase (Example of “Strand-turn-Strand” Structural Motif)

(Ref: Leahy, D.J., Axel, R. & Hendrickson, W.A. (1992) Cell, 68; 1145) (Ref: Protein Data Bank: 1CD8.pdb)

(Ref: Carugo, K.D., et al. (1996) Acta Crystallogr D: Biol Crstallogr., 52; 176) (Source: Protein Data Bank: 1XSO.pdb)

Fig. 10.7(i) Structure of Bovine Lens gCrystallin Protein (Example of “b-Barrel” Structural Motif)

Fig. 10.7(ii) Structure of Transthyretin (Example of “b-Barrel” Structural Motif)

(Ref: Najmudin, S., et al. (1993) Acta Crystallogr D: Biol Crystallogr., 49; 223) (Source: Protein Data Bank: 4GCR.pdb)

(Ref: Sunde, M., et al. (1996) Eur J Biochem., 236; 491) (Source: Protein Data Bank: 1TFP.pdb)

Computational Methods in Structure Prediction 141

Fig. 10.8 Structure of Human Cell Adhesion Protein CD2 (Example of a Structure with “Immunoglobulin-fold” Structural Motif) (Ref: Bodian, D.L., et al. (1994) Structure, 2; 755) (Source: Protein Data Bank: 1HNF.pdb)

10.3.2.3 (a + b) and (a /b)-Motif Structures Examples of (a + b) structures are insulin, trypsin inhibitor ribonuclease, glutaredoxin, p21, and bacteriochlorophyll-containing protein (Fig. 10.9). a/b-Motif structures (e.g. Rossmann

Fig. 10.9(i) Structure of T4 Glutaredoxin Protein (Example of (a+b) Structural Motif)

Fig. 10.9(ii) Structure of Protein p21 (Example of (a+b) Structural Motif)

(Ref: Eklund, H., et al. (1992) J Mol Biol., 228; 596) (Source: Protein Data Bank: 1ABA.pdb)

(Ref: Wittnghofer, F., et al. (1991) Environ Health Perspect., 93; 11) (Source: Protein Data Bank: 121P.pdb)

142 Bioinformatics: A Primer

fold) are built up by bab structural motif (Fig. 10.10). Structures (e.g. flavodoxin, carbonic anhydrase, adenylate kinase, and triose phosphate isomerase) consist of parallel b-strands at the center pointing in different directions like arrows (b-barrel) with a-helices wound around (Fig. 10.11).

Fig. 10.9(iii) Structure of Bacteriochlorophyll-containing Protein (Example of (a + b) Structural Motif)

Fig. 10.10 Schematic of “b-a-b” (“Rossmann-fold”) Structural Motif

(Ref: Tornrund, D.E. & Matthews, B.W. (1993) Photosyn Reaction Center, 1:13) Source: Protein Data Bank: 4BCL.pdb)

Fig. 10.11(i) Structure of Carbonic Anhydrase (Example of (a/b) Structural Motif)

Fig. 10.11(ii) Structure of Adenylate Kinase (Example of (a/b) Structural Motif)

(Ref: Nair, S.K. & Christianson, D.W. (1991) J Amer Chem Soc., 113; 9455) (Source: Protein Data Bank: 1HCA.pdb)

(Ref: Berry, M.B. & Phillips, Jr. G.N. (1998) Proteins, 32; 276) (Source: Protein Data Bank: 1ZIN.pdb)

Computational Methods in Structure Prediction 143

Fig. 10.11(iii) Structure of Triose Phosphate Isomerase (Example of (a/b) Structural Motif) (Ref: Lolis, E., et al. (1990) Biochemistry, 29; 6609) (Source: Protein Data Bank: 1YPI.pdb)

10.3.2.4 Other Structural Motifs There are many super-secondary structures that do not come under more common structural classes. In the structural motif, common to all scorpion venom toxins and many snake venom toxins, the disulfide bonds impose natural constraints and stabilize the helix and sheet regions of the motif (Fig. 10.12). EF-hand structural motif is found in many calcium-binding proteins (Fig. 10.13). Motifs, such as ‘helx-turn-helix’, ‘zinc-finger’, and leucine-zipper’, are found in

Fig. 10.12 Structure of Erabutoxin (A Snake Venom Toxin; Disulfide Bonds stabilize the Structure) (Ref: Nastopoulos, L., et al. (1998) Acta Crystallogr D: Biol Crystallogr., 54; 964) (Source: Protein Data Bank: 1QKE.pdb)

Fig. 10.13 Structure of Calmodulin (“E-F Hand” Structural Motif) (Ref: Wilson, M.A. & Brugner, A.T. (2000) J Mol Biol., 301; 1237) (Source: Protein Data Bank: 1EXR.pdb)

144 Bioinformatics: A Primer

nucleic acid regulatory proteins (Fig. 10.14). These structural motifs influence the spatial folding, and such structural data are helpful in the tertiary structure prediction analyses.

Fig. 10.14 Structure of Zif268-DNA Complex (“Zinc-finger” Structural Motif) (Ref: Pavletich, N.P. & Pabo, C.O. (1991) Science, 252; 809) (Source: Protein Data Bank: 1ZAA.pdb)

10.4 APPLICATION OF STRUCTURE PREDICTION PROGRAMS The stated goal of structural genomics and proteomics involves generating a set of structures representative of most of the possible motifs, profiles and folds for specific proteins and then solving the structures for new proteins based on known motifs, profiles and other foldstructure relationships. There are many empirical procedures, programs and structure prediction and model building packages available––single and multiple-sequence alignment, protein family classification, inverted protein prediction, pattern recognition protocols, to name a few (see Chapter 12). Sequence comparison and database searching are the pre-eminent approaches in these methods (see Chapter 11). Protein family classification provides an effective means of understanding the structure and function. Fold recognition procedure (e.g. Gene Threader) has become an important approach to the protein structure prediction problem. The complexity of structure prediction problem can be minimized in certain classes of proteins, where structural (geometrical) constraints strongly influence the tertiary folding interactions. Such classes of proteins are immunoglobulins, disulfide-containing proteins, metallo-proteins (EF-hand, zinc-finger motifs) and other classes of proteins with distinctly identifiable motifs (e.g. helix-turn-helix, leucine-zipper). In cases of disulfide-containing proteins, the S-S bridges impose natural constraints and influence the folding. S–S bridge moieties contain both secondary and tertiary structure features and such S–S moieties help in minimizing the cooperative processes in tertiary structure folding. In these classes of proteins, simplification of structure prediction problem can be achieved by incorporating the (i) ‘knowledge’ governing the packing interactions between and among various structural

Computational Methods in Structure Prediction 145

elements, and (ii) ‘knowledge’ about the structural motifs and their hierarchies the roles bridges in these structures. Artificial neural networks (ANN) procedures are increasingly applied for gene recognition, secondary structure prediction, protein family classification and molecular design. The principles are based on the analogy to the functioning of biological neural networks, with inputs (dendrites), processing algorithms (soma), and outputs (axons) networks interfaced (Fig. 10.15). In the neural network approach, computer programs are trained to be able to recognize amino acid sequence patterns that are located in known structures and to distinguish these patterns from other patterns not located in these structures.

Fig. 10.15

Flowchart of Neural Network Protocols in Bioinformatics

EXERCISE MODULES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

Which are the physical techniques employed to elucidate 3-D structures of macromolecules? Why is the structure prediction methods are an alternate route to experimental methods? What are the protein folding mechanisms? What are the differences between fibrous and globular proteins? What are the general structural characteristics of fibrous molecules? Why is the structure prediction of fibrous proteins is simpler than that of globular proteins? What are the salient features of collagen group protein? Try to build the coiled-coil structure of a a-keratin protein from the amino acid sequence and hydrophobic propensity of amino acids. Why are the protein folding rules are more complex in the case of globular proteins? Which are the constituents of secondary structure in proteins? Which are the physicochemical parameters that assist in the prediction of secondary structure? Why are the secondary structure prediction methods not that reliable? Model the secondary structure of a globular sequence from its amino acid sequence data (obtain the sequence data from a Web site). What are the various classes of folding units in proteins? Which are the cases where the complexity of the structure prediction can be minimized and why and how? Download proteins of various classes (all a-, all b-, a/b, zinc-finger, helix-turn-helix) and observe their domains, topologies and structures.

BIBLIOGRAPHY 1. Attwood, T.K. & Parry-Smith, D.J. (2002), Pearson (Education): Delhi. “Introduction to Bioinformatics”. 2. Bajorath, J., Stenkamp, R. & Aruffo, A. (1993), Protein Sci., 2; 1798. “Knowledge-based model building of proteins: concepts and examples”.

146 Bioinformatics: A Primer 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.

Baldwin, R.L. (1989), TiBS, 14; 291. “How does protein folding get started?” Barton, G.J. (1995) Curr Opin Struct Biol., 5(3); 372. “Protein secondary structure prediction”. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “Protein Data Bank”. Blundell, T.L., et al. (1987), Nature, 326; 347. “Knowledge-based prediction of protein structures and the design of novel molecules”. Bork, P. & Koonin, E.V. (1996), Curr Opin Struct Biol., 6(3); 366. “Protein sequence motifs”. Bowie, J.U. & Eisenberg, D. (1993), Curr Opin Struct Biol., 3; 437. “Inverted protein prediction”. Branden, C. & Tooze, I. (1999), Garland Press: New York. “Introduction to Protein Structure”, 2nd Edn. Burge, C.S. & Karlin, K. (1997), J Mol Biol., 268; 78–94. “Prediction of gene structures in human genomic DNA”. Chothia, C. (1984), Annu Rev Biochem. 53; 537. “Principles that determine the structure of proteins”. Chothia, C. & Lesk, A.M. (1986), EMBO J., 5; 823. “The relation between the divergence of sequence and structure in proteins”. Chou, P.Y. & Fasamn, G.D. (1978), Annu Rev Biochem., 47; 251. “Empirical predictions of protein conformation”. Creighton, T.E. (1992), Freeman Press: New York. “Protein Folding”. Creighton, T.E. (1993), Freeman Press: New York. “Proteins––Structures and Molecular Properties”, 2nd Edn. Cuff, J.A. & Barton, G.J. (2000), Proteins, 40; 502. “Application of multiple sequence alignment profiles to improve protein second structure prediction”. Danchin, A. (1999), Curr Opin Struct Biol., 9(3); 363. “From Protein sequence to function”. Dobson, C.M. & Karplus, M. (1999), Curr Opin Struct Biol., 9(1); 92. “The fundamentals of protein folding: bringing together theory and experiment”. Dubchak, I., Holbrook, S.R. and Kim, S-H. (1993), Proteins, 16; 79. “Prediction of protein folding class from amino acid composition”. Eisenhaber, F., Persson, B. and Argos, P. (1995), Crit Rev Biochem Mol Biol., 30; 1. “Protein structure Prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence”. Engel, J. (1991), Curr Opin Cell Biol., 3; 779. “Common structural motifs in proteins of the extracellular matrix”. Englander, S.W. (1993), Science, 262; 848. “In pursuit of protein folding”. Farber, G. & Petsko, G.A. (1990), TIBS, 15; 228. “The evolution of a/b barrel enzymes”. Fasman, G.D. (Ed). (1990), Plenum Press: New York. “Prediction of Protein Structure and the Principles of Protein Conformation”. Frishman, D. and Argos, P. (1995), Proteins, 23; 566. “Knowledge-based protein secondary structure assignment”. Garnier, J., Gibrant, J.F. and Robson, B. (1996), Methods Enzymol., 266; 540. “GOR method for predicting protein secondary structure from amino acid sequence”. Gribskov, M. & Veretnik, S. (1996), Methods Enzymol., 266; 198. “Identification of sequence pattern with profile analysis”. Guex, N., Diemand, A. & Pettsch, M.C. (1999), Trends Biochem Sci., 24; 364´–67. “Protein modeling for all”. Hadley, C. & Jones, D.T. (1999), Struct Fold Desn., 7; 1099–1112. “A systematic comparison of protein structure classifications: SCOP, CATH and FSSP”.

Computational Methods in Structure Prediction 147 30. Janin, J. & Chothia, C. (1980), J Mol Biol., 143; 95. “ Packing of a-helices onto b-pleated sheets and the anatomy of a/b-proteins”. 31. Johnson, M.S. & Overignton, J.P. (1993), J Mol Biol., 233; 716. “A structural basis for sequence comparison”. 32. Jones, D.T. (1999), J Mol Biol., 292; 195–202. “Protein secondary structure prediction based on positionspecific scoring matrices”. 33. Kleywegt, G.J. (1999), J Mol Biol., 285; 1887–97. “Recognition of spatial motifs in protein structures”. 34. Lesk, A.M. (1991) IRL Press: London. “Protein Architecture: A Practical Approach”. 35. Lesk, A.M. & Chothia, C. (1980), J Mol Biol., 136; 225. “How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of globins”. 36. Levitt, M. & Chothia, C. (1976), Nature, 261; 552. “Structural patterns in globular proteins”. 37. Lilley, D.M.J. (Ed) (1995), IRL Press: Oxford. “DNA-Protein: Structural Interactions. 38. Lohman, R., Schneider, G. and Behrens, D. (1994), Protein Sci., 3; 1597. “A neural network model for the prediction of membrane-spanning amino acid sequences”. 39. Lupas, A. (1996) Methods Enzymol., 266; 513. “Prediction and analysis of coiled-coil structure”. 40. Luthy, R., Bowie, J.U. & Eisenberg, D. (1992), Nature, 356; 83. “Assessment of protein models with threedimensional profiles”. 41. Merz, K.M. & Le Grand, S.M. (1994), Birkhauser: Boston/MA. “The Protein Folding Problem and Tertiary Structure Prediction”. 42. Moult, J. (1999) Curr Opin Biotech., 10(6); 583. “Predicting protein three-dimensional structures”. 43. Mount, D.W. (2001), Cold Spring Harbor Lab Press: New York. “Bioinformatics: Sequence and Genome Analysis”. 44. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print). 45. Narayanan, P. & Lala, K. (1992), Life Sciences, 50; 683. “Prediction of tertiary structure in ‘scorpion toxin’ type structures”. 46. Orengo, C.A., et al. (1994), Nature, 372; 631. “Protein superfamilies and domain superfolds”. 47. Orengo, C.A., et al. (1994), Curr Opin Struct Biol., 4(3); 429. “Classification of protein folds”. 48. Perutz, M.F. (1991), W.H. Freeman: New York. “Protein Structure and Function”. 49. Quain, N. & Sejnowski, T.J. (1988), J Mol Biol., 202; 865. “Predicting the secondary structure of globular proteins using neural network models”. 50. Richardson, J.S. (1985), Methods Enzymol, 115; 349. “Describing patterns of protein tertiary structure”. 51. Ripley, B.D. (1996), Cambridge University Press: Cambridge. “Pattern Recognition and Neural Networks”. 52. Rost, B., Schneider, R. & Sander, C. (1997), J Mol Biol., 270; 471. “Protein fold recognition by predictionbased threading”. 53. Struhl, K. (1989), TiBS., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eucaryotic transcriptional regulatory proteins”. 54. Sun, Z. & Jiang, B. (1996), J Prot Chem., 15; 675. “Conformation of commonly occurring super-secondary structures (basic motifs) in protein databank”. 55. Todd, A.E., Orengo, C.A. & Thorton, J.M. (1999), Curr Opin Chem Biol., 3(5); 548. “Evolution of protein function, from a structural perspective”. 56. Vriend, G. Sander, C. (1991), Proteins, 11; 52. “Detection of common three-dimensional structures in proteins”. 57. Wu, C.H. & McLarty, J.W. (2000), Elsevier: Amsterdam. “Neural Networks and Genome Informatics”. 58. Zucker, M. (2000), Curr opin Struct Biol., 10; 303. “Calculating nucleic acid secondary structure”.

This page intentionally left blank

Section IV

Database Search, Analysis and Modeling (Bioinformatics-III)

This page intentionally left blank

11 Database Search The first step towards protein structure prediction, starting from the beginning, is isolating and purifying the desired protein, say “MyProtein” by experimental methods. The next step is determination of amino acid sequence of “MyProtein” either by gene sequencing methods or by physicochemical methods (refer to Chapters 4, 5 & 6). Once, the primary structure of “MyProtein” is determined by experimental methods, the rest of the protocols–database searches for sequence similarity, alignment, analysis and modeling—are all computationalbased procedures. Databases are of several types: (1) Primary databases contain one principal kind of information such as gene sequence data. (2) Secondary databases contain one principal kind of information, such as sequence alignment (e.g. motifs, profiles, domains), derived from other databases. (3) Knowledge databases contain structural and functional information from many sources (e.g. hydrophobicity, pH, actives sites etc.).

11.1 PRIMARY STRUCTURE (SEQUENCE) Both genetic (gene selection and amplification) and physicochemical methods (chromatography and electrophoresis) can be used for isolation and purification of the desired protein, “MyProtein”. With the advent of gene cloning and polymerase chain reaction (PCR) techniques, it is possible to purify defined fragments of DNA in large quantities. Beginning with a single molecule of DNA the PCR method can generate millions of copies of the DNA fragment in a short span of time. Wherever it is feasible gene-cloning methods can be employed to obtain large quantities of homogeneous protein at a faster rate. The primary structure (amino acid sequence data) can be obtained either by protein sequencing or by gene sequencing methods (see Fig. 1.2 of Chapter 1). Laser-activated fluorescence technology has enhanced the fastness of gene sequencing methods, and the amino acid sequence of a protein can be inferred from its gene sequence (refer to Chapter 4). But, inferring amino acid sequence from the gene sequence has some pitfalls and ambiguities. These factors should be taken note of while inferring the amino acid sequence from the gene sequence. Some of the factors leading to these ambiguities are: 1. A set of three contiguous nucleotides (codon) codes for an amino acid. Any frame-shift error reading of the gene sequence would lead to inferring wrong amino acid sequence.

152 Bioinformatics: A Primer

2. Genetic code is degenerate as in most cases more than one codon code for the same amino acid (see Chapter 4). Therefore, there are ambiguities in such cases. 3. Genomic DNA sequences contain assortment of data types (e.g. untranslated sequences, coding and non-coding, transcription and translation regions). In addition, information gathered at the mRNA level fails to represent the changes occurring at the protein level. This is due to numerous regulation mechanisms in place during protein expression and post-expression.

11.2 DATABASES Database search and analysis are of two categories. (1) Genomics analysis includes analysis of nucleic acid composition, restriction enzyme cleavage sites, transcriptional factors, promoter sites, secondary structure and sequence similarity searches. (2) Proteomics analysis includes determination of amino acid composition, sequence alignment, phylogenetic analysis, sequence similarity searches, prediction of secondary structure, motifs, profiles, domains and tertiary structure (Fig. 11.1).

Fig. 11.1 Organization of Biological Databases

Searching of sequence databases is one of the most common tasks with a newly discovered protein or nucleic acid. This is used to find if (i) the sequence is already in a database, (ii) if it is new, then to infer its structure (secondary and tertiary), and its function, and (iii) presence of active sites, substrate-binding sites etc. Databases are effectively electronic filing cabinets, a convenient and efficient method of storing vast amount of information. The process of database search starts with retrieval of sequences that are similar to that of “MyProtein” for further analysis. For example, if “MyProtein” has been purified from snake venom, it could be a toxin, lipase, phosphodiesterase or one of the proteins found in snake venoms. Depending on the class of protein and the number amino acids in “MyProtein, database search can be initiated to obtain sequences of neurotoxins, cytotoxins, or other classes of proteins from various species. This can be achieved by various approaches, programs, and from various databases (from genomic and proteomic databases). Most of the databases (databanks) are Web-based. Other sources are journals, authors, research groups, and institutions.

Database Search 153

11.2.1

Search Sites and Search Engines

WWW search engines are good tools to get started from the Web sites. Some of the Web sites and engines are: z z z z z z z z z z z z

z

z

z z

z

z

z z

z z z

Altavista Google Yahoo Infoseek Medline Meta Crawler Web Crawler Research Index Shared What’s New Tool Pedro’s Biomolecular Research Tools. GAC : (http://compbio.ornl.gov/gac/index.shtml/). (Genome Annotation Consortium). COG : (http://www.ncbi.nlm.nih.gov/cog/). A gene classification system–– cluster of orthologous groups. DDBJ : (http://www.ddbj.nig.ac.jp/). (DNA Databank of Japan). A nucleic acid database. DSSP : Database of secondary structures of proteins from PDB. EBI : (http://www.ebi.ac.uk/). (European Bioinformatics Institute; UK, an outstation of the EMBL). EMBL : (http://www.ebi.ac.uk/). (European Molecular Biology Laboratory; Germany). ExPASy : (http://www.expasy.ch/). Expert Protein Analysis System, a Molecular Biology Server, Switzerland, with SWISS-PROT, PROSITE, 2D-PAGE, and other proteomics tools. Key site for protein sequence and structure information. GDB : (http://www.gdb.org). The genome databank. GenBank : (http://www.ncbi.nlm.nih.gov/Web/GenBank/). GenBank of the National Institute of Health (NIH, USA) genetic sequence database is an annotated collection of all publicly available DNA sequences. GenBank is a part of the International nucleotide sequence database, which is comprised of the DNA databank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL, Germany) and GenBank at NCBI, USA. GenomeNet : (http://www.genome.ad.j/). (Genome database, Japan). GOLD : Genomes on-line database. Provides list of all genome projects worldwide. GRAIL : (http://compbio.ornl.gov/Grail-1.3/). Gene Recognition and Assembly Internet Link software. A suite of tools designed to provide analysis and putative annotation of DNA sequences both interactively and through the use of automated computation.

154 Bioinformatics: A Primer z z

z

z z

z

z

z z

z

z z

z

z z

z

z

z z z

GSDB HGP

: (http://www.seqim.ncgr.org/). (Genome sequence database, USA). : (http://www.ornl.gov/TechResources/HumanGenome/). (Human Genome Project, USA). NCBI : (http://www.ncbi.nlm.nih.gov/). (National Center for Biotechnology Information; NIH, USA) NDB : Nucleic acid structure database. NRL 3D : Sequence/structure database derived from PDB (at Johns Hopkins University, Baltimore, USA). OMIM : (http://www3.ncbi.nlm.nih-gov/Omim). On-line Mendelian inheritance in Man (for human genes and genomics at NCBI). PDB : (http://www2.ebi.ac.uk/pdb/index.shtml). (Protein Data Bank; Brookhaven National Laboratory; USA). PDBFINDER : A database comprising PDB, DSSP and HSSP. PEDANT : (http://www2.ebi.ac.uk/pdb/index.shtml). A protein extraction, description and analysis tool. Sanger Center : (http://www.sanger.ac.uk/DataSearch/). Genomic sequencing, and genomics analysis server (UK). SRS : (http://www.srs.hgmp.mrc.ac.uk/). (Sequence Retrieval System). STACK : (http://www.sanbi.ac.za/Dbases.html). Sequence tag alignment and consensus knowledge database. TRANSFAC : (http://transfac.gbf.de/TRANSFAC/index.html). Transcription factor database, for transcription factors and transcription factor-binding sites. Other sites : (http://www.hgmp.mrc.ac.uk/GenomeWeb/). (A list of other sites). EMBASE : (http://www.ncbi.embase.com/). Bibliographic index to biomedical and pharmacological literature. PUBMED : (http://www.ncbi.nlm.nih.gov/PubMed/). Covers mainly medical literature. SWISS-PROT : (http://expasy.hcuge.ch/sprot/sprot-top.html). A protein sequence database (Switzerland). TIGR : (http://www.tigr.org/). (The institute for genomic research). Other sites : (http://www.hgmp.mrc.ac.uk/GenomeWeb/). (A list of other sites). Journals : (Nature–Molecular Biology; Molecular Biology; Proteins;

Protein Science; Nucleic Acid Research; Current Opinion in Structure Biology; Bioinformatics).

11.2.2

Sequence Retrieval Programs

There are various sequence retrieval programs available and some of the programs and sources are: z

BLAST

z

CLEVER

: Basic Local Alignment and Search Tool (Home Page: NCBI, USA) sequence retrieval and sequence similarity search engine, which consists of a suite of programs—BLASTN (nucleotide BLAST), BLASTP (Protein BLAST), BLASTX (Translated BLAST), PhyloBLAST and PIR-BLAST. : Command-line ENTREZ Version from NCBI. It is an interactive tool to browse ENTREZ database using only test input/output.

Database Search 155 z

z

z

z

z

z

z

ENTREZ

: (http://www3.ncbi.nlm.nih.gov/ENTREZ/). ENTREZ is a powerful search engine, a part of NCBI server. The NCBI contains all the nucleotide and protein sequences in GenBank and Medline. The program allows one to start with only tentative set of keywords, or a sequence identified in the laboratory, and rapidly accesses a set of relevant list and a list related database sequences. FASTA : (http://www2.igh.cnrs.fr/bin/fasta-guess.cgi). Sequence retrieval and similarity search database. FETCH : FETCH is sequence retrieval program that retrieves sequences from the GenBank and other databases. The program requires the exact locus name or accession number of a sequence. NetFETCH : This is a sequence retrieval program that retrieves sequences from the NCBI’s NetENTREZ Web server. Name or accession number can retrieve sequences. LOOKUP : LOOKUP is a sequence retrieval program that uses SRS (Sequence Retrieval System) and is useful if the accession number is not known, but one wishes to download sequences of all proteins related to the query protein. LOOKUP identifies sequence by name, accession number, keyword, title, reference, feature or date. The output is a list of sequences. E-mail Servers : These servers are useful for those who do not have full access to the Internet with a graphical WWW browser. NCBI has e-mail query service: ([email protected]). The query format is: DATALIB $$ Titles MaxDocs # BEGIN Words DATALIB and BEGIN are mandotory. $$ = gb (for Genbank); e (for EMBL); sp (for Swiss-Prot). # = Number of sequence needed. EMBL Get : EMBL sequences can be obtained via e-mail: ([email protected]). The input format is: get nuc: ##; get prot: ##. (## = Accession number).

11.3 GENOME DATBASE SEARCH Database searches can be carried out either based upon gene sequence databases, called genome informatics, or upon protein sequence (proteome) databases. Genome informatics includes (i) functional genomics, which deals with interpretation of the function of nucleotide sequences on a genomic scale, and (ii) structural genomics, which deals with classification and prediction of protein structures from gene sequence data. There is a vast amount of gene sequence data available (e.g. from genome sequence project). Two main databases that are widely used for novel gene discovery are high-throughput genomic databases, and the expressed sequence tag (EST) databases. EST databases are singlepass, partial sequences of 50-500 nucleotides from cDNA libraries. They provide direct window onto the expressed genome. EST sequencesare generated by shotgun sequencing method. The sequencing is random and a sequence can be generated several times, and can be inaccurate.

156 Bioinformatics: A Primer

The major nucleic acid sequence databases are: z

COMPEC

z

EID

z

MAGPIE

z

z

dbEST DDBJ EMBL GenBank

z

GSDB

z

MEDLINE

z z

11.3.1

: (http://compec.bionet.nsc.ru/). A database that contains protein-DNA and protein-protein interactions for composite regulatory elements. : (http://golgi.harvard.edu/gilbert/eid/). An Exon-Intron database, derived from GenBank. : Multipurpose Automated Genome Interpretation Environment. It is genome analysis and annotation system to add graphical representation to the results. : Database Expressed Sequence Tags (EST) at NCBI. : Nucleotide database, Japan. : Information can be retrieved from EMBL using SRS system. : DNA database from NCBI. Information can be retrieved using Entrez. GenBank is available via FTP. : The Genome Sequence Database, from the National Center for Genome Resources. : The facility provides abstracts from the original published articles.

Gene Structure and Gene Sequences

Genome sequence databases are used for codon usage, restriction maps, identification of coding regions (exons), repeats, translation (protein coding) and motif identification. Pattern recognition methods play an important role in elucidating the location and significance of genes throughout the genome. Genome sequence databases contain an assortment of data types that cannot be treated alike. The raw sequence (basepairs) data is meaningless, without analyzing (annotating) various factors/regions that constitute the sequence. These include untranslated regions (UTRs), coding region sequences (CDRs), introns and exons and ribosome-binding sites, and translational termination sites (see chapter 9; & Fig. 9.2). Gene identification in prokaryotes is simplified by their lacking introns. In eukaryotes intron-exon and exon-intron splice junctions have to be identified. Intron/exon prediction includes consensus sequences at the intron-exon and exon-intron splice junctions, base composition and condon usage. UTRs occur both in DNA and RNA. They are portions of the sequences flanking coding reading sequences (CDRs) but are not themselves translated. The task of transcription and translational recognition involves prediction of promoter sites that function in the initiation and termination of transcription and translation (CCAAT box; GC box; TATA box). The transcription initiation site is always an ATG codon and it is always ~ 30 basepairs downstream from TAATAA sequence. In an arbitrary DNA sequence, it is not known whether the 1st base marks the start of the coding sequence. So, it is always essential to carryout a six-frame translation (three forward and three reverse). Thus for any piece of DNA sequence, the result of six-frame translation is six potential protein sequences. The simplest method of finding DNA sequences that code proteins is to search for correct reading frame, called open reading frames (ORFs). An ORF is normally the longest reading frame uninterrupted by a stop codon (TAA, TAG or TGA). The coding regions (CDRs) can be from (i) sufficient ORF length, presence of flanking Kozak sequence, (iii) patterns of codon

Database Search 157

usage, and presence of ribosome-binding sites (Shine-Dalgarno sequences) upstream of the start codon. While the coding region is a single gene open reading frame (ORF) in prokaryotes, the eukaryotic genes are commonly organized as exons (coding regions) and introns (non-coding regions), and hence may comprise several disjoining ORFs and the gene products may be of different lengths. The main task of gene identification (in eukaryotic DNA) involves coding region recognition (intron/exon discrimination) and splice sites detection. Complete CDRs are rarely sequenced in one reaction. So variable-length, overlapping fragments are aligned, in a multiple sequence alignment (sequence assembly), to obtain a consensus sequence. This is also to minimize cloning errors. Majority of the DNA sequence data in the databases contain partial sequences, Expressed Sequence Tags (ESTs) obtained by random sequencing of cDNA copies of cell mRNA sequences. EST libraries are useful for preliminary identification of genes by database similarity searches. Screening the predicted protein sequences against an expressed sequence tag (EST) library confirms the prediction and expression of the gene. Cloning and sequencing the intact cDNA may then be used to make a more detailed analysis. An EST database of an organism can be analyzed for the presence of gene family, orthologs and paralogs. Since gene prediction methods are only partially accurate, partial cDNA copies of expressed genes (ESTs) confirm that a predicted gene is transcribed. ESTs are not only incomplete, but also to a certain degree inaccurate. When a search of the databases reveals several ESTs, EST analysis protocols are used. These include sequence similarity, sequence assembly (multiple sequence alignment) and sequence clustering algorithms (see Chapter 12). A large number of regulatory sequences (promoters, enhancers etc.) have been identified and collected into databases. z

EPD

z

FINDPATTERNS

z

FRAMES GEMS

z

z z

GSDB HTD

z

MAP Mat Inspector Model-it

z

ORF Finder

z z

: Eukaryotic Promoter Database. Provides a comprehensive compilation of eukaryotic transcriptional sites (promoters). : Searches DNA sequences for the occurrence of transcription initiation sites. : (Open Reading Frames; (ORFs). : (Genomatrix, Germany) provides an output of mono-, di-, and trinucleotide frequencies. : Genome sequence database (from GenBank). : Human Transcription Database. It provides information related to RNA molecules that have been sequenced. : (ORFs). : (uses TransFac Database). : Produces pictures of DNA (needs RasMol program in *.pdf file). : (http://www.ncbi.nlm.nih.gov/gorf/gorf.html). Provides ORFs in a sequence as colored bars (NIHNCBI, USA).

158 Bioinformatics: A Primer z

Plotorf

z

z

Promoter Scan Protein Back translation Reverse Translate a Protein Signal Scan TESS (USA) TransFac (Germany) TRRD

z

VecScreen

z z z z z

: Plotting of open reading frames (ORFs). It displays ORFs in a sequence, designed as the longest sequences starting with a “start” codon (usually ATG) and ending with a “stop” codon. The largest ORFs are likely to be true genes. : Searches for promoter sites (NIHBIMS, USA). : Gives DNA sequence from protein sequence. : Gives DNA sequence from protein sequence. : Searches for promoter sites (NIHBIMS, USA). : Transcription Element Search Software. : Transcription Factor Database. : (http://www.mgs.bronet.nsc.ru/mgs/dbases). Transcription Regulatory Regions Database. Provides information about DNA regulatory regions (DNA-protein interactions). : Screens the DNA sequence for potential vector sequence (NCBI, USA).

Coding regions (CDRs) can be found by searching z

CODONPREFERENCE

z

Codon Usage (USA) Frame Plot (Japan)

z

z z z

z z

GENLANG GenMark GRAIL REPuter (Germany) TESTCODE:

: Statistical algorithm that measures codon usage. Identifies protein-coding sequences. : Analysis of different ORFs in a gene sequence. : Permits to select maximum size of the ORF, and the start codon, which can be used in similarity searches. : Pattern recognition program. : Provides a family of programs for ORF analysis. : An ORF identification tool. Provides analysis of protein coding potential of a DNA sequence. : Provides maximal repeats in complete genomes.

Genome sequence comparison can be carried by several methods. z z

Core Genes Gene Builder

1. Mode

: Determines core set of genes. : Gene Builder is a versatile, multi-module operation system. Each module is executed independently, with various options. Some of the modules with options are: : There are tow options, GENE and EXON. The GENE option is used for predicting the full gene model. Potential coding fragments (PCFs) are used to construct the gene models with the maximum coding potential by using the dynamic programming methods. The EXON option is used for only selecting the exons with the best scores. The EXON option can be useful for long genomic sequences with an unknown number of potential genes, since there is very little over prediction.

Database Search 159

2. Sequencing error correction

3. Splice sites 4. Potential coding Regions

5. First and Last coding exons 6. EST mapping

z

GeneFinder

z

GenHacker (Japan)

z

GenScan

z

Pairwise FLAG

z

PipMaker SCAN2

z

: The option is to correct potential sequencing errors due to the frame-shifts and substitutions in the “stop” codons. The predicted gene model can be substantially improved if these errors are eliminated. : Classification analyses combined with the weight-matrix method are used for splice site prediction. : Potential coding regions are found by combining the protein coding potential, calculated b using the dicodon statistic and the splicing signals. In ‘All’ option, all potential coding exons will be used for gene construction. “All” option is useful as the first step of sequence analysis when no information about gene content is available for a query sequence. With “Good” and “Excellent” options, only the excon having ‘good’ and ‘excellent’ quality will be used for gene prediction. If the “Protein similarity” option is selected, only the exons with similarity to a selected homologous protein will be used for gene reconstruction. : This option is useful where several genes are present in a query sequence and the gene location can be confirmed by using homology with a chosen protein. : A homology search is performed against the EST database and the position of the homologous EST sequences in relation to a query sequence is given in the output. : The algorithm first predicts all possible potential exons, and then by dynamical programming it searches for optimal combination of these exons and construct gene model. : Predicts gene structure in microbial genomes using hidden Markov model (HMM). : (http://genes.nit.edu/genscan.html/). GenScan is a generalpurpose gene identification program. It determines the most likely gene structure for each sequence, based on probabilistic models. The score of a predicted feature (e.g. exon) is a log-odds measure of the quality of the feature based on local sequence properties (refer to Chapter 12 for details). Forward and backward recursions are performed that allow determination of the most likely gene structure in the sequence, probability of each exon, together with the corresponding predicted amino acid sequences. : Alignment of small genomes (gigabases). Performs local alignment for two different DNA sequences. : DNA sequence alignment tool. : Provides color-coded graphical alignment of genome-length DNAs.

160 Bioinformatics: A Primer z

VISTA

z

WebGene

: Visualization tools for alignments. Allows alignment of two genome-length sequences. : A multi-package system for gene structure analysis and prediction.

11.4 PROTEIN DATABASE SEARCH Protein sequence similarity search analysis is more sensitive than by DNA sequence similarity search, because 1. DNA has only four bases as compared to twenty amino acids. 2. Pair-wise comparison of DNA bases is scored as “match” or “mismatch”, whereas two amino acids can share varying degrees of similarity, based on their physicochemical properties. 3. Proteins have database information at various levels (primary, secondary and tertiary level databases). But, for phylogenetic analysis, DNA sequence is better suited, because (i) The pattern of mutations, insertions and deletions at nucleotide level is definitive. (ii) Silent mutations, that is, mutations at the DNA level do not result in an amino acid substitution at the protein level, because of the redundancy of the genetic code. Many reputed genome databases (EBI; NCBI; SWISS-PROT) also have protein databases. Some others are: z

CluSTr

z

DIP

z

HSSP

z

MIPS

z

MMDB

z

NRL-3D PDB PIR

z z

: (http://www.ebi.ac.uk’clustr). A database from SWISS-PROT and TrEMBL protein databank. It can be used for (i) search for new protein families, (ii) annotation of newly sequenced proteins, (iii) prediction of functions of new proteins, and (iv) proteome analysis. : (http://dip.doe-mbi.ucla.edu/). Database of Interacting Proteins. Contains information on protein-protein interactions. Gives experimental methods used for determining interactions. : (http://www.sander.embl-heidelberg.de/hssp/). A database of homology-derived secondary structure of proteins (EMBL, Germany). : (Martinsried Institute for Proteins Sequence). European partner of PIR for genomic and protein data analysis. : (http://www.ncbi.nlm.nih.gov/Structure/). Molecular modeling database. An NCBI source that contains all the experimentally determined 3-D structural data in the PDB. : Produced by PIR from sequences from PDB database. : (Protein Data Bank, Brookhaven, USA). : (http://www_nbrf.georgetown.edu/pir/). (Protein Information Resource, USA). PIR produces the protein sequence database (PSD) of functionally annotated protein sequences.

Database Search 161 z

PIR-NREF

z

PIR-PSD

z z

SRS SWISS-PROT

z

TrEMBL

z

UCL

11.4.1

: A comprehensive database for sequence searching and protein identification. : (http://www_nbrf.georgetown.edu/pir/). (Protein Information Resource, USA). PIR produces the protein sequence database (PSD) of functionally annotated protein sequences. : Sequence Retrieval System. : (http://expasy.hcuge.ch/sprot/sprot-top.html). A database of protein sequences and structures, translated from the EMBL genomic database. : (http://www.ebi.ac.uk). TrEMBL is computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequences. : (University College, London). Protein sequence and structure analysis.

Sequence Similarity Search

BLAST, FASTA, and other search tools are used to carry out sequence similarity searches. Some of these servers are: z

BLAST

z

CINEMA

z

CLUSTAL

z

Consensus

z

DiAlign

z

FASTA

z z

LALIGN SAPS

z

Sanger Center

: (http://www.ncbi.nlm.nih.gov/blast3.html/). A suite of Basic Local Alignment Search Tools (NCBI, USA) for sequence similarity search. : A Color Interactive Editor for Multiple Alignments for nucleic acids and Proteins. : Multiple sequence alignment tool based on clustering algorithms. : Takes CLUSTAL or multiple sequence alignment programs and calculates the consensus sequence. : Constructs pair-wise and multiple alignments by comparing whole segment of the sequence. : (http://www2.ebi.ac.uk/fasta3/). Sequence similarity search tool. : A tool for matching of two sequences. : Statistical Analysis of Protein Sequences, which includes analysis of amino acid composition, charge, hydrophobicity and transmembrane segments etc. : (http://www.sanger.ac.uk/DataSearch/). The Sanger Center for Database Search Services.

Various servers/databases are available to carry out sequence similarity search and phylogenetic analysis, and some of the servers and databases are: z

DISTANCES

z

PARSIMONY

: Calculates pair-wise distances between groups of sequences for phylogenetic analysis. : A clasdistic algorithm for constructing ancestral relationship.

162 Bioinformatics: A Primer z

PAUP

z

PHYLIP

z

PhyloBLAST

z

PILEUP

z

Tree base & Tree view Tree Gen UPGMA

z z

11.4.2

: http://www.lms.si.edu/PAUP/). Phylogenetic Analysis Using Parsimony. : It is a collection of many programs––UPGMA, Parsimony, Neighbor-joining, Maximum likelihood algorithms. : Compares the query protein to the SWISS-PROT/TrEMBL database and carries out phylogenetic analysis. : PILEUP uses UPGMA to create its dendogram of DNA sequences, and then uses this dendogram to guide its multiple alignment algorithm. : For graphical representation of phylogenetic trees. : Tree generation from distance data. : Unweighted Pair Group Method is a clustering algorithm using arithmetic averages.

Secondary Structure Search

There are many secondary structure prediction tools/servers available. The prediction programs rely on the propensity parameters of amino acids and their physicochemical characteristics, such as hydrophobicity, charge and solvation. z

BCM Search Launcher

z

CoDe DAS DSC

z z

z z z z

z z z

z z z

z z z

JPRED PHDsec PHDhtm PREDATOR

PROFsec ProScale PSA (Boston, USA) PSI-Pred (UK) ReDe (USA) SSCP SSPRED TMHMM TMPred

: Provides access to a large collection of secondary structure prediction tools. : Secondary structure consensus prediction. : Transmembrane prediction server (Sweden). : (http://www.bmm.icnet.uk/dsc/). A linear discriminator secondary structure prediction program. : Consensus method of secondary structure prediction. : Prediction of secondary structure (EMBL, Germany). : Transmembrane helix location prediction and topology. : (http://www.embl-heidelberg.de/). A program that can predict secondary structure of single sequence, or for number of related sequences. : Secondary structure prediction server. : Predicts hydrophobicity, a-helix, b-sheet and other features. : Protein sequence analysis database has protein secondary structure prediction server. : PSI-BLAST profiles. : Transmemebrane prediction server. : Prediction based on amino acid composition input information. : Prediction of secondary structure from SWISS-PROT database. : Prediction of transmembrane helices in proteins (Denmark). : Prediction of transmembrane regions and orientation (ISREC, Switzerland).

Database Search 163

11.4.3

Motifs, Domains and Profiles Search

Many proteins are organized into structural motifs (super-secondary structures) and domains that are highly conserved. Profiles are mathematical representation of conserved regions, encompassing domain alignments. Some of the motifs, domains and profiles servers are: z

BLOCKS

z

CDD-Search COILSCAN HTHSCAN InterPro Search Leucine Zippers MAST

z z z z z

z

MeMe (USA) MOTIF PFAM

z

PRINTS

z

z

PRODOM PROFILES Profile Scan

z

PROSITE

z

SMART

z

TOPITS

z z

z

11.4.4

: (http://www.ebi.ac.uk). Search tool of motifs and protein classification. : Conserved domain database search. : Identifies coiled-coil regions. : Helix-turn-helix motifs scan. : Meta site profile scan server (USA). : Leucine-zipper motifs scan. : Motif Alignment and Search Tool for searching sequence databases for sequences that contain one or more groups of motifs. : Motif elicitation tool. : Meta site motif search (Japan). : (http://www.sanger.ac.uk/pfam). Encodes sequence conservation within aligned families. : (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS). Search of Motifs and protein classification. : A protein domain database (France). : A profile search database. : Meta site profile scan server (ISREC, Switzerland) that searches a sequence against a library of profiles. : (http://www2.ebi.ac.uk/ppsearch/). Secondary structure database from EBI. It is the best starting point for motif search. : (http://smart.emol-heidelberg.de/). Genetically mobile protein domain search server. : Fold recognition by prediction-based threading.

Pattern Recognition Search

Pattern recognition programs follow reverse process of sequence analysis. Rather than predict how a sequence will fold, they predict how well a fold will match a sequence. Some of the pattern recognition servers are: z z z z z

BLASTPAT EPAT FASTAPAT FINDPATTERNS PRATT

: : : : :

BLAST-based patterns database search. Patter n search (for PDB; SWISS-PROT; PIR databases). FASTA-based patterns database search. A pattern recognition tool. Searches for patterns conserved in set of protein or nucleic acid sequences. It is able to discover patterns conserved in sites of unaligned protein sequences.

164 Bioinformatics: A Primer z z

PatternP PATScan

11.4.5

: : Search for patterns conserved in set of protein or nucleic acid databases.

Protein Classification

Proteins can be classified under various categories based on their structural similarities. Some of the protein classification databases are: z

ASTRAL

z

CATH

z z

GeneFind iProClass (USA)

z

ProClass

z

SCOP

11.4.6

: (http://astral.stanford.edu/). Compendium for sequence and structure analysis. It is partially derived from, and augments the SCOP database. Most of the resources provided here depend upon the coordinate files maintained and distributed by the Protein Data Bank (PDB). : (http://www.biochem.ucl.ac.uk/bsm/CATH). Class, Architecture, Topology & Homology database is a hierarchical domain classification of protein structures. : An integrated neural network protein classification database. : Integrated protein classification resource that provides a comprehensive family relationships and structure/functional features of proteins. : Provides summary description of protein family, structure and function for PIR-PSD, SWISS-PROT and TrEMBL. : Clustering algorithm that provides hierarchical structural classification.

Tertiary Structure Modeling

Tertiary structures are predicted by homology modeling methods. Some of the homologybased modeling databases are: z

3DinSight

z

z

3D-JIGSAW (UK) 3D-PSSM DALI

z

Decipher (USA)

z

: (http://www.rtc.riken.go.jp/jouhou/3dinsight/ 3DinSight.html) (Japan). An integrated database and search tool for structure, property and function of biomolecules. The structural data, functional data (motifs, mutations, proteinnucleic acid binding, protein-ligandbinding etc.) and property data (amino acid property and thermodynamic data of proteins) of biomolecules are implemented into a relational database, so that flexible searches can be done by a combination of queries (SQL). : A homology protein-modeling tool. : Protein fold recognition tool. : The server searches the Protein Data Bank (PDB) for structure homologues of a query protein. : A modeling server with a variety of nucleic acid and protein analysis tools.

Database Search 165 z

FAMS

z

FSSP

z

MODELLER

z

PDBSum

z

Predict Protein

z

Predict Protein Server

z

ProSAL (Sweden) SDSC1

z

z z

SWISS MODEL Server WHATIF

: Fully Automated Modeling System (Japan) based upon homology modeling method, including a structural optimization process. : Database families of structurally similar proteins derived from PDB at EMBL. : (http://guitar.rockefeller.edu/modeller/). A 3D-structre modeling server (New York University, USA). : The database provides summaries of structural analyses of PDB data files. : The server (EMBL, Germany) is used to find structural homologues of a query protein sequence. : (http://www.embl-heidelberg.de/predictprotein/ predictprotein.html). A neural network-based prediction server used to find structural homologues of a query protein sequence. : A Meta site for protein analysis and characterization. : San Diego Supercomputer Center Protein Structure Homology Modeling. : 3D-structure by homology modeling (> 50% homology). : (http://www.umbi.kun.nl/whaif/). A web interface (EMBL, Germany) that provides tools for examining PDB files.

11.4.7 Knowledge Databases Knowledge databases contain structural and functional information from many sources (e.g. hydrophobicity, pH, actives sites etc.), for mining sequence databases in conjunction with mass spectrometry, and other data. Some of these are: z

PeptIdent

z

ProteinProspector

z

ProtParam

z

ProtScale

z

Prowl

: Protein identification using pI, M r , and peptide mass fingerprinting data at ExPASy. : Proteomics tools for mining sequence databases in conjunction with mass spectrometry. : Resource for amino acid composition, Mr, pI, and extinction coefficients (at ExPASy). : Resource for hydrophobicity, and other conformational parameters (at ExPASy). : Resource for protein chemistry and mass spectrometry.

EXERCISE MODULES 1. 2. 3. 4. 5.

What are the general experimental methods of determining the primary structure of proteins? Name some methods that have enhanced the gene sequencing? What are the pitfalls of inferring protein sequences from gene sequences? What is a database? What the two major categories of database searches?

166 Bioinformatics: A Primer 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

What are search sites and search engines? What is a sequence retrieval program? Comment on genome database search. What are the genome database searches used for? What are the database types obtainable from genome database searches? What are open reading frames (ORFs)? What are coding regions (CDRs), and expressed sequence tags (ESTs)? Why is protein database search more sensitive than gene database search? Name few programs used in sequence similarity search? What are the secondary structure prediction tools based upon? Define motifs, domains and profiles? What is basis for tertiary structure prediction?

BIBLIOGRAPHY 1. Altschul, S.F., et al. (1990), J Mol Biol., 215; 403. “Basic local alignment search tool”. 2. Altschul, S.F., et al. (1997), Nucleic Acid Res., 25; 3389. “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs”. 3. Attwood, T.K. & Beck, M.E. (1994), Protein Engineering, 7; 841. “PRINTS– A protein motif fingerprint database”. 4. Attwood, T.K. & Parry-Smith, D.J. (2002), Pearson (Educational): Delhi. “Introduction to Bioinformatics”. 5. Bairoch, A. & Apweiler, R. (2000), Nucleic Acid Res., 28; 45. “The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000”. 6. Barker, W.C., et al. (1998), Nucleic Acid Res., 26(1); 27. “The PIR—International protein sequence database”. 7. Bateman, A., et al. (2000), Nucleic Acids Res., 28; 263. “The Pfam protein families database”. 8. Benson, D.A., et al. (1998), Nucleic Acid Res., 26(1); 1. “GenBank”. 9. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “The Protein Data Bank”. 10. Bork, A. & Gibson, T. (1996), Methods Enzymol., 266; 162. “Applying motif and profile searches”. 11. Burge, C.B. & Karlin, S. (1998), Curr Opin Struc Biol., 8; 346. “Finding the genes in genomic DNA”. 12. Corpet, F., Gouzy, J. & Kahn, D. (1998), Nucleic Acid Res., 26; 323. “The ProDom database of protein domain families”. 13. Cuff, J.A., et al. (1998), Bioinformatics, 14; 892. “JPred: A consensus sequence structure prediction server”. 14. Etzold, T., Ulyanov, A. & Argos, P. (1996), Methods Enzymol., 266; 114. “SRS—International retrieval system for molecular biology databases”. 15. Gracy, J. & Argos, P. (1998), Trends Biochem Sci., 23; 495. “DOMO: a new database of aligned protein domains”. 16. Hamosh, A., et al. (2000), Human Mutat., 15; 57. “Online Mendelian Inheritance in Man (OMIM)”. 17. Henikoff, J.G., Henikoff, S. & Pietrokovski,S. (1999), Biotransformatics, 15; 471. “Blocks+: a nonredundant database of protein alignment blocks derived from multiple compilations “. 18. Henikoff, S. & Henikoff, J.G. (2000), Adv Protein Chem., 54; 73. “Amino acid substitution matrices “. 19. Higgins, D.G., Thompson, J.D. & Gibson, T.J. (1996), Methods Enzymol., 266; 383. “Using CLUSTAL for multiple sequence alignments”. 20. Hofmann, K., et al. (1999), Nucleic Acids Res., 27; 215. “The PROSITE database, its status in 1999”.

Database Search 167 21. Hogue, C.W. & Bryant, S.H. (1998), Methods Biochem Anal., 39; 46. “Structure Databases”. 22. Holme, L., et al. (1992), Protein Sci., 1; 1691. “A database of protein structure families with common folding motifs”. 23. Holm, L. & Sander, C. (1997), Nucleic Acids Res., 25; 231. “Dali/FSSP classification of three-dimensional protein folds”. 24. Hubbard, D.T. (1999), Nucleic Acid Res., 27; 254. “SCOP: a structural classification of proteins database”. 25. Jones, D.T. (1999), J Mol Biol., 287; 797. “Gen THREADER: efficient and reliable protein fold recognition method for genomic sequences”. 26. Kreil, D.P. & Etzold, T. (1999), Trends Biochem Sci., 24; 155. “DATABANKS– a catalogue database of molecular biology databases”. 27. Lakowski, R.A., et al. (1997), Trends Biochem Sci., 22; 488. “PDBsum: A web-based database of summaries and analyses of all PDB structures”. 28. Mewes, H.W., et al. (2000), Nucleic Acids Res., 28; 37–40. “MIPS: a database for genomes and protein sequences”. 29. Michie, A.D., Jones, M.L. & Attwood, T.K. (1996), TiBS, 21(5); 191. “DbBrowser: integrated access to databases worldwide”. 30. Morgenstern, B., et al. (1998), Bioinformatics. 14; 290. “DIALIGN: Finding local similarities by multiple sequence alignment”. 31. Murzin, A.G., et al. (1995), J Mol Biol., 247; 536. “SCOP: A structural classification of proteins database for investigation of sequence and structures”. 32. Orengo, C.A., et al. (1997), Structure, 5(8); 1093–1108. “CATH– a hierarchical classification of protein domain structures”. 33. Pearson, W.R. (1990), Methods Enzymol., 183; 63. “Rapid and sensitive sequence comparison with FASTP and FASTA”. 34. Pearson, W.R. (2000), Methods Mol Biol., 132; 185–219. “Flexible sequence similarity searching with FASTA3 program package. 35. Sali, A., et al. (1995), Proteins, 23; 318. “Evaluation of comparative protein modeling by MODELLER”. 36. Schuler, G.D., et al. (1996), Methods Enzymol., 266; 141. “Entrez: molecular biology database and retrieval system”. 37. Sonnhammer, E.L., et al. (1998), Nucleic Acid Res., 26(1); 320. “Pfam: multiple sequence alignments and HMM-profiles of protein domains”. 38. Stoesser, G., et al. (1998), Nucleic Acid Res., 26(1); 8. “The EMBL nucleotide sequence database”. 39. Wu, C.H., Shivakumar, S. & Huang, H. (1999), Nucleic Acids Res., 27; 272. “ProClass protein family database”.

Data Mining, Analysis and Modeling 169

Data mining and analysis aim at nontrivial extraction, by computational means (in silco methods), of previously unknown and potentially useful information from data, or search for relationships and patterns that exist in databases. The structure prediction of a protein from its amino acid sequence data by data mining and analysis procedures depends on the amount of structural information available from various types of databases. 1. High degree of sequence similarity and three-dimensional structure data of homologous protein (s) are available. 2. Poor sequence similarity from database search. 3. No sequence similarity found from database search. If “MyProtein” has a high degree of sequence similarity with the protein sequences from the database search, and also if three-dimensional structure(s) of homologous protein(s) in the series is (are) available, then protein structure prediction of the test protein is relatively simple, and reliable, and a fairly accurate tertiary structure model can be generated with computational algorithms, based upon the tertiary structures of homologous protein(s) (Path 1 of Fig. 12.1). For modeling of (putative) proteins, alignment of sequences is followed by insertions, deletions and replacements in the three-dimensional structure(s) of the homologous proteins(s). Initial modeled proteins are refined using energy minimization and other procedures to give final structures without appreciable steric hindrances. Alternatively, modeling can be carried out on the basis of each available known tertiary structure and test the resulting models for packing of side chains, solvent accessibility and other physicochemical parameters. Simultaneously, but selectively, the information from all the known tertiary structures of homologous family can be used in modeling procedures. Once a 3-D model of a protein (“MyProtein”) is available, its structure can be viewed in various directions to visualize the tertiary folding, and towards rational design of new version of the proteins. More elaborate alignment and structure prediction procedures are required, if the sequence similarity between “MyProtein” and the database proteins is poor, to arrive at plausible model structure(s) (Path 2). Computational procedures include secondary structure prediction, fold recognition, alignment of motifs and profiles, and finally 3D-structure modeling. If for some reason database search does not find any proteins with sequence similarity with “MyProtein” sequence (it could be a new class of protein), secondary and tertiary structure prediction procedures are solely based on statistical methods (Path 3). The correctness (validity) of the modeled structure(s) should be treated with due caution and skepticism. The real test of the model structure(s) is to crystallize the protein and determine its three-dimensional structure by X-ray diffraction methods. Overall analysis approach towards protein structure prediction, depending on the sequence similarity, is given in table 12.1. Table 12.1 Protein Structure Approach depending on Sequence Identity

Sequence similarity > 80% 50–80% 25–50% < 25%

Approach Sequence alignment Sequence alignment; pair-wise alignment Consensus methods; profile methods De novo structure prediction methods

Path taken (Fig. 12.1) Path 1 Path 1 & Path 2 Path 2 Path 3

170

Bioinformatics: A Primer

12.1 SEQUENCE ALIGNMENT ANALYSIS Sequence alignment is the process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Sequence similarity analysis is the single most powerful method for structural and functional inference available in databases. Sequence similarity analysis allows the inference of homology between proteins and homology can help one to infer whether the similarity in sequences would have similarity in function. Methods of analysis can be grouped into two categories– (i) sequence alignmentbased search, (ii) profile-based search. Fundamentally, sequence-based alignment searches are string-matching procedures. A sequence of interest (the query sequence) is compared with sequences (targets) in a databankeither pair-wise (two at a time) or with multiple target sequences, by searching for a series of individual characters. Two sequences are aligned by writing them across a page in two rows. Identical or similar characters are placed in the same column and non-identical characters can be placed opposite a gap in the other sequence. Gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. In optimal alignment, non-identical characters and gaps are placed to bring as many identical or similar characters as possible into vertical register. The objective of sequence alignment analysis is to analyze sequence data to make reliable prediction on protein structure, function and evolution vis a vis the three-dimensional structure. Such studies include detection of orthologous (same function in different species), and paralogous (different but related functions within an organism) features. Analysis procedures include various statistical algorithms for sequence alignment, pattern matching and prediction of structure directly from sequence. Sequences that are highly divergent during evolution cannot be detected by simple sequence similarity search methods. In such cases, computational methods, comprising multiple sequence alignment and profile-matching searches that go beyond simple pair-wise sequence similarity methods, are tested for meaningful results. A set of n amino acids can form 20n different polypeptides, and the problem of protein structure prediction becomes astronomical even for a small protein of 100 amino acids. One of the methods of minimizing this problem is to rely on statistical methods to search for structural similarities (protein families), based on the sequence similarities, from the probabilities calculated from the observed frequencies of amino acids in the family classes. Thus, sequence similarity analysis is the cornerstone of bioinformatics. It is useful for discovering structural, functional and evolutionary information in sequences. The sequence alignments indicate the changes that could have occurred between two homologous sequences and a common ancestral sequence during evolution. Sequence alignment from the database search is the operation upon which all other computational procedures are based. 1. Necessary for inferring phylogenetic relationships. 2. Sequence similarity analysis is the starting point for predicting the secondary structure of proteins. 3. It is prerequisite for all “knowledge-based” protein family classification and tertiary structure prediction.

Data Mining, Analysis and Modeling 171

Sequence similarity search algorithms rest on the premise that if two sequences are sufficiently similar, almost invariably they have similar biological functions, and will be descended from a common ancestor. Sequence conservatism ® Structural conservatism ® Functional conservatism Two proteins that have a certain number of amino acids common at aligned positions are said to be identical to that degree (25% identical for 40 common amino acids out of 160-residue sequence). That a stretch of two sequences is nearly identical does not imply that they are homologous, related by divergence form a common ancestor. Homology is not synonymous with similarity. There is an important difference between similarity and homology. Similarity is a value between 0 and 100%. On the other hand, there are no degrees of homology. The sequences are either homologous or not. But, a high level of sequence similarity is a strong indication of homology, implying a common divergent evolutionary relationship. The sequence similarity analysis can be stated as—given two sequences how to find best alignment that can be obtained by sliding one sequence along the other. A major complication arises due to insertions or gaps in the alignment of sequences gaps in the alignment of sequences. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. Usually, gap penalties (cost of inserting and extending gaps) are chosen to be length dependent. Typically, the cost of extending a gap (gap elongation) is 5-10 times lower than is the cost for introducing a gap (gap open). The process of alignment can be measured in terms of the number and length of gaps introduced, and the number of mismatches remaining in the alignment. A matrix relating such parameters represents the distance between two sequences. Various methodologies, mutation matrices (scoring matrices), dotplots, global and local sequence alignments and other algorithms are available to address the sequence alignment problem.

12.1.1 Similarity/Distance Matrices A sequence can be described in terms of the number of bits needed to specify its message. The correspondence between two aligned sequences can be expressed in terms of similarity/ identity score. Scoring penalties are introduced to minimize the number of gaps. The total alignment score is then a function of the identity between aligned residues and the gap penalties incurred. A compilation of the similarity scores in pair-wise alignment into a matrix is called scoring matrix. Such matrices are constructed for 1. 2. 3. 4.

Evaluating match/mismatch between any two characters (residues). A score for insertion/deletion. Optimization of total score. Evaluating the significance of the alignment.

12.1.2 Construction of Scoring Matrix Scoring matrices implicitly represent a particular theory of evolution. Elements of a matrix specify the weight to be assigned to a given comparison (i) by the measure of similarity for replacing one residue with another (similarity matrix), or (ii) by the cost for the replacement (distance matrix). Similarity matrices are used for database searching, while distance matrices are naturally used for phylogenetic tree construction.

172

Bioinformatics: A Primer

The distance score (D) is usually calculated by summing up of mismatches in an alignment divided by the total number of matches and mismatches, which represents the number of changes required to change one sequence into the other, ignoring gaps. D=

Matches (Matches + Mismatches)

(12.1)

Similarity (S) and distance matrices (D) are inter-convertible (S = 1 – D). Understanding of theories underlying scoring matrices can aid in making proper choices. By determining the number of mutational changes by sequence alignment methods, a quantitative measure can be obtained of the distance between any pair of sequences. These values can be used to reconstruct a phylogenetic tree, which describes a relationship between the gene sequences. The more mutations required changing one sequence into the other, the more unrelated the sequences and the lower the probability that they share a recent common ancestor sequence. Conversely, the more alike a pair of sequences, the fewer the number of changes required to change one sequence into the other, and the greater the likelihood that they share a recent common ancestor sequence. Distances between DNA sequences are relatively simple to compute as the sum between two sequences, that is the least number of steps required to change one sequence into the other (D = X + Y). It is preferable in phylogenetic tree analysis because (i) the pattern of mutations, insertions and deletions at the nucleotide level are definitive and (ii) silent mutations at the DNA level do not result in an amino acid substitution (because of the redundancy of the genetic code). A simple matrix of the frequencies of the 12 possible types of replacement (each base can be replaced by any of the three other bases) can be used. Differences due to insertions/deletions are generally given a large score than substations. In the distance method, all possible pairs of sequences are aligned to determine which pairs are the most similar or closely related. The alignment provides a measure of the genetic distance between the sequences. The distance measurements are then used to predict the evolutionary relationship. A matrix of distance scores among all of the sequences is first made. A scoring matrix is a tool to quantify how well a certain model is represented in the alignment of two sequences. Less similar the sequences, the higher the distance score between them. The method is based on the assumption (Markov model) that proteins evolve through a succession of independent point mutations (change from one state to another does not depend on the previous history of the state) that are accepted in a population and subsequently can be observed in the sequence pool. The degree of match between two letters (residues) can be represented in a matrix. The score is Score =

pij qi q j

(12.2)

( pij = probability that a residue I is substituted by residue j; qi and qj = background probabilities for residue i and j, respectively). The simplest matrix in use is the identity (similarity) matrix. If two letters are the same they are given +1 and 0 if they are not the same. For DNA sequences, the identity matrix is (Table 12.2).

Data Mining, Analysis and Modeling 173 Table 12.2 Identity Scoring Matrix for DNA sequences

A T G C

A

T

G

C

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

Replacement matrix, Rij = 1 (for i = j); Rij = 0 (for i ¹ j). Nucleotide bases fall into two classes depending on the ring structure of bases— two-ring purine bases (A and G) and single-ring pyrimidine bases (T and C). A mutation that conserves the ring number (A « G; or T « C) is called transition and a mutation that changes the ring number (purine « pyrimidine) is called transversion. Use of transition/transversion matrix with weighted scores reduces noise in comparisons of distantly related sequences (Table 12.3). Table 12.3 Transition/Transversion Scoring Matrix for DNA sequences

A T G C

A

T

G

C

0 5 5 1

5 0 1 5

5 1 0 5

1 5 5 0

Distances between amino acid sequences are more difficult to calculate, because 1. Some amino acids can be changed due to replacement of single DNA base (single-point mutation), while replacement of other amino acids require two or three base changes within the DNA sequence. 2. While conservative mutations of amino acids do not have much effect on the structure and function, other replacements can be functionally lethal. Since all point mutations arise from nucleotide changes, the probability that an observed amino acid pair is related by chance, rather than by inheritance should depend only on the number of point mutations necessary to transform one codon into the other. A matrix resulting from this model would define the distance between two amino acids by the minimal number of the nucleotide changes required (genetic code matrix). It may be more useful to compare the sequences of the purine (R)-pyrimidine level. To this can be added other physicochemical attributes of amino acids (hydrophobicity matrix, size and volume matrix etc.).

12.2 PAIR-WISE SEQUENCE ALIGNMENT Pair-wise alignment is a fundamental process in sequence comparison analysis. Pair-wise alignment of two sequences (DNA or protein) is relatively straightforward computational problem. In a pair-wise comparison, if gaps or local alignments are not considered (i.e. fixedlength sequences), the optimal alignment method can be tried and the number of computations required for two sequences is roughly proportional to the square of the average length,

174

Bioinformatics: A Primer

as is the case in dotplot comparison. The problem becomes complicated, and not feasible by optimal alignment method, when gaps and local alignments are considered. That a program may align two sequences is not a proof that a relationship exists between them. Statistical values are used to indicate the level of confidence that should be attached to an alignment. A maximum match between two sequences is defined to be the largest number of amino acids from on protein that can be matched with those of another protein, while allowing for all possible deletions. A penalty is introduced to provide a barrier to arbitrary gap insertion. Dotplots, dynamic programming and word of k-tuple are the common pair-wise alignment procedures.

12.2.1 Dotplot Analysis Dotplot analysis is essentially a signal-to-noise graph, used in visual comparison of two sequences and to detect regions of close similarity between them (Fig. 12.2). The concept of similarity between two sequences can be discerned by dotplots. Two sequences are written along and x- and y-axes, and dots are plotted at all positions where identical residues are observed, that is, at the intersection of every row and column that has the same letter in both the sequences. Within the dotplot, a diagonal unbroken stretch of dots will indicate a region where two sequences are identical. Two similar sequences will be characterized by a broken diagonal; the interrupted region indicating the location of sequence mismatch. A pair of distantly related sequences, with fewer similarities, has a noisier plot (Fig. 12.3). Isolated dots that are not on the diagonal represent random matches that are probably not related to any significant alignment. Detection of matching regions may be improved by filtering out random matches in a dotplot. Filtering (overlapping, fixed-length windows etc.) can be used to place a dot only when a group of successive nucleic acid bases (10–15) or amino acid residues (2–3) match, to minimize noise. 12.2.2 Dynamic Programming Methods The best solution for pair-wise sequence alignment problem seems to be an approach called dynamic programming. Dynamic programming methods assure the optimal global (NeedlemanWunsch method) or local alignment (Smith-Waterman method) by simply exploring all possible alignments and choosing the best. These methods allow the introduction of artificial gaps in aligned sequences to create an optimal alignment) (Fig. 12.4). The principle of divide-andconquer rule is extensively used in dynamic programming. Subdivide a problem that is too large to be computed into smaller problems that may be efficiently computed. Then assemble the information to give a solution for the large problem. Scores for each comparison are stored in a table, like a spreadsheet, inside the program. These individual scores are then used to build an alignment score, stepping through the tale from beginning to end. A key part of alignment methods is the scoring method for insertions and deletions (gaps). (i) The first is a comparison matrix, which gives a single score for every possible match and mismatch between two bases. (ii) Second score is a penalty to subtracted each time a gap is made in one sequence so that two other matching regions can be better aligned. (iii) Third score is penalty to be subtracted each time a gap is extended by another residue.

176

Bioinformatics: A Primer

Fig. 12.4 Global and Local Sequence Alignments

12.2.2.1 Global Alignment Global alignment is an alignment of two nucleic acid or protein sequences over their entire length. The Needleman-Wunsch algorithm (GAP program) is one of the methods to carry out pair-wise global alignment of sequences by comparing a pair of residues at a time. Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array (one sequence along x-axis and the other along y-axis), and pathways through the array represent all possible comparisons (every possible combination of match, mismatch and insertion and deletion). Statistical significance is determined by employing a scoring system; for a match = 1 and mismatch = 0 (or any other relative scores) and penalty for a gap. Each cell in the matrix is examined, maximum score along any path leading to the cell is added to its present contents and the summation is continued. In this way the maximum match (maximum sequence similarity) pathway is constructed. The maximum match is the largest number that would result from summing the cell score values of every pathway, which is defined as the optimal alignment. Leaps to the non-adjacent diagonal cells in the matrix indicate the need for gap insertion, to bring the sequence into register. Complete diagonals of the array contain no gaps. Needleman-Wunsch algorithm creates a global alignment. That is, it tries to take all the characters of one sequence and align it with all the characters of a second sequence. NeedlemanWunsch algorithm works well for sequences that show similarity across most of their lengths. Globally optimal alignment is a difficult problem (biological sequences may have gaps, insertion sequences relative to each other). There are limitations to global alignment methods1. Global alignment algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. 2. The influence of global properties for local properties is not valid for all biological sequences. 3. Short and highly similar sub-sequences may be missed in the alignment because the rest of the sequence outweighs them.

12.2.2.2 Local Alignment Local alignment is an alignment of some portion of two nucleic acid or protein sequences. Smith-Waterman algorithm is a variation of the dynamic programming approach to generate

Data Mining, Analysis and Modeling 179

forms which occur in nature. The phrase also refers to an actual sequence, which approximates the theoretical consensus. A known conserved sequence set is represented by a consensus sequence. Commonly observed supersecondary protein structures are often formed by conserved sequences. Sequences are aligned optimally by bringing the greatest number of similar characters into register in the same column alignment, just as for the alignment of two sequences. Table 12.4 (a) Multiple Sequence Alignment of a highly conserved Region of a Protein Family

I II III IV V VI VII VIII Consensus Sequence

1

2

3

4

5

6

7

G G G G G G G G G

A G A A G A G A A/G

G G R S A G G C g

G S G G G E G G g

V G V V V S S V v

G G G G G G G G G

K L K K K K F K k

Table 12.4 (b) Multiple Sequence Alignment of a Disulfide-containing Protein Family

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1 2 3 4 5 6 7

G G N G D G G

C C C C C C C

K K K K V K T

Y K Y V Y L V

G T E W E S S

C C C C C C C

L Y K V Y F G

K K K I N – T

L L – N P I –

G G – – K R –

E E – – G P –

N N D N – S –

E D D E S G –

G F Y E Y Y –

C C C C C C C

D N N G N G –

Sq

g

C

k

x

x

C

x

x

x

g

x

n

x

x

C

n

(Note: In a consensus sequence, if same residue occurs in a column, upper case letter is used; if a particular residue occurs most of the times, lower case letter is used; if several residues with equal number are present all are mentioned; and if the distribution in a column is random, the alphabet X is mentioned. In disulfidecontaining proteins, retention of S-S bridges are a priority, and accordingly cysteins are kept in register, with manual intervention if necessary).

Alignment of large number of sequences by pair-wise dynamic programming is almost impossible, because the problem increases exponentially with number of sequences involved. So, shortcut progressive methods, based on heuristic approach, are available for multiple alignment of sequences. Most of the available programs (BLAST, FASTA etc) use incremental method that makes pair-wise alignments of most related sequences, and then progressively

180

Bioinformatics: A Primer

add less related sequences or groups of sequences to those aligned group. BLAST and FASTA programs that very quickly find the “best” diagonal between a pair of sequences. Both BLAST and FASTA make use of amino acid substitution matrix, PAM-250, or BlockSum-62 methods to score and asses pair-wise sequence alignment, and are particularly good at identifying between 25%–100% sequence similarity with the query sequence. PSI-BLAST can identify matches between 25%–15% sequence identify. Once a multiple sequence alignment has been found, the number or types of changes in the aligned sequence residues may be used for phylogenetic analyses. BLAST is now widely used sequence alignment tool for proteins and nucleic acids. FASTA is more sensitive than BLAST in detecting distantly related protein sequences. l

BLITZ

12.3.1

: (http://www.ebi.ac.uk/searchs/blitz.html). Fast comparison of protein sequences against SWISS-PROT).

BLAST Suite of Algorithms

Basic Local Alignment Search Tool (BLAST) is from NCBI/GenBank (USA). It consists of a suite of algorithms, and they provide a fast, accurate and sensitive database searching. BLOSUM62 is the default-scoring matrix. BLAST works better on protein sequence databases. A general operational procedure is: 1. It takes each word from the query sequence, optimally filtered to remove low-complexity regions and locates all similar words in the current test sequence. It initially throws away all database sequences that do not have a similar match. 2. If similar words are found (3 amino acids or 11 nucleotides), BLAST tries to expand the alignment to the adjacent words (gaps not allowed). 3. High-scoring segment pairs are generated. An HSP consists of two sequence fragments of arbitrary but equal length whose alignment is locally maximal and for which the alignment score is above the threshold score. 4. After all words are tested, a set of high-scoring segment pairs (HSPs) are chosen for that database sequence. Two sequences, a scoring system, and a threshold score define a set of HSPs. 5. Several non-overlapping HSPs may be combined in a statistical test to create a longer, more significant match. A suite of BLAST programs is: l

BLAST

l

Gapped BLAST

l

PSI-BLAST

: Un-gapped BLAST. The program may miss the similarity if two sequences do not have a single highly conserved region. : Seeks only one from the un-gapped alignments that make up a significant match. Dynamic programming is used to extend a central pair of aligned residues in both directions to yield the final gapped alignment. : Position-Specific Interactive BLAST is a generalized BLAST algorithm that incorporates both pair-wise and multiple sequence alignment methods. It is used for the identification of weak sequence similarities. It uses a position-specific score matrix in place of query sequence.

Data Mining, Analysis and Modeling 181

l

BLASTN

:

l

BLASTP

:

l

BLASTX

:

l

TBLASTN

:

l

TBLASTX

:

l

BEAUTY

:

l

BLAST-2

:

1. It takes as input a protein sequence and compares it to protein databanks, and constructs a multiple alignment from a Gapped BLAST search and generates a profile from any significant local alignment, called a “profile”. 2. The profile is compared to the protein databases, again seeking best possible local alignments and PSI-BLAST estimates the statistical significance of the local alignments found, using “significant” hits to extend the profile search until convergence. Compares the nucleotide query sequence against all nucleotide sequences in the non-redundant databases (DNA ® DNA). Suited for high-scoring matches; not suited for distant relationship matching. Compares a protein query sequence against all protein sequences (gapped) in the non-redundant databases (Protein → Protein). Suited for finding homologies. The query nucleotide sequence will be translated in all six reading frames (each frame gapped) and the conceptual translation products are compared against all protein sequences in non-redundant databases (DNA translated → protein). Suited for finding ESTs and new DNA searches for finding novel proteins. Compares a protein query sequence against nucleotide sequence databases, dynamically translated in all six reading frames (each frame gapped) (Protein → DNA (translated). Suited for finding ESTs and novel proteins. Compares the six-frame translation of a nucleotide query sequence against the six-frame (ungapped) translation of nucleotide sequence databases (DNA (translated) → DNA (translated). Suited for ESTs and gene structure annotations. (http://dot.imgen.bcm.tmc.edu/seq-search/protein-search.html). BLAST Enhanced Alignment Utility that predicts the function of the protein being tested. It adds additional information, on sequence family membership, the location of the conserved domains, and the locations of any annotated domains and sites directly into BLAST search results. These enhancements make it much easier to detect weak, but functionally significant, matches in BLAST database searches. The BLOCKS server offers a variety of BLAST searches that use as a query sequence a consensus sequence derived from multiple sequence alignment of a set of related proteins. The consensus sequence is called a cobbler sequence. The BLOCKS server offers a variety of BLAST searches that use as a query sequence a consensus sequence derived from multiple sequence alignment of a set of related proteins. The consensus sequence is called a cobbler sequence. (http://www.ch.embnet.org/software/frameBLAST.html). A newer release of BLAST that allows insertions or deletions in the aligned sequences. Gapped alignments may be more biologically significant.

182

Bioinformatics: A Primer

l

PIR-BLAST

l

PhyloBLAST

Synonymous with gapped BLAST. Compares two sequences against one another, employing BLASTN, BLASTP, BLASTX, TBLASTN and TBLASTX.). : Provides general BLASTP against the entire non-redundant reference protein database (PIR-NREF). : Compares the query protein sequence to a SWISSPROT and TrEMBL database using WU-BLAST and then phylogenetic analysis.

The help manual is available on the Web at: (http://www.ncbi.nlm.nih.govt/BLAST/blasthelp.html).

12.3.2

FASTA Algorithm

FASTA is a program for rapid alignment of pairs of DNA or protein sequences. Rather than comparing individual residues in the two sequences, FASTA algorithm instead is based on the idea of identifying short words (k-tuples), common to both sequences under comparison. In a dotplot, regions of similarity between two sequences show up as diagonals. Comparison of ktuples between the two sequences can be viewed as focussing on diagonal matches in a dynamic programming matrix. FASTA calculates the sum of these dots along each diagonal in the following way. 1. Match identical words from each list and then create diagonals by joining adjacent matches (non-overlapping words). 2. Find sum of identical words. 3. Rescale using PAM matrix and retain top scoring matrix. 4. Join segments using gaps, and eliminate other segments. 5. Use (Smith-Waterman) local dynamic programming to create an optimum alignment. The related algorithms, FASTX and FASTY translate a query DNA sequence in all three reading forward frames and compares all three frames to a protein sequence database. TFASTX and TFASTY compare a query protein sequence to a DNA sequence database, translating each DNA sequence in all six reading frames. This is generally the best way to scan EST databasess l l l

FASTA FASTA3 Octopus

12.3.3

: (http://www2.igh.cnrs.fr/bin/fasta-guess.cgi). : (http://www2.ebi.ac.uk/fasta3/). : It is a program for rapid interpretation of BLAST, BLAST-2, and FASTA output test files.

PILEUP Algorithm

PILEUP algorithm estimates the best alignment for a group of sequences using pair-wise approach. PILEUP uses global alignment procedure (GCG GAP program). 1. First, similarity scores are calculated between all sequences to be aligned, and they are clustered into tree structure by the neighbor-joining method. 2. Next, most similar pairs of sequences are aligned and averages (similar to consensus sequences) are calculated to aligned pairs. 3. The final multiple alignment is performed by a series of progressive, pair-wise alignments between sequences and clusters of sequences.

Data Mining, Analysis and Modeling 183

12.3.4

CLUSTAL Algorithm

CLUSTAL algorithm uses local alignment program (GCG BESTFIT program), which can be advantageous for aligning highly diverged sequences with some regions of homology, but are dissimilar in other regions. The CLUSTAL algorithm is based on the premise that similar sequences are likely to be evolutionary related. Thus, the method aligns sequences in pairs, following the branching order of a phylogenetic tree family. Similar sequences are aligned first, and more distantly related sequences are added later. Once the pair-wise alignment scores have been calculated, they are used to cluster the sequences into groups. Some of the advantages of CLUSTAL are: 1. A gap and its length are distinct quantities, and different weights are given to each. 2. Different weights are given to different types of mismatches. E.g. a transition (leu - val) is more probable than transversion (leu - Asp), and hence is treated with different weights. 3. It can use rapid alignment method (FASTA), or slower and more accurate Smith-Waterman method. 4. It can add individual sequences to an existing alignment or to align two groups of prealigned sequences with each other. 5. It can realign selected sequences of selected regions of the alignment leaving the unselected portions of the alignment constant. 6. Secondary structure features, such as regions of hydrophobicity, proximity to other groups are incorporated.

12.3.5

Strategies for Sequence Similarity Search

It is necessary to decide at the outset, whether to search nucleic acid or protein databases. Whether to use protein or nucleic acid sequence query depends upon the biological information desired. If the sequence is protein, or the gene sequence codes for a protein, then that search should be almost always be performed at the protein level, because proteins with 20letter alphabet allow one to detect far more distant similarity than do nucleic acids with 4letter alphabet. l

l

l

Initial search should be done with the heuristic algorithm programs (e.g. BLAST and FASTA). If the query sequence is an unknown sequence, then matching a gene fragment will probably not contribute much useful information. It is possible to automatically translate a DNA sequence into amino acid sequence in all six reading frames (BLASTX) and compare it to protein sequence database; or compare a protein sequence to the six reading frame translation of all DNA base sequences (TFASTA & TBLASTN). Some of the sequence similarity alignment servers/databases are:

l

ALIGN

l

CINEMA

: Applies the BLOSUM50 matrix to deduce the optimal alignment between two sequences. : (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.1/).

184

Bioinformatics: A Primer

A Color Interactive Editor for Multiple Alignments for nucleic acids and Proteins. l

CLUSTAL Consensus

l

DCA

l

DiAlign

l

l

FASTA LALIGN MULTALIN

l

PSSM

l

SAPS

l

T-COFFEE USC Server VSNS

l

l

l l

: Multiple sequence alignment tool based on clustering algorithms. : Takes CLUSTAL or multiple sequence alignment programs and calculates the consensus sequence. : Multiple sequence alignment tool, better suited to distantly related sequences. : Constructs pair-wise and multiple alignments by comparing whole segment of the sequence. Suitable for local similarity search. : Sequence similarity search tool. : A tool for matching of two sequences. : (http://www.toulouse.inra.fr/multalin.html). A multiple sequence alignment server with hierarchical clustering algorithm. Sequence alignment is highlighted with color code for easy visualization. : Position-specific Scoring Matrix. Analysis of multiple sequence alignment for conserved blocks. It represents an alignment of sequence of the same length (no gaps). Sliding the matrix along the sequence one position at a time scores every possible sequence position. The amino acid substitution scores in each column of the PSSM are used to evaluate each sequence position. : Statistical Analysis of Protein Sequences, which includes analysis of amino acid composition, charge, hydrophobicity and transmembrane segments. : Multiple sequence alignment tool. : Aligns two sequences with dynamic programming. : (http://www.techfak.uni-bielfeld.de/bed/Curric/MulAli/). An excellent, comprehensive resource for multiple sequence alignment, software and tutorials.

BLAST and FASTA and related programs are statistically based sequence similarity search (SIM) methods. Lately, alternative non-SIM-based bioinformatic methods are becoming popular. One such method is Data Mining Prediction (DMP) that is based on combining evidence from amino-acid attributes, predicted structure and phylogenic patterns; and uses a combination of Inductive Logic Programming data mining, and decision trees to produce prediction rules for functional class. DMP predictions are more general than is possible using homology.

12.4 PHYLOGENETIC ANALYSIS Phylogenetic analysis of a family of related sequences is a determination of how the family might have been derived during evolution. Placing the sequences as outer branches on a tree depicts the evolutionary relationships among the sequences. The branching relationships on the inner part of the tree then reflect the degree to which different sequences are related.

12.4.1 The Dayhoff Mutation Data Matrix Once the evolutionary relationship of two sequences is established, the residues that did exchange are similar (conservative mutations). This is the underlying principle behind the

Data Mining, Analysis and Modeling 185

Dayhoff mutation data matrix compilation. The Dayhoff mutation data matrix is based on the concept of the percentage-accepted mutation (PAM). Proteins are organized into families based on the degree of sequence similarity. From aligned sequences, a phylogenetic tree is derived showing graphically which sequences are mot related and therefore share a common branch on the tree. After the construction of the evolutionary trees, they are used with scoring matrices to evaluate the amino acid changes that occurred during evolution of the genes for the proteins in the organisms from which they originated. Subsequently, a set of tables (matrices), the percentage of amino acid mutations accepted by evolutionary selection, known as PAM tables are determined. PAM tables show which amino acids are most conserved and the corresponding positions in two sequences during evolution. Steps in the construction of mutation matrix are: 1. 2. 3. 4. 5. 6.

Align sequences that are at least 85% identical and determine pair exchange frequencies. Compute frequencies of occurrence. Compute relative mutabilities. Compute a mutation probability matrix. Compute evolutionary distance scale. Calculate log-odds matrix.

1st Step: Pair-exchange frequencies PAM (Point Accepted Mutation) is a unit of evolutionary distance between two amino acids sequences of closely related proteins. PAM1 = 1 accepted point mutation (no insertions or deletions) event per 100 amino acids. PAM1 can be multiplied by itself N times. PAM250 = 250 point mutations/100 amino acids (mutations occur multiple times at any given position, with identity score = 20%). Tally replacements “accepted” by natural selection, in all pair-wise comparison. Aij = Number of times amino acid, j, is replaced by amino acid, i, in all comparisons. If score = 0; functionally equivalent and/or easily inter-mutable. If score < 0; two amino acids that are seldom inter-changeable.

2nd Step: Frequencies of occurrence, fi fi =

Ovservation of i th amino acid Observations of any amino acid

å fi = 1

(12.3) (12.4)

3rd Step: Relative mutabilities The amino acids that do not mutate are to be taken into account. This is the relative mutability of the amino acid. mj = fj ´ (number of times amino acid, j, is observed to change)

4th Step: Mutation probability matrix, Mij Mij = probability that amino acid, i, in row i of the matrix will replace an amino acid in column j.

186

Bioinformatics: A Primer

Mij = mj .

Aij

(12.5)

20

å Aij i=1

(12.6) Mii = (1 – mi ) The diagonal elements represent the probability that the amino acid will remain unchanged.

5th Step: The evolutionary distance scale An evolutionary distance between two sequences is the number of point mutations that was necessary to evolve one sequence into another (the distance is the minimum number of mutations). Since Mii represents the probabilities for amino acids to remain unchanged, multiplying the matrix by l gives the matrix the evolutionary distance of PAM1. The mutation probability is Mij = l... m j .

Aij 20

å Aij

(12.7)

i =1

Mii = (1 – l.mi) In the framework of this model, a mutation probability matrix for any distance can be obtained by multiplying 1PAM matrix with itself the required number of times.

6th Step: Log-odds matrix The probability that that some event is observed by random chance piran is piran = fi Relatedness

Rij =

Mij fi

(12.8)

Log-odds matrix, Sij, is the log-odds ratio of two probabilities–probability that two amino acid residues are aligned by evolutionary descendence to the probability that they are aligned by chance. Sij = log (Rij)

(12.9)

12.4.1.1 Limitatons of The PAM Model The PAM model is built on the assumptions that are imperfect. 1. The Markov model, that replacement of any site depends only on the amino acid at that site and the probability given by the table, is an imperfect representation of evolution. Replacement is not equally probable over entire sequence (e.g. local conserved sequences). 2. Each amino acid position is equally mutable is incorrect. Sites vary considerably in their degree of mutability. 3. Many sequences depart from average amino acid composition. 4. Errors in PAM1 are magnified in extrapolation to PAM250.

Data Mining, Analysis and Modeling 187

5. Model is devised using the most mutable positions rather than the most conserved positions, which reflect chemical and structural properties of importance. 6. Distant relationships can only be inferred.

12.4.2

Blocks Model

Other scoring matrix models are all based on the basic concepts of Dayhoff. In Blocks substitution matrix (BLOSUM) method, the starting data is conserved in blocks, and aligned in order to represent distant relationships more explicitly. In this method, the sequences of the individual proteins in each of the families are aligned in the regions defined by the blocks. Each column in the aligned sequences then provided a set of possible amino acid substitutions. The types of substitutions are then scored for all aligned patterns in the database and used to prepare a scoring matrix, the “BLOSUM” matrix, indicating the frequency of each type of substitution. More common (conservative) substitutions should represent a closer relationship between two amino acids in related proteins, and thus receive a more favorable score in sequence alignment. Conversely, radical substitutions should be less favored. Patterns of different identities are grouped in different groups—60% identical patterns are grouped under one substitution matrix blosum60, and those 80% alike under blosum80, and so on. BLOSUM matrix values are given as log-odds scores of the ratio of observed frequency of amino acid substitution dived by the frequency expected by chance. While PAM matrix is designed to track evolutionary origins of proteins, the BLOSUM model is designed to find their conserved domains. The better reliability of blocks method is due to1. Many sequences from aligned families are used to generate matrices. 2. Any potential bias introduced by counting multiple contributions from identical residue pairs is removed by clustering sequence segments on the basis of minimum percentage identity. 3. Clusters are treated as single sequences (Blosum60; Blosum80 etc.). 4. Log-odds matrix is calculated from the frequencies, Aij, of observing residue, i, in one cluster aligned against residue, j, in another cluster. 5. Derived from data representing highly conserved sequence segments from divergent proteins rather than data based on very similar sequences (as is the case with PAM matrices). 6. Detects distant similarities more reliably than Dayhoff matrices. l ALIGN : (http://www2.igh.cnrs.fr/bin/align-guess.cgi). Applies BLOSUM50 matrix to deduce the optimal alignment between two sequences. l BLOCKS : Multiple alignment of ungapped segments corresponding to the most highly conserved region of proteins.

12.4.3 Clustering Algorithms Clustering, or grouping procedures of large data sets are a set of statistical methods (a major subgroup of numerical analysis), based on similarity criteria for appropriately scaled variables that represent the data of interest. Sequence clustering algorithms take a large number of sequences and subdivide them into clusters, based on the extent of shared sequence identity

188

Bioinformatics: A Primer

in a minimum overlap region. These algorithms use evolutionary distances to build phylogenetic trees. The tree construction is based solely on the relative number of similarities and differences between a set of sequences. Cluster methods construct a tree by linking the least distant pairs of taxa, followed by successive more distant taxa. Kohenen self-organizing method is one such clustering algorithm, based neural networks, for construction of phylogenetic trees from sequence information. Clustering algorithms can be used in clustering analysis in other problems, such clustering of ESTs by sequence similarity to known genes. This allows each predicted gene to be compared against an array of EST sequences, enabling more effective information annotations.

12.4.4

Distance Method

The distance method calculates the number pair-wise distances between a group of sequences to produce a phylogenetic tree of the group by the program GROWTREE, based either on UPGMA or the neighbor-joining method. The sequence pairs that have the smallest number of sequences between them are termed “neighbors”. On a tree, these sequences share a common node or a common ancestral position and are each joined to that node by a branch. Finding the closest neighbors among a group of sequences by the distance method is often the first step in producing a multiple sequence alignment. CLUSTALW uses the neighbor-joining distance method as a guide to multiple sequence alignment. The score between two sequences is the number of mismatched positions in the alignment or the number of sequence positions that must be changed to generate the other sequence. A general approach is1. 2. 3. 4. 5.

Find the most closely related sequencers A and B. Treat the rest of the sequences as a single composite sequence. Calculate the average distance from A to all other sequences, and B to all other sequences. Use these values to calculate distances (a and b). Now treat A and B as single composite sequence AB, and calculate the average distance between AB and each of the other sequences and make a new distance table. 6. Repeat the steps. Some of the phylogenetic programs are:

l

Phylip

l

PhyloBLAST

l

PILEUP

l

TreeBase Tree Gen TreeView

l l

: (http://evolution.genetics.washington.edu/Phylip.html). Phylogenetic inference package is a collection of many programs––UPGMA, Parsimony, Neighbor-joining, Maximum likelihood algorithms. : Compares the query protein to the SWISS-PROT/TrEMBL database and carries out phylogenetic analysis. : PILEUP uses UPGMA to create its dendogram of DNA sequences, and then uses this dendogram to guide its multiple alignment algorithm. : For graphical representation of phylogenetic trees. : Tree generation from distance data. : (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html). For graphical representation of phylogenetic trees.

Data Mining, Analysis and Modeling 189 l

UPGMA

: Unweighted Pair Group Method is a clustering algorithm using arithmetic averages. It calculates branch lengths between the most closely related sequences, and then averages the distance between this pair or sequence characters until all the sequences are included in the tree.

These similarity/distance matrix comparison methods and other statistical algorithms are based on phenotypic similarities of the species, without taking into account the evolutionary history that brought the species to the current phenotypes. Computer algorithms based on the phenotypic models rely heavily on sequence data in calculating evolutionary distances.

12.4.5

Cladistic Methods

Cladistic methods of phylogenetic analysis rely on current data as well as knowledge of ancestral relationships. They are based on the explicit assumption that sets of sequences (proteins) are evolved from a common ancestor by a process of mutation and selection without mixing. They Evolutionary trees reconstructed via these are called ‘cladograms’. Computer algorithms based on the cladistic model generally rely on PARSIMONY or maximum likelihood methods for the calculation of relationships and building trees. Parsimony uses positionspecific information in a multiple sequence alignment. Maximum likelihood method takes into account every sequence, every sequence change, and specific model of sequence evolution. l

PARSIMONY

l

Maximum Likelihood Method

l

PAUP

l

PAUPSEARCH PAUPDISPLAY

l

: PARSIMONY is the most popular algorithm for constructing ancestral relationship. It allows the use of all known evolutionary information in tree building. It involves evaluating all possible trees and giving each tree a score based on the number of evolutionary changes that needed to explain the observed data. For each aligned position (vertical column in the multiple sequence alignment), phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes are identified. The most parsimonious tree is the one that requires the fewest evolutionary changes for all sequences to derive from a common ancestor. This method is used for sequences that are similar and for small number of sequences, for which it is best suited. : This algorithm attempts to reconstruct a phylogenetic tree using an explicit model of evolution—let all sites selectively be neutral and let them spontaneously mutate at constant rate per gamete per generation. : (http://onyx.si.edu/PAUP). Phylogenetic Analysis Using Parsimony. Starting with a set of aligned sequences PAUP can search for phylogenetic trees that are optimal according to parsimony, distance, or maximum likelihood criteria using heuristic, branchand-bound or exhaustive tree searching algorithms. : Calculates trees. : Produces graphical version of PAUPSEARCH tree files.

Data Mining, Analysis and Modeling 191

boring amino acids. This method examines a sequence window of ~13–17 residues and assumes a central amino acid in the window will adopt a conformation that is determined by the side chains of all the amino acids in the window. Amino acid segments with the highest local scores for a particular secondary structure are assigned to that structure. However, secondary structure do not just depend on conformational preferences of individual amino acids. Distant interactions within the amino acid sequence may influence local secondary structure. Vector methods provide a fast and reliable way to align structures. It is much simpler computational problem to compare vector representations of secondary structures than to compare positions of all Ca or Cb atoms in those structures. The most sophisticated methods for secondary structure prediction are by neural network algorithms. In the neural network approach, computer programs are trained to be able to recognize amino acid sequence patterns that are located in known secondary structures and distinguish these patterns from other patterns not located in these structures. Once individually aligning sets of secondary structural elements have been identified, they are clustered into large alignment groups. This clustering generates a large number of possible groups of secondary structural elements from which the most likely ones must be selected. One of the methods is to align the atomic coordinates of a helix (or a sheet) in one protein with those of the matched helix (sheet) in the second structure and the root mean square deviation calculated. Some of protein secondary structure prediction programs are: l

l l l

l

l l l

l l l

l

BCM Search Launcher : Provides access to a large collection of secondary structure prediction tools. CoDe : Secondary structure consensus prediction. DAS : Transmembrane prediction server (Sweden). DSC : (http://www.bmm.icnet.uk/dsc/). Linear discrimination secondary structure prediction server. JPRED (EBI, UK) : Consensus method of secondary structure prediction server, based upon PHD, PREDATOR, DSC, ZPRED and MulPred programs. PHDsec : Prediction of secondary structure (EMBL, Germany). PHDhtm : Transmembrane helix location prediction and topology. PREDATOR : (http://www.embl-heidelberg.de/). Program that can predict secondary structure of single sequence, or for number of related sequences. It is based on an analysis of amino acid patterns in structures that form hydrogen bond interactions between adjacent b-strands and between n and n + 4 residues in a-helices. PROFsec : Secondary structure prediction server. ProScale : Predicts hydrophobicity, a helix, b-sheet and other parameters. PSA (Boston, USA) : Protein sequence analysis database has protein secondary structure prediction server. Predicts probable secondary structures and folding classes for a given amino acid sequence. PSI-Pred (UK) : PSI-BLAST profiles to predict secondary structure, transmembrane topology and fold recognition.

192

Bioinformatics: A Primer

l

ReDe (USA) SAPS

l

SOUSI

l

SSCP

l

SSPRED

l

TMHMM TMPred

l

l

l l

tRNA Scan-SE VAST

: Transmembrane prediction server. : Statistical Analysis of Protein Sequences—analysis of amino acid composition, charge, hydrophobicity, transmembrane regions and other parameters. : Classification of secondary structure prediction of membrane proteins. : Prediction of content of helix, strand and coil for a query protein using the amino acid composition as the only input information. : A three-state secondary structure prediction routine based on SWISS-PROT database. : Prediction of transmembrane helices in proteins (Denmark). : Prediction of transmembrane regions and orientation (Switzerland). : Provides cloverleaf diagram of the tRNA molecules. : Statistical method similar to BLAST. VAST score is the number of superimposable secondary structural elements found in comparing two sequences.

Secondary structure prediction can be carried out with statistical predictive methods with manual intervention wherever necessary. The principle behind manual intervention methods is to look for patterns of residue conservation that are indicative of secondary structures.

12.5.1.1

a-Helices

a-Helix has a periodicity of 3.6. So, for a-helices with one face buried in the protein core, and the other exposed to solvent will have residues at positions, i, i + 3, i + 4, i + 7 (where i is a residue in a helix) will lie on one face of the helix. Thus patterns showing such conservation are indicative of a-helical regions.

12.5.1.2 b-strandstrands b-Strands that are half buried in the protein core will tend to have hydrophobic residues at position i, i + 2, i + 4, i + 6 …etc., and polar residues at i + 1, i + 3, i + 5,… etc.

12.5.1.3 Reverse Turns As reverse turns exhibit polar character, they are usually found at the molecular surface regions of proteins. Hydrophobic residues are found in regions adjacent to the turns. Pattern of occurrence of helix- and sheet-breakers and hydrophobic residues is a sign of the presence of reverse turns. Prediction of loop regions based on sequence alone is difficult, because loop regions vary in length, sequence and conformation. A better approach is to align the available tertiary structures and use distance-geometry to obtains various classes of loop conformations, followed by choosing the best conformer on the basis of ‘energy minimization’ procedure.

12.6 MOTIFS, DOMAINS AND PROFILES Search and analysis for structural motifs and domains are a part of pattern recognition protocols. A motif is an aggregation of secondary structural elements. A protein may contain single

Data Mining, Analysis and Modeling 193

motif or multiple motifs. Structural domain is a segment of polypeptide chain that can fold into spatially separable entity/moiety in globular proteins. Profiles encompass full domain alignments, by defining which residues are allowed at given positions in the sequence, which positions are highly conserved and which positions/regions tolerate insertions.

12.6.1 Motifs Function of a protein is as much a consequence of regions of local structural elements in the amino acid sequence-motifs. Motifs are components of a more fundamental unit of structure and function, namely, the protein module. Proteins may have modules corresponding to different units of function, and these modules may be present in different order. Structural motifs, within protein structures, once identified in one protein structure, can be used as templates to search the entire database of proteins structures. Approach to pattern recognition is to characterize a family by means of a single conserved motif to a consensus pattern (e.g. E-F hand, helix-turn-helix, and zinc-finger motifs). The motif search programs ignore all but invariant positions in an alignment, and just describe the key residues that are conserved and define the family. For example, H–[FW]–X–[LIVM]–X–G– X(5)–[LV]–H–X(3)–[DE]–describes the motif found in a family of DNA-biding proteins. An element of tolerance is introduced to motif search by dealing amino acids according to physicochemical properties. There are proteins that have multiple motifs. These motifs (conserved regions) can be used to create a “fingerprint” so that in a database search there is a better chance of identifying a distant relative. One of the approaches is to excise groups of motifs from alignments, and the sequence information they contain is converted into unweighted scoring matrices, to create a “fingerprint”. BLOCKS algorithm, which searches for conserved amino acids in a family of proteins, is an alternative method of multiple motif search. In this method, each cluster is treated as a single segment, each with a score that gives a measure of its relatedness. Blocks within a family are converted to position-specific substitution matrices (PSSMs), which are used to make database searches. l

BLOCKS

l

BlockMaker

l

COILSCAN Gibbs Motif Sampler

l

l l l l

HTHSCAN HSSP Leucine Zippers MAST

: (http://www.blocks.org). Search of Motifs and protein classification. Motifs or blocks are created automatically, by detecting the most highly conserved regions of each protein family (USA). : Finds conserved blocks in a group of two or more unaligned, related protein sequences (USA). : Identifies coiled-coil regions. : Identification of conserved motifs in nucleic acids or protein sequences. It searches for the statistically most probable motifs and can find the optimal width and the number of these motifs in each sequence. : Helix-turn-helix motifs scan. : Database of homology-derived secondary structure of proteins. : Searches for leucine-zipper motifs (Germany). : Motif Alignment and Search Tool for searching sequence databases for sequences that contain one or more group of motifs.

194

Bioinformatics: A Primer

l

MEME

l

l

MOTIF Multi Coil PRINTS

l

PROSITE

l

: Multiple emotif for motif elicitation and search tool. MEME locates one or more ungapped patterns in sequences. A search is conducted for a range of possible motif widths, and the most likely width for each profile is chosen after one iteration of EM algorithm. The EM then iterates to find the best EM estimate for the width. : Meta site motifs search server (Japan). : Predicts the location of coiled-coil regions in amino acid sequences. : (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS). Search of Motifs and protein classification. Motifs are encoded as ungapped, unweighted local alignments. : (http://expasy.ch/sprot/prosite.html). Secondary structure/domain database from SWISS-PROT. It is based on highly conserved residues in a protein family. It contains a comprehensive list of documented protein domains. It is the best starting point for motif search. Motifs in proteins are encoded as patterns, and are good at identifying enzyme classes by their active site motif.

BLOCKS and PRINTS are two motif databases that represent protein or domain families by or more ungapped multiple alignment fragments.

12.6.2 Domains Globular proteins exhibit domain (independently folding unit within) structure. Large size proteins (> 50,000 daltons) will tend to fold into structural domains. Structural domains are contiguous stretches of 100-150 amino acid residues that have a globular fold. They cannot be divided into smaller units and they represent fundamental building blocks that can be used to understand the function and evolution of proteins. Single or several structural domains form functional domains with functions (and evolutionary significance) that are distinct from other parts of a protein. As the protein domains are often the evolutionarily conserved fragments of proteins, at both the sequence and structural levels, it has many advantages to organize databases based on domain classification to enhance protein structure prediction and modeling procedures. Fold comparison allows detection of many distance relationships and extends protein families. Similarity of the protein fold goes beyond divergence of the amino acid sequence. If one can identify, through sequence analysis, the location or presence of domains, it is often possible to gain greater insight, not only of the probable function but also of the evolutionary history of that protein. Long stretches of repeated amino acid residues, particularly Pro, Gln, Ser and Thr often indicate linker sequences and are usually good place to split protein into domains. Internal protein domains can indicate whether the protein is likely to be involved in signaling or a transmembrane protein. Transmembrane segments are also very good dividing points, since they can easily separate extracellular from intracellular domains and comprise of ordered segments (e.g. a-helices) in the intracellular domains (Fig. 12.9). Profiles, hidden Markov models and other profile search algorithms are used to search for domain databases.

Data Mining, Analysis and Modeling 195

Fig. 12.9 Structure of Integral Membrane Light Harvesting Protein Complex with Multiple a-Helices (Ref: Prince, S.M., et al. (1997) J Mol Biol., 268; 412) (Source: Protein Data Bank: 1KZU.pdb)

Interacting pairs of proteins co-evolve to maintain functional and structural complementarity. Consequently, such a pair of protein families shows similarity between their phylogenetic trees. Evaluation of the degree of co-evolution of family pairs by global protein structural interactome map (PSIMAP—a map of all the structural domain–domain interactions in the PDB) would improve the accuracy of prediction based on ‘homologous interaction’. Some database sites and programs for domain search are: l

CDD-Search

l

INTERPRO

l

PRODOM

: (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml). Conserved domain database search, collected from Pfam and Search. : (http://www/ebi/ac/uk/interpro/). Integrated resource of protein domains and functional sites is a combination of PFAM, PRINTS, ProSite, SWISSProt/TrEMBL packages. : (http://protein.toulouse.inra.fr/prodom.html). Group of sequence segments or domains from similar sequences found in SWISSPROT database by BLASTP multiple sequence alignment.

Data Mining, Analysis and Modeling 197

multiple aligned sequence data to identify additional members of the family, (ii) or by using the sequence to search (by Pfam) against publicly available HMM profiles. In the first case, a model of a sequence family is first produced and initialized with prior information about the sequences. The trained model may then be used to produce the most probable multiple sequence alignment as posterier information. Using the publicity available HMM profile is convenient, if the domain is already present in the database. In the HMM, each column in the model represents the probability of a match, insert or delete in each column of the multiple sequence alignment. Each state generates an observation and has a table of amino acid emission probabilities. There are also transition probabilities from moving from state to state. A protein is represented as sequence of probabilities, represented by a path through the model. The HMM generates a protein sequence by emitting amino acids as it progresses through a series of interconnecting match, mismatch, delete or insert states. The object is to calculate the best HMM for a group of sequences by optimizing the transition probabilities between states and the amino acid compositions of each match state in the model. The HMM is a probabilistic representation of a section of multiple sequence alignment, and has position-dependent character distribution and position-dependent insertion and deletion gap penalties. A protein sequence can be generated from the HMM by starting at the beginning (Fig. 12.11) and then by following any one of many pathways from one type of sequence variation to another (states) along the state transition arrows and terminating at the end. Each

Fig. 12.11 Schematic of Hidden Markov Model (HMM)

198

Bioinformatics: A Primer

sequence is a match state. Insert state (hexagon box) produces random amino acid letters for insertions between aligned column and delete state (circle) produces a deletion in the alignment with probability 1. The scores show the probability that an amino acid occurs in a particular state. The “hidden” aspect of the model arises from the fact that the state-sequence is not directly observed. Instead, one must infer the state-sequence from a sequence of observed data using the probability model. A general procedure is 1. The model is initialized with estimates of transition probabilities of amino acid composition of each match and insert state. 2. All possible pathways through the model for generating each sequence in turn are examined. 3. A new version of the HMM is produced that uses the results found (in step 2) to generate new transition probabilities and match/insert state compositions. 4. Steps 2 and 3 are repeated about ten more times until the parameters do not change significantly. 5. The trained model is used to provide the most likely path for each sequence (viterbi algorithm). Limitations of HMM models are: (i) The HMM is a linear model and is unable to capture higher order correlation among amino acids in proteins. In reality, amino acids in globular proteins, far apart in linear sequence, may be physically close to each other in protein folding. (ii) The Markovian assumption that the future is independent of the past, given the present, is not strictly applicable in biology (e.g. clustering of hydrophobic amino acids in proteins; and conserved regions). Some profile search programs/tools/servers/sites are: l

l

l

l

l

eMATRIX Search : Meta site profile search (USA). Includes BLOCKS, DOMO, PRINTS, PRODOM, and PROSITE databases. InterPro Search : Meta site profile scan server (USA) that includes PFAM, PRINTS and PROSITE. PANT : Meta site server (USA). Searches PROSITE patterns and PROFILES, BLOCKS, PFAM and PRINTS. PFAM : ( h t t p : // w w w . s a n g e r . a c . u k / s o f w a r e / p f a m < h t t p : // www.mrc.lmb.ac.uk/SCOP/). Protein family database contains curated multiple sequence alignments for each protein family. It contains functional annotation, databank links for each family and literature references. Thee multiple sequence alignments are used to construct HMM profiles. A library of these profiles is used in turn to identify protein domains in uncharacterized query sequences. PFAM excels in extracellular domain search. PROFILES : Weighted matrices provide a sensitive means of detecting distant sequence relationships, where only few residues are conserved.

Data Mining, Analysis and Modeling 199 l

Profile Scan

l

SAM TOPITS UCLA-DOE

l l

: Meta site profile scan server (ISREC, Switzerland) that searches a sequence against a library of profiles. The server includes PROSITE, PFAM, and Gribskov. : Protein family search tool based on hidden Markov model. : Fold recognition by prediction-based threading. : Protein fold recognition server (USA).

Because of their information-rich descriptions, PROFILE, PRODOM, and PFAM databases are able to detect even very distant instances of a motif not otherwise detectable.

12.7 PATTERN RECOGNITION Pattern recognition programs follow reverse process of sequence analysis. Rather than predict how a sequence will fold, they predict how well a fold will match a sequence. That is, matching of sequence with a given topology rather than search for a topology with a given sequence. Pattern recognition methods attempt to detect similarities between 3-D structures that are not accompanied by any significant sequence similarity. The general approach involves calculating of a table of propensities that gives the probability for each type of amino acid being found in a given environment. For a given structure each position can be assigned to one of the environments. Dynamic programming is then used to find the best match of the sequence to the pattern of environments found in a given fold. Some of the common programs are:

l

BLASTPAT EPAT FASTAPAT FINDPATTERNS Meta-MEME

l

PRATT

l

PATScan

l

PatternP

l l l l

: : : : :

BLAST-based patterns database search. Pattern search (for PDB; SWISS-PROT; PIR databases). FASTA-based patterns database search.

The program uses HMM method to find motifs (conserved sequence domains) n a set of related protein domains. : A pattern recognition tool. Searches for patterns conserved in set of protein or nucleic acid sequences. It is able to discover patterns conserved in sites of unaligned protein sequences. : Search for patterns conserved in set of protein or nucleic acid databases. : Search for patterns in protein sequences in a PDB file.

12.8 PROTEIN CLASSIFICATION AND MODELING The most general description of the three-dimensional structure of protein is in terms of its spatial fold, that is, the topology of its polypeptide chain. This classification disregards local differences, and mostly concerned with the secondary structure elements and their mutual disposition. Proteins are classified into families according to their sequence similarity (PSSM), secondary structure (class), motif (architecture), profile (HMM) and homology.

200

Bioinformatics: A Primer

12.8.1

Protein Classification

Proteins are classified into families according to their sequence similarity (PSSM), secondary structure (class), motif (architecture), profile (HMM) and homology. Protein classification methods are based on the premise that proteins that share structural similarities reflect common evolutionary origins. Many proteins are made up of modules (regions of conserved amino acid patterns comprising one or more motifs). Proteins from widely divergent biological sources may share several such modules, but the modules may not be in the same order. Protein families, members of which have the same domains in the same order, but also have dissimilar regions, are designated as a homeomorphic family. Proteins with the same biochemical functions have been examined for the presence of structurally conserved amino acid patterns that represent an active site or other important feature (Prosite catalog). Proteins can be classified by clustering methods. They have been used to identify groups of proteins that lack a relative with known structure and hence are suitable for structure analysis. Some of the protein classification databases are: l

3D-PSSM

l

CATH

l

FSSP

l

GeneFind

l

LPFC

l

MMDB

l

iProClass (USA)

l

ProClass

l

Prosite

: (http://www.bmm.icnet.uk/3dpssm/). Database, based on structure similarity in the SCOP. : (http://www.biochem.ac.uk/bsm/CATH/). Class, Architecture, Topology and Homology database is a hierarchical domain classification of protein structures. : (http://www2.embl-ebi.ac.uk/dali/fssp/). Fold classification based on structure alignment of proteins database is based on a structural alignment of all pair-wise combination of the proteins in the PDB structural database by the structural alignment program DALI. : An integrated neural network protein classification database. It is based on the MOTIFIND neural networks and the ProClass family database. GeneFind uses a multilevel filter system, with MOTIFIND, BLAST and Smith-Waterman pair-wise alignment programs. : http://www-canis.stanford.edu/projects/helix/LPFC/). A library of protein family cores based on multiple sequence alignment of protein cores using amino acid substitution matrices based on structure. : Molecular modeling database contains PDB structures that have been categorized into structurally related groups by the vector alignment search tool (VAST). : Integrated protein classification resource that provides a comprehensive family relationships and structure/functional features of proteins. : (http://www-nbrf.georgetown.edu/gfserver/proclass.html). Provides summary description of protein family, structure and function for PIR-PSD, SWISS-PROT and TrEMBL. : (http://www.expasy.ch/prosite/). Database of groups of proteins with similar biochemical functions, derived on the basis of amino acid patterns.

Data Mining, Analysis and Modeling 201

: (http://www.mrc.lmb.ac.uk/SCOP/). Clustering algorithm for proteins with sequence identity > 30%. Provides hierarchical structural classification. It is important in protein classification analysis, that experimental 3-D structure data is available for at least one representative protein for every family for homologous proteins, for validation of predicted models. The quality of such predicted models would be related to the level of structural similarity between a query protein whose sequence is available and the representative protein whose is 3D-structre is also available. l

SCOP

12.8.2

Tertiary Structure Modeling

There are a number of methods for predicting the 3-D structure of a protein from its amino acid sequence data. The best approach is to locate link by sequence analysis between the sequence of a query protein and protein of known 3D-structure. If the query protein sequence shows significant homology to another protein of known 3-D structure, then a fairly accurate model of 3-D structure of the query protein can be obtained via homology modeling (Path 1 of Fig. 12.1), by superposition of the query sequence to the of the sequence of a related protein whose 3D-structure has been experimentally determined. There are ~ 500 common structural folds for ~ 12,000 3-D structures. That is, many different sequences can adopt the same fold. Thus, there are many combinations of amino acids that can assemble together into the same 3D-conformation. This implies that while substantially significant sequence similarity is an indicator of evolutionary relationship between sequences, significant structural similarity may or may not be an indicator of evolutionary relationship. Structural homology modeling methods provide means to expand this number by building models of other proteins that show some level of similarity to a protein with known 3-D structure. If a portion of the sequence matches a domain of a protein of known 3-D structure, a 3-D homology model can be constructed from the protein. When a global sequence alignment shows > 45% homology, the amino acids should be quite superimposable in the 3D-structure of the proteins. For comparison of two structures, positions of atoms in two 3D-structures are compared. These methods initially examine the positions of secondary structural elements— a-helices and b-strands, within a protein domain to determine whether or not the number, type and relative positions of these structural elements are similar. Stabilizing the secondary structure elements to maximize the hydrophobicity of the core is an important feature in prediction and modeling of protein folds. Even with no homologue of known 3-D structure is found, it may be possible to predict a model from fold recognition methods (SCOP; CATH). Steps in building a homology-based model are: 1. Obtain a relative of the query sequence. 2. Build template structure in the protein structure database. 3. Ensure to align conserved residues that are predicted to be buried/exposed to those known to be buried/exposed in the template structure (use PHD server). 4. Align backbone first and then add side chains of the query sequence. 5. Line up every secondary structure element with its appropriate counterpart.

202

Bioinformatics: A Primer

6. H-bonding patterns are not disturbed in secondary structure. 7. Conserve residue properties (size, polarity, hydrophobicity etc.). 8. Optimize the structure using energy minimization. l

Dali

: (http://www.embl-ebi.ac.uk/dali/). The Dali server is a network service for comparing protein structures in 3-D. Dali compares the coordinate list of a query protein structure against those in the Protein Data Bank (PDB). In favorable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences.

Instead of exhaustive database searching, in which a 3-D query structure is compared to each and every structure in the database, a rapid 3-D protein structure retrieval system (“ProtDex2”) can be used to perform rapid database searching without having access to every 3-D structure in the database. This retrieval process is based on the inverted-file index algorithm, constructed on the feature vectors of the relationships between the secondary structure elements (SSEs) of all the 3-D protein structures in the database. ProtDex2 algorithm is faster than other protein comparison algorithms, such as DALI.

12.8.2.1 Distance Matrix Method The method is similar to dotplot analysis. It uses graphic procedure to identify the atoms that lie most closely together in the 3-D structure. If two proteins have a similar structure, the graphs of these structures will be superimposable. Distance between Ca-atoms along the polypeptide chain can be compared by a 2-D matrix representation of the structure. Distance matrix compares geometric relationships between the structures without regard to alignment. The sequence of the protein is listed both across the top and down the side of the matrix. Each matrix position represents the distance between the corresponding Ca-atoms in the 3D-structure. The program DALI (distance alignment tool) uses this method to align protein structures.

12.8.2.2 Structure Profile Method Structure profile method is an environmental template method; a sequence profile to predict which amino acids might be able to fit into a given structural position. The environment of each in each known structural core is determined, including the secondary structure, the buried surface of side chains. On the basis of these physicochemical parameters at each site, the position is classified into eighteen types, six types representing increasing levels of residue buried and fraction of surface covered by polar atoms, combined with three classes of secondary structure. Each amino acid is then assigned for its ability to fit into that type of site in the structure. The sequence of the protein is then aligned with a series of such environmentally defined positions in the structure to see whether a series of amino acids in the sequence can be aligned with the assigned structural environments of a given core. The procedure is then repeated for each protein core in the structural database and the best matches of the query sequence to the core are identified. The structural 3-D profile is a table of scores with one row for each amino acid position in the core and a column for each possible amino acid substitution at that position and two columns for deletion penalties at that site (Fig. 12.12). Each position in the core is assigned to one of 18 classes of structural environment. The scores in each row reflect the suitability of a

Data Mining, Analysis and Modeling 203

given amino acid for that particular environment. The penalty at each core position reflects the acceptability of an insertion or deletion of one or more amino acids at that position in the structure. If the position is within the core, these penalties are generally high, reflecting incompatibility with the structure, but the scores are lower for positions on the surface of the core and within the loop regions. The dynamical programming is also used to identify an optimal, best-scoring alignment. If a target structure is found to have significantly high score, then the query sequence is predicted to have a fold similar to that of the target core.

Fig. 12.12 3-D Structure Profile Scheme

12.8.2.3 3D-1D Profile Sum Method In this method (Fig. 12.13), sequences are screened to prepare a 3-D profile, a discrete list of scores for matching 1-D sequence to a 3-D structure. The procedure takes into account amino acid neighbors, main-chain conformations and secondary structural features of each residue in the structure.

12.8.2.4 Contact Potential Method In contact potential method, each structural core is represented as 2-D contact matrix (similar to distance matrix method or program DALI). A matrix is produced with the amino acids in the structure listed across the rows and down the columns. In each matrix position, the distance between the corresponding pair of amino acids in the structure is placed. A group of amino acids in closest contact produces recognizable patterns. The object is to superimpose sets of amino acid pairs in the query sequence on to the distance matrix of the core. To find the best combinations, the approximate conformational energies of each predicted pair are summed to predict the conformational stability of the predicted structure. Contact energies can be used to choose the correct core in a structural database.

Data Mining, Analysis and Modeling 205 l

MODELLER

l

PDBSum

l

Predict Protein

l l

ProSAL ProtDex2

l

SDSC1

l l

SRS SWISS MODEL

l

WHATIF

12.8.3

: (http://guitar.rockefeller.edu/modeller/). Dynamic programming alignment of sequences and 3D-structre modeling. : The database provides summaries of structural analyses of PDB data files, structural classification and programs to plot schematic diagrams of protein-ligand interactions, protein structure motifs etc. : The Meta-server (EMBL, Germany) is used to find structural homologues of a query protein sequence, detection of functional motifs and domains, and prediction of secondary structure based on a single sequence or multiple sequence alignment. : A Meta site for protein analysis and characterization (Sweden). : ([email protected]). Rapid 3-D protein structure database searching using information retrieval techniques. : San Diego Supercomputer Center Protein Structure Homology Modeling. This is a site to try if the query sequence does not show any sequence homology to existing proteins in databases. : (http://srs.hgmp.mrc.ac.uk/). Sequence retrieval system. : (http://www.expasy.ch/swissmodel/). Sequence alignment of a query sequence with a known 3D-structure by homology modeling (> 50% homology). : (http://www.umbi.kun.nl/whatif/). A web interface (EMBL, Germany) that provides tools for examining PDB files.

3-D Structure Viewing

Once the 3-D model is available (model), various programs can be used to view the 3-D structure, may also be used to compare with other homologous structures in databases (e.g. PDB database) by superposition. The superposed structures can be viewed with CHIME or RasMol. Some of the structures viewing programs are— l

CHIME

l

Cn3D

l

l

ExPASy GRASP LOCK (USA) MolMol Prepi RasMol

l

Swiss 3D Image

l l l l

: (http://www.umass.edu/microbio/chime/). Good for lecture presentation. : (http://www.ncbi.nih.gov/structure/). Provides viewing of 3-D structures from Entrez. : : : Hierarchical protein structure superposition tool. : : : (http://www.umass.edu/microbio/rasmol/). Most commonly used viewer program for windows. : (http://www.expasy.ch/sw3d/) (ExPASy, Switzerland). An image database that provides high quality pictures of biological macromolecules with known three-dimensional structures.

206

Bioinformatics: A Primer

EXERCISE MODULES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.

What is sequence alignment? What is the basis of sequence similarity search algorithms? What are the similarity/distance matrices meant for? What is meant by pair-wise sequence alignment? What is dotplot analysis? What is dynamic programming? What are the differences between global alignments against local alignments of sequences? What is multiple sequence alignment and what it is its objective? Which are the programs for multiple sequence alignment? Name BLAST suit and FASTA algorithms for multiple sequence alignment. What are the best strategies for sequence similarity search? What is phylogenetic analysis? What is the Dayhoff mutation data matrix? What are the limitations of Dayhoff’s PAM model? What are the features of BLOSUM algorithm? What are distance method algorithms? What are the cladistic methods in evolutionary analysis? What is an evolutionary tree and how is it constructed? What are the protocols for secondary structure prediction? What are the methods for helix prediction? What are the methods for strand prediction? How are turns and loops predicted? What are the methods and programs for predicting motifs, domains, and profiles? What is the relevance of Hidden Markov models in multiple sequence alignments and pattern recognition analysis? What are the protocols for pattern recognition? Which are programs used in protein classification? What are the various methods in protein tertiary structure prediction and modeling? What is sequence threading? Name some servers/databases for protein modeling? Which are the programs for 3-D structure viewing?

BIBLIOGRAPHY 1. Altschul, S.F., et al. (1990), J Mol Biol., 215; 403. “Basic local alignment search tool”. 2. Altschul, S.F. (1991), J Mol Biol., 219; 555. “Amino acid substitution matrix from an information theoretic perspective”. 3. Altschul, S.F., et al. (1997), Nucleic Acid Res., 25(17); 3389. “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”. 4. Argos, P. (1987), J Mol Biol., 193; 385. “A sensitive procedure to compare amino acid sequences”. 5. Attwood, T.K. & Beck, M.E. (1994), Protein Engineering, 7; 841. “PRINTS-A protein motif fingerprint database”. 6. Attwood, T.K. & Parry-Smith, D.J. (2002), Pearson (Education): Delhi. “Introduction to Bioinformatics”. 7. Aung, Z. & Tan, K-L. (2004), Bioinformatics, 20(7); ([email protected]). “Rapid 3D protein structure database searching using information retrieval techniques”.

Data Mining, Analysis and Modeling 207 8. Bairoch, A., Bucher, P. & Hoffman, K. (1997), Nucleic Acid Res., 25(1); 217. “The PROSITE database, its status in 1997”. 9. Bairoch, A., & Apweiler, R. (2000), Nucleic Acid Res., 28; 45. “The SWISSPROT protein sequence database and its supplement TrEMBL in 2000”. 10. Bajorath, J., Stenkamp, R. & Aruffo, A. (1993), Protein Sci., 2; 1798. “Knowledge-based model building of proteins: concepts and examples”. 11. Baker, W.C., et al. (1999), Nucleic Acid Res., 27; 39. “The PIR-International Protein Sequence Database”. 12. Barton, G.J. (1995), Curr Opin Struct Biol., 5(3); 372. “Protein secondary structure prediction”. 13. Bateman, A., et al. (2000), Nucleic Acids Res., 28; 263. “The Pfam protein families database”. 14. Baxevanis, A.D. & Ouellette, B.F. (eds). (1998), Wiley & Sons: New York. “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins”. 15. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235. “Protein Data Bank”. 16. Bilofsky, H.S., et al. (1986), Nucleic Acids Res., 14; 1. “The GenBank: genetic sequence databank”. 17. Blundell, T.L., et al. (1987), Nature, 326; 347. “Knowledge-based prediction of protein structures and the design of novel molecules”. 18. Bork, A. & Gibson, T. (1996), Methods Enzymol., 266; 162. “Applying motif and profile searches”. 19. Bork, P. & Koonin, E.V. (1996), Curr Opin Struct Biol., 6(3); 366. “Protein sequence motifs”. 20. Bowie, J.U. & Eisenberg, D. (1993), Curr Opin Struct Biol., 3; 437. “Inverted protein structure prediction”. 21. Bowie, J.U. Lüthy, R. & Eisenberg, D. (1991), Science, 253; 164. “A method to identify protein sequences that fold into a known three-dimensional structure”. 22. Brennan, R.G. & Matthews, B.W. (1989), Trends Biochem Sci., 14; 286. “Structural basis of DNA-protein recognition”. 23. Brenner, S.E., et al. (1996), Methods Enzymol., 266; 635. “Understanding protein structure using SCOP for fold interpretation”. 24. Bryant, S.H. (1996), Proteins, 26; 172. “Evaluation of threading specificity and accuracy”. 25. Chothia, C. (1984), Annu Rev Biochem. 53; 537. “Principles that determine the structure of proteins”. 26. Chothia, C. & Lesk, A.M. (1986), EMBO J., 5; 823. “The relation between the divergence of sequence and structure in proteins”. 27. Chou, P.Y. & Fasman, G.D. (1978), Advs Enzymol., 47; 45. “Prediction of the secondary structure of proteins from their amino acid sequence”. 28. Cohen, C. & Perry, D.A.D. (1993), TiBS., 11; 245. “a-helical coiled coils-a widespread motif in proteins”. 29. Corpet, F., et al. (2000), Nucleic Acids Res., 28; 267. “ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons”. 30. Cuff, J.A. & Barton, G.J. (2000), Proteins, 40; 502. “Application of multiple sequence alignment profiles to improve protein secondary structure prediction”. 31. Danchin, A. (1999), Curr Opin Struct Biol., 9(3); 363. “From Protein sequence to function”. 32. Dayhoff, M.O (ed). (1978), Natl Biomed Res Foundation (NBRF): Washington, DC. “Atlas of Protein Sequence and Structure”, Vol 5, Supl 3. 33. Dayhoff, M.O., Barker, W.C. & Hunt, L.T. (1983), Methods Enzymol., 91; 534. “Establishing homologies in protein sequences”. 34. Dobson, C.M. & Karplus, M. (1999), Curr Opin Struct Biol., 9(1); 92. “The fundamentals of protein folding: bringing together theory and experiment”. 35. Doolittle, R.F. (1992), Protein Sci., 1; 191. “Reconstructing history with amino acid sequences”.

208

Bioinformatics: A Primer

36. Dubchak, I., Holbrook, S.R. & Kim, S-H. (1993), Proteins, 16; 79. “Prediction of protein folding class from amino acid composition”. 37. Eddy, S.R. (1996), Curr Opin Struct Biol., 6; 361. “Hidden Markov models”. 38. Eisenhaber, F., Persson, B. & Argos, P. (1995), Crit Rev Biochem Mol Biol., 30; 1. “Protein structure Prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence”. 39. Engel, J. (1991), Curr Opin Cell Biol., 3; 779. “Common structural motifs in proteins of the extracellular matrix”. 40. Englander, S.W. (1993), Science, 262; 848. “In pursuit of protein folding”. 41. Etzold, T., et al. (1996), Methods Enzymol., 266; 114. “SRS: Information retrieval system for molecular biology databanks”. 42. Fasman, G.D. (Ed). (1990), Plenum Press: New York. “Prediction of Protein Structure and the Principles of Protein Conformation”. 43. Felsenstein, J. (1996), Methods Enzymol., 266; 368. “Inferring phylogeny from protein sequences by parsimony, distance and likelihood methods”. 44. Fitch, A. & Margoliash, E. (1987), Science, 155; 277. “Construction of Phylogenetic trees”. 45. Frishman, D. & Argos, P. (1995), Proteins, 23; 566. “Knowledge-based protein secondary structure assignment”. 46. Garnier, J., Gibrant, J.P. & Robson, B. (1996), Methods Enzymol., 266; 540. “GOR method for predicting protein secondary structure from amino acid sequence”. 47. Gracy, J. & Argos, P. (1998), Trends Biochem Sci., 23; 495. “DOMO: a new database of aligned protein domains”. 48. Gribskov, M. & Veretnik, S. (1996), Methods Enzymol., 266; 198. “Identification of sequence pattern with profile analysis”. 49. Guex, N., Diemand, A. & Pettsch, M.C. (1999), Trends Biochem Sci., 24; 364. “Protein modeling for all”. 50. Hadley, C. & Jones, D.T. (1999), Struct Fold Desn., 7; 1099. “A systematic comparison of protein structure classifications: SCOP, CATH and FSSP”. 51. Henikoff, J.G. & Henikoff, S. (1996), Methods Enzymol., 266; 88. “Blocks database and its application”. 52. Henikoff, J.G., Henikoff, S. & Pietrokovski, S. (1999), Biotransformatics, 15; 471. “Blocks+: a nonredundant database of protein alignment blocks derived from multiple compilations “. 53. Henikoff, S. & Henikoff, J.G. (2000), Adv Protein Chem., 54; 73. “Amino acid substitution matrices “. 54. Higgins, D.G., Thompson, J.D. & Gibson, T.J. (1996), Methods Enzymol., 266; 383. “Using CLUSTAL for multiple sequence alignments”. 55. Hofmann, K., et al. (1999), Nucleic Acids Res., 27; 215. “The PROSITE database, its status in 1999”. 56. Holm, L., et al. (1991), Protein Sci., 1; 1691. “A database of protein structure families with common folding motifs”. 57. Holm, L. & Sander, C. (1993), J Mol Biol., 233; 123. “Protein structure comparison by alignment of distance matrices”. 58. Holm, L. & Sander, C. (1997), Nucleic Acids Res., 25; 231. “Dali/FSSP classification of three-dimensional protein folds”. 59. Hubbard, D.T. (1999), Nucleic Acid Res., 27; 254. “SCOP: a structural classification of proteins database”. 60. Janin, J. & Chothia, C, (1980), J Mol Biol., 143; 95. “Packing of a-helices on to b-pleated sheets and the anatomy of a/b-proteins”. 61. Johnson, M.S. & Overignton, J.P. (1993), J Mol Biol., 233; 716. “A structural basis for sequence comparison”.

Data Mining, Analysis and Modeling 209 62. Jones, D.T. (1999), J Mol Biol., 292; 195. “Protein secondary structure prediction based on position-specific scoring matrices”. 63. Jones, D.T. (1999), J Mol Biol., 287; 797. “Gen TGREADER: efficient and reliable protein fold recognition method for genomic sequences”. 64. Karplus, K., Barrett, C. & Hughey, D.G. (1998), Bioinformatics, 14; 846. “Hidden Markov models for detecting remote protein homologies”. 65. Kasuya, A. & Thornton, J.M. (1999), J Mol Biol., 286; 1673. “Three-dimensional structure analysis of PROSITE patterns”. 66. Kim, W. K., Bolser, D.M. & Park, J.H. (2004), Bioinformatics, 20(7); ([email protected]). “Large-scale co-evolution analysis of protein structural interlogues using the global protein structural interactome map (PSIMAP)”. 67. King, R.D., Weiss, P.H. & Clare, A. (2004), Bioinformatics; ([email protected]). “Confirmation of data mining based predictions of protein function”. 68. Kleywegt, G.J. (1999), J Mol Biol., 285; 1887. “Recognition of spatial motifs in protein structures”. 69. Konforti, B. (1999), Nature Struct Biol., 6; 505. “Rules for protein-DNA interactions”. 70. Kreil, D.P. & Etzold, T. (1999), Trends Biochem Sci., 24; 155. “DATABANKS– a catalogue database of molecular biology databases”. 71. Krogh, A. et al. (1994), J Mol Biol., 235; 1501. “Hidden Markov models in computational biology: application to protein modeling”. 72. Kyte, J. & Doolittle, R.F. (1982), J Mol Biol., 157; 105. “A simple method for displaying the hydropathic character of a protein”. 73. Lakowski, R.A., et al. (1997), TiBS., 22; 488. “PDBsum: A web-based database of summaries and analyses of all PDB structures”. 74. Lee, R.H. (1992), Nature, 356; 543. “Protein model building using structural homology”. 75. Lesk, A.M. (1991) IRL Press: London. “Protein Architecture: A Practical Approach”. 76. Lesk, A.M. & Chothia, C. (1980), J Mol Biol., 136; 225. “How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of globins”. 77. Levitt, M. & Chothia, C. (1976), Nature, 261; 552. “Structural patterns in globular proteins”. 78. Lipman, D.J. & Pearson, W.R. (1985), Science, 227; 1435. “Rapid and sensitive protein similarity searches”. 79. Lohman, R., Schneider, G. & Behrens, D. (1994), Protein Sci., 3; 1597. “A neural network model for the prediction of membrane-spanning amino acid sequences”. 80. Lupas, A. (1996), Methods Enzymol., 266; 513. “Prediction and analysis of coiled-coil structures”. 81. Luthy, R., Bowie, J.U. & Eisenberg, D. (1992), Nature, 356; 83. “Assessment of protein models with threedimensional profiles”. 82. Martin, A., et al. (1998), Structure, 6; 875. “Protein folds and functions”. 83. Mewes, H.W., et al. (2000), Nucleic Acids Res., 28; 37. “MIPS: a database for genomes and protein sequences”. 84. Michie, A.D., Jones, M.L. & Attwood, T.K. (1996), Trends Biochem Sci., 21(5); 191. “DbBrowser: integrated access to databases worldwide”. 85. Morgenstern, B., et al. (1998), Bioinformatics. 14; 290. “DIALIGN: Finding local similarities by multiple sequence alignment”. 86. Moult, J. (1999), Curr Opin Biotechnol., 10(6); 583. “Predicting protein three-dimensional structure”. 87. Mount, D.W. (2001), Cold Spring Harbor Lab Press: New York. “Bioinformatics: Sequence and Genome Analysis”.

210

Bioinformatics: A Primer

88. Murzin, A.G., et al. (1995), J Mol Biol., 247; 536. “SCOP: A structural classification of proteins database for investigation of sequence and structures”. 89. Needleman, S.B. & Wunsch, C.D. (1970), J Mol Biol., 48; 443. “A general method applicable to the search for similarities in the amino acid sequences of two proteins”. 90. Orengo, C.A., et al. (1994), Nature, 372; 631. “Protein superfamilies and domain superfolds”. 91. Orengo, C.A., et al. (1994), Curr Opin Struct Biol., 4(3); 423. “Classification of protein folds”. 92. Orengo, C.A., et al. (1997), Structure, 5(8); 1093. “CATH-a hierarchical classification of protein domain structures”. 93. Orengo, C.A., et al. (1999), Curr Opin Struct Biol., 9(3); 374. “From protein structure to function”. 94. Overington, J.P. (1992), Curr Opin Struct Biol., 2; 394. “Comparison of three-dimensional structures of homologous proteins”. 95. Pabo, C.O. & Sauer, R.T. (1992), Annu Rev Biochem., 61; 1053. “Transcription factors: structural families and principles of DNA recognition. 96. Pearson, W.A. (1990), Methods Enzymol., 183; 63. “Rapid and sensitive sequence comparison with FASTA and FASTP”. 97. Pearson, W.R. (2000), Methods Mol Biol., 132; 185. “Flexible sequence similarity searching with FASTA3 program package. 98. Pearson, W.R. & Miller, W. (1992), Methods Enzymol., 210; 575. “Dynamic Programming algorithms for biological sequence comparison”. 99. Quain, N. & Sejnowski, T.J. (1988), J Mol Biol., 202; 865. “Predicting the secondary structure of globular proteins using neural network models”. 100. Richardson, J.S. (1985), Methods Enzymol, 115; 349. “Describing patterns of protein tertiary structure”. 101. Ripley, B.D. (1996), Cambridge University Press: Cambridge. “Pattern Recognition and Neural Networks”. 102. Rose, G.D., et al. (1985), Science, 229; 834. “Hydrophobicity of amino acid residues in globular proteins”. 103. Sali, A. & Overington, J.P. (1994), Protein Sci., 3; 1582. “Derivation of rules for comparative protein modeling from a database of protein structure alignments”. 104. Sali, A., et al. (1995), Proteins, 23; 318. “Evaluation of comparative protein modeling by MODELLER”. 105. Sanchez, R. & Sali, A. (1997), Curr Opin Struct Biol., 7; 206–14. “Advances in comparative protein structure modeling”. 106. Sayle, R.A. & Milner-White, E.J. (1995), TiBS., 20; 374. “RASMOL: Biomolecular graphics for all”. 107. Schuler, G.D., et al. (1996), Methods Enzymol., 266; 141. “Entrez: molecular biology database and retrieval system”. 108. Sensen, C.W., (Ed). (2001), Wiley-VCH: Weinheim. “Biotechnology(5b): Genomics and Bioinformatics” (2nd Edn). 109. Smith, T.F. & Waterman, M.S. (1981), J Mol Biol., 147; 195. “Identification of common molecular subsequences”. 110. Sonnhammer, E.L., et al. (1998), Nucleic Acid Res., 26(1); 320. “Pfam: multiple sequence alignments and HMM-profiles of protein domains”. 111. Struhl, K. (1989), TiBS., 14; 137. “Helix-turn-helix, zinc-finger, and leucine-zipper motifs for eucaryotic transcriptional regulatory proteins”. 112. Sun, Z. & Jiang, B. (1996), J Prot Chem., 15; 675. “Conformation of commonly occurring super-secondary structures (basic motifs) in protein databank”. 113. Swindells, M.B. & Thornton, J.M. (1991), Curr Opin Struct Biol., 1; 219. “Modeling by homology”.

Data Mining, Analysis and Modeling 211 114. Todd, A.E., Orengo, C.A. & Thornton, J.M. (1999), Curr Opin Chembiol., 3(5); 548. “Evolution of protein function, from a structure perspective”. 115. Vriend, G. & Sander, C. (1991), Proteins, 11; 52. “Detection of common three-dimensional structures in proteins”. 116. Wodak, S.J. & Rooman, M.J. (1993), Curr Opin Struct Biol., 3; 247. “Generating and testing protein folds”. 117. Wu, C.H., Shivakumar, S. & Huang, H. (1999), Nucleic Acids Res., 27; 272. “ProClass protein family database”. 118. Zucker, M. (2000), Curr Opin Struct Biol., 10; 303. “Calculating nucleic acid secondary structure”.

13 Medico- and Pharmacoinformatics Genes are units of heredity that provide the blueprint for our physical body and our wellbeing. The extent of quality of life can be drastically altered by disease and genetic disease is perhaps the purest illustration of the relationship between our genes and our health. In this context, the major impact of the new era of genomic biology (applications of genome sequencing projects), combining experimental data from gene expression microarrays, electrophoresis, mass spectrometry other experimental techniques with computational methods, has been and is going to be felt in medical and pharmacological fields. Vast amount of genome sequence data (ESTs) with refined annotations are available, and these databases with analysis of tissuespecific assays will be the sources of information to understand the molecular basis of genetic diseases, gene discovery (some of which may have disease propensity), genetic screening, and to find new methods of diagnosis, design of novel genes, and personalized gene-based therapies. A general approach towards genetic disease management is—(i) identification of the diseasecausing gene, (ii) diagnostic/screening tests, and (iii) therapeutic protocols (Fig. 13.1).

Fig. 13.1 Flowchart of Databank-based Genetic Disease Management

13.1 DISEASE GENE IDENTIFICATION Chromosome abnormalities are responsible for a significant portion of genetic disorders that appear arise de novo. Chromosomal abnormalities can be within the same chromosome (dele-

Medico- and Pharmacoinformatics 213

tions, insertions, duplications and inversions), or between chromosomes (translocation), as well as changes in chromosome number (ploidy). The chromosomal abnormalities can now be made during pregnancy through amniocentesis and cytogenic analyses (by chromosome banding, and fluorescence in situ hybridization (FISH) analysis– a type of in situ hybridization in which target sequences are stained with fluorescent dye so their location and size can be determined using fluorescence microscopy.).

13.1.1 Linkage Analysis and Positional Cloning Once a mutant, a marker allele (one of several alternate forms of a gene which occur at the same locus on homologous chromosomes, and governing the same biochemical and developmental process), or susceptibility region of chromosome has been identified that is associated with a disease, specific genetic and biochemical assays, for detecting physiological and metabolic changes carried by the disease, can be developed for diagnostic screening. However, for vast majority of individuals, these data were insufficient. An alternative strategy is linkage mapping analysis and isolation of disease genes, based on their chromosomal location (“positional cloning”). The positional cloning approach relies on a three-step process– (i) localizing a disease gene to a chromosomal subregion, generally by using traditional linkage analysis, (ii) searching databases for an attractive candidate gene within that subregion, and (iii) testing the candidate gene for disease-causing mutations. First, the position of the gene must be mapped. Linkage analysis can be used to pinpoint which chromosomal neighbor contain disease alleles by scanning the genome of family members, with affected pedigrees, for alleles that appear to be linked to the disease phenotype. DNA fingerprinting methods that rely on enzymatic cleavage of DNA followed by electrophoresis and visualization by hybridization by probes specific for repetitive sequences–restriction fragment length polymorphisms (RFLPs), and variable number of tandem repeats (VNTRs)– have been used in linkage analysis. Detection of chromosomal aberrations associated with the disease may be detected by cytogenic methods (e.g. FISH analysis). After linkage mapping, the next step is to look through the basepairs that have co-segregated with the known genetic markers. The gene in this region must be identified and a specific mutation in one of the genes must be shown to cause the disease. This process is termed positional cloning. Linkage analysis is useful where the family is quite large such that DNA samples can be obtained can be obtained from several affected individuals in the family. Traditional linkage maps are not useful in identifying the genes responsible for the majority of complex human diseases

13.1.2 Impact of Genomics on Disease Gene Identification An understanding of the fundamental nature of the mechanisms of disease will permit the causes rather than the symptoms of disease to be addressed. Concomitant with our fundamental understanding of disease will come improved intervention through insight into disease predisposition, earlier disease detection and disease characterization. The development of new mapping technologies, associated with the genome projects, has accelerated the pace of identification of disease genes, and the underlying principles of mutational mechanisms. This has also introduced the computational tools to search or “mine” DNA and protein sequence

Medico- and Pharmacoinformatics 215

sequence amplification by PCR– PCR-based genotyping of DNA microsattelites and single nucleotide polymorphisms (SNPs) technology. Gene chip technology can also be used to analyze transcription by providing a snapshot of gene expression for all the genes expressed in a given cell or tissue gene expression profiles). It is also to genotyping the DNA of an individual for the major alleles of each gene using SNPs. For example, cDNA can be prepared for a tumor biopsy and used to create an expression profile for that tumor, to detect common ontogenesis. Detection is carried out by annealing fluorescent-libeled DNA to the microarrays, followed by computer-aided scanning and scoring. Sequential analysis of gene expression (SAGE) is another method that can be used instead of microarrays for expression profiling. This gene chip/SNP detection technology has also been extended to protein analysis through proteomics, using mass spectrometry, to probe protein-protein interactions occurring within the cell. High-throughput screening of SNPs can been accomplished by peptide-nucleic acid (PNA) detection of the polymorphisms, coupled to MALDI-TOF-MS.

13.2.2 Monogenic Diseases Rare genetic diseases, such as sickle cell anemia, cystic fibrosis (CF) are often Mendelian, and monogenic traits are the result of mutations, in which predisposition for the disease is directly associated with the presence of a single gene allele. Sickle cell anemia is a classic example of the monogenic disease. It was one of the first inherited diseases, for which the molecular basis was established– connection between a single gene mutation, a single nucleotide change (single nucleotide polymorphism, that is, single-point mutation) leading to the substation of a valine for glutamic acid at position 6 in the b-subunit of the hemoglobin protein complex, and a disease phenotype. Cystic fibrosis (CF) is another example monogenic disease. Defects in the cystic fibrosis transmembrane regulator (CFTR) locus directly predispose an individual to cystic fibrosis, due to deletion of phenylalanine at position 508 (DF508). Some monogenic diseases exhibit allele heterogeneity with multiple mutations for the underlying disease gene. Examples are: Duchenne muscular dystrophy (DMD), a common muscle wasting disease, and osteogenesis imperfecta (OI), in which a mutation of either of the two genes that make up type-I collagen results in brittle or malformed bones. For monogenic diseases, a genetic component for the disease becomes evident from pedigree analysis. Once the genetic component has been suggested from pedigree analysis, then a search for common alleles between affected members can be carried out, in which one progresses from a disease phenotype to a candidate gene. For monogenic diseases, genetic mapping for single-gene (Mendelian) traits can be conducted by linkage analysis. This involves the identification of DNA markers, previously laced on the genetic map, that co-segregate with the disease in one or more families. Mutation screening is another approach for testing DNA samples from multiple individuals for the presence of a specific single mutation. Methods are single nucleotide primer extension assays, oligonucleotide microarrays (gene chips) for SNP detection and combination of the both.

216 Bioinformatics: A Primer

13.2.3 Polygenic Diseases Most common genetic diseases, such as diabetes, asthma, cancer, and heart disease are often polygenic, involving multiple genes, and multifunctional, that together with environmental factors predisposes the individual to disease. For multifunctional diseases, multiple genes, each having a small effect on the phenotype, are involved. Determining the genomic component(s) of non-Mendelian characters is more difficult, because their phenotypic expression often depends on the interaction of a myriad of genetic, social and environmental factors. Also, several modes of inheritance depend either the sex of the individual or the sex of the parent transmitting the trait. The majority of X-linked recessive diseases affect predominantly males, and X-linked dominant diseases are more frequent in females. The phenotype expression of the common genetic diseases that are time-dependent relies on a mechanism known as “imprinting”, which describes the dependence of the disease on the parent transmitting the trait. Environmental factors, such as diet or exposure to infectious agents may also affect the expression of disease phenotype and thus the penetration of the trait. It is these polygenic, non-Mendelian, complex diseases that have the greatest impact on the human population. The structural and functional information from the HGP and other genome projects will have a much greater impact on the elucidation of the etiology of common, multifactorial diseases. From the HGP data, the reference sequence of normal gene will provide the starting point for detecting disease-causing mutations. Documentation of “normal” variations will provide a basis for the subsequent identification of pathogenic variations. Comparative genomics in mammals will provide both the opportunities to develop animal models for human diseases, and a chance to increase our understanding of how genes and gene families have evolved.

13.3 GENETIC TESTING AND THERAPY The purpose of genetic testing/screening is to identify carriers of genetic disorders that could predispose the carrier, or the progeny to an inherited disease. Clinical programs are aimed at detecting the symptoms of the disease at an early stage. Therapeutic procedures are aimed at improving the present therapeutic protocols and/or develop new and more effective therapies.

13.3.1 Genetic Testing/Screening The genetic testing/screening can be carried out (i) linkage analysis, and (ii) direct detection of the mutants by mutation screening or other methods (Fig. 13.3). Once it is established which gene is responsible for the disease in a given family, linkage analysis can be used to predict whether a person at risk has in fact inherited the mutation-bearing chromosome.

13.3.1.1 Biomarkers Biomarkers play an important role in disease diagnostics and drug discovery and monitoring. Biological markers are measurable and quantifiable biological parameters (e.g. specific enzyme concentration, specific hormone concentration, specific gene phenotype distribution in a population, presence of biological substances), which serve as indices for health- and physiologyrelated assessments, such as disease risk, environmental exposure and its effects, disease diagnosis, metabolic processes. They provide the basis for developing new diagnostic products,

218 Bioinformatics: A Primer

structure function information rather than the traditional trial-and-error method, and targeted at specific sites in the body and at particular biochemical events leading to disease, promise to have fewer side effects than many of today’s medicines. New methods in protein profiling include surface-enhanced laser desorption ionization mass spectrometry (SELDI-TOF-MS), and antibody arrays.

13.3.1.2 Mutation Screening Mutation scanning refers to the process of analyzing DNA sequences or genes for the presence of any possible mutation. This can be carried out by (i) single-stranded conformational analysis (SSCA), or by (ii) hetero-duplex analysis. The SSCA method is based on the sequence-dependent mobility of single-stranded DNA molecule in non-denaturing polyacrylamide gel electrophoresis. The DNA to be analyzed is amplified by PCR, and the two strands of the DNA molecule are separated and electrophoretic separation is carried out on a non-denaturing polyacrylamide gel. The two strands will usually migrate at different rates, even though they have the same molecular mass. Any change in the base composition, as a result of mutations, may further modify the mobility of the fragment in the gel. The hetero-duplex analysis involves comparing the electrophoresis mobility of normal double-stranded DNA with “heteroduplex” DNA that contains one normal strand and a complementary strand containing the mutated sequence. The target DNA is amplified by PCR, and the DNA double helix is separated by raising the temperature to 95°C. During subsequent cooling, the complementary DNA strands are re-annealed. The mobility of heteroduplex DNA will be different than that of the homo-duplex DNA. Large base-pair mismatches may also be analyzed by using electron microscopy to visualize heteroduplex regions. High-throughput genotyping methods (determination of relevant nucleotide-base sequences in each of the two parental chromosomes) are available for diagnosis, drug efficacy, and toxicity. These methods utilize genomic DNA that, after digestion, reacts with a SNP array to obtain an individual SNP pattern. These variations can for instance provide information about the diagnosis of a certain disease, or the effectiveness or side effect of a certain drug. The examination of single chromosome sets (haploid sets), as opposed to the usual chromosome pairings (diploid sets), is important because mutations in one copy of a chromosome pair can be masked by normal sequences present on the other copy. In diploid organisms, such as humans, the linkage of particular SNP genotypes on each chromosome in a homologous pair (the haplotype) may provide additional information not available from SNP genotyping alone.

13.3.1.3 Differential Gene Expression Profiling The ability to profile the differences between biological samples is of fundamental importance in biology. Differential gene expression (DGE) and differential protein expression (DPE) are screening methods that are widely used for target validation. Current technologies available for the gene expression analysis are DNA microarray, and SAGE. Gene expression profiling (determination of the pattern of genes expressed under specific circumstances or in a specific cell) has been used extensively to examine biological models and disease tissues in an effort to understand the disease process and identify therapeutic targets. This involves studying the expression (as mRNA) of thousands of genes in a cell

Medico- and Pharmacoinformatics 219

or tissue, and how gene expression changes under various conditions. Genome-wide expression profiling of disease states and treatment conditions represents a significant advance in the areas of discovery of molecular disease markers, therapeutic target validation, predictive toxicology, and clinical monitoring of patients. It allows a comprehensive high-throughput screening of the effects of an insult (genetic, physiologic, pathologic, etc.) on gene expression in tissues and specific cell populations of interest. These techniques may aid in determining the function of a newly discovered gene or discovering new biomarkers and therapeutics for patients with disease. Molecular profiling (MP) analysis on homogeneous cell samples instead of larger tissues that may contain mixed cell populations. It is a dynamic new discipline, capable of generating a global view of mRNA, protein patterns, and DNA alterations in various cell types and disease processes. MP integrates the expanding genetic databases from the Human Genome Project with newly developed expression analysis technologies and holds great promise to help us to (i) understand the molecular anatomy of normal cells and cells in various stages of disease, (ii) develop new diagnostic and therapeutic targets for clinical intervention, and (iii) explain the relationship between genotype and phenotype in humans, which is still largely unknown. Human genic bi-allelic sequences (HGBASE), a database of intra-genic (promoter to end of transcription) sequence polymorphism, facilitates genotype-phenotype association studies based upon the rapidly growing number of known, gene related, single nucleotide polymorphisms (SNPs). HGBASE includes intra-genic sequence variants found in ‘normal’ individuals. HGBASE is not limited to bi-allelic polymorphisms but covers all types of intra-genic variation. Polymorphisms that are probably functionally important (e.g. codon changes) and others (e.g. intron variations) are all included because all can potentially be employed as surrogate markers for unknown nearby functional variants due to linkage disequilibrium. A key component of future genomic research and drug development will be the study of epigenetic imprinting (drug- or environment-induced changes in gene expression) indicative of disease and/or pharmacological or environmental exposure.

13.3.1.4 Differential Protein Expression Profiling The importance of the protein-based methods is that they measure the final expression product rather than an intermediate. In addition, some of them enable the detection of posttranslational protein modifications (e.g., phosphorylation, glycosylation, carboxylation) and protein complexes, and in some cases, yield information about protein localization. Proteinbased methods are important as they measure observable that are not readily detected in other ways. It is likely that expression proteomics will be a useful tool in drug target discovery and in studying the effects of various biological stimuli on the cell. Protein expression profiling is typically used for target discovery, toxicological studies or disease marker discovery. Similar to gene expression profiling, protein expression can also be profiled. This assay provides an indication of the relative levels of protein expression between two different conditions, whether they are disease vs. health, tissue vs. tissue, or normal vs. drug treated. The antibodies can be used to tag the profiled proteins, or the proteins themselves can be hapten derivatized, which in turn become targets for the immuno-RCA signal amplification complex. Hapten derivatization of the profiled proteins is one way to make this a universal assay.

220 Bioinformatics: A Primer

Current technologies for protein profiling and search for biomarkers rely on traditional proteomic approaches, such as 2-dimensional electrophoresis (2-DE) with quantification via mass spectrometry. However, 2-DE is not ideally suited to rapid, large-scale protein expression screening. The physical process of separating proteins via 2-DE remains long, multi-step, labor intensive, and often results in irreproducible data. These attempts have proven less useful, especially for biomarkers that are serum-based. Alternative methods are sought to bypass 2-D gels, using combinations of protein arrays (protein chips), super-critical fluid chromatography (SFC), capillary electrophoresis, and mass spectrometry for protein analysis. Other new approaches combine the power of artificial intelligence-based algorithms and high-throughput proteomic fingerprinting tools such as SELDI-TOF (artificial intelligence-based bioinformatics) to find specific proteomic patterns that can distinguish healthy from diseased patients. The ability of the pattern itself to become the diagnostic represents a new paradigm for the application of proteomics to clinical specimen analysis and disease diagnosis. Direct profiling of expressed proteins via SELDI brings the research team one step closer to the ultimate drug target than gene expression. Differences in expression patterns between distinct biological states (healthy/ diseased tissue) should allow direct detection of biomarker patterns. Protein expression analysis can indicate what proteins are expressed, but it is also important to know where proteins are expressed, and where they go over time (as with secreted proteins). By mapping relative distribution of proteins, abundance, tissue specificity, and movement (in healthy versus diseased tissue and in control versus treated tissue), one can gain a greater understanding of these proteins’ functions and determine which are likely to be the best drug targets. Fluorescence microscopy may be employed for protein localization studies.

13.3.1.5 Metabolic Engineering Approach Metabolomics is the analysis of cellular metabolites, and provides a powerful tool for gaining insight into functional biology. Monitoring of the level of numerous small molecules within a cell, and how those levels change under different conditions, is complementary to gene expression and proteomic studies. Metabolic profiles of bodily fluids such as plasma, cerebrospinal fluid and urine reflect both normal variation and the physiological impact of disease and pharmaceuticals on organ systems. Metabolic engineering (ME) approach is a “knowledge-based” alteration of (by recombinant DNA and other genetic techniques) metabolic pathways found in an organism in order to better understand and use cellular pathways for chemical transformation, energy transduction, and supramolecular assembly. Metabolic profiling is actively being applied to studies of drug toxicity, drug efficacy and model organisms, as well as humans and plants. In vitro screening of the metabolic characteristics of a new chemical entity would help to mimic in vivo conditions. When integrated with gene-expression profiles, metabolic profiling provides plausible explanations and testable hypotheses for the interactions regulating the observed expression changes.

13.3.2 Therapy Most prescription drugs have side effects and certain percentage of patients do not get desired benefit from the drug treatment. The inherited genetic differences (represented by SNPs) alter not only our susceptibility to disease but can also affect the response of individual patients to administered drugs. Genetic variations that affect the efficacy of a drug depend on the polymorphisms of many genes towards (i) drug metabolism, (ii) transport, and (iii) drug targets.

Medico- and Pharmacoinformatics 221

[Drug metabolism] + [Drug transport] + [Drug targets] = [Drug Effects] Pharmacogenomics attempts at identifying the disease genes, as well as finding new and individualized therapies, based on the knowledge of gene polymorphisms. Gene therapy, the medical procedure that involves replacing, manipulating, or supplementing nonfunctional genes with healthy genes to treat human disease, is one such attempt. In future a genetic blueprint will allow screening of genetic sequence variations to tailor-made individualized treatment where therapies are safer and more effective.

EXERCISE MODULES 1. 2. 3. 4. 5. 6. 7. 8.

What are the goals of protein engineering? What are the aims of disease gene identification? What is the impact of genome projects on disease gene identification? How do you identify monogenic diseases? What are the difficulties in identifying polygenic diseases? How does genomic data help in assessing polygencic diseases? What are the protocols for genetic screening? Comment of future therapeutic methods in disease control.

BIBLIOGRAPHY 1. Adams, M.D., et al. (1991), Science, 252; 1651. “Complementary DNA sequencing: expressed sequence tags and the human genome project”. 2. Bently, D.R. (2000), Med Res Rev., 20; 189”. The Human Genome Project– an overview”. 3. Boguski, M.S. & Schuler, G.D. (1995), Nature Genet., 10; 369. “ESTablishing a human transcript map”. 4. Broder, S. & Venter, J.C. (2000), Curr Opin Biotechnol., 11; 581. “Whole genomes: the foundation of new biology and medicine”. 5. Collins, F.S., Guyer, M.S. & Chakravarthi, A. (1997), Science, 278; 1580. “Variations on a theme: cataloging human DNA sequence variation”. 6. Cotton, R.G.H. (1997), Oxford University Press: Oxford. “Mutation Detection”. 7. Drews, J. (2000), Science, 287; 1960. “Drug discovery: A Historical perspective”. 8. Dulbecco, R. (1986), Science, 231; 1055–56. “A turning point in cancer research: Sequencing the human genome”. 9. Dunham, I.N., et al. (2000), Nature, 404; 904. “The DNA sequence of human chromosome”. 10. Evans, W.E. & Relling, M.V. (1999), Science, 286; 487. “Pharmacogenomics: translating functional genomics into rational therapeutics”. 11. Griffin, T.J. & Smith, L.M. (2000), Trends Biotechnol., 18; 77. “Single-nucleotide polymorphism analysis by MALDI-TOF mass spectrometry”. 12. Hamosh, A., et al. (2000), Human Mutat., 15;57. “Online Mendelian Inheritance in Man (OMIM)”. 13. Jorde, L.B., et al. (1999), Mosby: St. Luis, MO. “Medical Genetics”. 14. Marrone, T.J., Briggs, J.M. & McCammon, J.A. (1997), Ann Rev Pharmacol Toxicol., 37; 71. “Structurebased drug design: computational advances”. 15. Marth, G.T., et al. (1999), Nature Genet., 23; 452. “A general approach to single-nucleotide polymorphism discovery”.

222 Bioinformatics: A Primer 16. Palu, G., et al. (1999), J Biotechnol., 68; 1. “In pursuit of new developments for gene therapy of human diseases”. 17. Rawlings, C.J. & Searls, D.B. (1997), Curr Opin Genet Devel., 7; 416. “Computational gene discovery and human disease”. 18. Roses, A.D. (2000), Nature, 405; 857. “Pharmacogenetics and the practice of medicine”. 19. Sandhu, J.S., Keating, A. & Hozumi, N. (1997), Crit Rev Biotechnol., 17; 307. ‘Human gene therapy”. 20. Schmalzing, D., et al. (2000) Nucleic Acids Res., 28; e43. “Microchip electrophoresis: a method for highspeed SNP detection”. 21. See, D., et al. (2000), Biotechniques, 28; 270. “Electrophoretic detection of single-nucleotide polymorphisms”. 22. Strachan, T. & Read, A.P. (1999), BioScience: Oxford. “Human Molecular Genetics”.

14 Molecular Engineering Mutation in evolutionary process is the Nature’s answer to produce molecules with altered characteristics to survive and propagate in altered environmental situations. But, timeframe in the evolutionary process is large and is essentially a trial-and-error process. Therefore, the major objective of all molecular engineering methods is to produce molecular species (mutant species), on laboratory time-scale, with altered (improved) characteristics to suit required situations. The objective of selective mutations, by laboratory techniques, is to have better proteins. The same rationale is behind the rational drug design of procedures. The goal of protein engineering methods is, therefore, to apply computational procedures (computational biology) to design proteins with desired characteristics, and apply them to obtain these structural mutants by experimental methods of molecular biology, and test (validate) their functions. The focus of molecular medicine is–translating the understanding of health and disease at the cellular and molecular level to the development of new and novel therapies and diagnosis tics (e.g. gene therapy, DNA-based testing, vaccine design). Research in comparative genomic has yielded valuable insight into the mechanisms of transcription, and the function of noncoding DNA. This new level of understanding will enable drug discovery researchers to put genomic information in context, and link sequence to downstream biologic events within a broader biological context. The molecular basis of targeted therapies will enable a new class of compounds that will be more effective and less toxic than traditional classes, and deliver the promise of genomic.

14.1 GENOMICS AND PROTEOMICS ANALYSES Traditional methods (pharmaceutical, and medicinal chemistry) of drug design/discovery have been legend/target-centric (Fig. 14.1); but with the availability of vast amount of genomic sequence and annotated structural and functional data, there has been shift towards gene-centric drug design/discovery procedures (Fig. 14.2).

14.1.1 Genomics Approach Genomic approach is rapidly transforming the ways in which new drugs are discovered, developed, and ultimately, prescribed to the patients in need. Applying an understanding of the human genome to the study of disease Based-based approach has minimized the major bottlenecks in drug discovery by besetting the conventional methods. Based-based drug discovery has the potential to identify those target molecules that underlie disease processes themselves, as opposed to symptoms. However, good methods of target identification and validation will be necessary to realize this potential. The major trend in lead optimization is

224 Bioinformatics: A Primer

Fig. 14.1 Flowchart of Ligand/Target-centric Drug Design/Discovery

Fig. 14.2 Flowchart of Genome-centric Drug Design/Discovery

the movement toward in silicon (computational), and in vitro high-throughput screening (HTS) approaches. Technical advances (e.g. PCR and Blotting methods, Gene chips, supercritical fluid chromatography (SFC), and 2-D electrophoresis-MS), availability of genetic markers, annotated genomic and polemics data, and high-resolution SNP maps, generate by the genome project studies, have revolutionized the study of human genetic variations and rational drug design/ discovery procedures. Latest approaches, such as artificial intelligence-based algorithms and

Molecular Engineering 225

high-throughput polemic fingerprinting tools (e.g. SELDI-TOF) that correlate MS pattern itself to become the diagnostic, enable the application of polemics to clinical specimen analysis and therapy. As sequencing technologies progressed, the genome databases facilitated the rapid cloning of novel genes, and the inference of putative functions from the comparison of the expressed sequence tags (Sets– complementary DNA (coda) sequences generated from the minas of genes expressed in the cells of various organisms). Such database information has provided (and provides) a framework from which to develop more rationale treatment strategies, such as genotype-specific therapies. General approach in ration drug design, utilizing genomic data is: 1. Identifying of variations within specific genes that cause or predispose to disease. 2. Identifying gene-environment (DNA-DNA, and DNA-protein) interactions that might have pharmacogenic implications. 3. Identifying variations in immune response genes, which have implications for vaccine development.

14.1.2 Proteomics Approach The general concept of ascribing function to new proteins by discovering small molecule ligands (such as drugs, nutrients and toxins) is referred to as chemical proteomics. Chemical proteomics approach makes use of synthetic small molecules that can be used to covalently modify a set of related enzymes and subsequently allow their purification and/or identification as valid drug targets. Furthermore, such methods enable rapid biochemical analysis and small-molecule screening of targets, thereby accelerating the often-difficult process of target validation and drug discovery. The method uses labeled-irreversible protease inhibitors to isolate or identify active proteases in complex mixtures by two-dimensional (2-D) gel electrophoresis or by using protease-activity chips with MALDI–TOF or MALDI–quadrupole–TOF (MALDI–Q–TOF) mass spectrometric identification of the captured proteases. In proteomics analysis, characterization of novel proteins to the sub-family level (e.g. signal transduction proteins etc.) is of paramount important in rational drug discovery studies. Computational tools, such as sequence alignment, profile matching, HMMs, homology modeling, and neural networks, are employed to bring out many hidden patterns and relationships in sequence, 2-D gel and MS databases. Sequence similarity in places along the polypeptide chain, where the conservation is highest, like the active site, can be used to predict substrate (ligand) specificity, analyze structure-function relationships, and to design inhibitor analogs/drugs. Even a 3-D global view of a protein is helpful for mapping residues proximal in space that are far apart in sequence and for utilization of site-directed mutagenesis results and provides a structural context for the analysis of structural mutants. Knowing of a 3-D structure of a protein and in particular the site of interaction with ligands, allows for computational screening of large libraries of compounds for their binding potential to specific site on a protein. Screening of 3-D structures of proteins (from databases), with unknown functions, with substrates, co-factors, and molecular modeling would help in optimization of drug candidates and identification of potential lead compounds. Knowledge of amino acid composition, and bulk properties (pI, Mr, and shape) of proteins can be particularly useful in isolation, purification, and characterization of any newly identi-

226 Bioinformatics: A Primer

fied proteins. There are several specific databases available, for protein property prediction, based on physicochemical properties, shape and function. Computational chemistry plays an important role in these studies. Certain amino acids remain highly conserved even among diverse members of protein families. These highly conserved sequence patterns are called “signature sequences”, and in many cases they define the active site of a protein (PROSITE database is an excellent choice). Post-translational modifications of proteins (glycosylation, phosphorylation) do greatly affect structure and function of proteins. Identification of these sites can be quite useful in understanding structure-function relationships. In general, protein structure prediction methods employ high-resolution 3-D structural features of biologically relevant sites. However, attempts are underway to utilize low-resolution protein structural data for biochemical function assignment– methods that automatically generate a library of 3-D functional descriptors for the structure-based prediction of enzyme active sites, based on functional and structural information automatically extracted from public databases. There are many interaction databases are available for such molecularmolecular interactions. Some of these are: z Aminoacyl-tRNA : (http://rose.man.poznam.pl/aars/index.html). Contains aminoacylsynthetase tRNA synthetase (AARS) sequences for many organisms. Collected database pairs of AARS + tRNA can be used to create RNA-protein interaction records. z BIND : (http://bioinfo.mshri.on.ca/BIND/BIND_prop/index.html). Biomolecular Interaction Network Database. With BIND, computer simulations of whole-cell models of disease processes spanning medicine to agriculture will be possible. z BRENDA : (http://www.brenda.uni-koeln.de/). A database for enzymes that contains annotated information on enzymes–structure, reaction, specificity, post-translational modifications, and cross-reference to structure databanks. z COMPEC : (http://compel.bionet.nic.ru/). Databank containing protein-DNA and protein-protein interactions of composite regulatory elements (CRES). z DIP : (http://dip.doe_mbi.ula.edu). Database of interacting proteins. Contains information on protein-protein interactions. Gives also experimental methods to determine interactions. z EMP : (http://wit.mcs.aml.gov/EMP/). An enzyme database that is chemical reaction-based. z ENZYME : (http://www.expasy.ch/enzyme/). This database is an annotated extension, linked to SWISS-PROT, and deals with the enzyme structurefunction features. z FIMM : (http://sdmc.krdl.org.sg.8080/fimm/). Provides information on protein interactions that are important immunologically. z GeneNet : (http://www.mgs.bionet.nsc.ru/systems/mgl/genenet/). Describes genetic networks from gene through cell to organism level using a chemical reaction-based formalism.

Molecular Engineering 227 z

KEGG

z

PFBP

z

WIT

: (http://www.genome.ad.jp/kegg/). Kyoto Encyclopedia of Genes and Genomes. The databank provides information on many metabolic and regulatory pathways, and some as graphical diagrams. : (http://www.ebi.ac.uk/research/pfmp). Protein function and Biochemical Networks. The aim the databank is to provide details on metabolism, gene regulation, transport, and signal transduction. Graph uses abstraction for the interaction data and can describe chemical reaction pathways. : (http://wit.mcs.anl.gov/WIT2). What is this database aimed at reconstruction metabolic pathways in newly sequenced genomes by comparing predicted proteins with proteins in known metabolic networks.

14.2 RATIONAL DESIGN The rapidly growing body of structural information emerging as a result of genomic-derived targets and industrialization of protein structure determination is dramatically altering the computer-assisted rational molecular/drug design approaches– direct structure-based drug design, which combines structural biology with computational and medicinal chemistry (e.g. 3D-QSAR) in order to design drugs rather than merely selecting drugs that modulate a protein target of interest. Combining medicinal chemistry, computer-aided design, and biochemistry enables accelerated progress from “target-to-hit,” “hit-to-lead,” or “lead-to-candidate. The first step towards molecular engineering rests on the availability of three-dimensional structure information of the molecule under consideration (or of its close structural homologues). Ab initio prediction of three-dimensional structures of proteins from the primary structure data is not possible at present. Therefore, the knowledge of the three-dimensional structure of a protein or its homologue(s) is prerequisite for rational design of improved “mutants”. Involved technical procedures and constraints in experimental methods of structure determination (primarily by X-ray diffraction and NMR spectroscopic methods), and the time factors are the hindrances towards obtaining experimental structural data. While matter in single-crystalline state is a prerequisite for X-ray analysis, size of the protein is the limiting factor in the case of NMR spectroscopic methods. In the absence of the experimental data on tertiary structures of proteins, computer-aided model building, on the basis of the known three-dimensional structure of a homologous protein, is at present the only reliable alternative to obtain structural information for the design of new proteins. In silico modeling, that is, modeling of biological pathways and processes, and predictive simulation of cellular processes holdthe potential to enhance both efficacy and efficiency throughout the drug discovery and development process. Though molecular/drug design is generally computer-aided, but incorporation of pharmacokinetics and molecular properties of intended drug molecules would greatly help in design protocols. The de novo design of bioactive compounds, by incremental construction of a ligand model within a model of the receptor or enzyme active site, the structure of which is known from X-ray or NMR data, is becoming a valuable and integral part of drug discovery. The input of biocomputing in drug discovery is twofold– (i) the computer may help to optimize

228 Bioinformatics: A Primer

the pharmacological profile of existing drugs by guiding the synthesis of new and “better” compounds; (ii) as more and more structural information on possible protein targets and their biochemical role in the cell becomes available, completely new therapeutic concepts can be developed. The computer analysis helps in both steps–to find out about possible biological functions of a protein by comparing its amino acid sequence to databases of proteins with known function, and to understand the molecular workings of a given protein structure. Understanding the biological or biochemical mechanism of a disease then often suggests the types of molecules needed for new drugs. The general strategy of all “computer-aided”, “rational-design” procedures is to incorporate experimental data, sequence homologies, and information from structural databases and physicochemical parameters as “input data” for designing “new molecules” (designed structural mutants). The experimentally determined structures or modeled structures that are validated with a high level of confidence are used as “lead molecules” for rational design of desired molecules. The efficacy of these “designed structural mutants” for their altered functions can be determined by obtaining the molecular species (or some of them) experimentally (e.g. gene manipulation, site-directed mutagenesis, de novo synthesis) and testing their functions vis a vis their altered structures. The procedures involve iterative process of “experimental data– theoretical structure prediction – experimental validation” (Fig. 14.3).

Fig. 14.3 Operational Procedures in Molecular Engineering

14.2.1 Information from Experimental and Structural Databases Three of the most prominent uses of modern molecular modeling applications are– (i) structure analysis, (ii) homology modeling, and (iii) docking. Information from experimental methods consists of three-dimensional structural data from X-ray crystallography and NMR spectroscopy, partial structural and functional information from other spectroscopic, and physicochemical (e.g. electrophoresis and mass spectroscopy), and biomdecular (e.g. gene expression microarrays) methods. All procedures in computational biology, structure prediction methods, and computer-aided designs depend on the information from experimental and structural databases as inputs in computer-aided modeling (“knowledge-based” computer

Molecular Engineering 229

modeling) of possible tertiary structures, selectively redesign them and test them for validation of the prediction. Bioinformatics offers a means to get to a structure through sequence, while structure-aided molecular design offers a means to get to a molecule/drug through structure. In essence, it is a blend of computational chemistry with computational biology to create software that will aid protein chemists in understanding, evaluating and predicting the structure, function and activity of medically and industrially important proteins/molecules. The procedure incorporates as inputs the propensity profiles of amino acids in tertiary structure folding of proteins, variant semi-variant and variant residues in the sequence, and other structural and functional information. Computational methods of design involve optimizing types of mutations, their positions in the sequence and their sterochemical compatibility (e.g. Ramachandran conformational maps) and overall thermodynamic stability by energy minimization and other mathematical procedures. Rationally designed proteins with expected functions can be synthesized by experimental methods that involve molecular genetics protocols. The aim of all computer-aided rational drug/molecular design (CAMD) protocols is to involve all computer-assisted techniques, such as three-dimensional quantitative structure activity relationships (3D-QSAR), to discover, design and optimize compounds with desired structure and properties, by analyzing the quantitative relationship between the biological activity of a set of compounds (with a putative use as drugs) and their three-dimensional properties using statistical correlation methods. General approach is: (i) Lead optimization by considering both receptor binding and pharmacologically important properties. (ii) Conformational search in a protein’s binding site to find the optimal positioning of ligands. This is carried out by subdividing molecules into fragments, the conformational search of the fragments and the assembly of the fragment conformations into molecular conformations (combinatorial chemistry). (iii) Computational screening of ligand for desired drug-like properties with a scoring function is used to rate the binding affinity or activity for each trial ligand.

14.2.1.1 Screening Host of screening methods are available for drug/molecular design– bioassays, chemical assays (e.g. ELISA, radioimmunoassays), electron microscopy, and structure-based screening methods. Combination of combinatorial chemistry and high-throughput screening (HTS) has greatly accelerated the drug discovery and development protocols. Immunoassays are ligand-binding assays. There are three basic components in any immunoassay–(i) a specific antigen or antibody capable of binding to the analyte, (ii) the antigen to be detected and/or quantified, (iii) and a system to measure the amount of the antigen in the sample. The antibody can be linked to a radioisotope (e.g. radioimmunoassay, RIA), or to an enzyme that catalyses a monitored reaction (enzyme-linked immunosorbent assay, ELISA), or to a fluorescent compound by which the location of an antigen can be visualized (immunofluorescence). Electron microscopy-based drug screening method enables direct imaging of minimally perturbed cells, tissues, and microbes at molecular resolution. Automation in electron microscopy opens up a new sphere of high-content drug assays.

230 Bioinformatics: A Primer

Structure-based screening methods of drug discovery targets have recently emerged as an alternative and complementary tool to conventional high-throughput bioassay-based screening. They combine the power of NMR spectroscopy, and X-ray crystallography, and automatic docking, which provide the means to apply structural information (form NMR, X-ray, and modeling) to identify hits, select targets, and optimize the hits in terms of their affinities and specificities. While chemistry on structure-based hits can be aided by X-ray crystal structures of ligand-target complexes, such complexes are often difficult to crystallize. Therefore, structure-based NMR screening is a better alternative to identify drug-like smallmolecule hits from customized libraries. NMR-detected hits are turned into leads through chemical optimization that is guided by 3-D structural data.

14.2.1.2 Virtual Screening Though, the three-dimensional molecular structure is one of the foundations of structurebased drug design, the data available are often for the shape of a protein and a drug separately, but not for the two together. Docking programs are the computational methods to perform automated docking of ligands (small molecules like a candidate drug) to their macromolecular targets (usually proteins, sometimes DNA). These programs select molecules predicted to be highly complementary to the receptor structure and can screen many of these ligands against the protein (virtual screening/). Virtual screening (in silico screening) technology offers the ability to screen many more compounds at once than the traditional laboratory-based method.

14.3 VALIDATION The validity of these “designed structural mutants” for their altered functions can be verified experimentally by obtaining the molecular species (or some of them) experimentally (e.g. by gene manipulation, site-directed mutagenesis, de novo synthesis) and testing their functions vis a vis their altered structures (Fig. 14.3). Their three-dimensional structures can be determined by the existing physical techniques for structure determination, namely NMR spectroscopy and X-ray crystallography (by difference-Fourier techniques). Structural features (folding) of a protein may not be grossly altered by mutation, but mutation(s) of the crucial residue(s) may lead to drastic changes in the functional features of that protein. Mutations at a particular site are determined by the changes that occur at the neighboring sites in the core. That is, steric constraints play an important part in specifying the detailed properties of a protein. But, there are many examples where a single-point mutation at the crucial position of a protein (or an enzyme) leads to drastic effects on its function. Protein engineering (molecular tinkering) at the genetic level is aimed at bringing changes in the functions of a protein under consideration by altering the amino acid residue(s) at crucial positions. Such studies can probe the structure-functional relationships of macromolecules at the molecular level. Site-directed mutagenesis is one of the strategies in molecular biology (genetics) to modify the functions of proteins by rational replacement of amino acid residue(s) at crucial site(s) by single-point mutation or by cassette mutation (randomly altering a set of selected residues). This is a novel strategy to inspect reaction mechanisms, alter chemical behavior, and improve the structural (thermo-stability) and chemical (functional) characteristics of proteins. Func-

Molecular Engineering 231

tions of enzymes can be analyzed by altering amino acid residue(s) at the crucial position(s). Such methods are harbinger for designing enzyme activation/inactivation, and control protocols. Site-directed mutagenesis, combined with electrophysiological methods open up the possibility of detailed analysis of voltage-dependent gating of membrane proteins, and perhaps, eventual design of them. Automated primer generators are now available to analyze original nucleotide sequence and desired amino acid sequence and design a primer that has a new restriction enzyme site. These advances in gene manipulation techniques have enabled the design of novel proteins easier and faster, and to study reaction mechanisms and structure-function relationships. De novo synthesis is another procedure in the rational design and synthesis protocols of proteins.

14.3.1 Limitations of Virtual Screening While experimentally determined structure-function data are necessary and highly desirable for validation in computational methods of molecular/drug design procedures, many pharmaceutical companies, on account of time and cost factors, prefer computational methods. Many statistical algorithms, predictive biosimulation, molecular dynamics and conformational analysis, and other structure-function relationship analysis are employed, grouped under virtual screening (in silico screening). 3D-QSAR methods employ statistical correlation methods, incorporating molecular parameters, such as structural (steric) hydrophobicity, hydrogen bonding, and electronic features. Conformational analysis consists of the exploration of energetically favorable spatial arrangements (shapes) and molecular conformations using molecular dynamics calculations, simulation procedures (e.g. Monte Carlo method) consisting of randomly sampling the conformational space of a molecule, or by analysis of experimentally determined structural data (from NMR and X-ray structure analysis). Modern pharmaceutical research is underpinned by chemical genomics, a high-throughput approach to biology and chemistry. But, virtual screening methods are still beset with problems and uncertainties in validation protocols. The large and growing number of approaches for conducting virtual screening points to both the complexity of the process, as well as the difficulty of creating ideal solutions. Many virtual high-throughput screening (HTS) methods generate many hits, but little, or unreliable information. Given the large number of possible targets that are being identified by genomics, it is inconceivable that high-throughput screening (brute-force approach) of synthesized compound libraries will be able to match the challenge of identifying “a small molecule for every protein.” Therefore, the screening should be limited to generate enough information to start chemical programs, and the goal should be to screen the fewest possible number of compounds. Towards this goal, the use of computational filters and analysis is being applied for smarter design of libraries, better selection and prioritization of compounds for screening, as well as being employed for structure-based design and lead optimization.

EXERCISE MODULES 1. What are the goals of protein engineering? 2. Why is the emphasis of drug design has shifted towards genome-centric approach? 3. Write about genomics and proteomics approaches towards drug design.

232 Bioinformatics: A Primer 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Comment on gene manipulation methods. Which are the physical techniques for structural data information? What are the essentials of rational design of proteins? What are the inputs for the computer-aided rational design protocols? What are the various experimental methods in screening and validation? What is virtual screening? What are the protocols for validation of designed structures? What is site-directed mutagenesis and what is its importance in molecular engineering? What are the procedures for de novo synthesis? What are the limitations of virtual screening?

BIBLIOGRAPHY 1. Arakaki, A.K., Zang, Y. & Skolnick, J. (2004), Bioinformatics, 20(7); ([email protected]). “Largescale assessment of the utility of low-resolution protein structures for biochemical function assignment”. 2. Baker, D. & De Grado, W.F. (1999), Curr Opin Struct Biol., 9(4); 485 . “Engineering and design”. 3. Barrett, G.C. (Ed) (1985), Chapman-Hall: London. “Chemistry and Biochemistry of Amino acids”. 4. Bently, D.R. (2000), Med Res Rev., 20; 189–96”. The Human Genome Project–an overview”. 5. Berman, H.M., et al. (2000), Nucleic Acids Res., 28; 235–42. “The Protein Data Bank”. 6. Blundell, T. L., et al. (1987), Nature, 326; 347. “Knowledge-based prediction of protein structures and the design of novel molecules”. 7. Cotton, R.G.H. (1997), Oxford University Press: Oxford. “Mutation Detection”. 8. Creighton, T.E. (1993), Freeman Press: “New York. “Proteins––Structures and Molecular Properties”, 2nd Edn. 9. Drews, J. (2000), Science, 287; 1960. “Drug discovery: A Historical perspective”. 10. Dutt, M.J. & Lee, K.H. (2000), Curr Opin Biotechnol., 11; 176”. “Proteomic analysis”. 11. Eisen, M.B. & Brown, P.O. (1999) Methods Enzymol., 303; 179. “DNA arrays for analysis of gene expression”. 12. Evans, W.E. & Relling, M.V. (1999), Science, 286; 487. “Pharmacogenomics: translating functional genomics into rational therapeutics”. 13. Guex, N., Diemand, A. & Pettsch, M.C. (1999), Trends Biochem Sci., 24; 364. “Protein modeling for all”. 14. Hedge, P., et al. (2000), Biotechniques, 29; 548. “A concise guide to cDNA microarray analysis”. 15. Jorde, L.B., et al. (1999), Mosby: St. Luis, MO. “Medical Genetics”. 16. Kreil, D.P. & Etzold, T. (1999), Trends Biochem Sci., 24; 155. “DATABANKS–a catalogue database of molecular biology databases”. 17. Levitt, M. & Chotha, C. (1976), Nature, 261; 552 . “Structural patterns in globular proteins”. 18. Marrone, T.J., Briggs, J.M. & McCammon, J.A. (1997), Ann Rev Pharmacol Toxicol., 37; 71. “Structurebased drug design: computational advances”. 19. Marth, G.T., et al. (1999), Nature Genet., 23; 452. “A general approach to single-nucleotide polymorphism discovery”. 20. Martin, A., et al. (1998), Structure, 6; 875–84. “Protein folds and functions”. 21. Narayanan, P. (2003), New Age Intl Pubs: New Delhi. “Essentials of Biophysics” (2nd Print). 22. Palu, G., et al. (1999), J Biotechnol., 68; 1. “In pursuit of new developments for gene therapy of human diseases”. 23. Pandey, A. & Mann, M. (2000), Nature, 405; 837. “Proteomics to study genes and genomes”.

Molecular Engineering 233 24. Rawlings, C.J. & Searls, D.B. (1997), Curr Opin Genet Devel., 7; 416. “Computational gene discovery and human disease”. 25. Richardson, J.S. (1985), Methods Enzymol., 115; 349 . “Describing patterns of protein structure”. 26. Ripley, B.D. (1996), Cambridge University Press: Cambridge. “Pattern Recognition and Neural Networks”. 27. Roses, A.D. (2000), Nature, 405; 857. “Pharmacogenetics and the practice of medicine”. 28. Sandhu, J.S., Keating, A. & Hozumi, N. (1997), Crit Rev Biotechnol., 17; 307. “Human gene therapy”. 29. Schmalzing, D., et al. (2000) Nucleic Acids Res., 28; e43. “Microchip electrophoresis: a method for highspeed SNP detection”. 30. Shao, Z. & Arnold, F.H. (1996), Curr Opin Struct Biol., 6; 513. “Engineering new functions and altering existing functions”. 31. Steitz, T.A. (1990), Quart Rev Biophys., 23; 205 . “Structural studies of protein-nucleic acid interaction: the sources of sequence-specific binding. 32. Strachan, T. & Read, A.P. (1999), BioScience: Oxford. “Human Molecular Genetics”. 33. Wladawer, A. & Vondrasek, J. (1998), Annu Rev Biophys Biomol Struct., 27; 249–84. “Inhibitors of HIV-1 protease: a major success of structure-assisted drug design”.

Glossary Adhesion Force of attraction between unlike molecules. Algorithm A methodical, logical sequence of steps, typically involving a repetition of operations, by which a task can be performed. Alignment The result of comparison of two or more sequences to determine the degree of their similarity. Alignment score Computed sum based on the number of matches, insertions, and deletions within an alignment. Alleles Mutually exclusive forms of the same gene, occupying the same locus on homologous chromosomes, and governing the same biochemical and developmental process. Allosteric Refers to a change in the properties (usually including shape) of a protein following the binding of another molecule to the protein. Amphiphilic Molecules containing both polar (hydrophilic) and apolar (hydrophobic) groups. Ampholyte Small molecule with positive and negative charges. Analogue Non-homologous proteins that have similar structural folding architecture, but arisen out of convergent evolution. Annotation Adding pertinent information, comments, or notations, such as genes coded for amino acid sequences. Anticodon A triplet of contiguous nucleotides on tRNA that binds to the triplet of contiguous nucleotides (codon) on mRNA. Atomic-force microscopy A type of scanning-probe imaging technique that provides highresolution topological maps. Base-pairing Complementary hydrogen-bonded basepairs (e.g. A=T; C∫G) in nucleic acids. Bayesian Technique A stochastic procedure used to estimate parameters of a distribution based on an observed distribution. Biochips Miniaturized arrays of large number of oligonucleotides (DNA microarrays). Bioinformatics An interdisciplinary subject to analyze the sequence data of nucleic acids and proteins and predict the structure and function of biological macromolecular complexes. Biomarkers Measurable and quantifiable biological parameters (e.g. specific enzyme or biological substances). Biphotonic excitation The simultaneous (coherent) absorption of two photons (either same or different wavelength), the energy of excitation being the sum of the energies of the two photons.

Glossary 235

BLAST Basic Local Alignment Search Tool; used in genome informatics for similarity search of DNA and protein sequences. BLOCKS Conserved ungapped, aligned sequence segments, algorithm in a set of related proteins. CARS microscopy A “chemical microscopy” that is based on Raman spectroscopy. Catalyst Substance that accelerates the rate of chemical reaction without being used up in the process. cDNA Complementary DNA that is synthesized in the laboratory from mRNA template using reverse transcriptase. CDR sequence Coding region sequence—an uninterrupted nucleic acid sequence that codes for a protein. “Chemical shift” Shielding of atomic nuclei from the influence of the external magnetic field. Chromatography A physical technique used to separate mixtures of substances based on differences in the relative affinities of the substances for mobile and stationary phases. Chromosome Self-replicating genetic structure in cells, which contain DNA sequences that constitute genes. Clone An exact copy of a gene, a cell or an organism (obtained asexually). Cloning The processes of generating identical copies of a DNA fragment from a single template DNA. Cluster analysis A procedure of grouping together a set of objects from a large group of related objects. Cobbler A single sequence that represents the most conserved regions in a multiple sequence alignment. Coding sequence (CDS) A region of DNA or RNA that codes for the sequence of amino acids in a protein. Codon A triplet of contiguous nucleotides on mRNA that codes for an amino acid. COG Clusters of orthologous groups that include orthologs and paralogs. Confocal microscopy A type of aperture-based light microscopy. Conformation Spatial arrangement of structural moieties due to rotation around a single bond. Consensus sequence A pseudo-sequence that summarizes the residue information contained in a multiple sequence alignment. Conserved sequence A base sequence in DNA that has remained essentially unaltered throughout evolution. Contig A length of contiguous sequence assembled from a partial, overlapping sequences, generated from a “shotgun”-sequencing sequencing project. Correlation A statistical measure that indicates the extent to which two factors vary together and thus how well either factor predicts the other. Dalton Unit of mass that is equivalent to one-twelfth the mass of an atom of carbon-12 (~ the mass of a hydrogen atom).

236 Bioinformatics: A Primer

Databases Computer-based organization of sequence and structural data of biomolecules. Data mining Search of non-trivial information or relationships from the databases. Dendogram A graphical representation of an evolutionary tree from the output of a hierarchical clustering method. Differential Gene Expression Screening technologies that are widely used for target validation. Directed mutagenesis Alteration of DNA at a specific site and its reintroduction into an organism to study any effects of the change. DNA A complex molecule containing the genetic information that makes up the chromosomes. DNA fingerprinting A procedure in which multilocus band patterns of a DNA sample are generated by digestion of the DNA with restriction enzymes followed by electrophoresis and visualization by hybridization with probes specific for repetitive sequences. DNA footprinting A method for determining the sequence specificity of DNA-binding proteins. The method utilizes a DNA damaging agent, which cleaves DNA at every base-pair; DNA cleavage is inhibited where the ligand binds to DNA. Docking A type of virtual screening drug design technology to evaluate binding of ligands to macromolecular targets. Domain A discrete portion of a protein with a unique function. Doppler effect Apparent change of frequency due to motion of the source relative the object. Dotplot A graphical comparison of two sequences. Dynamic programming A mathematical method of solving a complex problem by combining solutions to sub-problems. Electrophoresis Molecular separation method, based on the migration of charged particles under the influence of applied electric field. ELISA Enzyme-Linked Immunosorbent Assay (ELISA)–an immunoassay utilizing an antibody labeled with an enzyme marker. Encoding The processing of information into the memory system, for example by extracting meaning. EST Expressed Sequence Tag (EST)–a partial sequence of a cDNA clone, which can act as identifier of a gene. Evolution Diversification and mutation of living organisms. Exon The protein-coding region of a DNA sequence. Expressed Sequence Tag (EST) A partial sequence of a DNA molecule that is a part of a cDNA molecule and can act as identifier of a gene. FASTA A sequence similarity search algorithm. Fingerprint A group of ungapped motifs characteristic of a family member. FISH Fluorescence in situ hybridization method in which target sequences are stained with fluorescent dye so their location and size can be determined using fluorescence microscopy. Frame-shift An alteration in the reading sense of DNA resulting from an inserted or deleted base.

Glossary 237

Frame-shift mutation A type of mutation in which a number of nucleotides not divisible by three is deleted from or inserted into a coding sequence. Free induction decay The time-dependent decay profile (signal amplitudes as a function of time) of pulsed NMR signal. Gap Mismatch in the alignment of two sequences caused by insertion or deletion. Genes The fundamental structural and functional units (nucleic acids) of heredity that codes for a proteins. Gene duplication A particular kind of mutation– production of one or more copies of any piece of DNA including a gene or even an entire chromosome. Gene expression The processes (transcription and translation) of converting genetic code into amino acid sequences. Gene informatics Database searches to analyze sequence information and prediction of structure and function from the sequence data. Genetic code The library of contiguous triplet nucleotides (codons), that code for 20 amino acids and stop signals. Genetic map A linear arrangement of the relative positions of genes along a chromosome. Gene marker A gene or other identifiable portion of DNA, in one haploid set of chromosomes, where inheritance can be followed. Genome The complete set of all genetic material present in the chromosome of an organism (measured as number of basepairs). Genomics Study of gene structure and functions. Genotype Genetic composition of an individual. Genotyping The determination of relevant nucleotide-base sequences in each of the two parental chromosomes. Global alignment The alignment of two nucleic acid or protein sequences over their entire length. Half-life The time needed for (1) half the atoms of a radioactive substance to decay or (2) half the amount of a substance (e.g., a drug) to be metabolized or excreted. H-T-H motif Helix-turn-helix structural motif found in proteins (e.g. in many of the prokaryotic transcriptional regulatory proteins). Heuristic Empirical procedure (“rule-of-thumb” strategy) to solve a problem based on “testedand-correct” rules. Homology Two or more biological species, systems or molecules that are related by divergent evolution from a common ancestor. Hydrolysis Decomposition of a substance by the insertion of water molecules between certain of its bonds. Hydrophilic Water-loving. Hydrophobic Water-repelling.

238 Bioinformatics: A Primer

Image reconstruction The mathematical procedures involved in the transformation of the diffraction patterns for structure determination. Immunoassay A ligand-binding assay that uses a specific antigen or antibody, capable of binding to the analyte, to identify and quantify substances. The antibody can be linked to a radioisotope (radioimmunoassay, RIA), or to an enzyme (ELISA). Immunoelectrophorsis Combination of gel electrophoresis and immunodiffusion methods. Indel An insertion or deletion in sequence alignment. Information theory Procedure that collects information in terms of bits, the minimal amount of structural complexity needed to encode a given piece of information. In silico analysis Analysis by computational (theroretical) methods, in contrast to experimental methods of data acquisition and interpretation. Intron A region of DNA in a gene that is not allowed to encode a protein. Ion Atom or group of atoms that has an electrical charge arising from the gain or loss of electrons. Ionic bond Chemical bond formed between ions of opposite charge. Isoelectric point pH at which a polar (charged) molecule has a zero net charge. Iterative A sequence of operations that is performed repeatedly. Isomer Molecule with the same molecular formula as another but with a different structural formula (e.g., glucose and fructose). Isotonic Having the same concentration of water as the solution under comparison. Isotope Atom that differs in weight from other atoms of the same element because of a different number of neutrons in its nucleus. Iterative A sequence of operations that is performed repeatedly. Kohonen map An unsupervised self-organization (clustering) algorithm. K-tuple Identical short stretch of sequences, also called words. Leucine-zipper A structural motif found in DNA-regulatory proteins. Ligand A molecule that binds to another molecule or to a cell. Local alignment The alignment of some portion of two nucleic acid or protein sequences. MALDI-TOF-MS A mass spectrometric technique that is use for the analysis of biological macromolecules. Mapping Determination of the physical location of a gene on a chromosome. Markov model A mathematical procedure based upon states and transition probabilities used in sequence alignment. Microarrays Microarrays in which nucleic acids representing genes are spotted on a substrate and then tested against a sample to evaluate mRNA levels, and thus gene expression. Molecular engineering Computational design (drug deign) of molecular species with required structural and functional characteristics. Molecule Smallest particle of a covalently bonded element or compound that retains the properties of that substance.

Glossary 239

Motif An aggregate of secondary structural elements forming a super-secondary structure. MRI Magnetic resonance imaging–a non-invasive imaging technique based on locating nuclear spins in a specimen. Multiple sequence alignment An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions are aligned in the same column. Multiplex sequencing A sequencing approach that uses several pooled samples, greatly increasing sequencing speed. Used in high-throughput sequencing. Mutation A heritable change in the nucleotide sequence of genomic DNA (or RNA in RNA viruses). Neural Networks Artificial neural networks (ANN) are a collection of information-processing mathematical models that draw on the analogies of adaptive biological learning. NMR spectroscopy A versatile spectroscopy method that is used in structure determination of biomolecules. NOE Nuclear Overhauser Effect–interaction between the dipole moments of two nuclei in spatial proximity–provides information about the distance between nuclei. NOESY An NMR technique used to help determine protein structures. It reveals how close different protons (hydrogen nuclei) are to each other in space. Northern blotting An analysis technique for the identification of RNA. NSOM Near-field Scanning Optical Microscopy—scanning probe imaging technique that circumvents diffraction limit, thus improves image resolution. Odds score The ratio of the likelihood of two events or outcomes. In sequence alignment, it is the ratio of the frequency with which two characters are aligned in related sequences divided by the frequency with which those same two characters align by chance alone. Open reading frame (ORF) The sequence of DNA or RNA located between the start-code (initiation codon) and the stop-code (termination codon) that encodes a gene. An ORF is potentially able to encode a polypeptide. Operator The DNA sequence in prokaryotes to which a repressor or activator protein binds, turning on (or off) transcription of the associated structural genes of the operon. Optimal alignment Statistically the best possible alignment of given characters. Orbital Wave function in space. Ortholog Homologous protein (or gene) that performs same functions in different species. PAM matrix Percent Accepted Mutation table that describes the probability that one base or amino acid has changed during the course of evolution. Paralog Homologous protein (or gene) that performs different but related functions within the same organism. PCR method Polymerase chain reaction (PCR), a gene amplification method. Peptide mass fingerprinting A means of protein identification by means of mass spectrometry. “Phase problem” Mathematical methods for obtaining the phase information for diffraction spots (hkl) in X-ray structure determination.

240 Bioinformatics: A Primer

Phenotype Observable traits (characteristics) of an organism. Phylogenetic analysis Study of the evolutionary relationships. Physical map A linearly ordered set of DNA fragments encompassing the genome or region of interest. Polymorphisms Genetic variations, encompassing any of the many types of variations in DNA sequence that are found within a given population (mutations, point-mutations, and SNPs). Positional cloning Use of genetic maps to determine the location of disease gene. Primary structure The linear sequence of amino acids in a protein. Primer Short preexisting polynucleotide chain to which a polymerase enzyme can add new nucleotides. Profile A matrix representation of a conserved region in a multiple sequence alignment that allows for gaps in the alignment. Promoter Region on a DNA molecule involved in RNA-polymerase binding to initiate transcription. Protein engineering A technique used to produce proteins with altered or novel amino acid sequences. Protein folding Spatial structure that is unique to an individual protein. Protein-folding problem The problem of determining the tertiary structure of a protein from its amino acid sequence data. Proteome The entire collection of proteins that are coded by the genome of an organism. Pseudogenes Genes that do not code for proteins or silent genes that contain elements that control gene expression. QSAR Quantitative structure activity relationship–mathematical approach towards linking molecular structure to its function and activity. Quaternary structure Arrangement of protein monomers (by non-covalent forces) in a multimeric protein complex. Radioimmunoassay (RIA) A radiolabeled immunoassay method. Ramachandran map A plot of sterically allowed and not-allowed conformations of a polypeptide backbone. Rayleigh scattering Elastic scattering of light by molecules. Reading frame A sequence of codons beginning with an initiation codon and ending with a termination codon. Regression analysis The procedure of fitting a mathematical model to data. Reverse turn A secondary structure element in proteins where the polypeptide backbone turns sharply on itself. Scaffolds Ordered set of contigs placed on the chromosome. Scanning-probe microscopy Surface-probe imaging technique that provides high-resolution topological maps.

Glossary 241

Scanning-tunneling microscopy A surface-probe imaging technique that is based on quantum mechanical tunneling current. Provides high-resolution topological maps. Secondary structure Helix, sheet and loop segments in a protein. Sequence alignment A linear comparison of nucleic acid or amino acid sequences (with insertions) to bring equivalent positions in adjacent sequences into the correct register. “Shotgun” sequencing High-throughput sequencing method which involves randomly sequencing tiny cloned pieces of the genome, with no foreknowledge of where on a chromosome the piece originally came from. Silent mutation A nucleotide substitution that does not result in an amino acid substitution in the protein, because of the redundancy of the genetic code. Single crystalline Matter wherein internal organization of atoms/ molecules or clusters of molecules is regular and periodic in all three dimensions (crystal lattice). SNPs Single nucleotide polymorphisms (SNPs) are single base-pair positions in genomic DNA at which different sequence alternatives (alleles) exist in normal individuals in some population(s), wherein the least frequent allele has an abundance of 1% or greater. Southern blotting An analysis technique for the identification of DNA Splice sites Boundaries between exons and intron. Substrate Substance that is acted upon by an enzyme. Tertiary structure Three-dimensional structure of a molecule (e.g. protein). Threading A procedure of aligning the sequence of an unknown protein with a known threedimensional structure. Transcription Synthesis of an RNA copy from a sequence of a DNA. Transcription factor A regulatory protein that is required to initiate transcription of gene in eukaryotes. Translation The process of converting the template information in mRNA into synthesis of a protein. Transmembrane protein Protein that passes one or more times through the lipid bilayer of a cell membrane. Unidentified reading frame An open reading frame (URF) encoding a protein of undefined function. Unit cell A parallelepiped (an imaginary box) that contains one basic unit of the structure and translational repeat of the unit cell in all three dimensions represent the crystal. Valence Number of electrons gained, lost, or shared by an atom in bonding to one or more other atoms. Virtual screening Validation of drug design by computational (in silico) methods. Western blotting An analysis technique for the identification of proteins or peptides that have been electrophoretically separated by blotting and transferred to strips of nitrocellulose paper. Radiolabeled antibody probes or fluorophores are used to detect the blots. Zinc-finger motif A structural motif found in DNA-regulatory proteins.

Index a-helix, 36, 192 b-sheet, 38 ,192 3D-QSAR, 231, 240 Agarose gel, 58 Amino acids, 30 characteristics of, 30 chemical formula of, 30 folding propensity of, 126 hydrophilic, 32 hydrophobic, 33 ionization states of, 31 neutral, 34 side chains, 32 zwitterionic state of, 30 Ampholytes, 61, 234 Anion exchanger, 56 Autoradiography, 72 Base-pairing, 15, 234 Watson-Crick type, 15, 18 Bessel function, 87 Bioinformatics, 1, 4, 52, 234 medico-, 212 molecular, 4, 116, 133 objectives of, 4 pharmaco-, 212 Biology computational, 6, 223 molecular, 5 quantitative, 5 structural, 6 systems, 6 Biomarkers, 216, 234 Biomedinformatics, 2 Biomolecules, 51-69 physicochemical characterization of, 51 spatial structure determination of, 83-101

Biophysics molecular, 6, 9-48 Blotting techniques Northern, 66, 239 Southern, 66, 241 Western, 66, 68, 241 Bonds (chemical) conjugated, 13 coordination, 13 double, 13 glycosyl, 15 peptide, 34 phosphodiester, 16 single, 13 Bragg, 84 Bragg’s equation, 84, 85 Bremsstrahlung, 83 Brewester’s angle, 66 Cation exchanger, 56 cDNA sequencing, 75 Cellulose acetate, 57 Charge-coupled diode (CCD), 72 Chemical graph, 11 Chemical shift, 90, 235 Cheminformatics, 2 Chromatography, 55, 235 affinity, 56 dHPLC, 55 HPLC, 55 ion-exchange, 56 liquid, 55 reversed-phase, 55 size-exclusion, 56 supercritical fluid (SFC), 56 thin layer (TLC), 55

Index Chromosomal maps, 117, 235 banding patterns, 28 genetic, 117 linkage maps, 117, 213 physical, 117 positional cloning, 213 Coding region (CDR), 120 Codon, 3, 25, 235 Codon usage, 120 Conformation analysis, 39, 235 Ramachandran plots, 39 Contig, 75, 118, 235 Coomassie Blue, 65 Crystal lattice, 84 Cystic fibrosis, 4, 215 Cystic fibrosis transmembrane regulator (CFTR), 4, 215 Cytofluorimetry, 73 Data mining, 168-211 analysis, 168 prediction, 121 Database search, 151-167 genome, 155 motifs, domains and profiles, 163, 192, 194 pattern recognition, 163, 199 primary structure, 151 protein, 160 search engines, 153 secondary structure, 162 sequence similarity, 161, 183 Databases, 152, 236 BIND, 226 BLOCKS, 163, 187, 193, 235 BRENDA, 226 COILSCAN, 163, 193 COMPEC, 156, 226 DIP, 226 EMBL, 156 ENZYME, 226 ExPASy, 153 genomic, 152, 155 HTHSCAN, 193 knowledge, 151, 165 NCBI, 154 Parsimony, 161, 189 PDB, 160 PFAM, 198 PILEUP, 162, 188 primary, 151

PRINTS, 163, 194 PRODOM, 163, 195 PROFILES, 163, 198 PROSITE, 163, 194 proteomic, 152 secondary, 151, 152 structural, 228 SWISS-PROT, 154 De novo synthesis, 116, 231 Dideoxynucleotide (ddNTP), 72 Diffusion, 52 rotational, 53 translational, 52 Dipole moment, 53 Disease gene identification, 212 impact of genomics, 213 DNA, 18, 236 A-form, 18 B-form, 20 replication, 21 transcription, 23 Z-form, 20 DNA fingerprinting, 118, 236 DNA microarrays, 124 DNA-histone complex, 108 Doppler effect, 53 Dotplot analysis, 174, 236 Drug discovery, 224 genome-centric, 224 “knowledge-based”, 228 ligand/target-centric, 224 Einstein, 52 Electron density, 86 Electrophoretic methods, 56, 236 2-D capillary, 64 2-D gel, 62 capillary, 63 capillary zone (CZE), 64 column, 58 free-solution, 64 horizontal, 58 IEF/SDS-PAGE, 63, 64 immunoelectrophoresis, 61, 238 microchip, 58 polyacrylamide gel (PAGE), 59 pore gradient, 61 pulse-field gel (PFGE), 71, 118 SDS-PAGE, 59, 60

243

244 Bioinformatics: A Primer slab gel, 59 Electrophoretic mobility, 57 Electrospray ionization (ESI), 78, 80 Elongation factors, 25 Endopeptidases, 77 Energy Madelung, 44 potential, 43 Eukaryote, 120, 121 Evolution, 236 and function, 128 and structure, 130 convergent, 129 divergent, 129 molecular, 130 Evolutionary tree, 190 Exon, 121, 236 Exopeptidases, 77 Expressed sequence tag (EST), 121, 157, 236 FISH, 118, 213, 216 Fourier transformation, 83 Free induction decay (FID), 90, 237 Frictional coefficient, 52 Frictional drag, 57 Functional informatics, 125 Gene, 237 amplification, 25, 70 cloning, 27, 70 duplication, 237 expression, 21, 218, 237 mapping, 28 replication, 21 separation, 27, 71 sequencing, 27, 71, 73 structure, 156 Gene chips, 215 Gene expression analysis, 123 DNA microarrays, 124 gene chips, 215 SAGE, 124, 215 Gene sequencing, 27, 71 dideoxy method, 71 Maxam Gilbert method, 71 multiplex, 71 Sanger method, 71 Gene therapy, 221 Genetic code, 3, 237 dictionary of, 3

Genetic diseases, 3, 214 Alzheimer’s, 4 common, 216 cystic fibrosis (CF), 215 Huntington’s, 4 imprinting, 216 Mendelian trait, 215 monogenic, 215 non-Mendelian trait, 216 Parkinson’s, 4 polygenic, 216 sickle cell anemia, 3, 215 single nucleotide polymorphism (SNP), 3, 214 Genetic engineering, 5 Genetic maps, 117 Genetic testing, 216, 217 biomarkers, 216 differential gene expression profiling, 218 differential protein expression profiling, 219 linkage analysis, 213 molecular profiling (MP), 219 mutation screening, 218 Genetic variations, 214 Genome annotation, 119 Genome sequencing, 74 “shotgun” strategy, 74, 241 Genomics, 116, 237 analysis, 116, 223 comparitive, 119 functional, 123 structural, 116 Glycoproteins, 110 Hidden Markov Model (HMM), 196, 238 HIV-1 protease, 4, 233 Human genome project (HGP), 21 Hydrogen bonding, 46, 103 sequence-specific, 103 Image reconstruction, 84, 238 Imaging methods, 95 functional/metabolic, 99 microscopies, 96 structural, 96 tomographies, 98 Immobilized pH gradient (IPG), 62 Immunoassay, 236 ELISA, 229 RIA, 229, 240 In silico analysis, 5, 227, 238

Index Information content, 2, 84 Interactions codon-anticodon, 2 dipole/dipole, 43, 44 dipole/induced-dipole, 43, 45 electrostatic, 43, 102 hydrogen bonding, 46 hydrophobic, 46, 237 induced-dipole/induced-dipole, 43, 45 ionic, 43 molecular, 102 non-bonded, 43 protein-carbohydrate, 110 protein-ligand, 102-111 protein-lipid, 110 protein-nucleic acid, 102 protein-protein, 110 Van der Waals, 43, 44 Intron, 121, 238 Intron/exon boundary, 122 Isoelectric focusing (IEF), 61 Isoelectric point (pI), 57, 61, 238 Keesom equation, 45 Kozak sequence, 120 Larmor frequency, 89 Laser-induced fluorescence (LIF), 72 Lennard-Jones potential, 46 Light scattering, 53 Linkage analysis, 117, 213 Lipoproteins, 110 London equation, 46 Macromolecule fibrous, 87 globular, 134, 135 Madelung formula, 43 Magnetization vector, 90 Mass fingerprinting, 79 Mass spectrometry (MS), 78 ESI-MS/MS, 80 MALDI-TOF-MS, 78, 79 MI-MS, 80 SELDI-TOF-MS, 80 tandem (MS/MS), 80 Metabolic engineering, 220 Metabolomics, 220 Methyl green pyronin, 65

245

Microscopy atomic-force (AFM), 73, 97, 234 biphotonic excitation confocal, 97, 234 chemical, 98 coherent anti-Stokes Raman scattering (CARS), 98, 235 confocal, 73, 96, 235 electron, 96 fluorescence, 97 fluorescence lifetime imaging (FLIM), 97 magnetic resonance (MRM), 98 NSOM, 73, 98 optical, 96 scanning-probe (SPM), 97, 240 scanning-tunneling (STM), 73, 97, 241 surface plasmon resonance (SPR), 98 transmission electron (TEM), 98 X-ray, 98 Mining data, 121, 168, 236 sequence, 5 structure, 5 Molecular mass, 52 shape, 53 size, 53 Molecular dynamics (MD), 109 Molecular engineering, 223-233, 238 genomic approach, 223 proteomics approach, 225 rational design, 227 Molecular structure forces stabilizing, 42 Molten globule, 133 Mutation screening, 218 single-point, 3 structural, 223 Mutation screening, 215, 218 Nanopore sequencing, 75 Neutrons, 11 Niels Bohr, 11 NMR Spectra, 90 COSY, 92 DFQ-COSY, 92 HMQC, 94 NOESY, 92 NMR spectroscopy, 89 multi-dimensional, 91

246 Bioinformatics: A Primer Nucleic acid bases, 15 adenine, 15 cytosine, 15 guanine, 15 thymine, 15 uracil, 15 Nucleic acid families, 18 A-form, 18 B-form, 20 Z-form, 20 Nucleic acids, 15-29 cDNA, 75, 121, 235 constituents of, 15 DNA, 16, 18 double helical structure, 17, 18 families of, 18 mRNA, 24 polynucleotides, 16 primary structure of, 70 RNA, 16 tRNA, 19 Nucleoside, 15,16 Nucleotide, 16 Okazaki fragments, 22 Oligonucleotides, 16 Open reading frame (ORF), 120, 156, 157 Orbitals, 11 Patterns recognition, 163 Pauli, 12 Pauli’s exclusion principle, 12 Peptide fragmentation, 76 Peptide unit, 36 Pauling’s hypothesis of, 35 Pharmacogenomics, 221 Phase problem, 83 anomalous dispersion method, 87 direct methods, 87 molecular replacement method, 87 Patterson method, 86 Phylogenetic analysis, 184 BLOCKS model, 187, 235 cladistic methods, 189 Dayhoff method, 184 distance method, 186, 188 log-odds matrix, 186 PAM model, 185 Polyethylene oxide (PEO), 64

Polymerase DNA, 22 RNA, 22 Polymerase chain reaction (PCR), 27, 71 Polymorphisms, 3 Polynucleotides, 16 Polypeptide, 34 fragmentation, 76 Primary structure, 70-82 of nucleic acids, 70 of proteins, 31, 76 Prokaryote, 120 Protein classification, 200 CATH, 164, 200 SCOP, 164, 201 Protein folding, 115 clasdistic approach, 128 kinetics methods, 127 phylogetic methods, 128 problem, 115-132 rules, 133 Protein P21, 4 Protein sequencing 2-DE/MS method, 79 Edman method, 77 Protein structure a/b-barrel domains, 129, 142 domains, 194 evolutionary trends in, 129 motifs, 193 prediction, 116 profiles, 196 Protein trafficking, 4 Proteins, 30-48 amino acid sequencing, 76 classification, 34, 164 collagen group, 134 disulfide-containing, 144 DNA-regulatory, 104 fibrous, 87, 134 globular, 134, 135 homologous, 169 keratin group, 134 monomeric, 41 oligomeric, 41 orthologous, 130, 170 paralogous, 130, 170 primary structure of, 35 purification of, 77 quaternary structure of, 41

Index secondary structure of, 35 spatial structure of, 70 tertiary structure modeling, 201 tertiary structure of, 41, 201 three-dimensional structure of, 41 Proteomics, 62, 225 analysis, 125 inter-proteome, 125 intra-proteome, 125 Protons, 11 Pseudogenes, 21, 240 Quantum number angular, 11 azimuthal, 11 magnetic, 11 principal, 11 spin, 11 Ramachandran analysis, 39 angles, 40 map, 39, 240 Rational design, 227 CAMD, 229 Rayleigh equation, 53 Rayleigh scattering, 53, 73, 240 Refractive index, 53 Relative mobility, 60 Relaxation, 90 spin-lattice (T1), 90 spin-spin (T2), 90 Replication, 21 RNA translation, 25 SAGE, 124, 215 Schlieren optics, 53 Scoring matrix, 171 Search engines, 153 Secondary structure elements a-helices, 192 b-sheets, 192 reverse turns, 192 Sedimentation, 54 Separation gel, 59 Sequence alignment analysis, 170 BLAST, 180 Blocks method, 187 cladistic methods, 189

247

CLUSTAL algorithm, 183, 184 clustering algorithms, 183, 187 consensus sequence, 75, 122, 157, 179 distance method, 188 dotplot, 173, 174 dynamic programming, 174 FASTA, 182 global alignment method, 176 k-tuple method, 177 local alignment method, 176 multiple sequence alignment, 178 Needleman-Wunsch algorithm, 176 pair-wise alignment, 173, 177 PAUP, 189 phylogenetic analysis, 184 PILEUP algorithm, 182 sequence assembly, 122 Smith-Waterman algorithm, 176 strategies for, 183 Sequence retrival programs, 154 BLAST, 154 ENTREZ, 155 FASTA, 155 FETCH, 155 Shine-Delgarno sequence, 25, 120 Sickle cell anemia, 3 Single nucleotide polymorphism (SNP), 3, 214 Site-directed metagenesis, 230 Space group, 84 Spectra atomic, 12 frequency-domain, 91 line, 12 NMR, 90 Stacking gel, 59 Stokes, 53 Stokes-Einstein equation, 53 Structural motif a/b, 141, 142 b-barrel, 140, 142 b-turn-b, 139 all a-helices, 137, 138 all b-strands, 139 EF-hand, 143 Greek key, 139 helix-turn-helix, 104, 143 immunoglobulin-fold, 139, 141 leucine-zipper, 107 Rossmann (bab) fold, 141 swiss roll, 139

248 Bioinformatics: A Primer zinc-finger, 106, 143, 144, 241 Structure atomic, 11, 12 determination, 83, 85, 93 factor, 86 prediction, 133 primary, 35, 70, 76, 240 quarternary, 41, 240 refinement, 87 secondary, 35, 190, 241 spatial, 41, 83 tertiary, 41, 241 three-dimensional, 41, 205 Structure classification, 200 domains, 194 profiles, 196 super-secondary, 137 Structure factor, 86 Structure prediction, 134 “knowledge-based”, 135 Chou-Fasman method, 136 computational methods of, 133-147 contact potential method, 203 distance matrix method, 171, 202 inverse folding method, 135 of fibrous proteins, 134 of globular proteins, 135 pattern recognition, 163, 199 profile sum method, 203 secondary, 135, 190 sequence threading method, 204 structure profile method, 202 tertiary, 164, 201 Sugar deoxyribose, 16 furanose, 16 oxyribose, 16 Sugars N-linked, 110 O-linked, 110

Svedberg, 54 Systems biology, 6 TATA box, 120, 156 Tautomer, 15 Tomography, 98 CT, 98, 99 electron, 98 fMRI, 99 magnetic resonance imaging (MRI), 99 PET, 99 SPECT, 98, 99 Transcription, 23 Translation, 25 Tri-nucleotide repeat expansion (TNRE), 4 tRNA, 19 aminoacyl, 25 fMet, 25 Turns and loops, 39 Unit cell, 84, 241 Unit cell content, 85 Untranslated region (UTR), 120, 121 Validation, 230 Virial coefficient, 54 Virtual screening, 230, 241 Watson-Crick, 1 Watson-Crick hypothesis, 16, 18 Wave functions, 11 X-ray crystallography, 1 macromolecular, 1 time-resolved, 83 X-ray diffraction, 83 principles of, 84 single-crystal, 84