Concepts and Experimental Protocols of Modelling and Informatics in Drug Design 9780128205464

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design discusses each experimental protocol uti

1,105 139 16MB

English Pages [387] Year 2021

Report DMCA / Copyright


Polecaj historie

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design

Citation preview

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design Om Silakari Department of Pharmaceutical Sciences and Drug Research, Punjabi University, Patiala, India

Pankaj Kumar Singh Department of Chemistry and Pharmacy, University of Sassari, Sassari, Italy

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-820546-4 For Information on all Academic Press publications visit our website at

Publisher: Stacy Masucci Senior Acquisitions Editor: Rafael E. Teixeira Editorial Project Manager: Sam W. Young Production Project Manager: Niranjan Bhaskaran Senior Cover Designer: Mark Rogers Typeset by MPS Limited, Chennai, India

Contents Preface................................................................................................................ xiii Acknowledgment ................................................................................................... xv Chapter 1: Fundamentals of molecular modeling ...................................................... 1 1.1 Molecular modeling ............................................................................................. 1 1.2 Molecular representation...................................................................................... 1 1.2.1 Cartesian coordinate ................................................................................... 2 1.2.2 Polar coordinate ......................................................................................... 3 1.2.3 Internal coordinate...................................................................................... 3 1.3 Computer graphics ............................................................................................... 3 1.4 Molecular models................................................................................................. 4 1.4.1 CPK models ............................................................................................... 4 1.4.2 Dreiding models ......................................................................................... 5 1.4.3 Computer models ....................................................................................... 6 1.5 Molecular surfaces ............................................................................................... 7 1.5.1 Van der Waals surface (VWS) ................................................................... 8 1.5.2 Solvent accessible surface .......................................................................... 8 1.5.3 Solvent excluded surface ............................................................................ 9 1.5.4 Charged partial surface area (CPSA) ........................................................ 10 1.6 Workstations....................................................................................................... 11 1.6.1 GPU hardware .......................................................................................... 11 1.7 Principles of molecular modeling...................................................................... 12 1.7.1 Molecular mechanics ................................................................................ 12 1.7.2 Molecular dynamics ................................................................................. 18 1.7.3 Quantum mechanics ................................................................................. 22 References .................................................................................................................. 25 Chapter 2: QSAR: Descriptor calculations, model generation, validation and their application ............................................................ 29 2.1 Introduction ...................................................................................................... 29 v

vi Contents

2.2 Fundamental principle of QSAR ..................................................................... 31 2.3 QSAR methodology ......................................................................................... 32 2.3.1 Data preparation ..................................................................................... 34 2.3.2 Data analysis .......................................................................................... 37 2.3.3 Validation ............................................................................................... 39 2.4 Descriptor calculations for QSAR models ...................................................... 41 2.4.1 Types of QSAR descriptors .................................................................... 41 2.5 Development of Hansch models and their validation ..................................... 45 2.5.1 General guidelines for derivation of Hansch QSAR model ..................... 46 2.6 QSAR model generation using Free Wilson approach ................................... 47 2.6.1 Limitations.............................................................................................. 48 2.7 QSAR model generation using mixed approach ............................................. 48 2.8 3D QSAR analyses........................................................................................... 49 2.8.1 Comparative molecular field analysis (CoMFA) ..................................... 49 2.8.2 Comparative molecular similarity indices analysis, (CoMSIA) ............... 52 2.9 Conventional QSAR versus 3D-QSAR ........................................................... 54 2.10 Conclusion........................................................................................................ 54 References .................................................................................................................. 61

Chapter 3: Small molecule databases: A collection of promising bioactive molecules .............................................................................. 65 3.1 Introduction ........................................................................................................ 65 3.2 BindingDB.......................................................................................................... 65 3.2.1 Description ............................................................................................... 66 3.2.2 Details ...................................................................................................... 66 3.3 ChEBI ................................................................................................................. 68 3.3.1 Description ............................................................................................... 69 3.3.2 Details ...................................................................................................... 69 3.4 ChemSpider ........................................................................................................ 72 3.4.1 Description ............................................................................................... 72 3.5 ChEMBL ............................................................................................................ 73 3.5.1 Description ............................................................................................... 74 3.6 ZINC................................................................................................................... 77 3.6.1 Description ............................................................................................... 77 3.6.2 Details ...................................................................................................... 78 3.7 PubChem ............................................................................................................ 79 3.7.1 Description ............................................................................................... 81 3.8 DrugBank ........................................................................................................... 82 3.8.1 Description ............................................................................................... 83 References .................................................................................................................. 86

Contents vii

Chapter 4: Database exploration: Selection and analysis of target protein structures ..................................................................... 89 4.1 Introduction ........................................................................................................ 89 4.2 Protein databases ................................................................................................ 89 4.2.1 UniProt: the Universal Protein knowledgebase ......................................... 89 4.2.2 Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) .............................................................. 95 4.2.3 Binding database .................................................................................... 100 4.2.4 Therapeutic target database .................................................................... 101 References ................................................................................................................ 104

Chapter 5: Homology modeling: Developing 3D structures of target proteins missing in databases ..............................................................107 5.1 Introduction ...................................................................................................... 107 5.2 Methodology of homology modeling .............................................................. 110 5.2.1 Template recognition and initial alignment ............................................ 111 5.2.2 Alignment correction .............................................................................. 113 5.2.3 Step 3: Backbone building ..................................................................... 113 5.2.4 Loop modeling ....................................................................................... 114 5.2.5 Side-chain modeling ............................................................................... 116 5.2.6 Ligand modeling .................................................................................... 116 5.2.7 Model optimization ................................................................................ 116 5.2.8 Model validation .................................................................................... 117 5.3 Software for homology modeling .................................................................... 118 5.3.1 Robetta ................................................................................................... 118 5.3.2 Modeller ................................................................................................. 119 5.3.3 3D-JURY ............................................................................................... 120 5.3.4 Swiss-model ........................................................................................... 120 5.4 Conclusion........................................................................................................ 121 References ................................................................................................................ 126 Chapter 6: Molecular docking analysis: Basic technique to predict drug-receptor interactions ...................................................................131 6.1 Introduction: what is molecular docking? ....................................................... 131 6.2 Theory of docking ............................................................................................ 132 6.2.1 Sampling algorithms............................................................................... 132 6.2.2 Scoring functions ................................................................................... 135 6.3 Types of molecular docking ............................................................................ 137 6.3.1 Rigid docking: rigid ligand and rigid receptor docking .......................... 137 6.3.2 Constrained docking: flexible ligand and rigid receptor ......................... 137


Contents 6.3.3 Flexible docking: flexible ligand and flexible receptor docking ............. 138 6.4 Standard methodology for molecular docking ................................................ 139 6.5 Softwares available for molecular docking ..................................................... 143 6.6 Conclusion........................................................................................................ 145 References ................................................................................................................ 152

Chapter 7: Molecular dynamic simulations: Technique to analyze real-time interactions of drug-receptor complexes ...............................................157 7.1 Introduction ...................................................................................................... 157 7.2 Principles of MD simulations .......................................................................... 158 7.2.1 Definitions.............................................................................................. 158 7.2.2 Calculating averages from a MD simulation .......................................... 159 7.2.3 Classical mechanics................................................................................ 160 7.2.4 Algorithms ............................................................................................. 163 7.3 Steps of MD simulations.................................................................................. 165 7.3.1 Initialization ........................................................................................... 165 7.3.2 Energy minimization .............................................................................. 167 7.3.3 Heating the simulation system ................................................................ 168 7.3.4 Equilibration at a constant temperature .................................................. 168 7.3.5 Production stage of MD trajectory (NVE ensemble) .............................. 168 7.4 Applications of MD simulations in drug discovery ........................................ 168 7.4.1 Identifying cryptic and allosteric binding sites ....................................... 169 7.4.2 Improving the computational identification of small-molecule binders ........................................................................... 169 7.4.3 Advanced free-energy calculations using MD simulations ..................... 170 7.5 MD simulations: Current limitations ............................................................... 172 References ................................................................................................................ 177

Chapter 8: Water mapping: Analysis of binding site spaces to enhance binding .............................................................................179 8.1 Introduction ...................................................................................................... 179 8.2 Thermodynamics .............................................................................................. 181 8.3 Predicting location and nature of water molecule: To be or not to be replaced? ........................................................................................... 183 8.4 Strategies to identify cavity “waters” .............................................................. 187 8.4.1 Molecular docking.................................................................................. 187 8.4.2 Molecular dynamics ............................................................................... 189 8.4.3 Free energy calculations ......................................................................... 190 8.5 Loopholes and limitations................................................................................ 192

Contents ix

8.6 Conclusion........................................................................................................ 193 References ................................................................................................................ 197

Chapter 9: Ligand-based pharmacophore modeling: A technique utilized for virtual screening of commercial databases ...........................................203 9.1 Introduction ...................................................................................................... 203 9.2 Methodology of pharmacophore modeling or mapping.................................. 205 9.2.1 Input: Data set preparation and conformational search ........................... 205 9.2.2 Conformational search............................................................................ 206 9.2.3 Feature extraction ................................................................................... 207 9.2.4 Pattern identification .............................................................................. 208 9.2.5 Scoring of the model .............................................................................. 209 9.2.6 Validation of pharmacophore ................................................................. 210 9.2.7 Applications of pharmacophore modeling .............................................. 211 9.3 In process determinants for quality pharmacophore modeling....................... 213 9.3.1 Molecular alignments ............................................................................. 213 9.3.2 Handling flexibility ................................................................................ 214 9.3.3 Alignment algorithms ............................................................................. 214 9.3.4 Key aspects of scoring and optimization ................................................ 215 9.4 Automated pharmacophore generation methods ............................................. 216 9.4.1 Geometry- and feature-based methods.................................................... 216 9.4.2 Field-based methods ............................................................................... 225 9.4.3 Pharmacophore fingerprints .................................................................... 226 9.4.4 ChemX/Chem Diverse, PharmPrint, OSPPREYS, 3D keys, Tuplets .................................................................................... 226 9.4.5 Other methods ........................................................................................ 227 9.5 Conclusion........................................................................................................ 230 References ................................................................................................................ 232

Chapter 10: Fragment based drug design: Connecting small substructures for a bioactive lead .........................................................................235 10.1 Introduction .................................................................................................... 235 10.2 General strategy for fragment based drug design ......................................... 236 10.2.1 Techniques for finding fragments ....................................................... 236 10.2.2 Converting fragments into hits and leads ............................................ 239 10.2.3 Hit identification and validation.......................................................... 241 10.3 Recent advancements in FBDD techniques .................................................. 241 10.3.1 Fragment-based molecular evolutionary approach .............................. 242 10.3.2 Construction and deconstruction approach .......................................... 243 10.3.3 Computational functional group mapping ........................................... 243


Contents 10.3.4 Multitasking computational model approach....................................... 244 10.4 Limitations...................................................................................................... 245 10.5 Conclusion...................................................................................................... 246 References ................................................................................................................ 251

Chapter 11: Scaffold hopping: An approach to improve the existing pharmacological profile of NCEs ......................................................255 11.1 Introduction .................................................................................................... 255 11.2 Computational approaches of scaffold hopping ............................................ 256 11.2.1 Pharmacophore searching ................................................................... 257 11.2.2 Recombination of ligand fragments .................................................... 258 11.2.3 Molecular similarity method ............................................................... 260 11.3 Conclusion...................................................................................................... 261 References ................................................................................................................ 264 Chapter 12: Hotspot and binding site prediction: Strategy to target protein protein interactions ............................................................267 12.1 Introduction .................................................................................................... 267 12.2 Protein protein binding sites ........................................................................ 268 12.3 Types of protein protein interaction regions ............................................... 269 12.4 Computational prediction of protein binding sites........................................ 269 12.4.1 Protein protein docking ..................................................................... 269 12.4.2 Binding site prediction based on the protein sequence ........................ 271 12.4.3 Binding site prediction based on the protein structure ........................ 271 12.4.4 Energy-based methods ........................................................................ 274 12.5 Hot-spot residues at protein interfaces .......................................................... 275 12.6 Prediction of hot spots in protein protein interactions ................................ 276 12.6.1 Hot-spot prediction based on the sequence ......................................... 276 12.6.2 Hot-spot prediction based on the structure .......................................... 276 12.6.3 Hot-spot prediction based on the unbound protein structure ............... 277 12.7 Conclusion...................................................................................................... 278 References ................................................................................................................ 281 Chapter 13: In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets ......................................................285 13.1 Introduction .................................................................................................... 285 13.2 Sequence-based approaches to SNP analysis ................................................ 286 13.3 Structure-based approaches to SNP analysis................................................. 286 13.4 Sequence-based prediction tools.................................................................... 287

Contents xi 13.4.1 SIFT ................................................................................................... 287 13.4.2 MAPP ................................................................................................. 288 13.4.3 PANTHER .......................................................................................... 288 13.4.4 Parepro ............................................................................................... 289 13.4.5 PhD-SNP ............................................................................................ 289 13.4.6 SNPs&GO .......................................................................................... 289 13.5 Structure-based prediction tools .................................................................... 290 13.5.1 PolyPhen............................................................................................. 290 13.5.2 SNPs3D .............................................................................................. 290 13.5.3 LS-SNP............................................................................................... 291 13.5.4 SNPeffect ........................................................................................... 292 13.5.5 SNAP.................................................................................................. 292 13.5.6 PMUT ................................................................................................. 293 13.5.7 SAPRED............................................................................................. 293 13.5.8 MutPred .............................................................................................. 294 13.5.9 MuD ................................................................................................... 294 13.6 Conclusion...................................................................................................... 295 References ................................................................................................................ 297

Chapter 14: ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs .........................................................................299 14.1 Introduction .................................................................................................... 299 14.2 Prediction of physicochemical properties...................................................... 301 14.2.1 Solubility and solubilization ............................................................... 301 14.2.2 Permeability and active transporters ................................................... 302 14.2.3 Hydrogen bonding .............................................................................. 302 14.2.4 Ionization constant (or dissociation constant)...................................... 303 14.2.5 Lipophilicity ....................................................................................... 303 14.3 Prediction of ADME and related properties.................................................. 304 14.3.1 Absorption.......................................................................................... 305 14.3.2 Bioavailability .................................................................................... 306 14.3.3 Blood brain barrier penetration ......................................................... 306 14.3.4 Transporters ....................................................................................... 307 14.3.5 Dermal and ocular penetration ........................................................... 309 14.3.6 Plasma-protein binding ....................................................................... 309 14.3.7 Volume of distribution ....................................................................... 309 14.3.8 Clearance ........................................................................................... 310 14.3.9 Metabolism ........................................................................................ 310 14.3.10 Toxicity .............................................................................................. 312 14.4 Computational tools for ADMET prediction ................................................ 314 14.5 Conclusion...................................................................................................... 315 References ................................................................................................................ 318

xii Contents

Chapter 15: Cheminformatic tools: Identify suitable synthesis procedures to realize designed molecules ...........................................................321 15.1 Introduction .................................................................................................... 321 15.2 Strategies for computer assisted prediction of synthetic schemes ................ 322 15.2.1 Template library-based ....................................................................... 322 15.2.2 Template-free ..................................................................................... 322 15.2.3 Focused template application .............................................................. 323 15.3 Approaches to validate selected synthetic route ........................................... 324 15.3.1 Classifying reaction feasibility ............................................................ 324 15.3.2 Predicting mechanistic steps ............................................................... 325 15.3.3 Ranking templates .............................................................................. 325 15.3.4 Ranking products ................................................................................ 325 15.3.5 Generating products ............................................................................ 325 15.4 Tools developed so-far................................................................................... 326 15.5 Conclusion...................................................................................................... 330 References ................................................................................................................ 330

Chapter 16: Statistical methods and parameters: Tools to generate and evaluate theoretical in silico models .................................................333 16.1 Introduction .................................................................................................... 333 16.2 Data analysis methods.................................................................................... 334 16.2.1 Principal components analysis (PCA) ................................................. 334 16.2.2 Cluster analysis ................................................................................... 334 16.3 Regression methods........................................................................................ 334 16.3.1 Simple regression analysis .................................................................. 335 16.3.2 Multiple linear regressions .................................................................. 336 16.3.3 Stepwise multiple linear regression ..................................................... 336 16.3.4 Principal components regression (PCR) .............................................. 336 16.3.5 Partial least square regression analysis................................................ 337 16.3.6 Genetic function approximation (GFA)............................................... 340 16.3.7 Genetic partial least squares (G/PLS) ................................................. 340 16.4 Evaluation of in silico models ....................................................................... 341 16.4.1 Internal validation ............................................................................... 341 16.4.2 External validation .............................................................................. 345 16.4.3 Virtual screening validation ................................................................ 348 16.5 Conclusion...................................................................................................... 348 References ................................................................................................................ 349

Glossary .............................................................................................................351 Index ..................................................................................................................367

Preface The main reason for proposing a new book in the field of CADD is increasing focus of educational institution towards molecular modeling as a key and rational approach for drug discovery. Research scholars, undergrad students and sometimes even teachers of drug design and medicinal chemistry do not readily find books to guide their research ideas and CADD-based experiments. Students as well as teachers usually gain theoretical knowledge via various books available in the market but they lack proper experimental application of these tools. Most of the books available in libraries focus on the theoretical concepts of computer-aided drug design and use of molecular modeling in general. Thorough examination of such books disappoint scholars and teachers as they do not provide any insight into the actual application of these theoretical concepts into solving research problems. Even the books providing discussion about different softwares utilized in in-silico analysis fail in providing the clear guidelines to a “lay-man”, to utilize them. Therefore, we wanted to write a book that reflects on the issues faced by researchers in the practical application of concepts of CADD, rather than a theory book on definitions and explanations of CADD. This book will be a handbook for practical applications of different in-silico tools available today along with various information about hardware as well as list of softwares commonly employed in CAMM. Of all the current trends in medicinal chemistry, CADD based studies are most common and rationale approach. More importantly the correct utilization of such tools is important to obtain reliable results and therefore the user, which could be anyone from an undergrad student to a doctoral fellow, should be well-versed about the application of such tools and techniques. Additionally, experience in solving the problem arising during performing an analysis is necessary. We were greatly intrigued by the vast application offered by this field and believe that researchers would benefit more from a book that provide sample exercises in each chapter to guide an end user in practising a CAMM exercise on any and every software package/online servers. Users can re-perform the given exercises which will help them in understanding the correct interpretation in context of their study.




Such a book can facilitate the actual application of multiple CAMM based approaches employed in the process of drug designing. Each user who utilize a CAMM based tools do not actually have set standard to compare his/her obtained results. This book will provide users with a standard protocol and results which can be utilized to validate their own results and will aid in building confidence over the obtained results. We also kept in mind that information regarding the general principles associated with the basic concepts of molecular modeling is also imperative for the book to become a significant piece of literature. We aimed at an approach that would make sense and appeal to today’s research scholars. Thus we incorporated a subsection in each chapter that specifically understated the update in the current knowledge in each tools and technique of this field. To avoid complexity in discussions, we have provided graphical representations of general protocol followed in each in-silico technique for drug designing. We have deliberately omitted detailed discussion of obscure theoretical principles and have only focussed on simple explanations and information on how to utilize computers and artificial intelligence in the process of drug design. This book will also help the readers in understanding the utility of freely available software tools for the purpose of understanding the complex process of identification of a suitable drug for a pathological condition and their development. Each chapter will discuss the basic protocols utilized in the process of lead identification to optimization and finally prediction of its mechanism of action. The book will provide a set exercise which could be utilized by the researcher to optimize and validate the tool employed by him. Additionally this book will provide good number of exercises to UG students (B.Pharm, B.Tech (bioinformatics), BSc (bioinformatics), BSc (biotechnology), BSc (biochemistry)) and PG students (M.Pharm (all streams), MSc (bioinformatics)), valuable to train them for their practical applications. Thus, this book may serve well to the beginners of molecular modeling. It may be followed by the graduate student to gain basic knowledge about the tools and route exercises of molecular modeling.

Acknowledgment Authors would like to thank Dr. Bhawna Vyas, Research Associate, Department of Chemistry, Punjabi University, Patiala, Ms. Shalki Choudhary, Senior Research Fellow, DPSDR, Punjabi University, Patiala and Ms. Himanshu Verma, Senior Research Fellow, DPSDR, Punjabi University, Patiala, for their providing assistance and suggestions while compiling this book. Authors are also indebted to the supportive, and at the same time critical, faculty members of the department of Pharmaceutical Sciences and Drug Research. Authors would also like to acknowledge the support and guidance from Samuel Young, Editorial Project Manager, Elsevier and Rafael Teixeira, Acquisitions Editor, Cancer Research/Oncology, Medical Informatics/Bioinformatics, Systems Biology and Biostatistics, Elsevier, along with other members of the editorial team at Elsevier. Finally, the time spent on the preparation of this book was made available only with the forbearance of our families, friends, and research groups, and we thank all of them for their patience and understanding. A special mention to Prof. Mario Sechi, Department of Chemistry and Pharmacy, University of Sassari, Sassari, Italy, for his constant guidance, support and for providing an excellent environment during the execution of this work.



Fundamentals of molecular modeling 1.1 Molecular modeling Molecular modeling describes the generation, representation and/or manipulation of 3-D structure of chemical and biological molecules, along with determination of physicochemical properties that can help to interpret structural activity relationship (SAR) of the biological molecules. Molecular modeling provides scientist with five major types of information. a. b. c. d. e.

The 3D structure of molecules The chemical and physical characteristics of the molecules Comparison of structure of a molecule with other different molecules Visualization of complexes formed between different molecules/macromolecules Prediction about how new related molecules might look

For the analysis of exponentially increasing data obtained through the introduction of automated whole genome and protein sequencing techniques, in the early 2000s, the field of bioinformatics emerged rapidly [1]. From the pioneering laborious mapping and comparison of protein and gene sequences in molecular biology, via an intense phase, which to a large extent can be viewed as ‘database mining’ and the development of efficient computer based algorithms, into a science of its own, which today has reached a high level of maturity and sophistication. Tools in bioinformatics are nowadays used with great success in structural biology, computational chemistry, genetics, molecular biology, pharmaceutical industry, pharmacology and more. In this chapter, a brief outline of the basics of molecular modeling is given, focusing on the interfaces between medicinal chemistry, pharmacology, computational chemistry, informatics, artificial intelligence and machine learning. This includes molecular representations, computer graphics, molecular surfaces and their principles such as molecular mechanics/quantum mechanics/molecular dynamics [2]. The aim is to provide a brief introduction to a vast and rapidly growing field. In subsequent chapters, more specialized drug designing tools and techniques are presented, that build upon the foundations given herein.

1.2 Molecular representation One of the most basic and usually ignored component of molecular modeling study is representation of the molecules. Since the beginning of the molecular modeling studies Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.



Chapter 1

there have been several refinement in the methods utilized to represent molecules in insilico studies. To represent 3D structure of a molecule and electronic properties associated with it, certain coordinate systems are required. Following coordinate system are generally used for molecular representation.

1.2.1 Cartesian coordinate In Cartesian coordinate system, two perpendicular lines are chosen in the plane and the coordinates of a point are taken to be assigned as distances to the lines (Fig. 1.1A). Similarly for 3-D representation of molecules, three perpendicular planes are chosen and three coordinates of a point are assigned as distances to each of the planes (Fig. 1.1B). Depending on the direction and order of the coordinate axis the system may be a right hand or a left hand system (Fig. 1.1C) [3]. This coordinate system is important to understand the orientation of the molecules in molecular space on computer as in Fig. 1.1B and 1.1C. orientations of chair conformations of cyclohexane are different as their Cartesian coordinate are different.

Figure 1.1 Coordinate system for representation of a molecule: Cartesian (A) 2D, (B) & (C) 3D and (D) polar coordinate.

Fundamentals of molecular modeling 3

1.2.2 Polar coordinate In this system, a point is chosen as the pole and a ray from this point is taken as the polar axis. For a given angle θ, there is a single line through the pole whose angle with the polar axis is θ. Then there is unique point on this line whose signed distance from the origin is r for given number r. For a given pair of coordinates (r, θ) there is a single point, but any point is represented by many pairs of coordinates. For example, (r, θ), (r, θ 1 2π) and (-r, θ 1 π) are all polar coordinates for the same point (Fig. 1.1D) [1].

1.2.3 Internal coordinate A more chemically intuitive way of writing the coordinates is to use the internal coordinates of a molecule (i.e. bond lengths, bond angles and torsion angles). Internal coordinates are usually written as a Z-matrix [1]. Here is an example of a Z-matrix, for ethene (C2H4): Atom number

Atom type

1 2 3 4 5 6


˚) Distance (A 1 1 1 2 2

1.31 1.07 1.07 1.07 1.07

Bond angle ( )

2 2 1 1

121.5 121.5 121.5 121.5

Torsion angle ( )

3 3 4

180.0 180.0 180.0

The first line of the Z-matrix defines atom number 1 (carbon). Atom 2 is also a ˚ from atom 1 (the approximate length of carbon atom, and is at a distance of 1.31 A a carbon-carbon double bond). The third column defines the atom to which the ˚ from atom 1 (the distance in column 4 refers, i.e. atom 3 (a hydrogen) is 1.07 A length of the C-H bond). Similarly the atom numbers in columns 5 and 7 define which atoms are involved in the bond angle and torsion angle (values given in columns 6 and 8 respectively). So, for example, atom number 6 is a hydrogen. It is ˚ from atom 2, the bond angle involves atoms 6-2-1, and the torsion angle is 1.07 A for atoms 6-2-14. All the torsion angles are 180 , showing that the molecule is planar. This system of coordinate is required to generate unique conformation of a molecular system on computer screen.

1.3 Computer graphics Computer graphics display are either vector or raster. Raster images, also known as bitmaps, are comprised of individual pixels of color. Each color pixel contributes to the


Chapter 1

overall image. Raster images might be compared to pointillist paintings, which are composed with a series of individually-colored dots of paint. Each paint dot in a pointillist painting might represent a single pixel in a raster image. When viewed as an individual dot, it’s just a color; but when viewed as a whole, the colored dots make up a vivid and detailed painting. The pixels in a raster image work in the same manner, which provides rich details and pixel-by-pixel editing. Unlike raster graphics, which are comprised of colored pixels arranged to display an image, vector graphics are made up of paths, each with a mathematical formula (vector) that tells the path how it is shaped and what color it is bordered with or filled by. Since mathematical formulas dictate how the image is rendered, vector images retain their appearance regardless of size. They can be scaled infinitely. Vector images can be created and edited in programs such as Illustrator, CorelDraw, and InkScape [4]. On vector displays the lines making up the image are traced on the face of the CRT. The lines are continuous strokes and appear very straight and smooth. However only lines and dots can be drawn on vector systems. Filled areas such as molecular surfaces, must be represented by many closely spaced lines or dots, which adds greatly to the complexity of the image. On raster displays, the CRT is repeatedly horizontally scanned, as on a television screen. The image is made up of discrete pixels. Lines can appear jagged, depending on the resolution of the CRT being used. Because of the pixels method used in raster system, filled areas are more readily drawn on these systems than on vector systems [5].

1.4 Molecular models The simplest types of models are CPK, dreiding models and computer models, which provide a better way to represent molecules.

1.4.1 CPK models CPK models are physical models in which a color coded molecular model assembly kit is provided for representing organic molecular structures (Fig. 1.2) [6]. These models consist of shapes comprising two basic and complementary construction units capable of being interlocked. The basic construction units are color coded plastic tubes which can be coupled to second basic construction units, representing the bonds between adjacent atoms. The second basic construction units are color coded coupling spheres, according to the valency of the atoms to be joined, the center of said coupling sphere represent atom centers. These coupling spheres have radial arms substantially located on the surface of a sphere with the center of the coupling unit being the center of the sphere. These units are made up of plastic and are of two types, one type adapted for planar-trigonal coupling of three said tubes separated by angles of about 120 and the other type adapted for tetrahedral coupling

Fundamentals of molecular modeling 5

Figure 1.2 CPK model representation of quinazoline molecule.

of four said tubes separated by angles of about 109 . Said first and second construction units are capable of being joined together and held immobile by friction by having said radial arms and/or the cavities of said tubes tapered so that skeletal models of complex organic macromolecules may be assembled such that the distance between the centers of two directly connected coupling means represents the distance between joined atoms [7]. These models give a good representation of the shape of a molecule. They can be manipulated to produce various conformations of the molecule. But they cannot be used to present electronic properties of molecules and they cannot be superimposed upon one another to compare molecular conformation and shape. The bond lengths and angles cannot be adjusted in these models.

1.4.2 Dreiding models Dreiding models are physical models that use thin metal or plastic rods to represent bonds [8]. Bond lengths and angles are fixed, although rotations around bonds can be easily done. One can also demonstrate the ready conversion of one boat form into another and then stop halfway between the two and preserve the twist form. Selection of appropriate balls and use of rubber tubing connectors to form double bonds and C3-C4 cycloalkanes permits construction of numerous interesting cis and trans olefins, optically active allenes, and small ring-compounds, in some of which optical isomerism is superimposed on geometrical isomerism. However, they give a poor representation of molecular volume and cannot be used to show electronic properties. Depending on the complexity of the model, they could possibly be superimposed upon one another for comparison of molecular conformation (Fig. 1.3).


Chapter 1

Figure 1.3 Dreiding model representation of quinazoline molecule.

Figure 1.4 Computer model representation of ligand protein complex.

1.4.3 Computer models Computer models can be used to draw a virtually limitless variety of molecular representations from stick figures to molecular surfaces (Fig. 1.4). Computer graphics models can also very readily be used to represent electronic properties of molecules. These models can be easily superimposed and accurately constructed using any bond lengths, angles and torsion angles. Elements are color coded and can easily be recognized on the basis of color assignment to that particular element (Fig. 1.5).

Fundamentals of molecular modeling 7

Figure 1.5 Filled color coded computer model representation of ligand.

A disadvantage of computer models is that they are not physical, three dimensional models. Thus computer molecular modeling tools portray images in a way that seems three dimensional. Initially the models were drawn on cathode ray tubes (CRTs) using special purpose computer hardware. In CRTs the images were limited to two dimension. The third dimension was realized by rapidly displaying slightly different two dimensional images. In this method, time was used as a parameter to represent third Cartesian dimension. This technique is referred to as real-time graphics. Another technique used to give graphics a three dimensional look involved drawing “in front” parts of image more brightly than parts which are “in back”. More updated computer graphic systems such as PS350 and the silicon graphics IRIS work station allowed to combine the techniques of real-time graphics, stereographics and intensity depth cuing to produce a 3-D image with multiple colors.

1.5 Molecular surfaces Molecular surface is a fundamental aspect of a structure as it is through the complementarity of shape and chemistry of the surface that molecules interact with each other. The molecular surface is defined as ‘the surface in contact with a probe sphere while the sphere rolls over the surface of the molecule. More recently the increasing power of a raster graphics system allows more complex images to be viewed interactively and this has led to the development of many techniques for representing solid molecular surfaces (Fig. 1.6). In modeling of a macromolecule such as protein, DNA, RNA etc. and small molecules with biological significance, each constituent atom is considered as a simple sphere with its Van der Waals radius. The visualization or simulations of these overlapping balls can be done by surface mesh generation technique, where the shape quality of these overlapping balls has a strong influence on simulation accuracy [9]. There are mainly three type of molecular


Chapter 1

Figure 1.6 Representation of different molecular surfaces.

surfaces that plays important role in drug designing process such as van der waals surface (VWS), solvent-accessible surface (SAS) and solvent excluded surface (SES).

1.5.1 Van der Waals surface (VWS) VWS simply refer to the union of all possible overlapping balls. The interactions of these surfaces are important in various chemical and biological processes such as formation of a tertiary structure of biopolymer, electron tunneling in protein crystals etc. The probable role of weak VWS interactions on reaction dynamics is an issue of great concern [10]. Usually, for the macromolecules such as proteins and nucleic acid, VWS may be buried within the interior of the molecule. van der waal Surface bound molecules are held together by weak attractive forces like dispersive, electrostatic, charge transfer and hydrogen bond interaction between closed shell atoms or molecules, and molecules bound to each other by these type of attractive forces possess low dissociation energies. The understanding of both reactive and non-reactive dynamics in VWS complexes requires detailed information on potential energies [11]. The VW radii of each sphere varies slightly with its covalent bonding environment and these radii are needed for the evaluation of protein volume, interior packing and also the packing at the protein-water interface [12].

1.5.2 Solvent accessible surface The solvent accessible surface can be recognized as the surface created by the center of the solvent that is regarded as a rigid sphere, when it rolls around the van der Waals surface.

Fundamentals of molecular modeling 9 This term was initially coined by Richard and Lee, when going through a study on protein interactions [13]. They were interested in analysing the interaction of protein with the solvent molecules to determine the hydrophobicity and folding of proteins. In order to obtain molecular surface that a solvent could access, a probe sphere is made to roll over the van der Waal surface, and the traces of center of the probe sphere best describe the SAS [14]. SAS has become a common thread for most of the researchers especially in the case of the non-polar molecules, as the free energy of aqueous solvent is proportional to SAS, which in turn proportional to the number of solvent molecules that are in contact to the solute molecule. Thus, SAS is a central quantity in several solvation models used in molecular mechanics (MM) [15]. This surface can be used to determine the amino acid environment energy which depends on an accurate and rapid estimation of SAS. It can also be used to compute partition coefficient (logP), which is an important parameter extensively used in studying the structural-biological activity. Further, in molecular modeling study, the interactions of the compound with non-polar phase can be determined by utilizing SAS information both in vitro and in vivo [16]. The constructive solid operation geometry operation can be applied for the representation of SAS and the implicit functions underlying the molecular surface can be defined through some steps as displayed in Fig. 1.7. The first step is sign change of these functions, then identification of atoms and definition of function used for the map evaluation of molecular surface and eventually the clustering of atoms. Function fSAS (.) can be computed by adding the contribution of those atoms [(C i,Ri)]iεIζI that belong to the sphere of center x and radius 2r, which is a constructive solid geometry operation [17].

1.5.3 Solvent excluded surface Solvent excluded surface (SES) is one of the most popular surface definitions in the field of biophysics and molecular biology. It is divided into two parts i.e. contact surface and the reentrant surface. The contact surface, as a part of the van der Waals surface of each atom is accessible to a probe sphere of a given radius. The reentrant surface is described as the inward-facing part of the probe sphere, while in contact with more than one atom [13]. Later, Connolly developed the mathematical representation of the SES for arbitrary biomolecules in terms of concave patches, saddle patches, and concex spherical patches [18]. Among these regions, the convex contact surface segment of the van der Waals surface possess direct contact to the solvent surrounding the system, and the concave re-entrant surface segment in which solvent sphere has contact with two or more atoms spheres of the structure. Cannolly representation of SES is standard tool in molecular modeling that allows quantitative and qualitative comparison of molecules [14]. The actual representation of solvent excluded surfaces is displayed in Fig. 1.8.


Chapter 1

Figure 1.7 Steps involved defining implicit functions underlying molecular surface.

1.5.4 Charged partial surface area (CPSA) The physical and chemical properties of charged partial surfaces (CPSAs) play important role in specificity and selective interactions of ligand with protein. CPSAs were developed to get the information about the molecular structure that in turn help in determining the intermolecular interactions for QSAR. Practically, it is useful to ascertain the toxicity and give description of local and global electrophilicity in non-covalent interactions. CPSA descriptors have been used in distinguishing the antagonist and agonist that bind to estrogen receptors. These descriptors also have utility in determining partial charge and conformational changes [19]. Charged partial surfaces like hydrophilic and hydrophobic surface area have their utility in various phenomenon related to protein adsorption. Usually, absorption from hydrophobic surfaces is more effective than hydrophilic surface. Not only the adsorption, but also some of the protein exchange processes occurs with more ease over the hydrophobic surfaces than that of hydrophilic surfaces. There is a strong correlation of

Fundamentals of molecular modeling 11

Figure 1.8 Representation of molecular/solvent-excluded surface.

target selectivity with physical and chemical properties of these surfaces [20]. The modulation in hydrophobic surface can be achieved by some of the factors, one of them is temperature. Temperature, as one of the factor to induce exposure of hydrophobic surface was identified by studying its effect on the chaperone activity of α-crystallin [21].

1.6 Workstations Usually vector systems are preferred by molecular modelers because of their speed and high quality of line drawing. Since these vector systems use special purpose hardware, they have been more expensive than raster systems and are used as display devices separate from the host computer which being used to store and modify molecular coordinates. However, nowadays new computer graphic workstations have been introduced, which include raster systems having a computer with full operating system and mass storage facility integrated with the graphic display.

1.6.1 GPU hardware The general purpose graphics processing unit (GPU) computational resource is very fast and has also become very popular. It is strongly dependent on the hardware. Many computers have GPU boards that are used mainly for graphics. To utilize the GPU for an application program, we must uninstall the GPU graphics driver software and install CUDA, a computer language for use with GPU. Most of the GPU-MD programs adopt CUDA for


Chapter 1

GPU computations. The slot number of each GPU board in the computer must be explicitly indicated in CUDA. Every year, new GPU hardware has appear in the market with updated CUDA version. The GPU program should be tuned up for each GPU, since the performance of GPU programs depends on the balance of the number of GPU cores and memory-band width. In contrast to CPUs, which are used for all application programs, the application of GPU is quite limited. This means that the GPU is used only when the GPU programs are available. GPU computing is particularly suitable to run molecular dynamics simulation programs like AMBER, Gromacs, NAMD and psygene-G/myPresto. Some of these GPU programs are freely available. However, one of the most serious problems for end users is how to set up the GPU machine for these MD programs. The other problem is that the system size for the GPU computation must be larger than the minimum size that is determined by the program. Since most GPU programs adopt a space-decomposition method for parallel computing, the system must be decomposed into sub systems. This means that the MD of a small system (like a single molecule) is not suitable for GPU computing.

1.7 Principles of molecular modeling Modeling of molecules for understanding various types of molecular phenomenon related to chemistry, biochemistry, biophysics, molecular biology, drug discovery and drug design, pharmacogenomics, pharmacology etc. is based on calculation of different kinds of energy associated with a molecular system. A molecular system is associated with three types of energies i.e. potential energy, kinetic energy and quantum energy. Calculation and application of these energies depend upon the type of work to be executed in problems which are under consideration. Different fundamental principles have been employed to calculate these three different kinds of molecular energies. Potential energy, kinetic energy and quantum energy can be calculated by applying the concepts of molecular mechanics (MM), Molecular Dynamics (MD) and Quantum Mechanics (QM) respectively. Now brief account about these three concepts are discussed in next sections.

1.7.1 Molecular mechanics Molecular mechanics treats the molecule as collection of atoms held together by spring. This assumption is made to apply Newtonian’s mechanics (Huck’s law of mechanics) for calculating potential energy of molecular system by considering atoms held together by elastic or harmonic forces. These forces can be described by potential energy functions of structural features (internal coordinates) like bond-length, bond-angles torsional angle and non-bonded interactions of a molecule. Non-bonded atoms (greater than two bonds apart) interact through van der Waals attraction, steric repulsion, and electrostatic attraction/

Fundamentals of molecular modeling 13 repulsion. Any deviation from the normal ideal values of internal coordinates while we are drawing molecules on computer screen, accompanies with strain and leads to increase in the potential energy. For example ideal value of bond angle of sp3 hybridized carbon in methane is 109 28v. Deviation from this value leads to angle strain and subsequent increase in potential energy of methane molecules. Thus every molecular system experiences different kind of strain viz. bond strain, angle stain torsional strain and strain due to non-bonding interactions which are contributors of the potential energy of conformation of that molecular system. The combination of these potential energy functions is known as ‘force field’ which can be defined as it is a set of rule to parameterize potential energy functions of molecules. Fig. 1.9 displayed these different kind of strains along with the formulas used to calculate corresponding energy contributions for potential

Figure 1.9 Components of potential energy in force field arise from different kind of strains produced in a molecule of four atoms.


Chapter 1

energy of conformation in FF. Thus, the energy, E or Etotal of the molecules in the force field arise due to deviations from ‘ideal’ structural features and can be approximated by a sum of strain derived energies contributed by these deviation. This can be expressed in the most general form as Etotal 5 Es 1 Eb 1 EðWÞ 1 Enb 1


where, Es 5 energy of a bond being stretched or compressed from its natural bond length; Eb 5 energy of bending bond angles from their natural values E(W) 5 torsional energy due to twisting about bonds Enb5energy of the non-bonded interactions; Etotal 5 Differences in energy between real molecule and a hypothetical molecule where all structural values like bond angles, bond length, dihedral angle are exactly at their ideal value. The objective of molecular mechanics is to predict the energy associated with a given conformation of a molecule. However, molecular mechanics energies have no meaning as absolute quantities. Only differences in energy between two or more conformations have meaning. MM models predict the energy of a molecule as a function of its conformation. This allows predictions firstly for the equilibrium geometries and transition states and secondly for relative energies between conformers or between different molecules. Different kinds of force-fields have been developed over the years. Some include additional energy terms that describe other kinds of deformations. Some force-fields account for coupling between bending and stretching in adjacent bonds in order to improve the accuracy of the mechanical model [22]. Energy minimization The energy of a system can be calculated by using the molecular mechanics force field. Most often one is interested in determining minimum energy structures. This can be done by finding the coordinates at which the first derivatives of the potential function equals to zero. There are numerous algorithms available for this geometry optimization. The simplest and most straight-forward of these is the method of steepest descent. In steepest descent we move down the gradient in a direction parallel to the net force. Steepest descent lead directly to the nearest local minimum by following a path that is determined by moving from previous value to a new value by some constant k times the direction in which the energy is decreasing. Another method is conjugate gradient method, unlike steepest descent, it utilizes information from the previous gradients along with the current gradient to locate the

Fundamentals of molecular modeling 15 minimum. Conjugate gradient methods take the second search direction to be a linear combination of the current gradients and the previous one. Conjugate gradient methods are more efficient than steepest descent and require fewer energy evaluations and gradient calculations. Conjugate gradient methods can also induce large changes in the coordinates while searching for a minimum but the convergence characteristics are better than with steepest descent. Next method is the newton-raphson procedure, it is a powerful, convergent minimization procedure. In the newton-raphson algorithm, one needs to have the second derivative matrix available. This method is based on the assumption that the energy is quadratically dependent i.e., it behaves like a classical spring. If the energy function were quadratic, the increments x would lead directly to the minimum in one step. This is of course almost never the case for the potential surface of complex biomolecules. Newton Raphson methods do not need to do linear interpolations like conjugate gradient, so energy evaluations can be speeded up by a factor of about 23. The best compromise is to use the method of fletcher and powell. This algorithm combines the advantages of steepest descent with newton-raphson and requires only the first derivatives, but builds up the second derivatives by successive approximations. This method is both quadratically convergent and efficient [23]. Classes of force field methods Class I Methods: Higher order terms and cross terms. Higher accuracy, used for small or medium sized molecules. Examples: Allinger’s MM14, EFF, and CFF. Class II Methods: For very large molecules (e.g., proteins). Made cheaper by using only quadratic Taylor expansions and neglecting cross terms. Examples: AMBER, CHARMM, GROMOS, etc. Made even cheaper by using “united atoms.” List of some important common force fields AMBER: Assisted model building refinement, it is designated for the simulation of peptide and nucleic acid [24]. CHARMm: Chemistry at Harvard Macromolecular Mechanics. It is used to model macromolecular structure a protein [25]. CFF92: Consistent force field. Designated for accurate definitions of both small and large molecules [26]. CVFF: Consistent valence force field. It is parameterized to reproduce peptide and proteins properties [26]. ESFF: Extensible systematic force field. It is rule based force field, which is currently under development. The goal of this force field is to provide the widest possible coverage [27].


Chapter 1 MM2: Designated for small molecule and give good result for wide range of system [28]. DREIDING: A simple generic force field predicting structure and dynamics of organic, biological and main group-inorganic molecules [29]. XED: Extended Electron Distribution. Designed to provide a more sophisticated description of the charge distribution around a molecule. OPLS 3: Optimized Potential for Liquid Simulations 3. It is a latest forcefield and is designed to provide broad coverage of drug-like small molecules and proteins. It provide nearly 2 orders of magnitude more explicitly fit torsional parameters than other small molecule force fields. OPLS3e: It is an improved forcefield of OPLS 3, which provide more accurate and transferable torsional parametrization. Its performance in predicting protein 2 ligand binding affinities is better than OPLS 3. ReaxFF: A reactive forcefield meant for performing molecular dynamic simulations of large scale reactive chemical system (1000s of atoms). It has Coulomb and Van der Waal potentials to describe non-bond interactions between atoms. Parmbsc1: It is refined forcefield for DNA simulations which is parameterized from high-level quantum mechanical data and has been tested for nearly 100 systems i.e. covering most of DNA structural space. CHARMM36m: It is recent and an improved forcefield used for folded and intrinsically disordered proteins. AMOEBA polarizable atomic multipole force field: It has parametrization for DNA and RNA, including the electrostatics, van der Waals, valence, and particularly torsional parameters. List of computer programs that are predominantly used for molecular mechanics calculations a. Abalone: Biomolecular simulations, protein folding. b. ADF: ReaxFF, UFF, QM-MM with Amber and Tripos force fields, DFT and semiempirical methods, conformational analysis with RDKit; partly GPU-accelerated. c. ADUN: QM-MM calculations with empirical valence bond (EVB); framework based (GNUStep-Cocoa); SCAAS for spherical boundary conditions d. Ascalaph Designer: Molecular building (DNA, proteins, hydrocarbons, nanotubes), molecular dynamics, GPU acceleration. e. Automated Topology Builder (ATB): Automated molecular topology building service for small molecules (, 99 atoms). f. Avogadro: Molecule building, editing (peptides, small molecules, crystals), conformational analysis, 2D/3D conversion; extensible interfaces to other tools g. COSMOS: Hybrid QM-MM COSMOS-NMR force field with fast semi-empirical calculation of electrostatic and/or NMR properties; 3-D graphical molecule builder and viewer

Fundamentals of molecular modeling 17 h. CP2K: CP2K can perform atomistic and molecular simulations of solid state, liquid and biological systems. i. Culgi: Atomistic simulations and mesoscale methods, database integration, mapping techniques, interface to MOPAC/NWChem j. Discovery Studio: Comprehensive life science modeling and simulation suite of applications focused on optimizing drug discovery process: small molecule simulations, QM-MM, pharmacophore modeling, QSAR, protein-ligand docking, protein homology modeling, sequence analysis, protein-protein docking, antibody modeling, etc. k. HOOMD-blue: General-purpose molecular dynamics highly optimized for GPUs, includes various pair potentials, brownian dynamics, dissipative particle dynamics, rigid body constraints, energy minimizing, etc. l. ICM: Powerful global optimizer in an arbitrary subset of internal variables, NOEs, docking (protein, ligand, peptide), EM, density placement m. MacroModel: OPLS-AA, MMFF, GBSA solvent model, conformational sampling, minimizing, MD. Includes the Maestro GUI which provides visualizing, molecule building, calculation setup, job launch and monitoring, project-level organizing of results, access to a suite of other modeling programs. n. MAPS: Building, visualizing, and analysis tools in one user interface, with access to multiple simulation engines. o. Materials Studio: Environment that brings materials simulation technology to desktop computing, solving key problems in R&D processes. p. MBN Studio: Standard and reactive CHARMM force fields; molecular modeler (carbon nanomaterials, biomolecules, nanocrystals); explicit library of examples. q. MedeA: Combines leading experimental databases and major computational programs like the Vienna Ab-initio Simulation Package (VASP), LAMMPS, GIBBS with sophisticated materials property prediction, analysis, visualizing. r. MCCCS Towhee: Originally designed to predict fluid phase equilibria. s. MDynaMix: Parallel MDFree open source GNU GPL. t. MolMeccano: Semi-automatic Force Field parametrizer. u. Orac: Molecular dynamics simulation program to explore free energy surfaces in biomolecular systems at the atomic level. v. NAB: Generates models for unusual DNA and RNA. w. NWChem: High-performance computational chemistry software, includes quantum mechanics, molecular dynamics and combined QM-MM methods. x. StruMM3D (STR3DI32): Sophisticated 3-D molecule builder and viewer, advanced structural analytical algorithms, full featured molecular modeling and quantitation of stereo-electronic effects, docking, and handling of complexes.


Chapter 1

y. Scigress: MM, DFT, semiempirical methods, parallel MD, conformational analysis, Linear scaling SCF, docking protein-ligand, Batch processing, virtual screening, automated builders (molecular dynamics, proteins, crystals). z. Spartan: Small molecule (,2000 a.m.u.) MM and QM tools to determine conformation, structure, property, spectra, reactivity, and selectivity. aa. TeraChem: High performance GPU-accelerated ab initio molecular dynamics and TD/ DFT software package for very large molecular or even nanoscale systems. Runs on NVIDIA GPUs and 64-bit Linux, has heavily optimized CUDA code. bb. UCSF Chimera: Visually appealing viewer, amino acid rotamers and other building, includes Antechamber and MMTK, Ambertools plugins in development. cc. VEGA ZZ: 3D viewer, multiple file format support, 2D and 3D editor, surface calculation, conformational analysis, MOPAC and NAMD interfaces, MD trajectory analysis, molecular docking, virtual screening, database engine, parallel design, OpenCL acceleration, etc. dd. VLifeMDS: Complete molecular modeling software, QSAR, combinatorial library generation, pharmacophore, cheminformatics, docking, etc. Limitations of molecular mechanics Although concept of molecular mechanics is widely accepted for variety of molecular modeling procedures in drug designing, it has some important limitations. For example: • • • • • • •

Molecular mechanics cannot be used to study processes that include cleavage of covalent bonds such as chemical reactions, protonation states, etc. It cannot be used to study properties dependent on electron distribution  NMR shielding constants, spectroscopic data, etc. Problems with metal atoms that can form donor-acceptor bonds as well as vdW/ electrostatic non-bonding interactions Stacking interactions are usually not well described by force fields The method has limited precision, significantly lower than QM calculations Empirical parameters must be obtained before the method can be used and thus requires parametrization Some force fields can solve some of these problems but usually for specific types of molecules only

1.7.2 Molecular dynamics Molecular dynamics, developed in the 1970s, consists of the numerical, step-by-step, solution of the classical equations of motion of molecules in biological system [30]. The

Fundamentals of molecular modeling 19 idea of molecular dynamics is to mimic what atoms do in real life, assuming a given potential energy function. The energy function allows us to calculate the force experienced by any atom for the given positions of the other atoms. Newton’s laws tell us how those forces will affect the motions of the atoms [31]. Motion is inherent to all chemical processes. Simple vibrations, like bond stretching and angle bending, give rise to IR spectra. Chemical reactions, ligand -receptor binding and other complex molecular or cellular processes are associated with many kinds of intra- and intermolecular motions, including conformational transitions and local vibrations which are usual subjects of molecular dynamics studies. Molecular dynamics simulation is associated with searching of stable conformation of a molecule or complex of molecules under the influence of external environment. Molecular dynamics alters the intramolecular degrees of freedom in a step-wise fashion, analogous to energy minimization. The individual steps in energy minimization are merely directed at establishing a down-hill direction to a minimum. The steps in molecular dynamics, on the other hand, meaningfully represent the changes in atomic position, over time (i.e. velocity). Molecular dynamics is deterministic process based on simulation (production of computer model) of molecular motion. Simulation starts by giving each atom in the molecule some kinetic energy. This makes molecule move around, and it is possible to follow this moment by solving Newton’s equations of motion for each atom and by incrementing the position and velocity of each atom using small time increment (time step). The basic algorithm of molecular dynamics simulation involves division of total simulation time (usually 1029 second or 100 ns) into discrete time steps (usually femtoseconds 10215 s or 1 fs) with total 108 iterations (100 3 1029/10215 5 108). Equation of Newton’s second law of motion, as displayed in Fig. 1.10, is used in the molecular dynamics formalism to simulate atomic motion. Knowledge of the atomic forces and masses can be used to solve this equation for the positions of each atom along a series of time steps (1fs) for total simulation period (100 ns). The force on an atom can be calculated from the change in energy between its current position and its position a small distance away. This can be recognized as the derivative of the energy with respect to the change in the atom’s position over the time energies can be calculated using either molecular mechanics or quantum mechanics methods. Molecular mechanics energies are limited to applications that do not involve drastic changes in electronic structure such as bond making/breaking. Quantum mechanical energies can be used to study dynamic processes involving chemical changes. Integration of law of motion generates successive configuration of the systems. The results are trajectory (the resulting series of snapshots of structural changes over the time). The trajectories specified now the position and velocity of particle in the system varies with

Figure 1.10 Molecular dynamics formalism.

Fundamentals of molecular modeling 21 time. The time steps that is sufficiently small is chosen in such a manner that the simulation moves atoms in sufficiently small increment so that the position of surrounding atom does not change significantly in per incremental move. In practice, trajectories are not directly obtained from Newton’s equation due to lack of an analytical solution. As described in Fig. 1.10, first, the atomic accelerations are computed from the forces and masses. The velocities are next calculated from the accelerations and lastly, the positions are calculated from the velocities. A trajectory between two states can be subdivided into a series of substates separated by a small time step, “delta t” (e.g. 1 femtosecond). The initial atomic positions at time “t” are used to predict the atomic positions at time “t 1 delta t”. The positions at “t 1 delta t” are used to predict the positions at “t 1 2 delta t”, and so on. The “leapfrog” method is a common numerical approach to calculating trajectories based on Newton’s equation. The method derives its name from the fact that the velocity and position information successively alternate at 1/2 time step intervals. Molecular dynamics calculations can be performed using AMBER, CHARMm CHARMM/GAMESS, Discover, QUANTA/CHARMm and SYBYL. Applications of MD •

MD simulations are used to determine where ligand molecule binds to its receptor, and how it changes the binding strength of molecules that bind elsewhere (in part by changing the protein’s structure). This information is then used to alter the molecule in such that it has a different effect. MD simulation can be used to describe many kind of events involving drug receptor interaction including: solvation, conformational changes required for initial complex formation, any conformational and covalent rearrangement that may occur subsequent to the binding. It is used as powerful method for generating thermally accessible molecular conformation that are needed in calculation of entropies, enthalpies and other thermodynamic properties. MD calculation can be used to predict how changes in chemical structure of drug changes equilibrium constant for binding to the receptor if high resolution structure of DR complex are available (Docking analysis) [32]. Simulation can be performed in which a receptor protein transits spontaneously from its active structure to its inactive structure. We can use this type of simulation study to describe the mechanism by which drugs binding to one end of the receptor cause the other end of the receptor to change shape (activate) [33,34]. MD simulation can also be used for understanding the process of protein folding for example, in what order do secondary structure elements form? But note that MD is generally not the best way to predict the folded structure [35].


Chapter 1 List of computer programs that are predominantly used for molecular dynamic calculations a. ACEMD: Molecular dynamics with CHARMM, Amber forcefields; runs on NVIDIA GPUs, highly optimized with CUDA b. Desmond: High performance MD; has comprehensive GUI to build, visualize, and review results and calculation setup up and launch c. Energy Calculation and Dynamics (ENCAD): the first MD simulation software by Michael Levitt d. FoldX: Energy calculations, protein design e. GROMACS: High performance MD f. GULP: Molecular dynamics and lattice optimization g. NAMD 1 VMD: Fast, parallel MD, CUDA h. YASARA: Molecular graphics, modeling, simulation

1.7.3 Quantum mechanics MM based conformational energy profiling of a molecular system does not consider energy of electron whereas quantum mechanics provides detailed insight into the electronic nature of molecular structure and allows one to analyze phenomena not yet parameterized for molecular mechanics. Therefore QM methods can be used to calculate those electronic properties which are involved in physical and chemical reactions of drugs with their original environment [3638]. Additionally, calculation of conformational energy profile and intermolecular interactions in a variety of contexts are best done using quantum mechanical methods. Electronic energy in QM can be calculated by solving Schrodinger’s equation (Eq. 1.2) which is a second order differential equation of wave function ψ of electron wave (electron density) with respect to space coordinates (x, y and z). HΨ 5 EΨ


This equation for a given molecular system can be solved by two types of QM methods i.e. Ab initio and Semiempirical. Schrodinger’s equation solved by these methods addresses the following questions • •

Where are the electrons and nuclei of a molecule in space? This will lead to determination of configuration, conformation, size, shape, etc. of the molecule. Under a given set of conditions, what are their energies? This will provide important thermodynamic properties i.e. heat of formation, conformational stability, chemical reactivity, spectral properties etc.

Ab initio method which involve no approximations at all. In most ab initio methods all electrons are explicitly included and have the advantage of not requiring any

Fundamentals of molecular modeling 23 parameterization and therefore can be used for all types of system. This is limited to tens of atoms and best performed using a supercomputer. It can be applied to organics, organometallics, and molecular fragments (e.g. catalytic components of an enzyme), considering vacuum or implicit solvent environment. It can also be used to study ground, transition, and excited states (certain methods). Specific software implementations of ab-initio include: GAMESS and GAUSSIAN [3941]. Semiempirical methods which involve introduction of some approximations. In semi empirical method, only valence electrons are explicitly included, some integrals are neglected and others are approximated by parameters derived from experiment. The selection of the most appropriate method depends on the size of molecule and the type of molecular property (e.g. conformation, electron density, electrostatic potential, frontier orbital, etc.) that is derived. This method is limited to hundreds of atoms. It can be applied to organics, organo-metallics, and small oligomers (peptide, nucleotide and saccharide). It can be used to study ground, transition, and excited states (certain methods). AMPAC and MOPAC are general-purpose Quantum Chemical Program Exchange (QCPE) semiempirical molecular orbital packages for the study of solid state, molecular structures and reactions. These packages include CNDO, INDO, PSILO, MINDO, AM1, and PM3 Hamiltonian program [42,43] EXERCISES Exercise 1: Draw the structure of various conformations of following structure and calculate conformational energy profiling of these molecules using some advanced force field

Exercise 2: Draw structure and do molecular mechanic based energy minimization of a set of heterocycles employing MM2-force field. Note: In this exercise we are going to perform the molecular mechanics based energy minimization of a set of small molecule of heterocycles. The set of ligand chosen for this exercise are randomly selected from the literature and belong to triazole scaffold.


Chapter 1

Requirements: 1. Operating system: Windows (7, 8 and/or 10) 2. Software for uses: Chemdraw [44,45] [Note: Please check version of operating system and download accordingly] Step by step protocol: 1. Sketching and preparing the set of ligands Several free tools are available to sketch and prepare the ligands, in the current exercise 3D structures of the ligands are developed using chemdraw. 2. Energy minimization via MM2 forcefield  Open ChemBio3D Ultra  Click on File; Open; Browse to ligands; select Ligand; Open  Click on Calculation; click MM2; click Minimize energy  Select Job type as Minimize energy; select Minimum RMS gradient as 0.1; Click Run 3. Results For each ligand, different energy components were displayed on the screen. A table of the obtained results can be prepared as (Table 1.1) Table 1.1: List of all the ligands along with different energy components calculated via MM2 force field. S. No. Structure

Etotal 5 (Stretch 1 Bend 1 Stretch-Bend 1 Torsion 1 Non-1,4 VDW 1 1,4 VDW 1 Dipole/Dipole)


(1.7761 1 31.12040.03144.8523 1 2.5787 1 15.47131.8623) 5 44.2004 kcal/mol


(1.8049 1 30.90640.01384.2874 1 2.2612 1 16.19361.4461) 5 45.4189 kcal/mol


(1.8574 1 30.9732 1 0.01043.4894 1 1.6274 1 16.19521.8591) 5 45.3153 kcal/mol


Fundamentals of molecular modeling 25 Table 1.1: (Continued) S. No. Structure

Etotal 5 (Stretch 1 Bend 1 Stretch-Bend 1 Torsion 1 Non-1,4 VDW 1 1,4 VDW 1 Dipole/Dipole)


(2.6347 1 43.70220.00553.5430 1 3.5695 1 18.43702.6659) 5 62.1289 kcal/mol


(2.4217 1 34.0380 1 0.26216.8284 1 1.7583 1 18.8462 1 2.1342) 5 52.6321 kcal/mol

References [1] A.R. Leach, Molecular Modelling: Principles and Applications, Pearson education, 2001. [2] K.M. Merz Jr., D. Ringe, C.H. Reynolds, Drug Design: Structure- and Ligand-Based Approaches, Cambridge University Press, 2010. [3] D. Kihara, L. Sael, R. Chikhi, J. Esquivel-Rodriguez, Molecular surface representation using 3D Zernike descriptors for protein shape comparison and docking, Curr. Protein Peptide Sci. 12 (2011) 520530. [4] S.J. Hollinger, System for generation of a composite raster-vector image, in, Google Patents, 1998. [5] D.W. Higgins, D.M. Scott, System and method for synchronizing raster and vector map images, in, Google Patents, 2007. [6] W.L. Koltun, Precision space-filling atomic models, Biopolymers: Original Res. Biomol. 3 (1965) 665679. [7] F.H. Clarke Jr, Organic molecular model assembly, in, Google Patents, 1976. [8] L.F. Fieser, Plastic dreiding models, J. Chem. Educ. 40 (1963) 457. [9] M. Chen, B. Lu, TMSmesh: a robust method for molecular surface mesh generation using a trace technique, J. Chem. Theory Comput. 7 (2010) 203212. [10] Z. Shen, H. Ma, C. Zhang, M. Fu, Y. Wu, W. Bian, et al., Dynamical importance of van der Waals saddle and excited potential surface in C (1 D) 1 D 2 complex-forming reaction, Nat. Commun. 8 (2017) 14094. [11] M.J. Ondrechen, Z. Berkovitch-Yellin, J. Jortner, Model calculations of potential surfaces of van der Waals complexes containing large aromatic molecules, J. Am. Chem. Soc. 103 (1981) 65866592. [12] A.J. Li, R. Nussinov, A set of van der Waals and coulombic radii of protein atoms for molecular and solvent-accessible surface calculation, packing evaluation, and docking, Proteins: Struct., Funct., Bioinf. 32 (1998) 111127. [13] J.-L. Pascual-ahuir, E. Silla, I. Tunon, GEPOL: an improved description of molecular surfaces. III. A new algorithm for the computation of a solvent-excluding surface, J. Comput. Chem. 15 (1994) 11271138. [14] J. Gasteiger, T. Engel, Chemoinformatics: A Textbook, John Wiley & Sons, 2006.


Chapter 1

[15] J. Weiser, A.A. Weiser, P.S. Shenkin, W.C. Still, Neighbor-list reduction: optimization for computation of molecular van der Waals and solvent-accessible surface areas, J. Comput. Chem. 19 (1998) 797808. [16] W. Dunn III, M. Koehler, S. Grigoras, The role of solvent-accessible surface area in determining partition coefficients, J. Med. Chem. 30 (1987) 11211126. [17] W. Rocchia, M. Spagnuolo, Computational Electrostatics for Biological Applications: Geometric and Numerical Approaches to the Description of Electrostatic Interaction Between Macromolecules, Springer, 2014. [18] B. Liu, B. Wang, R. Zhao, Y. Tong, G.W. Wei, ESES: software for E ulerian solvent excluded surface, J. Comput. Chem. 38 (2017) 446466. [19] D. Stanton, S. Dimitrov, V. Grancharov, O. Mekenyan, Charged partial surface area (CPSA) descriptors QSAR applications, SAR QSAR Environ. Res. 13 (2002) 341351. [20] P. Ducheyne, Comprehensive Biomaterials, Elsevier, 2015. [21] K.P. Das, W.K. Surewicz, Temperature-induced exposure of hydrophobic surfaces and its effect on the chaperone activity of α-crystallin, FEBS Lett. 369 (1995) 321325. [22] K. Vanommeslaeghe, O. Guvench, Molecular mechanics, Curr. Pharm. Des. 20 (2014) 32813292. [23] R. Fletcher, M.J. Powell, A rapidly convergent descent method for minimization, Comput. J. 6 (1963) 163168. [24] J. Wang, R.M. Wolf, J.W. Caldwell, P.A. Kollman, D.A. Case, Development and testing of a general amber force field, J. Comput. Chem. 25 (2004) 11571174. [25] K. Vanommeslaeghe, E. Hatcher, C. Acharya, S. Kundu, S. Zhong, J. Shim, et al., CHARMM general force field: a force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields, J. Comput. Chem. 31 (2010) 671690. [26] J.R. Maple, M.J. Hwang, T.P. Stockfisch, U. Dinur, M. Waldman, C.S. Ewig, et al., Derivation of class II force fields. I. Methodology and quantum force field for the alkyl functional group and alkane molecules, J. Comput. Chem. 15 (1994) 162182. [27] S. Shi, L. Yan, Y. Yang, J. Fisher-Shaulsky, T. Thacher, An extensible and systematic force field, ESFF, for molecular modeling of organic, inorganic, and organometallic systems, J. Comput. Chem. 24 (2003) 10591076. [28] N.L. Allinger, Y.H. Yuh, J.H. Lii, Molecular mechanics. The MM3 force field for hydrocarbons. 1, J. Am. Chem. Soc. 111 (1989) 85518566. [29] S.L. Mayo, B.D. Olafson, W.A. Goddard, DREIDING: a generic force field for molecular simulations, J. Phys. Chem. 94 (1990) 88978909. [30] J.A. McCammon, B.R. Gelin, M. Karplus, Dynamics of folded proteins, Nature 267 (1977) 585. [31] E. Hairer, C. Lubich, G. Wanner, Geometric numerical integration illustrated by the Sto¨rmerVerlet method, Acta Numerica 12 (2003) 399450. [32] R.O. Dror, H.F. Green, C. Valant, D.W. Borhani, J.R. Valcourt, A.C. Pan, et al., Structural basis for modulation of a G-protein-coupled receptor by allosteric drugs, Nature 503 (2013) 295. [33] M.P. Bokoch, Y. Zou, S.G. Rasmussen, C.W. Liu, R. Nygaard, D.M. Rosenbaum, et al., Ligand-specific regulation of the extracellular surface of a G-protein-coupled receptor, Nature 463 (2010) 108. [34] R.O. Dror, A.C. Pan, D.H. Arlow, D.W. Borhani, P. Maragakis, Y. Shan, et al., Pathway and mechanism of drug binding to G-protein-coupled receptors, Proc. Natl. Acad. Sci. 108 (2011) 1311813123. [35] K. Lindorff-Larsen, S. Piana, R.O. Dror, D.E. Shaw, How fast-folding proteins fold, Science 334 (2011) 517520. [36] P. Atkins, J. De Paula, J. Keeler, Atkins’ Physical Chemistry, Oxford university press, 2018. [37] D.J. Tannor, Introduction to Quantum Mechanics: A Time-Dependent Perspective, University Science Books, 2007. [38] O.A. Arodola, M.E. Soliman, Quantum mechanics implementation in drug-design workflows: does it really help? Drug Des., Dev. Ther. 11 (2017) 2551. [39] F. Jensen, Introduction to Computational Chemistry, John wiley & sons, 2017.

Fundamentals of molecular modeling 27 [40] J.P. Perdew, A. Ruzsinszky, Fourteen easy lessons in density functional theory, Int. J. Quantum Chem. 110 (2010) 28012807. [41] K. Burke, Perspective on density functional theory, J. Chem. Phys. 136 (2012) 150901. [42] W.J. Hehre, A Guide to Molecular Mechanics and Quantum Chemical Calculations, Wavefunction Irvine, CA, 2003. [43] M.J. Dewar, E.G. Zoebisch, E.F. Healy, J.J. Stewart, Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model, J. Am. Chem. Soc. 107 (1985) 39023909. [44] K.R. Cousins, Computer Review of Chemdraw Ultra 12.0, ACS Publications, 2011. [45] Z. Li, H. Wan, Y. Shi, P. Ouyang, Personal experience with four kinds of chemical structure drawing software: review on ChemDraw, ChemWindow, ISIS/Draw, and ChemSketch, J. Chem. Inf. Comput. Sci. 44 (2004) 18861890.


QSAR: Descriptor calculations, model generation, validation and their application 2.1 Introduction Biological activity of any bioactive molecule (drug or natural substrate) depends on its 3D structure and associated physicochemical properties. Variation in the physicochemical properties by modifying structure may lead to the variation in the biological activity. Fundamentally these variations in physicochemical properties affect the interactions of bioactive molecules with their biological counterparts in molecular recognition process, in which two partners specifically recognize each other through these intermolecular forces i.e. hydrophobic, polar, electrostatic and steric [1]. Basically in quantitative structure-activity relationships (QSAR) analysis, these intermolecular forces are quantified in terms of descriptors and then correlated with quantified biological activity by mathematical procedures like regression analysis. Thus QSAR analyses derive models which describe the structural dependence of biological activities either by physicochemical parameters (Hansch analysis) or by indicator variables encoding different structural features (Free Wilson analysis), or by three-dimensional molecular property profiles (steric and electrostatic fields in 3D space in comparative molecular field analysis, CoMFA and pharmacophoric features in pharmacophore mapping) of the compounds [2]. Drugs, which exert their biological effects by interacting with a specific target like an enzyme, a receptor, an ion channel, a nucleic acid or any other biological macromolecule, must have a 3D structure, which in the arrangement of its functional groups and in its surface properties is more or less complementary to a binding site in target [3]. Thus, it is broadly hypothesized that better the steric fit and the complementarity of the surface properties of a drug to its binding site are, the higher its affinity will be and the higher may be its biological activity [4]. A complication arises from the functionalities of the biological macromolecules typically involved in ligandprotein interactions, which means that certain structural features of the ligand determine whether a compound is a substrate, an inhibitor, a competitive receptor antagonist, an allosteric receptor antagonist, a functional receptor antagonist, a receptor agonist, or an allosteric effector molecule [5]. Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.



Chapter 2

Although the fit of the 3-D structure and the complementarity of the surface properties of a drug to its binding site are vital for its biological activity, another equally important aspect is ability of the drug to reach this binding site [6]. Even in simple in vitro systems, such as enzymatic assays, the surrounding water molecules compete to form hydrogen bonds to the binding site and to the functional groups of the ligand [7]. The balance of hydrogen bonds in solution and in the bound state alters the affinity. In more complex biological systems, like in cells, isolated organs, or whole animals, a certain range of lipophilicity enables the drug to walk its random way from the site of administration to the site of action, i.e. to cross several lipophilic and hydrophilic barriers, lipid membranes as well as aqueous phases [8]. In the case of nonspecific biological activities caused by membrane perturbation, only the distribution of the drug and its local concentration in a certain membrane compartment is responsible for its biological activity. While the affinity of a ligand to its binding site results from the sum of all hydrophobic, polar, electrostatic, and steric interactions, the influence of lipophilicity and ionization on the distribution of a drug in a biological system is much more complex [4]. As long as the biological system is kept constant, the interaction of two different drugs with the binding site as well as their distribution in the system only depend on the chemical structures of the compounds [9]. If these structures are closely related, e.g. having a chlorine atom instead of a hydrogen atom in a certain position, the differences in their physicochemical properties and thus the differences in the interaction forces can easily be described in a quantitative manner. Thus, the corresponding difference in biological activities should directly be related to the differences in these properties. This is indeed the case and all quantitative models of structureactivity relationships are based on the assumption of a more or less strict additivity of group contributions to biological activity values [10]. In many cases nonlinear models are needed to describe, in addition to binding and intrinsic activity, the dependence of drug transport and distribution on lipophilicity and ionization. While the classical models of quantitative structure-activity analyses do not consider the 3D arrangement of functional groups, some recent approaches deal with this problem and describe biological activities in terms of favorable and unfavorable interaction spheres, derived from the hydrophobic, electrostatic, and steric interaction fields of the ligands [11]. The methods of quantitative structure-activity relationships which have been developed during the past 30 years, are nowadays widely applied to describe the relationships between chemical structures of molecules and their biological activities. Many attempts have been made to understand structure-activity relationships in physicochemical terms (or in terms of structural features, using indicator variables for individual substituents and groups) and to design new drugs on a more rational basis [12]. Interestingly, often QSAR analyses are retrospective studies, to determine whether they follow a rational design of investigated structures or not. Only after performing syntheses

QSAR: Descriptor calculations, model generation, validation and their application 31 and biological testing, a quantitative relationship can be derived. Often the optimization of a lead compound is step by step accompanied by QSAR analyses. The dispute, whether QSAR really aids to find the optimum hit from within a series of biologically active molecules cannot generally be decided [13]. Obviously, the QSAR results depend on the validity of the underlying hypotheses, on the complexity of the test model and on the precision of the biological data. For new compounds within a congeneric series the quality of prediction of the biological activity values is related to the spanned parameter space and to the distance of the physicochemical properties of the new analogues to those of the other compounds. To mention only a few other effects, it also depends on the conformational flexibility of the ligand and its binding site, on multiple binding modes, and on differences in transport and metabolism [3].

2.2 Fundamental principle of QSAR QSAR is a ligand based indirect drug design technique and most often applicable when the three-dimensional structure of the target is unknown. Hypothetical model of target binding site can be deduced by comparing physicochemical properties or common structural features of a set of structures, which bind to the same active site with varied affinity. Basically in any kind of biological activity e.g. drug activity or normal hormone or neurotransmitter mediated activities or enzyme-substrate activity, ligands (drug or hormone or neurotransmitter or substrate) are to be recognized by their respective targets (receptor or enzyme). This is called molecular recognition process, as described in Fig. 2.1. This molecular recognition is specific and based on spatial distribution of certain properties of active site, which are complimentary to those of the interacting ligands. In QSAR analyses, we develop the relationship between structural and biological properties by correlating variation in structural or property descriptors of compounds with variation in activity. It can be expressed in its most general form by the following equation:   Biological response 5 f physicochemical and=or structural descriptors (2.1) The physicochemical descriptors of the ligand include both 2D and 3D parameters that account for hydrophobicity, topology, electronic properties and steric effects, and are determined empirically by computational methods. These effects attribute to change in the standard free energy (ΔG ) in molecular recognition process and thus affect the affinity of LT complex and therefore biological response. High affinity of the ligand (expressed in terms of association constant) towards its target indicates high magnitude of biological activity (BA). Thus in expression given in the Fig. 2.1, BA (biological activity) is linearly correlated with ΔG and therefore QSAR models are usually linear free energy relationship. Association constant K of the expression indirectly measures activities including chemical


Chapter 2

Figure 2.1 Schematic diagram for fundamental principle of all QSAR analyses.

measurements and observations in biological assays which can be used in QSAR. QSAR currently are being applied in many disciplines, including drug design, ADME, toxicity assessment and environmental risk assessment.

2.3 QSAR methodology QSAR modeling provides an effective way for establishing and exploiting the relationship between chemical structures and their biological actions toward the development of novel drug candidates. Theoretically, QSAR analysis is the application of mathematical and

QSAR: Descriptor calculations, model generation, validation and their application 33 statistical methods for the development of models to understand the trend of biological activities with respect to the properties of compounds [14]. The whole process of QSAR modeling can be broadly divided into four steps as describe in Fig. 2.2: data preparation, data analysis, validation and applications [15].

Figure 2.2 Methodology of QSAR.


Chapter 2

Recently, the European Organization for Economic Co-operation and Development(OECD) developed a set of principles for the development and validation of QSAR models, which, in particular, requires “appropriate measures of goodness-of-fit, robustness, and predictivity” [16]. The OECD guidance document especially emphasizes that QSAR models should be rigorously validated using external sets of compounds that were not used in the model development.

2.3.1 Data preparation The first part of QSAR analyses is the preparation of appropriate molecular dataset for acquiring or calculating molecular descriptors (quantities characterizing molecular structures) and correlating them with biological activity using suitable statistical method [17]. Three kinds of information related to dataset for QSAR modeling are required: i. The correct structures of the molecules present in the data set i.e. structural data. ii. Quantified biological activities of the ligands observed in experimental in vivo and in vitro assays as dependent variables i.e. biological data. iii. 2D and 3D descriptors that quantify intermolecular forces involved in molecular recognition process as dependent variables i.e. QSAR descriptors. Sructural data: Structures of the reported compounds can be collected from the literature and for that of the newly synthesized compounds can be anticipated from the spectral analyses. Prior to QSAR modeling, all structures in a dataset should be verified with respect to their correct representation by considering following practical aspects which may affect value of molecular descriptor and therefore correlation in final QSAR model: •

Stereoisomerism of the compounds should be well defined. Biological data of molecules with unpredicted stereoisomerism i.e. stereoisomerism is not defined, cannot be included in data set e.g. Racemic mixture, impure compound. Tautomerism of the molecules at given pH should be considered while preparing structures of ligand molecules because different tautomers have different structures and therefore different physicochemical properties e.g. enol-keto or lactam-lactim tautomerism. • Structures consisting of several disconnected parts should be removed, • Salts of the molecules and duplicate structures should be removed.

Generation of correct bioactive conformations of the ligands is the starting point and the most important step of QSAR modeling. Any error in this step cannot be corrected in later stages of the analysis. Thus while generating bioactive conformations on computer screen, consider all possible factors that may affect real time simulation of each ligand

QSAR: Descriptor calculations, model generation, validation and their application 35 in vivo e.g. pH of the environment where ligand act, ionization state, etc. Choice of appropriate MM and QM methods for preparing conformations of ligands is also important and should not be ignored. Different modules for the preparation of structure of ligands are available in freewares and commercially used softwares. It is also important to have some freely available molecular format converters such as OpenBabel [18]. Biological data: All different kinds of quantified biological activity data, for which dependence is only on structural variation in ligands, can be considered. This includes • •

• • • • •

• • • •

Affinity data e.g. binding energies to a receptor. Different types of constants that are determined in the kinetic studies of biochemical or physiological process e.g. rate constants, like association/dissociation, and Michealis Menten constants, inhibition constants Ki. Dose of drugs in molar concentrations required to produce 50% effect e.g. EC50 and IC50. In case of toxicity modeling, lethal concentration in water LC50, or lethal dose LD50. For pharmacokinetic modeling, absorption rate constants, distribution parameters, clearance, rate constants of metabolic degradation, and elimination rate constants. In vitro biological activity values from various cell line assays. In vivo biological activity values. There are some important points one should consider while using these biological activity for QSAR modeling: Biological activity data of all molecules in the data set should be determined by same experimental protocol, preferably measured in the same lab to avoid errors that may occur due to method variations. There should be proper variation in the activity data (at least two log units) with molecules of intermediate activities. Mechanism of biological action of all drugs should be well defined i.e. all molecules in the data set should interact in the same active site of target. Experimental data should be available from one source and the correlation between measurements obtained from different sources cannot be used directly in QSAR studies. Compounds in the dataset should be given a rank or assigned to categories of activities. In the majority of such cases, binary classification is used, in which a compound is classified as either active or inactive [19]. Another key consideration is the target receptor of the compounds in the dataset.

According to the nature of the activity data, QSAR studies can be divided into continuous (in which activity takes many different values from within some interval), category (in which activity is represented by ranks or ordinal numbers), and classification (activities are different types of biological properties which cannot be rank ordered) approaches [20].


Chapter 2

QSAR descriptors: After the dataset is selected and curated, the next task is the acquisition or calculation of descriptors. According to Todeschini and Consonni [21], molecular descriptors can be grouped into zero-dimensional, one-dimensional, two-dimensional, threedimensional and some other descriptors. Some descriptors can be experimental or calculated physicochemical properties of molecules such as molecular weight, molar refraction, energies of HOMO and LUMO, normal boiling point, octanol/water partition coefficient, molecular surface, molecular volume, etc. Most of descriptors included in this chapter can be calculated by the Dragon software. Molconn-Z is another widely used descriptor calculation software which calculates more than 800 descriptors. A relatively small, but diverse set of molecular descriptors can be calculated by the MOE software [22]. Many descriptors calculated from the knowledge of 3D structure of molecules (3D descriptors) have been developed and published as well. Although these are inherently more rigorous, one should keep in mind that their calculation is much more time and resource consuming. In many QSAR applications, the calculation of 3D descriptors should be preceded by conformational search and 3D structure alignment. However, even for rigid compounds, it is not generally known whether the alignment corresponds to real positions of molecules in the receptor binding site [23]. It has been demonstrated that in many cases QSAR models based on 2D descriptors have comparable (or even superior) predictivity than models based on 3D descriptors. Thus when 3D QSAR studies are necessary, if possible, 3D alignment of molecules should be preferably obtained by docking studies [24]. Virtually, any molecular modeling software package contains a set of its own descriptors and there are many other descriptors not mentioned here that can be found in the specialized literature. There are sets of descriptors that take values of 0 or 1 depending on the presence or absence of certain predefined molecular features (or fragments) such as oxygen atoms, aromatic rings, rings, double bonds, triple bonds, halogens, and so on. These sets of descriptors are called molecular fingerprints or structural keys or indicator variables or dummy variables (as used in Free Wilson type approach). Such descriptors can be represented by bit strings. Molecular holograms are similar to fingerprints; however, they use counts of features rather than their presence or absence [25]. Prior to QSAR studies, processing of descriptors is required. It includes: exclusion of descriptors having the same value for all compounds in the dataset as well as duplicate descriptors. To avoid higher influence on QSAR models of descriptors with higher variance, all descriptors are usually normalized (in most cases, range scaling or auto scaling is used). Molecular holograms or AP descriptors do not need to be normalized. Molecular field values around molecules are also not normalized. Preferably, descriptors with low variance and one of the highly correlated pair of descriptors should be excluded as well [26]. Finally, data for QSAR model development can be represented in a form of a table, in which each compound is a row and each descriptor as well as activity is a column.

QSAR: Descriptor calculations, model generation, validation and their application 37 The problem of outliers Success of QSAR modeling depends on the appropriate selection of a dataset for QSAR studies. One of the main deficiencies of many chemical datasets is that they do not fully satisfy the main hypothesis underlying all QSAR studies: similar compounds are expected to have similar biological activities or properties [27]. Maggiora defines the “cliff” in the descriptor space where the properties change so rapidly, that, in fact adding or deleting one small chemical group can lead to a dramatic change in the compound’s property. In other words, small changes in descriptor values can lead to large changes in molecular properties. Generally, in this case there could be not just one outlier, but a subset of compounds properties of which are different from those on the other “side” of the cliff. In other words, cliffs are areas where the main QSAR hypothesis does not hold. So cliff detection remains a major QSAR problem that has not been adequately addressed in most of the reported studies. There are two types of outliers we must be aware of: leverage (or structural) outliers and activity outliers. In case of activity outliers the problem of “cliffs” should be addressed as well [28]. Structural outliers can be defined as compounds that are largely dissimilar to all other compounds in the descriptor space.

2.3.2 Data analysis QSAR model development The general QSAR data analysis and validation workflow is represented in the Fig. 2.3. Following the data curation step, next is randomly selecting a fraction of compounds (typically, 15% of the total data set molecules) as an external evaluation set. The remaining subset of compounds (the modeling set) is rationally divided multiple times into pairs of training and test sets that are used for model development and validation, respectively. Multiple QSAR techniques can be employed based on the combinatorial exploration of all possible pairs of descriptor sets and various supervised data analysis techniques and select models characterized by high accuracy in predicting both training and test set data. Validated models are finally tested using the external evaluation set. The critical step of the external validation is the use of applicability domains (ADs). If external validation demonstrates the significant predictive power of the models, then the model is used for virtual screening of available chemical databases to identify putative active compounds. Methods QSAR modeling techniques employ various methods of multidimensional data analysis and it is impossible to discuss all the methods used in QSAR analysis. All these methods can be classified into linear and nonlinear approaches. Linear methods include simple and multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), etc. [29]. The main distinctive characteristic of these methods is the linearity of the function


Chapter 2

Figure 2.3 Flowchart representing various steps of QSAR modeling.

approximating the biological activity of their arguments (which are molecular descriptors). In linear discriminant analysis (LDA), linear combinations of descriptors are built, which define hyperplanes that separate representative points of different classes of compounds in the multidimensional descriptor space. Non-linear methods can be derived from linear or from more complex approaches that predict compound activities from their descriptors by the means of non-linear relationships. Many nonlinear methods are derived from linear methods via transforming them by a so-called kernel trick [30]. Calculations are executed in a so-called feature space where linear methods are applied. The advantage of these methods is that there is no need to directly calculate the transformation functions. Examples of such methods include non-linear support vector machines (SVMs) and support vector regression (SVR) methods, nonlinear discriminant analysis, kernel-PCA, kernel-PLS, etc. [31]. In the multidimensional feature space, SVM builds a soft margin hyperplane, which separates points belonging to two different classes, or more hyperplanes to separate points of larger number of classes. In contrast, SVR builds a hyperplane

QSAR: Descriptor calculations, model generation, validation and their application 39 such that as many points as possible are within the margin [32]. Other non-linear methods include k-nearest neighbors QSAR, in which the activity of a compound is predicted as a (weighted) average of activities of its nearest neighbors. k-nearest neighbor methods can include stochastic or stepwise variable (descriptor) selection [33]. Another large group of generally nonlinear methods are artificial neural networks (ANNs) [34]. Ensembles of ANNs can make use of bagging and boosting approaches [34]. ANNs consist of groups of artificial neurons. In feed-forward back-propagation neural networks, neurons are organized in input, hidden, and output layers. Input layer neurons receive descriptor values of compounds, which are passed with different weights to the hidden layer neurons [29]. A neuron activation function is then applied at each neuron to the sum of weighted inputs, and the results are passed to the output layer neurons, which calculate predicted activities of compounds. During training process, parameters of neuron functions and weights are adjusted so that the total error of predictions is minimized. There are network architectures with multiple hidden layers. Recursive partitioning (RP) methods build decision trees in order to precisely assign compounds to their classes. The tree consists of one root node containing all objects (compounds), intermediate (or decision), and leaf (terminal) nodes. A measure of node purity is introduced; for example, it could be the ratio of counts of compounds belonging to majority and minority class in a node. At each node, the procedure tries to partition the data to increase the purity measure, that is, to make the difference between sum of child node purities and parent node purity as higher as possible. Analysis is based on descriptor value distributions between classes at the node. If such a partition at the node is impossible, it becomes a leaf node. Additional criteria may be imposed on the minimum number of compounds in a leaf node, etc. Compounds in each node satisfy certain descriptor criteria. After growing, some leaves are consecutively removed based on the improvement of classification at them (so-called pruning of a tree). Without pruning, the tree could be over fitted. Prediction process consists of moving a query compound up the tree (based on its descriptor values) until it reaches a leaf node. Predicted class of a compound is defined as that of the majority class in this node. There are also RP regression methods which are used, if response variable is continuous [31]. Random Forest methods construct ensembles of trees based on multiple random selections of subsets of descriptors and bootstrapping of compounds. The compounds not selected in a particular bootstrapping are considered as a so-called out of bag set, and used as the test set. The trees are not pruned. Best trees in the forest are chosen for consensus prediction of external compounds [35]. The method can include bagging and boosting approaches.

2.3.3 Validation QSAR modeling is the theoretical technique for predicting activities of newly designed molecules therefore selected model should be rigorously validated before its use for


Chapter 2

designing new molecules and forecasting their activity. Following are commonly used validation test: Applicability domains Applicability domain (AD) was developed to avoid an unjustified extrapolation in activity prediction [36]. In usual drug design studies, the AD is defined as the Euclidean distance threshold (DT) between a query compound and its closest k-nearest neighbors of the training set. It is calculated as follows: DT 5 y 1 Zσ


Here, y is the average Euclidean distance between each compound and its k-nearest neighbors in the training set, k, is optimized in the course of QSAR modeling, and the distances are calculated using descriptors selected by the optimized model only, σ is the standard deviation of these Euclidean distances, and Z is an arbitrary cutoff parameter defined by a user [37]. The default value of this parameter Z is set at 0.5, which formally places the allowed distance threshold at the mean plus one-half of the standard deviation. Euclidean distances are calculated using all descriptors. Thus, if the distance of the external compound from its nearest neighbor in the training set within either the entire descriptor space or the selected descriptor space exceeds these thresholds, the prediction is not made [38]. Instead of Euclidean distances, other distances and similarity measures can also be used. Y-randomization To establish model robustness, Y-randomization (randomization of the response variable) test should be used. This test consists of repeating all the calculations with scrambled activities of the training set. Ideally, calculations should be repeated at least five times. The goal of this procedure is to establish whether models built with real activities of the training set have good statistics, is it due to overfitting or chance correlation. If predictive power for the training or the test set of all models built with randomized activities of the training set is significantly lower than that of models built with real activities of the training set, the latter ones are considered reliable. Using different parameters of the model development procedure, multiple QSAR models are built which have acceptable statistics. Suppose, the number of these models is m. Y-randomization test can also give n models with acceptable statistics. For acceptance of models developed with real activities of the training set, the condition n ,, m should be satisfied. Y-randomization test is particularly important for small datasets. External validation Consensus prediction, which is the average of predicted activities over all predictive models, always provides the most stable results [39]. The consensus prediction of biological

QSAR: Descriptor calculations, model generation, validation and their application 41 activity for an external compound on the basis of several QSAR models is more reliable and provides better justification for the experimental exploration of hits. External evaluation set compounds are predicted by models that have passed all validation criteria described above. Each compound is predicted by models for which the compound is within the AD. Actually, each external set compound should be within the AD of the training set and within the entire descriptor space as well (vide supra). A useful parameter for consensus prediction is the minimum number (or percentage) of models for which a compound is within the AD; it is defined by the user. If the compound is found within the AD of a lower number of models, it is considered to be outside of the AD. Prediction value is the average of predictions by all models. If a compound is predicted by more than one model, standard deviation of all predictions by these models is also calculated. For classification and category QSAR, the average prediction value is rounded to the closest integer (which is a class or category number); in case of imbalanced datasets, rounding can be done using the moving threshold. Predicted average classes or categories (before rounding), which are closer to the nearest integers are considered more reliable [39]. Using these prediction values, AD can be defined by a threshold of the absolute difference between predicted and rounded predicted activity. Sometimes, however, the external evaluation set may have a much smaller range of activities than the modeling set, so it could be impossible to obtain sufficiently large R2 value (and other acceptable statistical characteristics) for it.

2.4 Descriptor calculations for QSAR models Descriptors are the parameters which encode certain structural features and properties needed to describe the intermolecular forces of drug-receptor interactions and, transport and distribution of a drug in a quantitative manner and to correlate them with biological activity [21,40]. The parameters are classified into following classes: a. 2D- 2D parameters only use the atoms and connection information of the molecule for the calculation. 3D coordinates and individual conformations are not considered. b. i3D- Internal 3D parameters use 3D coordinate information about each molecule; however, they are invariant to rotations and translations of the conformation. c. x3D- External 3D descriptors also use 3D coordinate information but also require an absolute frame of reference (e.g., molecules docked into the same receptor)

2.4.1 Types of QSAR descriptors On the basis of types of intermolecular interaction forces as displayed in the Fig. 2.4, descriptors can be broadly classified in three classes i.e. electronic descriptors to account for electrostatic attraction and repulsion forces, lipophilic descriptors for determining lipophilic or entropic contribution and steric descriptors for van der Waal types of


Chapter 2

Figure 2.4 Intermolecular interaction forces for molecular recognition process.

interaction and steric clashes. Sometime indicator variables or dummy variables are also used to understand the weightage of individual structural component in eliciting the activity. Electronic parameter i. Apol: The sum of atomic polarizabilities (Apol) descriptor computes the sum of the atomic polarizabilities. The polarizabilities are calculated from the A coefficients used for molecular mechanics calculations: ii. σx: Hammett electronic parameter is the substituent constant, σx, which is the electronic effect of substituent x relative to hydrogen. σx is determined based on the influence of a substituent on the ionization of a benzoic acid. It is defined as   (2.3) Log Kx =KH 5 p cx where, KH is the equilibrium or rate constant for the parent (unsubstituent) and Kx is the equilibrium or rate constant for the derivative and is measured

QSAR: Descriptor calculations, model generation, validation and their application 43 experimentally. Positive values correspond to electron withdrawal and negative ones with electron release. iii. DIP: Dipole moment measured in Debyes(D) units. The dipole moment descriptor is a electronic descriptor that indicates the strength and orientation behavior of a molecule in an electrostatic field. Both the magnitude and the components (X, Y, Z) of the dipole moment are calculated. It is estimated by utilizing partial atomic charges and atomic coordinates. μðDebyes; DÞÞ 5 eðe:s:uÞx dðin cmÞ


iv. HOMO: HOMO (highest occupied molecular orbital) is the highest energy level in the molecule that contains electrons. It is crucially important in governing molecular reactivity and properties. When a molecule acts as a Lewis base (an electron-pair donor) in bond formation, the electrons are supplied from the molecule’s HOMO. How readily this occurs, is reflected in the energy of the HOMO. Molecules with high HOMOs are more able to donate their electrons and are hence relatively reactive compared to molecules with lowlying HOMOs; thus the HOMO descriptor should measure the nucleophilicity of a molecule. v. LUMO: LUMO is the lowest energy level in the molecule that contains no electrons. It is also important in governing molecular reactivity and properties. When a molecule acts as a Lewis acid (an electron-pair acceptor) in bond formation, incoming electron pairs are received in its LUMO. Molecules with low-lying LUMOs are more able to accept electrons than those with high LUMOs; thus the LUMO descriptor should measure the electrophilicity of a molecule. Hydrophobic parameters i. AlogP: Log of the partition coefficient. Hydrophobicities of solutes can readily be determined by measuring partition coefficients designated as P. By convention, P is defined as the ratio of concentration of the solute in octanol to its concentration in water. ALogP (the octanol/water partition coefficient) of a solute is an equilibrium constant related to free energy differences of solute between water and 1-octanol and consequently to the hydrophobic character of the molecule. ii. Hansch lipophilic substituent constant, π: For substituent X, it is given by the difference of its Log P from the Log P for hydrogen. πx 5 Log Px =PH


where, Px is the partition coefficient of substituted benzene X and PH is the partition coefficient of the unsubstituted benzene compound. Positive value indicates hydrophobic nature of substituents iii. FH2O, Desolvation free energy for water (kcal mol-1), Foct Desolvation free energy for 1-octanol: Foct and FH2O are physicochemical properties associated with LFE models


Chapter 2 of a molecule. These properties have proven useful as molecular descriptors in structure-activity analyses. All LFE computations are based solely on the connectivity of the atoms in a molecule. LFE computations are not conformationally dependent. Foct is the 1-octanol desolvation free energy and FH2O is the aqueous desolvation free energy derived from a hydration shell model developed by Hopfinger, where Foct and FH2O are in kcal mol-1. FH2O and Foct can be calculated for each molecule by searching the molecule for recognizable substituent groups and their bonding patterns and summing the substituent constants contributions for each group that is present in the molecule. Steric parameters i. Taft’s steric constant Es: The first steric parameter to be quantified and used in QSAR studies was Taft’s steric constant ES and is defined as   Es 5 log kx=kH A (2.6) Where, kX and kH, represent the rates of acid hydrolysis of esters, XCH2COOR and CH3, COOR, respectively. ii. Molar refractivity (MR): It is a linear free energy molecular descriptor that can relate chemical structure to observed chemical behavior. The molar refractivity is the molar volume corrected by the refractive index. The molecular refractivity index of a substituent is a combined measure of its size and polarizability (Dunn III, 1977). It can be calculated according to following formula.  2  n 2 1 ðMWÞ (2.7) MR 5 2 n 12 d where, n is the refractive index, MW is the molecular weight, and d is the compound’s density. iii. Molecular surface area (AERA): The molecular surface area descriptor is a 3D spatial descriptor that describes the van der Waals area of a molecule. The molecular surface area determines the extent to which a molecule exposes itself to the external environment. This descriptor is related to binding, transport, and solubility. iv. Radius of gyration (ROG): The radius of gyration is calculated using the following equation: ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s   X x2i 1 y2i 1 z2i  Rog 5 (2.8) N Where, N is the number of atoms and x, y, z are the atomic coordinates relative to the center of mass.

QSAR: Descriptor calculations, model generation, validation and their application 45 i. Molecular volume (Vm): A 3D spatial descriptor that defines the molecular volume inside the contact surface. The molecular volume is calculated as a function of conformation. Molecular volume is related to binding and transport. ii. Molecular weight (MW): Molecular weight (MW) terms have also been used as descriptors, particularly in cellular systems, or in distribution and transport studies where diffusion is the mode of operation. iii. Sterimol Descriptor: These descriptors define the steric constraints of a given substituent along several fixed axes. Five parameters are deemed necessary to define shape: L, B1, B2, B3, and B4. Sterimol L: Steric length parameter, measured along the substitution point bond axis. Sterimol-B1through B4: It is the steric distances perpendicular to the bond axis. Sterimol- B: The overall maximum steric distance perpendicular to the bond axis. Indicator variables Indicator variables are sometimes called dummy variables or de nova constants. They are used in multiple regression analysis to account for certain features, which cannot be described by continuous variables as described above. In QSAR equations they normally stand for a certain structural elements (presence, 1, or absence, 0, of a substituent or molecular fragment). These variables can also be used to account other structural features e.g. intramolecular hydrogen bonding, hydrogen bonding donor and acceptor properties, ortho effects, cis/trans isomerism, different parent skeleton, a. Chirality counts number of chiral centers (R or S) in the current molecule. b. H-bond acceptor (HBA) counts number of hydrogen bond acceptor group in the current molecule. c. H-bond donor (HBD) counts number of hydrogen bond donor group in the current molecule. d. Rotlbond (ROTBOND) counts the number of bonds in the current molecule having rotations that are considered to be meaningful for molecular mechanics. All terminal H atoms are ignored (for example, methyl groups are not considered rotatable).

2.5 Development of Hansch models and their validation Hansch model is utilized to correlate biological activity values with physicochemical properties of a set of molecules. It is a property-property relationship model in which various conceivable combinations of lipophilic, polarizability, electronic and steric parameters have been used and correlated with biological activity values in linear, parabolic and bilinear equations. Hansch model is based on the following assumptions: • •

A particular kinetic mechanism operates in a biological response. Relative to π, there is a biphasic variation of biological activity.

46 •

Chapter 2 Physicochemical parameters govern the rate constant, rate-determining step of drug action i.e. molecular recognition process as described in Fig. 2.1.

The general form of Hansch QSAR equations are: For linear (Fig. 2.5A) in-vitro activity data   Log 1=C 5 aπ 1 bσ 1 cEs 1 e For bilinear or parabolic (Fig. 2.5B) in-vivo activity data   Log 1=C 5 aπ 1 bπ2 1 cσ 1 dEs 1 e



where, C 5 Concentration of the drug necessary to realize a specific biological response. π 5 Hansch lipohilic substituent constant that can be replaced or complemented with log P. σ 5 Hammett substituent constant. Es 5 Taft steric constant.

2.5.1 General guidelines for derivation of Hansch QSAR model Following general guidelines are used to generate and select appropriate Hansch QSAR model •

Wide range of different parameters including local properties (also called substituent constant e.g. π, σ Es, indicator variables and sterimol parameters), and global properties (e.g. HOMO, LUMO, Vm, MW etc.), should be tried. The parameter selected for the best equation should have inter-correlation coefficient not more than 0.60.7, exception are combination of linear and square term for example π and π2 [10]. All reasonable parameter must be correlated by appropriate statistical procedure, e.g. SMLR. The best equation is normally one with the lowest standard deviation, all term

Figure 2.5 Graph of Hansch QSAR model: linear model (A), bilinear model (B).

QSAR: Descriptor calculations, model generation, validation and their application 47

• •

being significant at the 95% confidence intervals, alternatively the equation with the highest overall F value may be selected as the best one. If all things being (approximately) equal, one should accept simplest model. One should have at least five to six data point per variable to avoid chance correlations (this rule only applies to the data sets of intermediate size; for small data set more parameter may be allowed if they are based on reasonable model; for larger data set e.g. n . 30 this recommendation leads to equations which include too many variable). Model should be consistent and able to explain all aspect of physical, organic and biomedicinal chemistry of molecular recognition process which is under consideration.

2.6 QSAR model generation using Free Wilson approach Free Wilson approach is a true structure- activity relationship model [41]. It is based on the following assumptions: a. All the drugs tested have to have the same parent structure. b. The substitution pattern in various derivatives has to be the same. c. The substituents have to contribute to the biological activity, additively and in the same position, with constant amount being independent of the presence or absence of other substituent in the molecule. Therefore, the total activity (A) of derivatives is the sum of constant independent partial contributions. AðThe total activityÞ 5 Contribution of R1 1 Contribution of R2 1 ? 1 Contribution of parent A5


ðaij; IjÞ 1 μ


Where, Ij 5 Substituents at jth position, aij 5 Contribution of substituent Ij, μ 5 Contribution of parent structure The equation is solved by multiple linear regression analysis using the presence or absence of the different substituents as independent dummy parameters, while the measured activities serve as the dependent variable. The Free Wilson method is useful in following three conditions: • • •

When nothing is known about the mode of action, When biological testing is slow compared to the synthesis and When the physicochemical properties of substituents are unknown.

Advantages: •

Table for regression can be easily generated.

48 • •

Chapter 2 Addition and elimination of compounds is simple and does not change the values of other regression coefficients significantly Any compounds may be chosen as reference compound because all have same parent contribution

Free Wilson approach is easy to apply, especially in early phase of SAR, it is simple method to derive substituents contributions and to have first look on their possible dependence on different structural features.

2.6.1 Limitations • •

• • • •

Structural variation is necessary at minimum two different positions of substitution. Every substituent, which occurs only once in data set, leads to single point determination; the corresponding group contribution contains the whole experimental error of the one biological activity value. In most cases large number of parameters are needed to describe relatively few compounds, sometime leading to equations, which are statistically not significant. Only a small number of new analogues can be predicted. Predictions for substituents, which are not included in analysis, are generally impossible. It is limited to linear additive SAR

2.7 QSAR model generation using mixed approach Mixed approach of QSAR model generation employs a combination of Hansch and Free Wilson models [42]. Assumptions for mixed approach include: • • • •

All the drugs tested have same parent structure. The substitution patterns in various derivatives have to be the same. The substituents contribution to the biological activity additively being independent of the presence or absence of other substituents. It is the relationship between Hansch analysis and Free Wilson analysis.

Hancsh analysis and Free Wilson model differ in their application, but they are closely related. Group contribution of any substituents in the Free Wilson approach can be assumed to be derived from linear Hansch equation. Thus Free Wilson contributions contain all possible physicochemical contributions of substituents therefore Free Wilson approach always gives upper limit of correlation which can be achieved by linear Hansch analysis. Due to the relationship between Hansch and Free Wilson analysis, indicator variables can be included in Hansch analysis. Both model can be combined to a mixed

QSAR: Descriptor calculations, model generation, validation and their application 49 approach, in linear and bilinear form, which offer advantages of both approaches and widens their applicability Log1=C 5 Contribution of R1 1 Contribution of R2 1 ? 1 Contribution of parent Log1=C 5 ðaπ 1 bσ 1 cEsÞof R1 1 kP (2.12) 2 R2: 1 k3 R3. . .Contribution of parent . . . 2 5 ðaπ 1 bσ 1 cEsÞof R1 1 ai 1 c where


ai 5 sum of group contributions.

2.8 3D QSAR analyses Comparative molecular field analysis (CoMFA) and Comparative molecular similarity indices analysis, (CoMSIA) are promising 3D QSAR approaches for structure/activity correlation [43]. Classical QSAR methods are based only on the concept of magnitude of particular physical properties and do not consider any directional preferences. 3D QSAR models look at a molecule in three dimensions, from the viewpoint of “receptor”, and describe the magnitude and directional preferences of molecular interactions. Thus, 3D QSAR analysis is an analysis of quantitative relationship between biological activity of a set of compounds and 3D properties in their bioactive conformations having magnitude as well as directional preferences.

2.8.1 Comparative molecular field analysis (CoMFA) Principle CoMFA is first 3D QSAR approach reported by Crammer et al. in 1988 [44] and has been extensively explored in the literature for 3D QSAR modeling of a variety of biologically active compounds. Basically CoMFA model is derived by comparing the steric and electrostatic interaction fields in the 3D space around set of aligned congeneric molecules and correlating this comparison with variation in their biological activity. CoMFA QSAR modeling is based on following basic assumptions • • •

The most relevant numerical property values with biological activity would be shapedependent. At the molecular level, the interactions which produce an observed biological effect are usually non-covalent. Molecular mechanics force fields, most of which treat non-covalent (non-bonded) interactions only as steric and electrostatic forces, can account precisely for a great variety of observed molecular properties.


Chapter 2

Thus it can be hypothesized that a suitable sampling of the steric and electrostatic fields surrounding a set of ligand (drug) molecules might provide all the information necessary for understanding their observed biological properties [44]. There are following four typical features of CoMFA approach: • •

Ligand molecules are represented by their steric and electrostatic fields, sampled at the intersections of a three-dimensional lattice. 3D structures of ligands are aligned by ‘field fit” technique which allows optimal mutual alignment within a series by minimizing the RMS field differences between molecules. Models are generated by correlating steric and electrostatic fields in 3D space with biological response by partial least squares (PLS), using cross-validation to maximize the likelihood that the results have predictive validity. Results, are graphically represented as contoured three-dimensional coefficient plots. CoMFA methodology CoMFA is seven step processes that start with the selection of appropriate set of congeneric molecules. It is an alignment sensitive method and there are number points one has to consider for generating highly predictive 3D QSAR model. Step1: Selection of set of molecules, building structure, assigns charges, generate low energy conformations In first step, a set of molecules is selected which will be included in the analysis. As a most important precondition, all molecules have to interact with the same kind of receptor (or enzyme, ion channel, transporter) in the same manner, i.e., with identical binding sites in the same relative geometry (comparable conformations and similar orientation). A certain subgroup of molecules is selected which constitutes a training set to derive the CoMFA model. The residual molecules are considered to be a test set which independently proves the validity of the derived model(s). Step2: Molecular alignment CoMFA study requires that 3D structure of molecules to be aligned according to suitable conformational template which is assumed to be bioactive conformation so that they have comparable conformation and similar orientations. A pharmacophore hypothesis is derived to orient the superposition of all individual molecules and to offer a rational and consistent alignment. Step3: Generation of 3D fields After superposition of the molecules, a rectangular box is placed around all molecules, ˚ around the structures. A grid distance (default keeping a minimum distance of a few A

QSAR: Descriptor calculations, model generation, validation and their application 51 ˚ ) is selected to generate points at the intersections of a regular 3D lattice. value 2:0 A According to the dimensions of the box and the chosen grid distance, normally a few to several thousand points are generated. CoMFA calculates steric fields using Lennard-Jones potential and electrostatic fields using a Coulombic potential. Probe atom or group such as neutral carbon atom (probe for van der Waals interactions), a charged atom (probe for Coulombic interactions) or a hydrogen bonding donor and acceptor (probe for hydrogen bonding interactions) are used to determine these potential or interaction energies between each molecule and every grid point. The relative alignment of the individual molecules at the time of computing their fields is the most important, sensitive and adjustable parameter in CoMFA analysis. Step 4: Compile data table and remove redundancy Biological activity and calculated steric and electrostatic interaction energies at each grid points are entered in a column. Data redundancy at grid points is achieved by using cut off value of 30 kcal/mol and column filtering value of 2 kcal/mol in PLS analysis. Step5: Partial least square regression analysis 3D QSAR models can be generated by partial least square regression (PLS) analysis using interaction energies as an independent variables and biological activity as dependent variables. It is the most promising multivariate statistical method. Many, even hundreds or thousands of independent variables (X-block) can be correlated with one or several dependent variable (Y-block). Linear PLS model finds new variables ‘A latent variable’ also called X-Scores, which is denoted by ta (a 5 1,2,3. . . .A). These scores are linear combination of original variable Xk with weight of coefficient w ka. ta 5 Σk w kaXik


where, a 5 index of components, i 5 index of objects, k 5 index of independent variables Step 6: Evaluation of model As in regression analysis, in PLS analysis the correlation coefficient r also increases with the number of extracted vectors. Dependent on the number of components, often perfect correlations are obtained in PLS analyses, owing to the large number of x variables. Correspondingly, the goodness of fit (high values of r2 and S) is no criterion for the validity of a PLS model. The significance of additional PLS vectors is determined by cross validation. In the most common leave-one-out cross validation, one object (i.e., one biological activity value) is eliminated from the training set and a PLS model is derived from the residual compounds. This model is used to predict the biological activity value of the compound which was not included in the model. The same procedure is repeated after elimination of another object until all objects have been eliminated once. The sum of the


Chapter 2

squared differences, PRESS 5 Σ(ypred - yobs)2, between these ‘outside-predictions’ and the observed y values is a measure for the internal productivity of the PLS model. For larger data sets, an alternative to the leave one- out technique is recommended to yield more stable PLS models. Several objects are eliminated from the data set at a time, randomly or in a systematic manner, and the excluded objects are predicted by the corresponding model. Step7: Final predictive model as contour plot The results of a PLS analysis can be transformed to regression coefficients of the X block variables that are used for the calculation and prediction of biological activity values. Because of the large number of regression coefficients, a direct interpretation of the corresponding equation is impossible. An appropriate way to visualize the results is the generation of contour maps, which show the volumes of regions that are larger or smaller than certain user-defined positive or negative values. Contour maps with final PLS model can be obtained displaying the most relevant regions of the space where variation in the activity with steric and electrostatic fields are the largest. The color coding is used to characterize contour plots for each of these two fields (Fig. 2.6). Limitations of CoMFA CoMFA is an alignment sensitive method and suffers with certain inherent limitations. • • • •

The force field functions do not model all interaction types Show singularities at the atomic positions Deliberately defined cut-off values needed Contour plots often not continuously connected

2.8.2 Comparative molecular similarity indices analysis, (CoMSIA) CoMSIA is an alternative or modified approach of CoMFA, to compute property field based on similarity indices of drug molecules that use a Gaussian-type distance dependence, and no singularities occur at the atomic positions [45]. Accordingly, no arbitrary definitions of cutoff limits and deficiencies due to different slopes of the fields are encountered. Due to cutoff setting in the CoMFA fields and steepness of the potential close to the molecular surface, the CoMFA maps are often rather fragmentary and not continuously connected. This makes their interpretation difficult. The maps obtained by the CoMSIA approach are superior and easier to interpret. The CoMSIA maps highlight those regions within the area occupied by the ligand skeletons that require particular physicochemical properties important for activity. This is an important significant guide to trace the features that really matter especially with respect to the design of the novel compounds. In CoMSIA, the steric indices are related to the third power of the atomic radii, the electrostatic descriptors are derived from atomic partial charges, the hydrophobic fields are derived from atom-based

QSAR: Descriptor calculations, model generation, validation and their application 53

Figure 2.6 CoMFA methodology.

parameters and the hydrogen bond donor and acceptor indices are obtained from a rulebased method derived from experimental values. Similarity indices are calculated using Gaussian-type distance dependence between the probe and the atoms of the molecules of the data set. This functional form requires no arbitrary definition of cutoff limits, and the similarity indices can be calculated at all grid points inside and outside the molecule. The value of the attenuation factor is set to 0.30.


Chapter 2 Contour plot analysis in CoMSIA CoMSIA does not calculate interaction energies but distance-dependent similarity indices (similarity of probe to molecule atoms) resulting in smooth contour plot. Color coding for Contour maps of the CoMSIA model: Property

Increase activity

Decrease activity

Steric field



Electrostatic field Positive Negative Hydrophobic Hydrophilic HB acceptor field

Blue Red Orange Black Magenta

Red Blue Black Orange Red,

2.9 Conventional QSAR versus 3D-QSAR QSAR uses descriptors that are a single number describing some aspect of the molecule. 3D-QSAR uses a 3D grid of points around the molecule, each point having properties associated with it, such as electron density or electrostatic potential. In general, conventional QSAR is best used for computing properties that are a function of nonspecific interactions between the molecule and its surroundings. For these properties, small changes in molecular structure generally give small changes in the property. For example, conventional QSAR is the method of choice for computing normal boiling points, passive intestinal adsorption, bloodbrain barrier permeability, colligative properties, etc. Conversely, 3D-QSAR is better for computing very specific interactions, such as how tightly a compound will bind to the active site in one, specific protein.

2.10 Conclusion Although there has been a question mark on the viability and practical utility of QSAR modeling due to instances of poor external performance, lax scientific practices, and the advent of newer models built using HTS data, it still provides an accurate and simple estimation within its domain. In this chapter, we have discussed various QSAR modeling methodologies used and offered guidelines for developing rigorous and properly validated QSAR models that, if followed, afford multiple and diverse successful applications of QSAR. The enormous, continuing growth of data in various molecular sciences, from medicinal chemistry to different “omics” discipline, suggest a growing importance of the QSAR approach to molecular data modeling. The developing trends on minimizing animal use in biomedical research place additional focus on QSAR as a source of alternative predictors of in vivo effects in both animals and humans. Hopefully, this chapter will help both computational and experimental

QSAR: Descriptor calculations, model generation, validation and their application 55 chemists to develop reliable QSAR models and to use these models to optimally exploit the experimental data to guide future studies. Exercise: Solved exercise Exercise 1: To develop CoMFA model for di-arylheterocyclic class of COX-2 inhibitors using Sybyl software 1. Collect structures and biological activity data (IC50 values, minimum two log unit variation) of COX-2 inhibitors from the literature [46]. 2. Convert the IC50 to pIC50 by taking -log of IC50 values. 3. Build the three dimensional structures of all molecules by considering bioactive conformation of ligand SC-558, co-crystalized with COX-2 (PDB code:1CX2) followed by energy minimization. 4. Generate the bioactive conformations for each molecule by docking all the molecules in COX-2 protein (PDB code:1CX2, co-crystalized with SC-558) and use for alignment (Fig. 2.7). 5. Divide the molecules in training and test set (Follow Racz et al. [47] for grouping criteria).

Figure 2.7 Aligned pose of bioactive conformations of data set molecules.


Chapter 2

6. Use training set molecules for model generation and test set molecules for model validation. 7. Add CoMFA column by default settings (steric field and electrostatic field are calculated using sp3 hybridized carbon atom and 1 1charge respectively, as a probe ˚ , truncation value 30 kcal/mol) atom, step size 2 A 8. Run PLS initially using leave One Out (LOO) cross-validation to determine optimum number of components, ONC, (highest q2LOO and lowest sLOO). 9. Run PLS analysis with non cross-validation option using column filtering value 2 kcal/mol to compute other statistical parameters (r2, F and S) for contour map. 10. For better visualization, the PLS results are presented as interactive graphics consisting of contour plots of coefficients of the fields variables at each lattice intersection, showing favorable and unfavorable regions in three dimensional space, which are associated with the biological activity (Fig. 2.8). 11. Derive a correlation between the field point values (steric and electrostatic) and the biological activity by using Partial Least Square (PLS) method, that

Figure 2.8 CoMFA contour map showing steric (green: favorable; yellow: unfavorable) and electrostatic (red: electronegative; blue: electropositive) field points around the co-crystalized ligand SC-558.

QSAR: Descriptor calculations, model generation, validation and their application 57 determines the effect of each substituent of congener’s series on their biological activity. Use CoMFA descriptors, steric and electrostatic, as independent variables and pIC50 values as dependent variables. Table Statistical results of CoMFA model. Parameters

CoMFA model

N q2 PRESS r2 r2bootstrap S F r2pred SDEPtest Steric Electrostatic

6 0.733 0.804 0.989 0.992 0.160 418.6 0.768 0.509 0.297 0.703

N 5 Number of components, q2 5 cross-validation correlation coefficient, PRESS 5 predictive sum square error, r2 5 correlation coefficient, r2bootstrap5bootstraped correlation coefficient, s 5 standard error of estimate, F 5 F ratio, r2pred5predictive r2, SDEPtest5standard deviation of error of prediction.

12. Predict the activity of training set molecules using developed CoMFA model. 13. Draw a correlation graph between predicted and experimental pIC50 values (Fig. 2.9). For practice, a dataset of 305 diaryl-heterocyclic COX-2 inhibitors can be found at: [48].

Figure 2.9 Correlation graph between the actual and predicted activities of training and test set molecules.


Chapter 2

Exercise 2. To derive QSAR model for nimesulide derivatives using Hansch approach 1. Collect the molecules with single substitution and their biological activity data from the literature [49]. 2. Feed the data in excel sheet. 3. Assign the values of aromatic substituent constant corresponding to each substituent [50]. 4. Take the log of biological activity data. 5. Consider biological activity data as dependent variable and aromatic substituent descriptors as independent variables. 6. Apply the multiple linear regression (MLR) by forward selection method using STATISTICA software. 7. Derive the equation and determine the statistical parameter. 8. This analysis will result in following equation LogðBAÞ 5 0:046ð0:122Þπ 1 0:489ð0:365ÞlogMR 1 0:842ð0:314Þσp 1 0:111ð0:292ÞHBA 1 0:132ð0:200ÞHBD-0:801ð0:265Þ Equation 1 n 5 18; r 5 0:780; sc 5 0:296; F 5 3:728; prob . F 5 0:029 where; n 5 number of samples in regression r 5 correlation coefficient f2 5 coefficient of determination sc 5 standard error of the regression (number in parentheses) F 5 F ratio (Used to derive the statistical significance of derived equation) Prob . F 5 Probability of finding greater F ratio 9. Find the correlation between dependent and independent variables using above equation Table Correlation matrix for the parameters used in Equation 1. log (BA) Π log MR σp HBA HBD

log (BA)


log MR





2 0.31 1.00

0.52 2 0.08 1.00

0.61 2 0.47 0.11 1.00

0.55 2 0.41 0.70 0.36 1.00

0.04 2 0.32 0.06 2 0.14 2 0.03 1.00

QSAR: Descriptor calculations, model generation, validation and their application 59

Table Dataset for Hansch model along with biological activity and values of aromatic substituent constants.

Compound ID



log (BA)




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20


3.44 2.52 2.70 2.32 2.02 2.19 2.13 2.19 2.02 1.69 1.71 1.04 0.88 0.75 0.76 0.71 0.66 0.52 0.34 0.09

0.537 0.401 0.431 0.365 0.305 0.339 0.328 0.340 0.305 0.228 0.233 0.017 2 0.056 2 0.123 2 0.120 2 0.148 2 0.179 2 0.288 2 0.463 2 1.047

2 0.28 2 0.57 2 0.55 2 1.27 2 1.49 2 1.51 2 0.88 2 1.63 2 0.06 0.61 0.40 0.51

7.36 6.33 11.18 14.57 9.81

0.867 0.801 1.048 1.163 0.992

0.78 0.66 0.50 0.36 0.36

5.02 13.49 21.10 13.82 15.73 17.47 24.12 10.28 13.70 15.83 9.22 18.42 5.65 1.03

0.701 1.130 1.324 1.141 1.197 1.242 1.382 1.012 1.137 1.199 0.965 1.265 0.752 0.013

0.54 0.72 2 0.01 0.00 0.30 0.45 0.07 0.10 0.49 0.48 0.15 0.03 2 0.17 0.00

2 0.38 2 1.53 0.06 0.39 1.07 0.56 0.00




1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0

0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0


% inhibition per μM of drug per kg of body weight. Hansch lipophilic substituent constant. c Molar refractivity (polarizability parameter). d Hammett substituent constant. e Hydrogen bond acceptor. f Hydrogen bond donor. b

Unsolved exercises Exercise 3. To estimate the individual contribution of each substituent to the analgesic activity and toxicity of following indan-amine derivatives using Free Wilson approach [51].


Chapter 2

Table Compound ID








1 2 3



N(CH3)2 N(C2H5)2



0.46 0.47 0.30

2.75 1.90 5.00

4 5 6



N(CH3)2 N(C2H5)2



0.24 0.48 0.46

1.77 1.55 5.20

7 8 9


C2 H 5 C2 H 5 C2 H 5

N(CH3)2 N(C2H5)2



0.22 0.30 0.46

1.75 1.58 5.00








11 12





0.21 0.38

1.28 4.18








14 15 16



H N(C2H5)2

C6H5 C6H5 C6H5


0.43 0.32 0.70

2.13 1.28 2.60

17 18 19


N(CH3)2 N(C2H5)2

C6H5 C6H5 C6H5


0.37 0.16 0.53

1.64 0.85 3.95

20 21 22


C6H5 H H


0.18 0.21 0.21

1.49 2.70 5.10

C2 H 5 C2 H 5 H




QSAR: Descriptor calculations, model generation, validation and their application 61 (Continued) Compound ID






24 25 26



27 28 29


C2 H 5 C2 H 5 C2 H 5









N(CH3)2 N(C2H5)2


0.46 0.36 0.41

2.60 3.57 5.00

N(CH3)2 N(C2H5)2


0.37 0.46 0.38

6.20 3.00 7.40

ED50 and LD50 are expressed as mg/10 g upon intra-peritoneal administration to mice.

For detailed study, please refer an article by Free and Wilson [51].

References [1] S. Asirvatham, B.V. Dhokchawle, S.J. Tauro, Quantitative structure activity relationships studies of nonsteroidal anti-inflammatory drugs: a review, Arab. J. Chem. (2016). [2] A.K. Debnath, Quantitative structure-activity relationship (QSAR) paradigm--Hansch era to new millennium, Mini Rev. Med. Chem. 1 (2001) 187195. [3] R. Mannhold, P. Krogsgaard-Larsen, H. Timmerman, QSAR: Hansch Analysis and Related Approaches, John Wiley & Sons, 2008. [4] X. Du, Y. Li, Y.-L. Xia, S.-M. Ai, J. Liang, P. Sang, et al., Insights into proteinligand interactions: mechanisms, models, and methods, Int. J. Mol. Sci. 17 (2016) 144. [5] M.S. Salahudeen, P.S. Nishtala, An overview of pharmacodynamic modelling, ligand-binding approach and its application in clinical practice, Saudi Pharm. J. 25 (2017) 165175. [6] E. Lionta, G. Spyrou, D.K. Vassilatis, Z. Cournia, Structure-based virtual screening for drug discovery: principles, applications and recent advances, Curr. Top. Med. Chem. 14 (2014) 19231938. [7] J. Schiebel, R. Gaspari, T. Wulsdorf, K. Ngo, C. Sohn, T.E. Schrader, et al., Intriguing role of water in protein-ligand binding studied by neutron crystallography on trypsin complexes, Nat. Commun. 9 (2018). [8] O. Zupanˇciˇc, A. Bernkop-Schnu¨rch, Lipophilic peptide characterWhat oral barriers fear the most, J. Controlled Release 255 (2017) 242257. [9] W. Yu, A.D. MacKerell, Computer-aided drug design methods, Antibiotics, Springer, 2017, pp. 85106. [10] H. Kubinyi, QSAR: Hansch analysis and related approaches, Trends Pharmacol. Sci. 16 (1995) 280. [11] M.G. Damale, S.N. Harke, F.A. Kalam Khan, D.B. Shinde, J.N. Sangshetti, Recent advances in multidimensional QSAR (4D-6D): a critical review, Mini Rev. Med. Chem. 14 (2014) 3555. [12] H. Kubinyi, Quantitative structure-activity relationships (QSAR) and molecular modelling in cancer research, J. Cancer Res. Clin. Oncol. 116 (1990) 529537. [13] A. Cherkasov, E.N. Muratov, D. Fourches, A. Varnek, I.I. Baskin, M. Cronin, et al., QSAR modeling: where have you been? Where are you going to? J. Med. Chem. 57 (2014) 49775010. [14] S. Kausar, A.O. Falcao, An automated framework for QSAR model building, J. Cheminformatics 10 (2018) 1.


Chapter 2

[15] R. Cox, D.V. Green, C.N. Luscombe, N. Malcolm, S.D. Pickett, QSAR workbench: automating QSAR modeling to drive compound design, J. Comput. Mol. Des. 27 (2013) 321336. [16] R. Diderich, Tools for category formation and read-across: overview of the OECD (Q) SAR Application Toolbox, In Silico, Toxicology: Princ. Appl. (2010) 385405. [17] T.I. Oprea, A. Tropsha, Target, chemical and bioactivity databasesintegration is key, Drug. Discov. Today: Technol. 3 (2006) 357365. [18] D. Fourches, E. Muratov, A. Tropsha, Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research, J. Chem. Inf. Model. 50 (2010) 11891204. [19] A. Golbraikh, X.S. Wang, H. Zhu, A. Tropsha, Predictive QSAR modeling: methods and applications in drug discovery and chemical risk assessment, Handb. Comput. Chem. (2016) 148. [20] J. Leszczynski, Handbook of Computational Chemistry, Springer Science & Business Media, 2012. [21] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, John Wiley & Sons, 2008. [22] A. Kovatcheva, A. Golbraikh, S. Oloff, J. Feng, W. Zheng, A. Tropsha, QSAR modeling of datasets with enantioselective compounds using chirality sensitive molecular descriptors, SAR. QSAR Environ. Res. 16 (2005) 93102. [23] A. Cherkasov, F. Ban, O. Santos-Filho, N. Thorsteinson, M. Fallahi, G.L. Hammond, An updated steroid benchmark set and its application in the discovery of novel nanomolar ligands of sex hormone-binding globulin, J. Med. Chem. 51 (2008) 20472056. [24] W. Zheng, A. Tropsha, Novel variable selection quantitative structure 2 property relationship approach based on the k-nearest-neighbor principle, J. Chem. Inf. Comput. Sci. 40 (2000) 185194. [25] J.-L. Faulon, A. Bender, Handbook of Chemoinformatics Algorithms, CRC press, 2010. [26] M. Goodarzi, B. Dejaegher, Y.V. Heyden, Feature selection methods in QSAR studies, J. AOAC Int. 95 (2012) 636651. [27] G.M. Maggiora, On Outliers and Activity Cliffs Why QSAR Often Disappoints, ACS Publications, 2006. [28] R. Guha, J.H. Van Drie, Structure 2 activity landscape index: identifying and quantifying activity cliffs, J. Chem. Inf. Model. 48 (2008) 646658. [29] P.K. Singh, A. Negi, P.K. Gupta, M. Chauhan, R. Kumar, Toxicophore exploration as a screening technology for drug design and discovery: techniques, scope and limitations, Arch. Toxicol. 90 (2016) 17851802. [30] D.-S. Cao, Y.-Z. Liang, Q.-S. Xu, Q.-N. Hu, L.-X. Zhang, G.-H. Fu, Exploring nonlinear relationships in chemical data using kernel-based methods, Chemometrics Intell. Lab. Syst. 107 (2011) 106115. [31] R.A. Berk, Classification and regression trees (CART), Statistical Learning From a Regression Perspective, Springer, 2008, pp. 165. [32] A.J. Smola, B. Scho¨lkopf, A tutorial on support vector regression, Stat. Comput. 14 (2004) 199222. [33] S. Ajmani, K. Jadhav, S.A. Kulkarni, Three-dimensional QSAR using the k-nearest neighbor method and its interpretation, J. Chem. Inf. Model. 46 (2006) 2431. [34] D.W. Salt, N. Yildiz, D.J. Livingstone, C.J. Tinsley, The use of artificial neural networks in QSAR, Pesticide Sci. 36 (1992) 161170. [35] L. Breiman, Random forests, machine learning 45, J. Clin. Microbiol. 2 (2001) 199228. [36] J. Jaworska, N. Nikolova-Jeliazkova, T. Aldenberg, QSAR applicability domain estimation by projection of the training set in descriptor space: a review, Altern. Lab. Anim. 33 (2005) 445459. [37] J.-H. Hsieh, X.S. Wang, D. Teotico, A. Golbraikh, A. Tropsha, Differentiation of AmpC beta-lactamase binders vs. decoys using classification kNN QSAR modeling and application of the QSAR classifier to virtual screening, J. Comput. Mol. Des. 22 (2008) 593609. [38] H. Zhu, L. Ye, A. Richard, A. Golbraikh, F.A. Wright, I. Rusyn, et al., A novel two-step hierarchical quantitative structureactivity relationship modeling work flow for predicting acute toxicity of chemicals in rodents, Environ. Health Perspect. 117 (2009) 12571264. [39] L. Zhang, H. Zhu, T.I. Oprea, A. Golbraikh, A. Tropsha, QSAR modeling of the bloodbrain barrier permeability for diverse organic compounds, Pharm. Res. 25 (2008) 1902. [40] B. Jhanwar, V. Sharma, R. Singla, B. Shrivastava, QSAR-Hansch analysis and related approaches in drug design, Pharmacol. Online 1 (2011) 306344.

QSAR: Descriptor calculations, model generation, validation and their application 63 [41] H. Kubinyi, Free Wilson analysis. Theory, applications and its relationship to Hansch analysis, Quant. Struct.-Activity Relatsh. 7 (1988) 121133. [42] D.J. Wood, L. Carlsson, M. Eklund, U. Norinder, J. Sta˚lring, QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality, J. Comput. Mol. Des. 27 (2013) 203219. [43] K. Kim, Comparative molecular field analysis (CoMFA), Molecular Similarity in Drug Design, Springer, 1995, pp. 291331. [44] R.D. Cramer, D.E. Patterson, J.D. Bunce, Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins, J. Am. Chem. Soc. 110 (1988) 59595967. [45] G. Klebe, Comparative molecular similarity indices analysis: CoMSIA, 3D QSAR in Drug Design, Springer, 1998, pp. 87104. [46] P.A. Datar, E.C. Coutinho, A CoMFA study of COX-2 inhibitors with receptor based alignment, J. Mol. Graph. Model. 23 (2004) 239251. [47] A. Ra´cz, D. Bajusz, K. He´berger, Consistency of QSAR models: correct split of training and test sets, ranking of models and performance parameters, SAR. QSAR Environ. Res. 26 (2015) 683700. [48] P. Chavatte, S. Yous, C. Marot, N. Baurin, D. Lesieur, Three-dimensional quantitative structure 2 activity relationships of cyclo-oxygenase-2 (COX-2) inhibitors: a comparative molecular field analysis, J. Med. Chem. 44 (2001) 32233230. [49] W. Wilkerson, A quantitative structure—activity relationship analysis of a series of 20 -(2, 4difluorophenoxy)-40 -substituted methanesulfonilides, Eur. J. Med. Chem. 30 (1995) 191197. [50] B. Skagerberg, D. Bonelli, S. Clementi, G. Cruciani, C. Ebert, Principal properties for aromatic substituents, Multivar. Approach Des. QSAR, Quant. Structure-Activity Relatsh. 8 (1989) 3238. [51] S.M. Free, J.W. Wilson, A mathematical contribution to structure-activity studies, J. Med. Chem. 7 (1964) 395399.


Small molecule databases: A collection of promising bioactive molecules 3.1 Introduction The early steps in a modern drug discovery project typically include identifying a biological macromolecule that plays a key role in a disease process, and seeking a low-molecular weight compound that inactivates this macromolecular target by binding to it with high affinity. Ligand discovery involves a substantial component of trial and error, with advances in computer-aided drug-design, so many binding data are generated for each target. The accelerated growth of biological screening has resulted in both volume and complexity, the need for computational tools to retrieve and analyse such rich data becomes more imperative. When published, these data become a valuable resource for scientists studying the same macromolecular target, and also for those seeking to develop improved computational models of molecular recognition.

3.2 BindingDB Currently, scientific journals publish binding data almost exclusively, which provide an archival service, accessible in electronic formats, and can be searched in useful ways. By providing these missing functionalities, especially to small companies and researchers in academia, a database of measured binding affinities can improve the discovery of targeted ligands. BindingDB ( include analysis of ligands for a specific target to identify chemical features or pharmacophores that correlates with affinity, assist in the development of quantitative structure activity relationships and interpretation of measured entropies and enthalpies of binding in the context of a receptor’s 3D structure. It offers the possibility of publishing data constituting very large data sets, and raw experimental data which can be useful in the determination of data quality. It is a publicly accessible database which contains 20,000 experimentally determined binding affinities of protein ligand complexes, for 110 targets including their isoforms, mutational variants and 11,000 small molecule ligands. The data is obtained from the scientific literature, data collection is focused on proteins that are targets of drugs for which structural data is present in the Protein Data Bank (PDB). The BindingDB website supports a range of query types, Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.



Chapter 3

including searches by chemical structure, substructure and similarity; protein sequence; ligand and protein names; affinity ranges and molecular weight. Datasets generated by BindingDB queries can be downloaded in the form of annotated SDfiles for further investigation, or used as a basis for virtual screening of a compound database uploaded by the user. The data in BindingDB are linked both to the literature in PubMed via PubMed IDs and structural data in the PDB via PDB IDs and chemical and sequence searches [1,2].

3.2.1 Description Data collection for BindingDB focuses on targets whose 3-D structures can either be accurately modeled or are available in the Protein DataBank (PDB) [3]. Such data are of particular interest because they are useful in structural analysis and are applicable for the development and validation of computational models of binding. Analysis in BindingDB overlooks additional drug-targets whose structures could be built by comparative modeling. Restricting attention to proteins of known structure permits BindingDB to complement, rather than overlap, other binding databases collecting data for membrane proteins with no 3D structures available. Proteins are preferred for data collection based upon their importance as drug-targets or model systems, as well as the availability of relevant data. Once a protein is selected, relevant scientific articles are analysed and their data are extracted and deposited into BindingDB. Data from multiple laboratories and companies are also incorporated in order to obtain a wide range of chemotypes for the targeted protein. Web-accessible forms also allow direct deposition by experimentalists. The majority of the data are established upon enzyme inhibition studies, but a smaller number of data from the more informative method of isothermal titration calorimetry are included. Each data entry includes detailed experimental conditions, such as solution composition, temperature and pH, because these can affect the measured affinities [4]. BindingDB assemble data for many ligands that are not represented in the PDB. More generally, the search criterion of the PDB includes 2% of ligands in BindingDB must have an exact match in the PDB and 15% of ligands in BindingDB have 90% similarity to a ligand in the PDB. Thus, BindingDB’s data collection vary significantly from those databases which only gathers affinities for protein ligand complexes in the PDB, notably BindingMOAD which holds 1400 data, PDBBind with 1600 data, and AffinDB with 750 data.

3.2.2 Details The BindingDB website provides rich set of tools for query, analysis and downloading the binding data. Search capabilities constitute queries by target name; ligand name; ligand structure, affinity range; substructure and similarity; and target sequence, via BLAST [5]. Query results are included in a summary table, with the option to drilldown to obtain more

Small molecule databases: A collection of promising bioactive molecules 67 detail on a given measurement. Available details include links to PubMed, citation data, and the option to obtain all binding data from the same publication; sequence data and SMILES strings and chemical structures [6]. Hyperlinks to the PDB allow easy navigation to structural data of a input ligand, protein or complex. Additional tools also permit the user to build a ‘data set’ which can be downloaded in the form of an MDL SDfile comprising the chemical structures, target information and affinities. The website also provides web-accessible tools for virtual screening of candidate ligands; we are not aware of any other public website that provides this functionality. The user provides a training set of ligands that are active against a given target or class of targets, either by utilizing queries to form a BindingDB dataset, or by uploading an SDfile from disk. The user then uploads his or her own SDfile of candidate ligands, select some of three machine-learning methods installed on the BindingDB server, and starts the calculation. The software returns user’s ligands with ranking, where the top ranked compounds are most likely to share the activity of the training set of active compounds. The results can be downloaded in the form of an SDfile with the score of each compound; optionally, the compounds in the SDfile can be ranked according to their scores. The three machine learning methods are as follows. Maximum similarity JChem chemical fingerprints are computed for each active compound and for each candidate ligand with default parameters [7]. The software computes the Tanimoto similarity for each candidate compound to each active, and the candidate compounds are ranked according to their maximal similarity to any active. Binary Kernel discrimination JChem chemical fingerprints are computed for each active compound and decoy compounds that are expected to be inactive with default parameters. The decoy compounds can either be supplied by the user, or BindingDB to supply a random set of drug-like compounds drawn from the Zinc database [8]. Support vector machine (SVM) As for the BKD, a set of active compounds and a set of decoys is well established. The user is then presented with a list of quantitative molecular descriptors to be utilized for the model development and the screening process; a reasonable default set of these is suggested by the website in order to aid the user. Descriptors are computed for all the compounds using Molconn-Z (eduSoftLC), and the descriptor set is then refined to avoid using highly correlated, and redundant, descriptors [9]. The LibSVM software is then trained with a subset of the actives and decoys, and applied to the remaining active and decoy compounds to generate rankings of training set and test-set, as previously described [10]. Maximum


Chapter 3

Figure 3.1 Protocol followed to predict molecules in BindingDB database.

similarity is the fastest of the three methods and most convenient for very large screening sets. The BKD method is slower, but can recover more diverse actives. The SVM method also is slower than maximum similarity, but is arguably the best at finding actives that differ significantly from the known actives used to train the algorithm (Fig. 3.1). Availability BindingDB is freely accessible at, and also may be accessed by following links from compounds at PubChem. To download SDfiles, users must complete a simple registration process and agree not to republish the data without explicit permission. Users are invited to contact them through the ‘Email us’ link to participate in the user forum at

3.3 ChEBI Chemical Entities of Biological Interest (ChEBI) is an open source dictionary with molecular entities focused on ‘small’ chemical compounds. The molecular entities in this

Small molecule databases: A collection of promising bioactive molecules 69 database are majorly natural products or synthetic products used to intervene in the processes of living organisms. Macromolecules like nucleic acids, proteins and peptides (cleavage product of proteins) are not included in ChEBI. In addition to molecular entities, ChEBI constitutes groups (parts of molecular entities) and classes of entities. ChEBI includes an ontological classification, with specified relationships between molecular entities or classes of entities and their parents and/or children. ChEBI is available online at In 2002, a project was initiated at the European Bioinformatics Institute (EBI) to create a definitive, freely available dictionary of Chemical Entities of Biological Interest. ChEBI has now grown to represent greater than 12,000 molecular entities, groups and classes. The term ‘molecular entity’ refers to any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity [11]. A group is a characterized linked collection of atoms or a single atom within a molecular entity [11]. ChEBI consist of classes of molecular entities as well as classes of groups. The scope of ChEBI encloses not only ‘biochemical compounds’ but also agrochemicals, pharmaceuticals, isotopes, laboratory reagents and subatomic particles.

3.3.1 Description The terminology considered in ChEBI is ‘definitive’ as it explicitly endorse, where applicable by international bodies such as NC-IUBMB and IUPAC. The entire data is available to all without constraint as MySQL table dumps and Open Biomedical Ontologies (OBO) format files ( Although the initial objective of ChEBI is to standardize biochemical terminology, the need to store and represent the 2D chemical structures has been recognized from the start. In accordance with the principles outlined above, ChEBI has adopted open standards for the representation of chemical structure, such as IUPAC International Chemical Identifier (InChI) [12] and will shortly incorporate Chemical Markup Language (CML) [13]. The connectivity and stereochemistry (2D structure) for majority of the small organic molecules in ChEBI (including isotope-labeled ones) can actually be represented as InChI strings.

3.3.2 Details In order to create ChEBI, data from a number of different sources were incorporated and then merged. Data for the initial release were drawn from three main sources: IntEnz-The Integrated relational Enzyme database of the EBI. IntEnz contains the Enzyme Nomenclature, the recommendations of the NC-IUBMB on the Nomenclature and classification of enzyme-catalysed reactions [14].


Chapter 3

KEGG COMPOUND - One part of the LIGAND composite database (http://www. of the Kyoto Encyclopedia of Genes and Genomes (KEGG) [15]. Chemical Ontology-Originally developed as ‘Chemical Ontology’ by Michael Ashburner and Pankaj Jaiswal, the initial alpha release was subsumed into ChEBI and is currently in the process of being refined and extended. ChEBI is designed as a relational database, which is implemented in an Oracle database server. A number of utility applications, implemented mainly in Java and Unix scripts, provide the additional functionality around the database, such as the loading of data from external sources. Specialized web-based interfaces provide for both public access to the data and restricted access to the annotation tool. ChEBI ID: ChEBI ID is a unique and stable identifier for the entity. ChEBI Names: ChEBI Name is the name for an entity recommended for use by the biological community. Definition: Where appropriate, the meaning of class names is explained by means of a short verbal ‘definition’. Structural diagrams: ChEBI stores 2D or 3D structural diagrams as connection tables in MDL molfile format. One entity can have one or more connection tables. IUPAC InChI: It is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources, thus enabling easier linking of diverse data compilations [12]. SMILES: SMILES (Simplified Molecular Input Line Entry System) is a simple but comprehensive chemical line notation [16]. SMILES specifically represents a valence model of a molecule and is widely used as a data exchange format. Formula: Where possible, formulae are assigned for entities and groups. Ontology: Every ChEBI entry contains a list of parent and children entries and the names of the relationships between them. IUPAC name: The IUPAC name is a name provided for an entity based on current recommendations of IUPAC. Synonyms: Synonyms are alternative names for an entity which either have been used in EBI or external sources or have been devised by the annotators based on recommendations of IUPAC, NC-IUBMB or their associated bodies. The source of each synonym is clearly identified.

Small molecule databases: A collection of promising bioactive molecules 71 Database cross-references: A field ‘Database Links’ contains one or more manually entered accession numbers for entries in public databases relevant to the given ChEBI entry. Registry Number: The Chemical Abstracts Service (CAS) Registry Number is a unique numeric identifier assigned to a substance when it enters the CASREGISTRY database ( Comment: A free-text comment may be added to some terms especially in cases where confusing terminology has been historically used. ChEBI Ontology: Ontologies are structured controlled vocabularies; generally they are graph-theoretic structures consisting of ‘terms’, which form the nodes of the graphs, linked by ‘relations’, which form the edges between the nodes. Relationships: There is, in the OBO community, an effort to standardize the relationships used in biomedical ontologies [17]. The significant difference from a ‘classic’ OBO such as Gene Ontology is that some of the ChEBI relationships are necessarily cyclic. The relationship ‘A is conjugate acid of B’ means that the relationship ‘B is conjugate base of A’ is always true, while the relationships ‘E is tautomer of K’ and ‘R is enantiomer of S’ also mean that ‘K is tautomer of E’ and ‘S is enantiomer of R’ are always true. The members of these cyclic relationships are placed at the same hierarchical level of the ontology. The relationships were introduced out of a need to formalize the differences between terms that are often (incorrectly) interchangeably used, especially in the biochemical literature. Web access: ChEBI can be accessed via the web at Web Services: The main aim of ChEBI Web Services is to provide programmatic access to the ChEBI database in order to aid our users in integrating ChEBI into their applications. Web Services provide a standard means of interoperating between different software applications [18]. All data in the database and on the FTP server is non-proprietary or is derived from a nonproprietary source. It is thus freely accessible and available to anyone. In addition, each data item is fully traceable and explicitly referenced to the original source. Apart from web access, the entire ChEBI data is provided in four different formats and can be downloaded from the FTP server ( Flat-file table dumps - ChEBI is stored in a relational database and available as ChEBI tables in a flat-file tab delimited format. There are various spreadsheet tools available to import this into a relational database. The files are stored in the same structure as the relational database.


Chapter 3

Oracle binary table dumps - ChEBI provides an Oracle binary table dump that can be imported into an Oracle relational database using the ‘imp’ command. The parameter file import.par should reside in the same directory when the import is done. The correct command to execute is:imp database_name/ database_password@Instance_namePARFILE 5 import.par Generic Structured Query Language (SQL) table dumps - ChEBI provides a generic SQL dump which consists of SQL insert statements. The archive file calledgeneric_dump. zip consists of 12 files which contain SQL table insert statements of the entire database. The file called compounds.sql should always be inserted first in order to avoid any constraint errors. OBO ontology format - ChEBI provides the ChEBI ontology in OBO format version 1.2 ( The open-source ontology editor OBO-Edit (20) ( can be used to view the OBO file.

3.4 ChemSpider ChemSpider is an open source chemical database that offers access to varied type of information associated with almost 25 million unique chemical compounds sourced and linked to almost 400 separate data sources on the Web [19]. ChemSpider, a search engine layered on terabytes of chemistry data; it is also a crowd sourcing community for chemists who contribute their skills, data, and knowledge for database curation and enhancement. ChemSpider, therefore, resembles Wikipedia that encourages participation and contributions from the community. In May 2009, Royal Society of Chemistry (RSC) acquired ChemSpider. By aggregating data using nearly 400 different data sources and linking them by means of chemical structure that act as the primary record in the database, ChemSpider has been able to link PubChem, Wikipedia, ChEBI and the Kyoto Encyclopedia of Genes and Genomes (KEGG), a patent database, chemical vendors, and chemistry journals that can provide open- and closed-access to the information. Where possible, each chemical record retains the links to the original source of the material, thereby associating a micro-attribution. It also provide users source information of particular interest, including where to purchase a chemical, information associated with chemical toxicity, metabolism data, and so on. Aggregating that level of connected information via a classical search engine such as Google would be very time-consuming.

3.4.1 Description ChemSpider grant registered users to enter information, annotate and curate the records. Currently, Chemical Abstracts Service (CAS) is the standard chemical resource [19].

Small molecule databases: A collection of promising bioactive molecules 73 Database is updated regularly by making new additions. It is now integrated with the RSC publishing process whereby new compounds are identified in RSC articles and then are deposited and released to the community with article publication. More than a million of the name-identifier relationships have been manually or robotically curated [20]. ChemSpider depends on the crowd sourcing activities of the community. ChemSpider is superior over a simple Google search, as variety of information about a compound provided at ChemSpider is difficult to match on any other free Web site. The data continue to be updated and validated by practicing chemists, and in most of the cases, they have also been reviewed for accuracy. ChemSpider also cater links for further information via other online sources. Some of these links are Scholar, Google Books, and Patents; Microsoft Academic Search; Books, and Publishing Web site; the RSC Databases; and an ever-increasing number of government, commercial, and academic databases. Recently another feature, ChemSpider Synthetic Pages, has also been created to provide an online database to access the information regarding chemical synthesis procedures. Thereby, enables the Chemists to populate an online database with one of their chemical reactions and outline how to perform a reaction. Each reaction possess digital object identifier (DOI) issued to facilitate students to add this online “publication” to their resume.

3.5 ChEMBL ChEMBL is freely available database constituting information for binding, functional and ADMET information for numerous drug-like bioactive compounds. On a regular basis, these data are manually abstracted from the primary published literature, then further curated and standardized to maximize their quality and utility across a wide range of chemical biology and drug-discovery research problems. Currently, the database possess 5.4 million bioactivity measurements for more than 1 million compounds and 5200 protein targets. It is accessible through a web-based interface, data downloads and web services at: A wealth of information related to the activity of small molecules and biotherapeutics is available in the literature, and access to this information can allow many types of drug discovery analysis and decision making [21,22]. Owing to the continuing shift in fundamental research on disease mechanisms from the private to public sectors, access to this information has become important. However, bioactivity data published in journal articles are commonly found in a relatively unstructured format and demands intensive labor to search and extract. ChEMBL aims at bridging this gap by catering broad coverage across a diverse set of organisms, targets and bioactivity measurements reported in the scientific literature, together with a range of user-friendly search capabilities [23].


Chapter 3

3.5.1 Description The important activity data in the ChEMBL database are manually obtained from the full text of peer-reviewed scientific publications in varied journals. Each publication, is utilized as a source to abstract details of the compounds tested, the assays performed and any target information for these assays. Structures for small molecules are obtained in full machinereadable format, despite the structure often being provided as a list of R-group substituents, and a scaffold or referred to only by name in the original publication. Before loading to the database, structures are analyzed for potential problems, and then normalized as per a set of rules, to establish consistency in representation. Preferred representations are utilized for certain common groups. Details regarding all types of assays performed are derived from each publication, together with binding assays (analyzing the interaction of the compound with the target directly), functional assays (often measure indirect effects of the compound on a pathway, system or whole organism) and ADMET assays (measuring pharmacokinetic properties of the compound, interaction with key metabolic enzymes or toxic effects on cells/tissues). The activity endpoints measured in these assays are documented with the values and units as reported in the paper, but for the purposes of improved querying are also standardized, where possible, to convert them to a suitable measurement unit for a given activity type. The utility of bioactivity data is maximized, by carrying out detailed manual annotation of targets within ChEMBL. Where the intended molecular target of an assay is reported in a publication, this information is derived, together with associated details of the relevant organism in which the assay is achieved. A ‘multi’ field in the database records the fact that it is not clear whether the compound is interacting non-specifically with multiple proteins, and consequently less confidence should be placed in the assignments. Furthermore, protein targets are classified into a manually curated family hierarchy, as per the nomenclature commonly used by drug discovery scientists. While, organisms are classified as a simplified subset of the NCBI taxonomic structure [24]. This also permits data to be queried at a higher level. Approved drugs: Additionally, literature-derived data in ChEMBL also contains structures and annotation for Food and Drug Administration (FDA)-approved drugs. For each drug entry, any information associated with approved products including their administration routes, trade names, dosage information and approval dates is incorporated in the database. Structures for novel drug ingredients are manually assigned, and for protein therapeutics, where available, amino acid sequences may be included. Each drug is also annotated according to the drug type (natural product derived small molecule, synthetic small molecule, antibody, protein, oligonucleotide, oligosaccharide, inorganic etc.), whether they are ‘black box’ safety warnings associated with a product containing that active ingredient, whether it is a known prodrug, the earliest approval date (where known), whether it is

Small molecule databases: A collection of promising bioactive molecules 75 dosed as a defined single stereoisomer or racemic mixture, and whether it has a therapeutic application (as opposed to imaging/diagnostic agents, additives etc.). This information allows users of the bioactivity data to assess whether a compound of interest is an approved drug and is therefore likely to have an advantageous safety/pharmacokinetic profile or be orally bioavailable. The most important entity types within ChEMBL are compounds, targets, documents and assays. Each extracted document possess a list of associated compound records and assays, linked together by activities (i.e. the actual endpoints measured in the assay with their types, values and units). Since the same compound may have been tested multiple times in different assays and publications, the compound records are collapsed, based on structure, to form a non-redundant molecule dictionary. Standard IUPAC Chemical Identifier (InChI) representation is used for the determination of the compounds which are identical and which must be registered with new identifiers [25]. In general, the Standard InChI representation differentiates stereoisomers of a compound, but not tautomers. Therefore, stereoisomers are given unique identifiers, but not tautomers. A smaller number of protein therapeutics and substances with undefined structures have also been included in the molecule dictionary. Additional information is then associated with the table entries, like structure representations, synonyms, calculated properties, parent salt relationships and drug information. Similarly, a non-redundant target dictionary stores a list of the proteins, nucleic acids, subcellular fractions, cell-lines, tissues and organisms that are subject of investigation. Each assay is then mapped to one or more entries in this dictionary and linked to the target dictionary for further information, such as protein family classification. Each record in the documents, assays, molecule dictionary and target dictionary tables has been assigned a unique ChEMBL identifier that takes the form of a ‘CHEMBL’ prefix immediately followed by an integer. External identifiers are recorded for these entities where possible [26] and Standard InChI Keys. Where data are retrieved from other resources, while retaining the original identifiers. PubMed identifiers or Digital Object Identifiers (DOIs) are stored for documents [27]. Protein targets are represented by primary accessions within the UniProt protein database [28], and organism targets are assigned NCBI taxonomy IDs and names. ChEMBL contains a much larger proportion of active compounds identified using dose-response assays. The number of distinct protein targets with dose response measurements recorded is more than 4000 in ChEMBL. All ChEMBL literature-derived assays are now included in PubChem BioAssay, and a subset of PubChem assays have been loaded into ChEMBL. Similarly, compounds and binding measurements from ChEMBL have been integrated into BindingDB, and the reciprocal incorporation of BindingDB data into ChEMBL is also done. The ChEMBL database is available via a simple, user-friendly interface at: https://www.ebi. This interface allows one to search for compounds, targets or assays of interest in a number of ways. Alternatively, targets can be browsed as per the protein


Chapter 3

family, or organism. Since the database only includes protein targets for which bioactivity data are available, users can also carry out a BLAST search of the ChEMBL target dictionary with a protein sequence of interest. This can be helpful in identifying closely related proteins with activity data, even if the sequence of interest is not represented in the database. Having retrieved a target, or multiple targets of interest, a simple drop-down menu can be used by the users to display all associated bioactivity data, or to filter the available data and select activity types of interest. The resulting bioactivity table provides details of each compound that was tested, the measured activity type, value and units, a description of the assay, details of the target and, importantly, a link to the publication from which the data have been extracted. For further analysis, data from this view can be exported either as a text file or spread sheet. Alternatively, users may have a particular compound of interest and can retrieve potency, selectivity or ADMET information for this, or closely related compounds. Again, users can also search for compounds using a keyword search with names/synonyms or ChEMBL identifiers. The interface provides a choice of several different drawing tools, allowing users to sketch in a structure or substructure of interest [29]. A compound similarity or substructure search of the database can be also carried out to retrieve ChEMBL compounds similar to, or containing, the input structure. Having retrieved a list of compounds of interest, a variety of calculated properties such as molecular weight, calculated lipophilicity and polar surface area can be viewed and filtered via a graphical display [30]. This may be useful to restrict the set of compounds to those that are likely to have appropriate drug-like properties, before retrieving or filtering the associated bioactivity data [31]. ChEMBL (compounds, targets, assays and documents), provides report cardpages for each of the main data types. To gather further details about the entity of interest, such as names and synonyms, journal/abstract details, drug annotation, structures and calculated physicochemical properties, together with cross-references to other resources, this database can be explored. Each report card constitutes a series of clickable graphical ‘widgets’ summarizing and providing rapid access to all of the bioactivity data available for that entity. A table view of approved drugs is also provided, with relevant annotation indicated by a series of sortable icons. Users may download the structures for these drugs or go to report cards to access further information, such as bioactivity data. While ChEMBL interface also provides the functionality required for many common use-cases, some users may query it locally or may prefer to download the database. Each release of ChEMBL is openly available in numerous formats, including Oracle, MySQL, a SD file of compound structures and a FASTA file of the target sequences, under a Creative Commons Attribution-Share Alike 3.0 Unported license. Finally, to allow greater interoperability of the ChEMBL data with molecular interaction and pathway data, a subset of the database is available in PSI-MITAB 2.5 format [32] via PSICQUIC web services [33].

Small molecule databases: A collection of promising bioactive molecules 77

3.6 ZINC ZINC, non-commercial database that contains compound, purchasable for rapid testing of drug designing hypotheses. ZINC can be accessed at and can be downloaded from the same. The database supports multiple protonation models, stereochemistries, tautomeric forms, regioisomeric forms (E/Z isomerism), suppliers, and 3Dconformational sampling. It is possible to annotate molecules using both alphanumeric and numeric data. It facilitates addition of new molecules, tag, or removes those which are no longer available and fix those that have errors. The database accelerates quick searching and downloading, and provides regular updates. As sources, 10 vendor catalogs, most of them are updated monthly on the Web or CD-ROM were used to frame this database. Molecules with formula weight $ 700, calculated LogP $ 6 and # 2 4, number of hydrogen-bond donors $ 6, number of hydrogen-bond acceptors $ 11, and number of rotatable bonds $ 15 are filtered out. All molecules including atoms other than H, C, N, O, F, S, P, Cl, Br, or I are also removed. These rules are guidelines toward making the database loosely conform to current opinion in the field.

3.6.1 Description Molecules are obtained from the compound suppliers as 2D SDF format files, which are converted to isomeric SMILES utilizing OpenEye’s tool (http://www.eyesopen. com). OpenEye’s filter.1.0.2 program desalt the molecules and later filter out undesirable molecules. Typically, over 70% of compounds are achiral with no regioisomeric (E/Z) ambiguity. A single substance can be represented by at least one SMILES string. OpenEye’s Omega program is utilized to generate initial 3D models using unambiguous isomeric SMILES. Schro¨dinger’s ligprep program is employed to create relevant, correctly protonated forms of the molecule within the pH range of 5 9.5. The semi-empirical quantum mechanical program AMSOL16 computes the partial atomic charges and atomic desolvation penalties for each 3D conformation corresponding to each protonation state, stereoisomer, and tautomer [34]. 3D conformations are generated by OpenEye’s program Omega, which are distilled into a flexibase format using program mol2db [35]. Omega is employed because it computes accessible conformations relatively accurately and efficiently [36]. Calculating small molecule conformations has remained an active area of research in the field. ZINC includes molecules annotated by molecular property. These include molecular weight, calculated LogP, number of rotatable bonds, number of hydrogen-bond donors and acceptors, number of chiral centers, number of chiral double bonds (E/Z isomerism), polar and apolar desolvation energy (in kcal/mol), number of rigid fragments, and net charge. The octanol-water partition coefficient (calculated LogP) is calculated for each molecule that is loaded into ZINC that is fragment-based implementation by Molinspiration and agrees well with experimentally measured LogP for a diverse test set of


Chapter 3

molecules [37]. The calculated LogP using OpenEye’s implementation of is used, before filtering, to decide whether a molecule should be loaded into ZINC which uses Wang’s algorithm, because it is an integral part of their filtering tools [38]. Each molecule is also defined with the vendor and original catalog number for each commercial source of that compound. Molecules can be annotated for function or activity, when available. Using this protocol, molecules are processed, whether from a vendor’s catalog or as a result of a Weboriginated request, they are loaded into the relational database using a Perl script.

3.6.2 Details ZINC database is designed in a relational way, that efficiently loads, incrementally updates, aid in querying, and data sub-setting. This database is fast, efficient, and, free in the case of MySQL, with a relational-only structure. However, exporting subsets of the database is slow, thereby is one of the concern to address this problem, molecule subsets are exported from this database into ready-to-download compressed files, and database-intensive work is scheduled in batch mode. Once prepared, subsets can be downloaded rapidly, completely by passing the relational database. MySQL 4.0 and the Perl DBI/DBD toolkit are used and OE’s depict tool (part of the Ogham Suite) is used to render 2D depictions. The Cactvs suite and the software of Molinspiration are used for canonicalization, proofreading, and property calculations. It presently constitutes 727,842 purchasable compounds and the number of molecules in ZINC is currently growing. Of these 727,842, 494,915 are Lipinski compliant [39], with the warning that Molinspiration’s LogP is considered in place of cLogP. Of these, 202,134 are “lead-like” molecules, having molecular weight within the range of 150 350, calculated LogP # 4, number of hydrogen-bond donors # 3, and number of hydrogen-bond acceptors # 6. A total of 34,224 molecules are “fragment-like”, with calculated LogP values between 22 and 3, # 3 hydrogen-bond donors, # 6 hydrogenbond acceptors, # 3 three rotatable bonds, and molecular weight # 250. The number of single violations of the Lipinski rules in ZINC is mostly owing to high calculated LogP values. The number of rotatable bonds is another widely followed metric of suitability for screening, about more than half of the molecules in the ZINC database have five or fewer. A Web server is available to distribute this database, allowing investigators to search, subset, browse, and download some or all of the molecules in SMILES, mol2, SDF, and DOCK flexibase formats [35]. Users can also upload and process the molecules on the web server. The ZINC Web server runs on a dual processor Xeon 2.4 GHz server, has a similar machine dedicated solely to run MySQL, and for processing it can draw on a 50-CPU 2.4 GHz Xeon Linux cluster. Users may search ZINC database based on several criteria. Limits on molecular properties such as molecular weight and net charge may be specified on the left-hand side of the search on the Web page. On the bottom left, individual ZINC database registration codes as

Small molecule databases: A collection of promising bioactive molecules 79 well as the unique serial number have been assigned to each substance in ZINC which may be specified, either by choosing a text file of codes or typing them to upload from the browsing computer. Molecules matching any of the ZINC codes specified will be found. A constraint on the compound vendor can also be specified. On the right, Java Molecular Editor (JME) can be employed to draw molecular substructures [40]. A list of SMILES strings in a text file can be used to upload and search [16] database browser. This browser can be utilized to review the results obtained after search. The Database Browser displays molecules in a table including ZINC registration code, a 2D sketch, purchasing information, and molecular properties like number of rotatable bonds and calculated LogP. Clicking on a vendor’s catalog number links to the vendor’s ecommerce Website, in case of its availability. The following options are also available: (a) download individual molecules or the set of all molecules matched in mol2, SMILES, SDF, and DOCK flexibase formats, (b) download a table of molecular properties constituting purchasing information for analysis in a spreadsheet, and (c) subset creation for docking or download. Many users may only be interested in some of the molecules in ZINC. The ZINC Web pages permits the download of subsets by vendor and other criteria like Lipinski-compliant, “lead-like”, and “fragment-like” compounds. The search page may be used to download small subsets immediately or to create user-defined subsets using arbitrary criteria, including functional groups and molecular properties. Once prepared, each subset is available in SMILES, mol2, SDF, and DOCK flexibase format. Large files are broken into slices of approximately 20-100MB for easier download. In the limit, the entire ZINC database may be downloaded and can be utilized to process our own, molecules. Users may upload their own molecules to the ZINC server in SMILES, SDF, or mol2formats and have them processed using the same protocol used to build ZINC. The uploaded molecules subsequently appear as a subset for download in the usual way and disappear from the server after a week. Using ZINC database, 3D molecules in various formats that are compatible with most docking programs can be obtained. The Web-based interface is efficient to support fast and moderately complex queries. To accelerate experimental testing, this straightforward tool, provide direct links to e-commerce systems in order to purchase compounds online. The interface allows tables of data to be downloaded to a spreadsheet, to enable users to graph properties, and to spot trends within the database.

3.7 PubChem PubChem is an open source repository that includes chemical structures and their biological test results. It was aimed to discover chemical probes through high-throughput screening of small molecules that modulate the gene products activity using this repository [41]. There are three related databases i.e. Substance, Compound and BioAssay included in PubChem.


Chapter 3

The Substance database (primary accession-SID) includes contributed sample descriptions provided by the depositors, whereas the Compound database (primary accession-CID) constitutes unique chemical structures extracted from the substance depositions. The PubChem BioAssay database (primary accession-AID) also includes bioactivity screens of chemical substances described in PubChem, and serves as the public repository for the results of biological screening contributed by the NIH Molecular Library Program, industrial companies and other research organizations [42]. A BioAssay data entry constitutes contributed bioactivity descriptions and test results, like percentage of activity inhibition, generated by one assay protocol. Nearly, 30 academic institutions, government agencies, research laboratories, as well as industrial assay vendors have deposited biological test results, which can either be generated by HTS screenings or extracted from literature, to the PubChem BioAssay repository. The PubChem BioAssay database currently possesses more than 1400 bioassay depositions and 45 millions of biological activity outcomes for over 700,000 compounds with unique chemical structures. PubChem can be availed through the NCBI Entrez system. In Entrez’s PubChem Compound database, one can search a compound with a chemical synonym, and also link to a list of screening experiments including the given compound via the ‘BioAssays’ link. One can also find, all bioassay tests for a specific target by querying protein target name. The Entrez ‘Limits’ facility allows construction of a specific query based on one’s research need. Furthermore, PubChem gives a set of web-based tools to combine both chemical and biological activity information, and support in-depth data analysis and the navigation that facilitates chemical probes and biological interesting targets identification within PubChem databases. For these tools, PubChem implements a queuing mechanism to balance the web requests. The BioAssay information system caters exploratory data analysis tools by utilizing the ‘summary’ biological test results. It also provides diverse and assay-specific screening descriptions and test results. For each tested chemical sample, PubChem requires a summary result to define the bioactivity outcome and bioactivity score. PubChem bioactivity outcome summary comprises of five categories, e.g. chemical probe, active, inactive, inconclusive and unspecified. PubChem use these summary results to integrate the bioactivity analysis tools and provide a comprehensive review of biological tests, biological activity data comparison from multiple screenings, and further explore structure activity relationship. There are numerous publicly available databases to provide information regarding the bioactivity and data mining tools. It also provides a structure-activity analysis tool, it permits one to derive and compare the activity profile of the compound with quantitative bioactivity data across a broad range of targets. Overall, the PubChem bioactivity analysis services are clearly advantageous in several respects. These services are seamlessly

Small molecule databases: A collection of promising bioactive molecules 81 integrated and permits users to set and refine research focus as needed to be in the process of data analysis. The leverage on the powerful facilities provided by the NCBI Entrez system allows one to utilize the information relationship among the data content between PubChem and other NCBI databases. Some other unique features of the PubChem bioactivity analysis services include the facilities supporting test result drill-down, registration-free bulk download for screening results, structures of tested chemicals, as well as similarity matrices used in various data analysis. Multiple entry points have been provided for these services to support chemistry, bioassay or molecular target centric analysis.

3.7.1 Description PubChem BioAssay Summary service is the primary service for presenting depositorprovided information, represented by a PubChem BioAssay accession, AID. This includes a summary of data attribution, assay description, experimental protocol, depositor comments, screening outcome methodology and definitions of reported readouts. It also includes depositor supplied cross links to tested substance samples, hit compounds, protein and gene target, PubMed publications, and other information resources. Overall, it provides a comprehensive description of a bioassay report, helps user to understand the scientific goal of the experiment, the biological background of the testing system, assay technology exploited, discoveries made by the screening test, as well as threshold used for deriving bioactivity outcome and possible factors of artifacts. PubChem BioAssay database tracks and archives each update of an assay submission. The PubChem BioActivity Summary tool permits one to aggregate all available screening results, and readily examine and compare biological outcomes across multiple assays for one or more tested compounds or substances. Additionally, it reports and summarizes the available screening bioactivity outcomes for a single or a set of chemical samples. Depending on users’ goals, the bioactivity summary can be switched between substancecentric and compound-centric views. If centered on substance descriptions, this particular tool provides a summary view of all available biological tests and the respective bioactivity outcome contributed by a single organization. If the substances are deposited by the MLSMR (NIH Molecular Libraries Small Molecule Repository), they can be approved by multiple screening centers within the NIH Molecular Library Program. If centered on compound descriptions, the service provides a detailed view of biological activity summary by aggregating all screening data across multiple contributors for the unique chemical structures. The BioActivity Summary service provides various powerful selection-revise features that enable one to rapidly revise the focus of the analysis by the modification of the set of selected compounds and assays. Additionally, bioassays with similar bioactivity profiles and bioassays with similar protein target sequence can also be added to selection.


Chapter 3

To focus on confirmatory assays or assays with specific molecular targets one may use the filtering features mentioned in the ‘Other Filters’ pop-up menu. One of the common entry points for accessing the BioActivity Summary tool is from a single PubChem compound summary record. Invoking the BioActivity Summary tool from a compound summary record will easily generate an overview of all biological screenings performed for that compound. Structure-Activity Analysis tool can be used for subsequent analysis that further enables evaluation on the SAR and bioactivity profile of such analog series. Other entry points include NCBI Entrez ‘DocSum’reports for PubChem substance, compound and bioassay records, where the BioActivity Summary tool can be invoked for each individual record as well as for the entire data set resulted from an Entrez search. This can be done by using the explicit ‘BioActivity Analysis’ link, or clicking the double six-member ring icon from the ‘Tools’ area. For example, one may start, in Entrez’s PubChem Compound database, with a compound submitted to PubChem by a journal article reporting specific enzyme inhibitors. To verify the discussed inhibition activity of the enzyme inhibitor, one can compare the reported bioactivity information to the biological tests deposited in PubChem. Alternatively, one may start a structure search with a given substructure using the service provided at http://pubchem.ncbi., and launch the BioActivity Summary tool for the resulting compound set to the link described above. In another case, one may search PubChem BioAssay database for all available screening tests for a particular target, then use the BioActivity Summary tool to examine the bioactivity outcomes from each screening experiment, compare the hit list and compile a library of bioactive compounds for the target. Users can also choose to access this analysis tool through the common gateway of PubChem BioActivity Analysis Service. From this entry point, the dataset of assay, substance/compound, which are subject to the analysis, can be specified by entering an ID list, providing a text file contains the IDs (comma separated, or one ID per line), or referring to an Entrez search history. The BioActivity Summary analysis results are saved on a temporary server, and can be availed only for a limited period of time, usually 48 hours. However, the status of this analysis, can be saved using the ‘SaveView’ feature to aid the scientific communication. Analysis can be resumed by importing the file comprising the status through the web server at http://pubchem.ncbi.nlm.nih.-gov/assay/assay.cgi?p 5 qfile under the common gateway of PubChem BioActivity Analysis Service. Overall, the BioActivity Summary service aimed at providing insights into the activity profile of the compounds using various multiple screening test results, and offer an efficient platform to define and gather an interesting set of compounds and panel of assays to carry out further analysis.

3.8 DrugBank DrugBank is a comprehensive repository, developed for availing information associated with drug, drug target and drug action information, maintained and enhanced by extensive

Small molecule databases: A collection of promising bioactive molecules 83 literature surveys performed by skilled biocurators and domain-specific experts. The quality, breadth and uniqueness of its data have made DrugBank particularly highly popular and regarded among medicinal chemists, pharmaceutical researchers, clinicians, educators’ and the general public. Because most of the data in DrugBank are expertly curated from primary literature sources, it has become the referential drug data source for a number of well-known databases such as PharmGKB, KEGG, ChEBI, GeneCards, PubChem, PDB, Wikipedia and UniProt. Since its first release in 2006, DrugBank has been continuously evolving and rapidly expanding its user base in order to meet the growing demands of its users and the changing needs of its. The first version of DrugBank was limited to provide data on selected Food and Drug Administration(FDA)-approved drugs and their drug targets [43]. Pharmacological, molecular biological and pharmacogenomic data were added to DrugBank 2.0, along with a significant increase in the number of approved and experimental drugs [44]. DrugBank 3.0, released in 2010, included data on drug food interactions, and drug drug metabolic enzymes and transporters as well as pharmacokinetic and pharmacoeconomic information [45]. Now, DrugBank has been augmented to capture the increasing body of quantitative knowledge about drugs and improved technologies to ascertain drugs, their metabolites and their downstream effects. Notably, significant improvements and large-scale additions in the areas of QSAR, ADMET, pharmacogenomics and pharmacometabolomics have been made. Existing information about drug salt-forms, drug structures, drug names, drug targets and drug actions has also been advanced and updated. This database has been upgraded by adding numerous approved and experimental drugs in addition to a number of new data fields describing each drug. New search tools have also been developed or improved on to augment the ease with which information can be found. Many of the enhancements have been made over by user feedback and suggestions.

3.8.1 Description Fundamentally, DrugBank is known for its dual purpose bioinformatics cheminformatics database with a strong focus on analytic, quantitative or molecular-scale information about both drugs and its target. In many respects, data-rich molecular biology content such as Swiss-Prot and UniProt normally found in curated sequence databases are combined [46] with the equally rich data found in chemical reference handbooks and medicinal chemistry textbooks. To compile, confirm and validate this comprehensive collection of data, several hundred journal articles, more than a dozen textbooks, nearly 30 different electronic databases, and at least 20 in-house or web-based programs were searched individually, accessed, compared, written or run over the course of four years. The team of DrugBank archivists and annotators included a physician, two accredited pharmacists and three bioinformaticians with dual training in molecular biology/chemistry and computing science.


Chapter 3

DrugBank currently constitutes .4100 drug entries, corresponding to .12,000 different trade names and synonyms. These drug entries are selected according to the following rules: the molecule must be non-redundant, have more than one type of atom, have a known chemical structure and can be identified as a drug or drug-like molecule by more than one reputable data source. To facilitate more targeted research and exploration, DrugBank has been divided into four major categories: (i) small molecule drugs which are FDA-approved ( . 700 entries),(ii) FDA-approved biotech (protein/peptide) drugs ( . 100 entries), (iii) nutraceuticals or micronutrients like vitamins and metabolites ( . 60 entries) and (iv) experimental drugs, including de-listed drugs, unapproved drugs, illicit drugs, enzyme inhibitors and potential toxins (3200 entries). These individual ‘Drug Types’ are also bundled into two larger categories including all compounds (Experimental 1 FDA 1 nutraceuticals) and all FDA drugs (Approved Drugs). DrugBank’s coverage for nontrivial FDA-approved drugs is 80% complete. In addition, these drug entries are linked to .14 000 protein (i.e. drug target) sequences. More complete information about the drug targets, numbers of drugs and non-redundant drug targets (including their sequences) is available in the DrugBank ‘download’ page. The entire database, constituting text, sequence, structure and image data occupies nearly 16 GB of data most of which can be freely downloaded. Fully searchable tools like DrugBank is a web-enabled resource with many built-in tools and features for viewing, sorting and extracting drug or drug target data. DrugBank homepage provide detailed instructions on where to locate and how to use these browsing/search tools As with any web enabled database, this database supports standard text queries (through the text search box located on the home page). General database browsing is also offered by this database using the ‘Browse’ and ‘PharmaBrowse’ buttons located at the top of each DrugBank page. To facilitate this, DrugBank is divided into synoptic summary tables which, in turn, are linked to more detailed ‘DrugCards’—in analogy to the very successful GeneCards concept [47]. All of the database summary tables can rapidly be browsed, sorted or reformatted (using up to six different criteria) and can be similarly viewed as PubMed abstracts. Clicking on the DrugCard button found in the leftmost column of any given DrugBank summary table opens a webpage describing the drug of interest in much greater detail. Each DrugCard entry contains .80 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. In addition to providing comprehensive numeric, sequence and textual data, each DrugCard also contains hyperlinks to other databases, digital images, abstracts and interactive applets for visualizing the molecular structures. In addition to the general browsing features, DrugBank also cater a more specialized ‘PharmBrowse’ feature, which is designed for physicians, pharmacists and medicinal chemists who tend to think of drugs in clusters of indications or drug classes. This particular browsing tool provides navigation hyperlinks to .70 drug classes, which in turn list the FDA-approved drugs associated with the drugs. Each drug name is then linked to its respective DrugCard.

Small molecule databases: A collection of promising bioactive molecules 85 An important distinguishing feature of DrugBank from other online drug resources is its comprehensive support for higher level database searching and selecting functions. DrugBank also provide a local BLAST search to supports both single and multiple sequence queries, a Boolean text search [48], a chemical structure search utility and a relational data extraction tool in addition to data viewing and sorting features as already described [49]. These can all be accessed via the database navigation bar located at the top of every DrugBank page. The BLAST search (SeqSearch) is particularly used to potentially allow users to quickly and simply identify drug leads from the newly sequenced pathogens. Specifically, a new sequence, a group of sequences or an entire proteome may be scrutinize against DrugBank’s database of known drug target sequences by simply pasting the FASTA formatted sequence (or sequences) into the SeqSearch query box. Later, job can be submitted by pressing the ‘submit’ button. A significant hit reveals, through the associated DrugCard hyperlink, the name(s) or chemical structure(s) of potential drug leads that may act on that query protein (or proteome). DrugBank’s structure similarity search tool (ChemQuery) can be used in a similar manner to its sequence search tools. Users may sketch (through ACD’s freely available chemical sketching applet) or paste a SMILES string [16] of a possible lead compound into the ChemQuery window. Submitting the query launches a structure similarity search tool that looks for common substructures from the query compound that match DrugBank’s database of known drug or drug-like compounds. High scoring hits are presented in a tabular format with hyperlinks to the corresponding DrugCards. The ChemQuery tool allows users to quickly determine whether their compound of interest acts on the desired protein target. This kind of chemical structure search may also reveal whether the compound of interest may unexpectedly interact with unintended protein targets. In addition to these structure similarity searches, the ChemQuery utility also supports compound searches on the basis of chemical formula and molecular weight ranges. DrugBank’s data extraction utility (Data Extractor) explores a simple relational database system that allows users to select more than one data fields to search for ranges, occurrences or partial occurrences of words, strings or numbers. The data extractor uses clickable web forms so that users may intuitively construct SQL-like queries. Using a few mouse clicks, it is relatively simple to construct very complex queries (‘find all drugs less than 600 daltons with LogPs less than 3.2 that are antihistamines’) or to build a series of highly customized tables. The output from these queries is provided in HTML format with hyperlinks to all associated DrugCards. Exercise: Build a Library of Small Neutral Compounds Containing a Sulfonamide using ZINC database.


Chapter 3

Step by step protocol: Open ZINC database on your PC/laptop. In the ZINC search page, draw a sulfonamide group into the JME editor Click “Save SMILES” or simply types the SMARTS pattern “NS( 5 O)( 5 O)” directly into the SMILES field. Using the molecular properties fields, a maximum molecular weight of 300 is specified. A minimum and a maximum molecular charge of zero is input. A browser displaying the B2600 purchasable compounds in ZINC satisfying these constraints will appear. Click on “Download table” to bring up a spreadsheet in Excel. If the range of values of this subset is satisfactory, the user may then return to the ZINC Database Browser and download this subset in mol2, SDF, SMILES, or flexibase format by clicking on the appropriate button at the top of the page. Vendor information is included with this subset.

References [1] X. Chen, M. Liu, M.K. Gilson, BindingDB: a web-accessible molecular recognition database, Combinatorial Chem. High Throughput Screen. 4 (2001) 719 725. [2] X. Chen, Y. Lin, M.K. Gilson, The binding database: overview and user’s guide, Biopolymers: Original Res. Biomol. 61 (2001) 127 141. [3] H.M. Berman, The protein data bank: a historical perspective, Acta Crystallogr. Sect. A: Found. Crystallogr. 64 (2008) 88 95. [4] J. Zhang, M. Aizawa, S. Amari, Y. Iwasawa, T. Nakano, K. Nakata, Development of KiBank, a database supporting structure-based drug design, Comput. Biol. Chem. 28 (2004) 401 407. [5] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403 410. [6] D. Weininger, A. Weininger, J.L. Weininger, SMILES. 2. Algorithm for generation of unique SMILES notation, J. Chem. Inf. Comput. Sci. 29 (1989) 97 101. [7] F. Csizmadia, JChem: Java applets and modules supporting chemical database handling from web browsers, J. Chem. Inf. Comput. Sci. 40 (2000) 323 324. [8] J.J. Irwin, B.K. Shoichet, ZINC 2 a free database of commercially available compounds for virtual screening, J. Chem. Inf. Model. 45 (2005) 177 182. [9] R.N. Jorissen, M.K. Gilson, Virtual screening of molecular databases using a support vector machine, J. Chem. Inf. Model. 45 (2005) 549 561. [10] C.-c. Chang, C.-j. Lin, LIBSVM: a library for support vector machines, 2001. Software available at http:// tw/B cjlin/libsvm, (2012). [11] A.D. McNaught, Compendium of Chemical Terminology, Blackwell Science, Oxford, 1997. [12] A. McNaught, The IUPAC international chemical identifier, Chem. Int. (2006) 12 14. [13] P. Murray-Rust, H.S. Rzepa, Chemical markup, XML, and the World Wide Web. 4. CML schema, J. Chem. Inf. Comput. Sci. 43 (2003) 757 772. [14] A. Fleischmann, M. Darsow, K. Degtyarenko, W. Fleischmann, S. Boyce, K.B. Axelsen, et al., IntEnz, the integrated relational enzyme database, Nucleic Acids Res. 32 (2004) D434 D437. [15] M. Kanehisa, S. Goto, M. Hattori, K.F. Aoki-Kinoshita, M. Itoh, S. Kawashima, et al., From genomics to chemical genomics: new developments in KEGG, Nucleic Acids Res. 34 (2006) D354 D357.

Small molecule databases: A collection of promising bioactive molecules 87 [16] D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Comput. Sci. 28 (1988) 31 36. [17] B. Smith, W. Ceusters, B. Klagges, J. Ko¨hler, A. Kumar, J. Lomax, et al., Relations in biomedical ontologies, Genome Biol. 6 (2005) R46. [18] R.G. Coˆte´, P. Jones, R. Apweiler, H. Hermjakob, The Ontology Lookup Service, a lightweight crossplatform tool for controlled vocabulary queries, BMC Bioinf. 7 (2006) 97. [19] H.E. Pence, A. Williams, ChemSpider: An Online Chemical Information Resource, ACS Publications, 2010. [20] K.M. Hettne, A.J. Williams, E.M. van Mulligen, J. Kleinjans, V. Tkachenko, J.A. Kors, Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining, J. Cheminf. 2 (2010) 3. [21] J. Mestres, E. Gregori-Puigjane´, S. Valverde, R.V. Sole´, The topology of drug target interaction networks: implicit dependence on drug properties and target families, Mol. Biosyst. 5 (2009) 1051 1057. [22] M.J. Keiser, V. Setola, J.J. Irwin, C. Laggner, A.I. Abbas, S.J. Hufeisen, et al., Predicting new molecular targets for known drugs, Nature 462 (2009) 175. [23] W.A. Warr, Chembl. an interview with john overington, team leader, chemogenomics at the european bioinformatics institute outstation of the european molecular biology laboratory (embl-ebi), J. Comput. Mol. Des. 23 (2009) 195 198. [24] E.W. Sayers, T. Barrett, D.A. Benson, E. Bolton, S.H. Bryant, K. Canese, et al., Database resources of the national center for biotechnology information, Nucleic Acids Res. 38 (2009) D5 D16. [25] S.E. Stein, S.R. Heller, D.V. Tchekhovskoi, An open standard for chemical structure representation: the IUPAC chemical identifier, in: International Chemical Information Conference, 2003. [26] P. De Matos, R. Alca´ntara, A. Dekker, M. Ennis, J. Hastings, K. Haug, et al., Chemical entities of biological interest: an update, Nucleic Acids Res. 38 (2009) D249 D254. [27] N. Paskin, Digital object identifier (DOI®) system, Encycl. Library Inf. Sci. 3 (2010) 1586 1592. [28] U. Consortium, Ongoing and future developments at the Universal Protein Resource, Nucleic Acids Res. 39 (2010) D214 D219. [29] P. Ertl, Molecular structure input on the web, J. Cheminf. 2 (2010) 1 9. [30] P. Ertl, B. Rohde, P. Selzer, Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties, J. Med. Chem. 43 (2000) 3714 3717. [31] C.A. Lipinski, F. Lombardo, B.W. Dominy, P.J. Feeney, Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings, Adv. Drug. Deliv. Rev. 23 (1997) 3 25. [32] S. Kerrien, S. Orchard, L. Montecchi-Palazzi, B. Aranda, A.F. Quinn, N. Vinod, et al., Broadening the horizon level 2.5 of the HUPO-PSI format for molecular interactions, BMC Biol. 5 (2007) 44. [33] B. Aranda, H. Blankenburg, S. Kerrien, F.S. Brinkman, A. Ceol, E. Chautard, et al., PSICQUIC and PSISCORE: accessing and scoring molecular interactions, Nat. Methods 8 (2011) 528. [34] B.Q. Wei, W.A. Baase, L.H. Weaver, B.W. Matthews, B.K. Shoichet, A model binding site for testing scoring functions in molecular docking, J. Mol. Biol. 322 (2002) 339 355. [35] D.M. Lorber, M.K. Udo, B.K. Shoichet, Protein protein docking with multiple residue conformations and residue substitutions, Protein Sci. 11 (2002) 1393 1408. [36] J. Bostro¨m, J.R. Greenwood, J. Gottfries, Assessing the performance of OMEGA with respect to retrieving bioactive conformations, J. Mol. Graph. Model. 21 (2003) 449 462. [37] A. Jarrahpour, J. Fathi, M. Mimouni, T.B. Hadda, J. Sheikh, Z. Chohan, et al., Petra, Osiris and Molinspiration (POM) together as a successful support in drug design: antibacterial activity and biopharmaceutical characterization of some azo Schiff bases, Med. Chem. Res. 21 (2012) 1984 1990. [38] R. Wang, Y. Fu, L. Lai, A new atom-additive method for calculating partition coefficients, J. Chem. Inf. Comput. Sci. 37 (1997) 615 621. [39] C.A. Lipinski, Drug-like properties and the causes of poor solubility and poor permeability, J. Pharmacol. Toxicol. Methods 44 (2000) 235 249.


Chapter 3

[40] P. Ertl, O. Jacob, WWW-based chemical information system, J. Mol. Struct.: THEOCHEM 419 (1997) 113 120. [41] E.A. Zerhouni, Clinical research at a crossroads: the NIH roadmap, J. Investig. Med. 54 (2006) 171 173. [42] A.J. Harmar, R.A. Hills, E.M. Rosser, M. Jones, O.P. Buneman, D.R. Dunbar, et al., IUPHAR-DB: the IUPHAR database of G protein-coupled receptors and ion channels, Nucleic Acids Res. 37 (2008) D680 D685. [43] D.S. Wishart, C. Knox, A.C. Guo, S. Shrivastava, M. Hassanali, P. Stothard, et al., DrugBank: a comprehensive resource for in silico drug discovery and exploration, Nucleic Acids Res. 34 (2006) D668 D672. [44] D.S. Wishart, C. Knox, A.C. Guo, D. Cheng, S. Shrivastava, D. Tzur, et al., DrugBank: a knowledgebase for drugs, drug actions and drug targets, Nucleic Acids Res. 36 (2007) D901 D906. [45] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, et al., DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs, Nucleic Acids Res. 39 (2010) D1035 D1041. [46] A. Bairoch, R. Apweiler, C.H. Wu, W.C. Barker, B. Boeckmann, S. Ferro, et al., The universal protein resource (UniProt), Nucleic Acids Res. 33 (2005) D154 D159. [47] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, D. Lancet, GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support, Bioinforma (Oxford, Engl.) 14 (1998) 656 664. [48] D. Oppenheimer, A. Ganapathi, D. Patterson, USENIX symposium on internet technologies and systems, Seattle, WA, (2003). [49] S. Sundararaj, A. Guo, B. Habibi-Nazhad, M. Rouani, P. Stothard, M. Ellison, et al., The CyberCell Database (CCDB): a comprehensive, self-updating, relational database to coordinate and facilitate in silico modeling of Escherichia coli, Nucleic Acids Res. 32 (2004) D293 D295.


Database exploration: Selection and analysis of target protein structures 4.1 Introduction Protein science is entering a new era that promises to unlock many of the mysteries of the cell’s inner functioning. Next generation sequencing is transforming the way DNA information is accessed and, as the variety of protein assays that can be linked to a DNA or RNA read-out grows, accordingly protein information is being collected at an increasing rate. Novel insights into the mechanics of large assemblies of proteins through electron microscopy technology is also accessible now. However, this wealth of molecular data will be worth little without it being available to and interpretable by the scientific community. Therefore, structural bioinformatics is becoming an increasingly important component of modern drug discovery [1].

4.2 Protein databases 4.2.1 UniProt: the Universal Protein knowledgebase UniProt is a long-standing collection of databases that enable scientists to retrieve the vast amount of sequence and functional information associated with proteins. The Swiss-Prot, TrEMBL and PIR protein database activities untidily form the Universal Protein Knowledgebase (UniProt) consortium to make the scientific community available with a single, centralized, authoritative resource for protein sequences and functional information. It is a high-quality database meant to serve as a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase. The database contains over 60 million sequences, of which over half a million sequences have been curated by experts who critically review experimental and predicted data for each protein. While, the rest sequences are automatically annotated based on rule systems based on the expert curated knowledge. Also, it includes a pipeline to remove redundant highly similar proteomes that cause excessive redundancy in UniProt. The initial run of this pipeline decreased the number of sequences in UniProt by 47 million. To help with the interpretation of genomic variants, detailed protein information are provided for the major genome browsers. ASPARQL endpoint is also provided which allows complex queries of Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.



Chapter 4

more than 22 billion triples of data in UniProt ( UniProt resources can be accessed via the website at The UniProt databases consist of three database layers: UniProt Archive (UniParc): The UniProt Archive (UniParc) is the most comprehensive publicly accessible non-redundant protein sequence collection available. UniProt Knowledgebase (UniProtKB): It provides a central database of protein sequences with consistent annotations and functional information. It includes information about the gene names, protein names, organism, taxonomic classification, protein attributes including sequence length and status. General annotations, Ontologies and Sequence annotations/ features includes the information about function, post translational modifications (PTM), subcellular locations of protein, diseases associated with deficiencies or abnormalities of protein, enzyme-specific information, role of protein in biological, cellular and molecular processes, polymorphism and similarities to other proteins. It also includes the information about the secondary structure, quaternary structure and use of protein in biotechnological processes and as pharmaceutical drug. It provides cross references to external data collections such as nucleotide sequence databases, DDBJ/EMBL/GenBank, 2D PAGE, various protein domain and family characterization databases, 3D protein structure databases, PTM databases, Species-specific data collection, Variant databases, Diseases databases etc. Thus, Uniprot is central hub which provide biomolecular information archived in more than 50 cross referenced in Uniprot. UniProt NREF databases (UniRef): Three databases have been created on the basis of automatic procedures. NREF 100: a comprehensive non-redundant sequence collection clustered by sequence identity and taxonomy with source attribution. NREF 90 and 50: they are built from NREF100 to provide non-redundant sequence collections to perform faster homology searches. All records from all source organisms with mutual sequence identity of .90% or .50%, respectively, are merged into a single record on the basis of UniProt knowledgebase records. The UniProtKB is the central resource that combines UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot contains over 550,000 annotated protein sequences which have been created. For these entries experimental information is extracted from the literature and, organized and summarized, greatly making easy for scientists to access the protein information. UniProtKB/TrEMBL cater 60 million sequences, largely derived from high throughput sequencing of DNA. These entries are annotated by rule based automatic annotation systems. A series of UniRef databases is also provided that

Database exploration: Selection and analysis of target protein structures 91 afford sequence sets trimmed at various levels of sequence identity [2]. Finally, the UniProt Archive (Uni-Parc) is also available that provides a complete set of known sequences, including historical obsolete sequences [3]. Features The accelerating growth of sequenced genomes present great challenges for databases with their major advancement being driven by sequencing of very similar or almost identical strains of the same bacterial species. UniProtKB saw an exponential growth by 2014, reaching a peak of 90 million sequences with a high level of redundancy (e.g. 4080 proteomes of Staphylococcus aureus comprising 10.88 million proteins). New opportunities for proteome-wide analysis and interpretation is provided to users with this wealth of protein information. However, it creates challenges in capturing, searching, preserving and presenting proteome data to the scientific community. In 2015, a method to identify and remove highly redundant bacterial proteomes within species groups was developed. Briefly, the method establishes redundant proteomes by performing pairwise alignment of sets of sequences for pairs of proteomes and subsequently applies graph theory to find dominating sets that result in removal of proteomes that cause a minimal loss of information. The UniProt proteome database ( provide access to proteomes for over 56,000 species with completely sequenced genomes. The majority of these proteomes are based on the translation of genome sequence submissions to the International Nucleotide Sequence Database Consortium (INSDC) but also include predictions from Ensembl, RefSeq reference genomes and vector/parasite specific databases like Vector Base [4] and Worm Base ParaSite [5]. Some proteomes may also include protein sequences based on high quality cDNAs that cannot be mapped to the current genome assemblies and have been manually reviewed following supporting evidence, and/ or careful analysis of homologous sequences from closely related organisms. Proteomes can include both manually reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/ TrEMBL) entries. The proportion of reviewed entries is clearly greater for the proteomes of intensively curated model organisms and also varies between proteomes. The unreviewed records are updated by an automatic annotation systems for every release, ensuring consistency and up-to-date annotations. A proteome identifier is utilized that uniquely recognize the set of proteins corresponding to a single assembly of a completely sequenced genome. It is important to realize that there may be additional records for a particular species which do not belong to the defined proteome. In UniProt, as the number of new proteomes increases, selection of species is given by reference proteomes representing a broad coverage of the tree of life. These reference proteomes are chosen via consultation with the research community or computationally determined from proteome clusters based on algorithm that regard the best overall annotation score, an automatically calculated score which provides a heuristic measure of


Chapter 4

the annotation content of a proteome. There are currently 5631 reference proteomes that represents a cross-section of the taxonomic diversity found in UniProtKB. They are the central point of both manual and automatic annotation, aiming to give the best annotated protein sets for the selected species. They include model organisms and other proteomes of interest to biomedical and biotechnological research. To augment the growing set of reference proteomes, UniProt has established pan proteomes analogous to the pan genome concept [6]. A pan proteome is the full set of proteins expressed by a group of highly comparable organisms. Panproteomes cater a representative set of all the sequences within a taxonomic group and capture unique sequences not included in the group’s reference proteome. UniProtKB pan proteomes encompass all nonredundant proteomes and are aimed at users interested the study of genome evolution, phylogenetic comparisons and gene diversity. For each reference proteome cluster, also known as representative proteome group [7], a pan proteome is the set of all the sequences in the reference proteome including the unique protein sequences that are found in other species or strains of the cluster but not in the reference proteome. Pan proteomes are available as files of FASTA formatted sequences on the FTP site. Manual curation progress Expert literature-based curation is gathered in theUniProtKB/Swiss-Prot section of the UniProt Knowledgebase (UniProtKB) which constitutes a cornerstone of UniProt. UniProtKB/Swiss-Prot provides annotation of high quality for experimentally characterized proteins and serve as the comprehensive catalogue of information on proteins. It is composed of more than 550,000 curated proteins, including all protein-coding genes for a number of key organisms and contains in-depth information extracted from more than 210,000 publications that have been fully curated. Expert curation is by far the most reliable method to outline the gold-standard information and provide an up-to-date knowledgebase incorporated with experimental information. More than a million of articles are indexed every year in PubMed, therefore strict prioritization of articles and proteins for curation becomes imperative. A key challenge is the identification of relevant high-quality articles that allow for the comprehensive curation of a protein and literature selection. Literature curation of post-translational modifications (PTMs) and their consequences is a priority due to their important role in the generating of protein complexity and protein activity regulation, thus controlling many biological processes. Curation of experimentally determined PTMs from the literature is crucial as many PTMs cannot be reliably anticipated by the computational tools. Over the years, experimental PTMs have been catalogued in UniProtKB/Swiss-Prot, a catalogue that can be utilized for both development of high-quality training set and enhancement of bioinformatics algorithms, and as an essential library for the identification of proteins by proteomics. Priority is given to articles that interprets new PTMs and/or characterize the effects of modifications. In addition, to

Database exploration: Selection and analysis of target protein structures 93 complement this approach, a semi-automatic pipeline was developed for the integration of high-throughput proteomics data that is distinct from expert curation and which adds PTMs from manually evaluated large-scale proteomics publications [8]. In UniProtKB entries, ‘PTM/Processing’ section constitutes PTM information, which characterize the modified residues and a summary of what is known about the PTMs. Descriptions of the modified residues are organized in a machine readable format with the use of standardized vocabularies being essential to organize knowledge for subsequent retrieval. Four annotation types defines distinct types of amino acid modifications: modified residue, Disulfide bond, Cross-link and Glycosylation. Each PTM annotation correlates with a controlled vocabulary established in collaboration with the RESID database [9] and Proteomics Standards Initiative Protein Modification (PSI-MOD) [10]. A list of all terms used in UniProtKB is available in the ptmlist.txt document ( ptmlist). Traceability of data is crucial as every new piece of knowledge is associated with the authentic source of the information and the type of aiding evidence, employing the evidence and Conclusion Ontology [11]. Together with describing amino acid modifications on the protein sequence in a structured format, the ‘PTM’ subsection also cater a complete summary of available PTM information [12]. Complete and consistent annotation has to be ensured. In the course of PTM curation, curators must also check that the annotation content of enzymes that mediate modifications is up-to-date. Among them, more than 450 different types of PTMs have been described in UniProtKB/Swiss-Prot, 45,000 experimentally proven modification sites are curated and more than 22000UniProtKB/Swiss-Prot entries consists experimental information on PTMs. Two complementary rule-based systems have also been developed, through which the unreviewed protein sequences of UniProtKB/TrEMBL are automatically annotated with a high degree of accuracy. The first system, UniRule, constitutes annotation rules which are devised by the biocurators as part of the process of curation of the experimental literature for UniProtKB/Swiss-Prot. UniRule generation involves some of the Priorities i.e. focusing on using and annotating new functional data of interest for proteomes, such as enzymes and pathways and expanding the coverage into new taxonomic and protein families. The developers ensured that the supporting technical infrastructure enabled these rules being accurately and efficiently created, applied and maintained. Statistical Automatic Annotation System (SAAS) complements Unirule, which is a completely automatic decision-tree-based rule-generating system in which rules are derived automatically to share common annotations and characteristics from UniProtKB/Swiss-Prot entries. Both UniRule and SAAS utilizes the hierarchical InterPro to classify protein family and domain signatures [13] as a basis for protein classification and functional annotation. Common syntax is shared by these rules that specify the predicted annotations comprising of protein


Chapter 4

nomenclature, function, subcellular location and catalytic residues and necessary conditions, such as the requirement for conserved functional residues and motifs. InterPro integrates signatures from the HAMAP [14] and PIRSF [15] projects within the UniProt Consortium. In addition to the ongoing rule based systems development, another automatic annotation approach is also followed by focussing on enriching the feature annotation (FT) of UniProtKB/TrEMBL. The SAMs (Sequence Analysis Methods) are a suite of methods (Coilsv2.2, Phobius, TMHMM v2.0 and Signal P v4.0 currently) from external providers employed to automatically generate sequence features such as Chain, Signal, Transmembrane and Coil regions. The results of these methods are combined and refined as per the UniProt standards with the addition of the appropriate UniProtKB annotation. The new predictions are propagated into all the UniProtKB/TrEMBL records that have no feature predictions from UniRule. Another recent advancement in this collaboration is the annotation of domain predictions in UniProtKB/TrEMBL entries from the InterPro member databases PROSITE, SMART or Pfam. The name of the domain is taken from UniProtKB/ Swiss-Prot annotation or InterPro entry names. The resulting enrichment has increased significantly with these complementary approaches. Annotation retrieved from the automatic annotation systems is labeled with an evidence attribution manifesting specific source rule/method on both the UniProt website and in the UniProtKB files available at Website UniProt is based on user-centered design process, includes many users worldwide with varied research backgrounds and use cases, to augment its website to add new features. UniProt FTs describe the sites and regions of biological interest, such as an enzyme’s active binding sites, PTMs, domains, etc., which play an important role in understanding of what the protein does. With the growth in biological data, integration and visualization different data aspects have been covered. Users can zoom into and out of an area by simply clicking on a feature that in turn will trigger a pop-up with more information about the feature, such as the feature description, position, and any available evidence. The view can be customized to hide or show feature tracks. The viewer display all features of a UniProtKB entry and includes additional mapped features from large-scale studies, currently for variants and proteomics data. The viewer is available for every UniProtKB protein entry through the link ‘Feature viewer’ under the ‘Display’ heading on the left hand side of an entry page. One can query the data by taxonomy, genome and proteome identifiers. Later, filter the results for either reference or non-redundant proteomes. The result table indicates the number of UniProt entries for each proteome, that can either viewed or download in a range of formats by user. Individual proteome pages provide a short overview with details about the organism and genome assembly. The components of the genome assembly have been listed in a table where it is possible to view or download the UniProt entries for the

Database exploration: Selection and analysis of target protein structures 95 selected components or the entire proteome. A download link for the pan proteomics also provided in case a proteome is part of a pan proteome. UniProt continues to adapt its data gathering, processing and display to improve the utility and availability of the protein information for the benefit of all.

4.2.2 Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) provides a structural view of biology for aiding research and education, by developing tools and resources. It is the single archive of experimentally determined structures of nucleic acids, proteins and complex assemblies. Presently, it constitutes .84,000 entries, derived data files and related data dictionaries. The RCSB PDB web site ( uses the curated 3D macromolecular data contained in the PDB archive to offer unique methods to access, report and visualize data. Recent activities concentrate on improving methods for simple and complex PDB data searching, generating specialized access to chemical component data and carter domain-based structural alignments. New educational resources are displayed at the PDB-101 (educational resource for exploring a structural view of biology) educational view of the main web site such as author profiles that display a researcher’s PDB entries in a timeline. The RCSB Protein Data Bank (RCSB PDB) provides access to the experimentally determined structures of nucleic acids, proteins and complex assemblies in the PDB [16]. Currently, the public archive constitutes .84,000 entries, derived data files and related data dictionaries. With .570,000 files, the PDB requires .130 GB of storage space. Data are updated weekly and loaded into the relational database that supports the website. The PDB is maintained by the members of the Worldwide PDB (wwPDB): RCSB PDB (USA) [17], PDB in Europe (PDBe, [18], PDB Japan (PDBj, [19] and BioMagResBank ( [20]. Data are deposited to the PDB, curated and annotated following wwPDB standards, and then made available on an FTP server. Each wwPDB partner offers unique ‘views’ of PDB data through the different query, analysis and visualization tools provided on their respective web sites. To better serve these interests, users can customize the RCSB PDB home page and individual ‘Structure Summary’ pages by moving relevant data widgets [21] to different locations on the page, while areas of less interest are hided or minimized. PDB data can be searched in many different ways. Simple searches can be performed by using the top menu bar, including molecule name, author name, sequence or ligand ID. Queries can be built by using ‘Advanced Search’ with multiple constraints, like ‘find all protein homodimers bound to DNA’. The ‘Browse Database’ option permits exploration of the PDB archive utilizing


Chapter 4

different hierarchical trees. Browsers are available to search for related terms and structures based on many different classifications, like Biological Process, Cellular Component, Molecular Function [22], Enzyme Commission number (http://www.chem., Transporter Classification System [23], and structure classifications SCOP [24] and CATH [25]. Data distribution summaries, shown as pie charts and lists of hyperlinks, are available for standard features of PDB entries (resolution, release date, experimental method, polymer type, organism and taxonomy). These drill-down distributions provide another way to browse and select data from any search results or whole archive Simple searches, including author name, molecule name, sequence or ligand ID can be performed using top menu bar. ‘Advanced Search’ such as ‘find all protein homodimers bound to DNA’ can be utilized to build queries with multiple constraints. The ‘Browse Database’ option permits exploration of the PDB archive using different hierarchical trees. Browsers are available which can be used to search for related terms and structures based on many different classifications, such as Cellular Component, Biological Process, Molecular Function. Summary of the search result: The result of the search is summarized to include the information regarding the molecular description, citation, source, related PDB entries. It also includes information from external data sources i.e. external domain annotations, structural biology knowledge based data. Sequence Display: It includes the information of composition of the secondary structures of the protein along with full sequence displayed corresponding to SCOP, DSSP and PDB. Sequence similarity data: In this option sequence entities in PDB are clustered on the basis of sequence similarity. 3D-similarity result: It provides the data having 3D structural similarities along with P-value, RMSD, % identity, % similarity etc. Biology and Chemistry Reports: It provides the information regarding the structure information, protein information from UniProt and genetic details including genetic source and genome information. Method Details: It provides the information regarding details of the method employed for solving the crystal structure of query protein. Geometry: Geometry includes the information about the B-factor, Bond length, Bond angle and Dihedral Angle. Other Tools: Download Files: The files of structures, sequences, and ligands can be various formats including pdb format, mmCIF format, XML format, sd files. Compare Structures: The structure can be compared on the basis of Pairwise Sequence Alignment and Pairwise Structure Alignment employing various methods. Drug and Drug-Target Mapping: The files of structures, sequences, ligands can be various formats including pdb format, mmCIF format, XML format, sd files. Query results can be refined, used to explore individual structures and exported to generate interactive and tabular reports. Tabular report features include online data sorting, column customization, filtering and output to other report formats. These reports also contain data from, and links to, external resources. User feedback is an important influence on the evolution of the RCSB PDB resource.

Database exploration: Selection and analysis of target protein structures 97 Web site features Simple searches: The most common uses of the web site are searches for simple text. To further improve the text search, an autocomplete feature has been added to guide the user to more specific results. After typing a few letters in the top bar, a suggestion box organizes specific result sets in different categories. Each suggestion, which includes the number of results, links to the set of matching structures. Some of the suggestions use external data resources, such as the NCBI organism taxonomy tree [26]. The top bar search is context-specific and intelligently detects the type of user input. Entering a sequence text string in the search box returns possible Basic Local Alignment Search Tool (BLAST) [27] search options. Chemical formulas and SMILES strings [28] are also recognized. In case the suggestions are not what the user is seeking for, it is still feasible to perform a standard text search of the PDB entry (in mmCIF format) by pressing enter or clicking on the search icon. Top bar simple searches are also limited to specific categories by selecting the ‘Author’, ‘Macromolecule’, ‘Sequence’ or ‘Ligand’ icon. The ‘Author’ icon restrict the search to the names of depositors or primary citation authors. The ‘Macromolecule’ icon returns structures that are based on polymer names from the PDB and associated entries in cross-referenced sequence databases like UniProtKB [29]. The ‘Sequence’ icon reveals a link to the additional options for selecting the method and the parameters for a sequence search. Similarly, the ‘Ligand’ icon links to further options, including a chemical structure editor to draw a structure, and a form to search for ligands by name, identifier, formula and molecular weight. Advanced search features: Advanced Search expands on the search functionality of the top bar searches by employing additional and more specific data categories. Advanced Search has the capability to combine multiple searches of specific types of data in a logical AND or OR. The result is a list of structures that comply with ALL or ANY search criteria, respectively. Advanced Search options are available to search by: ‘All/Experimental Type/Molecule Type’ to quickly access all PDB entries or a subset based on experimental and macromolecular type, structure determination/phasing method, ‘Link Records’ to find structures containing inter-residue connectivity (LINK records in PDB entries) that cannot be inferred from the primary structure, structures determined by electron microscopy for which experimental data files are available in the PDB or at the Electron Microscopy Data Bank [30] and Pfam ID [31]. All Advanced Search query results can be further refined, filtered to remove similar sequences or used to generate reports. Structure alignments: Sequence and structure alignments are standard methods used for investigating the evolutionary and functional relationship between proteins [32]. The Protein Comparison Tool offers a number of sequence and structure alignment algorithms to perform


Chapter 4

the detailed analysis of pairwise relationships [33]. Additional algorithms can be availed by submitting alignments to some of the leading external web servers [34]. The Protein Comparison Tool has also been utilized to provide the pre-calculated alignments. The alignment are updated weekly with new incoming protein structures [33]. The first version of this tool was based on alignments of whole protein chains. Recently, this has been refined to provide alignments on a domain basis. The calculation based on domains extends our sequence clustering approach. To remove redundancy, we start with a 40% sequence identity clustering procedure based on complete polypeptide chains, and select a representative chain from each sequence cluster [17]. If the representative chain contains multiple domains, each is included. SCOP1.75 domain assignments are used when available; otherwise, assignments are computed using Protein Domain Parser (PDP). Pairwise alignments of the domains are performed with the jFatCat version [33] of FatCat [32]. For each PDB entry, the ‘3D Similarity’ tab provides a visual summary of the protein chains. Fig. 4.1 highlight show the residues listed in the sequence (SEQRES) and in the atom records (ATOM) map onto the relevant parts of the UniProtKB sequence, along with annotations from DSSP [35], SCOP, PDP [36] and Pfam [37]. The results of the pre-calculated database searches are shown in a table that displays the most important calculated alignment scores. For multi domain proteins, it is possible to switch between the results for different domains by selecting a domain from the pull-down menu above the table, or by clicking on a domain in the sequence image. The results table can be sorted and filtered, and links to the 3D structure alignment in Jmol (http:// [38] and to information about similar domains. Ligand reporting and visualization: Information about the chemistry and structure of all small molecule components found the PDB is contained in the Chemical Component Dictionary maintained by the wwPDB at [39]. Specialized ligand queries can be made using the top bar search or Advanced Search. Special support is also offered for the analysis of ligands associated with PDB entries.

Figure 4.1 Residues listed in the sequence (SEQRES).

Database exploration: Selection and analysis of target protein structures 99 The RCSB PDB web site builds on the functionality developed for the small molecule resource Ligand Expo ( [40] by providing special support for the analysis of ligands associated with PDB entries. Any ligands included with a PDB entry are listed in the ‘Ligand Chemical Component’ widget of the entry’s ‘Structure Summary’ page. This area displays the name and formula of each ligand, links to the summary page for the ligand and provides access to 3D visualization of the ligand in the context of that particular PDB entry using the Ligand Explorer viewer [41]. For non-trivial ligands, a PoseView [42] interaction diagram shows which atoms or areas of the ligand and the polymer interact with each other, as well as the type of interaction. ‘Ligand Summary’ pages are organized into widgets similar to Structure Summary pages for individual PDB entries, highlighting different types of hyperlinked information. These widgets provide an overview of the ligand, with links to PDB entries where the component appears as a non-standard component of a polymer or as a non-polymer, links to ligand summary pages for similar ligands and stereoisomers, 2D and 3D visualization and links to many external resources. Additionally, ligand Summary pages display information about molecules that have been annotated as having sub-components. Ligand Summary Reports are generated for query result sets and downloaded in a text file or a spreadsheet. These reports include information about the selected ligands, such as formula, molecular weight, name, and SMILES string, which PDB entries are related to the ligand and how they are related. Each ligand included in the report can be expanded to show a sub-table of all related PDB entries that contain the ligand, the entries that contain the ligand as a free ligand and entries that contain the ligand as part of a polymer. Visualization of molecular surfaces Protein Workshop [41] is one of several 3D molecular viewers offered from the RCSB PDB web site. It offers quick default styles and views, with additional appearance options. Chains and atoms can be selected by either clicking on the structure or molecules displayed as a tree. Protein Workshop supports molecular surfaces to assist in the display of quaternary structure binding sites and protein protein interactions. Surfaces are created for all macromolecule chains in a PDB entry using the Euclidean distance transform algorithm from Xu and Zhang [43]. For biological assemblies, surfaces are generated using the symmetry operation of the space group, which allows the display of even the largest assemblies in the PDB on a standard laptop computer. Surfaces can be color coded by chain, entity (unique macromolecules) and hydrophobicity. Color-blind friendly color schemes were adopted from ColorBrewer, a tool for selecting color schemes for maps [44]. In addition, options to export high-resolution images with custom sizes for publications and posters are available for the three RCSB PDB viewers: Protein Workshop, Simple Viewer and Ligand Explorer.

100 Chapter 4 Web services Web Services efficiently and remotely interact with PDB data on the fly and make it more user friendly by eliminating the need for local data storage. The RCSB PDB introduces RESTful search and fetch services that return XML files with URL requests. Search services return PDB ID lists for queries based on advanced search queries. Fetch services return data (such as entity descriptions, ligand information and external annotations) for a given list of IDs.

4.2.3 Binding database BindingDB is a public accessible database of measured binding affinities, focusing on interactions of proteins considered to be the drug targets, with small drug-like molecules. It is maintained by Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, California, and can be access through link, www.bindingdb. org. At present, this DB contains 1,009,290 binding data for 6589 proteins with 427,325 small molecules. This DB also provides 750 protein-ligand validation sets. Each dataset is a congeneric series with at least one associated protein-ligand co-crystal structure. The binding data set is provided for P-L sets. The BindingDB includes literature from: The BindingDB project, Selected PubChem Confirmatory BioAssays, ChEMBL entries for which a well protein target is provided. It employs the docking of many compounds for which no co-crystal structures are available, using the existing co-crystal structures as models. Then the results are integrated with the rest of binding DB including the validation sets. The data is derived from a variety of measurement techniques such as Enzyme inhibition and kinetics, Isothermal titration calorimetry (ITC), NMR, Radio-ligand and competition assays. Search can be done for target and/or ligand. The options for target search can be done on the basis of sequence, Name, Ki, IC 50, Kd, EC50 , Koff (Enzyme Inhibition Constant), Kon (Enzyme Inhibition Constant), ΔG , ΔH  , - TΔS , pH (enzymatic assay), pH (ITC), substrate or competitor, compound molecular weight, chemical structure. The options for ligand search include FDA drug: in this the binding data for FDA approved drug can be retrieved, for 2744 drugs, By Chemical structure (draw or smiles pattern), Name, by target: the ligands can be selected for particular targets. This search will yield the ligands for a particular selected target and retrieve the binding data for the retrieved molecules. The search for binding data of the ligand-protein can be made through citation values. This database has been already discussed in detail in Chapter 3.

Database exploration: Selection and analysis of target protein structures 101

4.2.4 Therapeutic target database The Therapeutic Target Database (TTD) is designed to give information about the known therapeutic targets i.e. protein and nucleic acid described in the literature, the targeted disease conditions, the corresponding drugs/ligands directed at each of these targets and the pathway information. It also provide links to the relevant DBs containing information about target functions, sequence, 3D structure, ligand binding properties, clinical development status, drug structure and their therapeutic class with full referencing. Cross-links to other databases are also introduced to facilitate the access of information about the sequence, 3D structure, function, nomenclature, drug/ligand binding properties, drug usage and effects, and related literature for each target. Source of TTD database is BIDD-Bioinformatics and Drug Design Group based in the Department of Pharmacy, National University of Singapore. This database can be accessed at and it currently contains entries for 433 targets covering 125 disease conditions along with 809 drugs/ligands directed at each of these targets. Each entry can be retrieved through multiple methods including target name, disease name, drug/ligand name, drug/ligand function and drug therapeutic classification. Pharmaceutical agents generally bind to a particular protein or nucleic acid target to exert their therapeutic effect. So far, hundreds of proteins and nucleic acids have been explored as therapeutic targets. Rapid advances in genetic [45], structural [46] and functional [47] information of disease related genes and proteins not only raise strong interest in the search of new therapeutic targets, but also promote the study of various aspects of known targets including molecular mechanism of their binding agents and related adverse effects [48], and pharmacogenetic implications of sequence or proteomic variations [49] etc. The knowledge gained from such a study is important in facilitating the design of more potent, less toxic and personalized drugs. Development of advanced computational methods for bioinformatics [46], molecular modeling [50], drug design and pharmacokinetics analysis [51] increasingly uses known therapeutic targets to refine and test algorithms and parameters. In TTD, the available literature is searched to collect all of the therapeutic targets. As per report, at present, approximately 500 of the therapeutic targets have been exploited in the currently available medical treatment [52]. However, description of some of these targets in the literature was not specific enough to point to a particular protein or nucleic acid as the target. Hence these targets are not included in this database. Details TTD has a web interface at The entries of this database are generated from a search of pharmacology textbooks, review articles and a number of recent publications. The database incorporates:ICD-10-CM and ICD-9-CM codes of the International Classification of Diseases (covering 897 disease conditions, 893 targets, and 5697 drugs), biomarkers (1755 biomarkers for 365 disease conditions), Drug scaffolds

102 Chapter 4

Figure 4.2 TTD database web interface.

(210 scaffolds for 714 drugs and leads), 1008 nature-derived agents, 20,818 multi-target agents against 385 target-pairs, and the activity data of 1436 agents against 297 cell-lines. The TTD database web interface is shown in Fig. 4.2. The search options of this database include: 1. Search whole database 2. Advanced search

Database exploration: Selection and analysis of target protein structures 103 3. Target similarity search 4. Drug similarity search 5. QSAR models: Both 2D and 3D models can be searched using drug target name or chemical type of the ligand. 6. On the basis of target validation: provide drug potency against target, drug potency against disease model, and the results of target gene knockout/genetic variation in animal models. 7. Multi-target agents: combination of targets, i.e. target pairs and MT agent information. It has structure and potency information of 20,818 multi target agents against 385 target pairs. 8. Drug combinations: provides synergistic, additive, and antagonist combinations, potentiative and reductive combinations data. 9. Nature derived drugs: approved, clinical trial and preclinical drugs together with their species origin information, target name or drug/ligand name. Searches involving any combination of these five search or selection fields are also supported. The search is case insensitive. In a query, a user can specify full name or any part of the name in a text field, or choose one item from a selection field. Wild characters of ‘%’ and ‘_’ are supported in text field. Here, ‘_’ represents any one character and ‘%’ represents a string of characters of any length. In the interface, all the therapeutic targets that satisfy the search criteria are listed along with the disease conditions to be treated, drugs or ligands directed at the target, and the drug class. More detailed information of a target can be obtained by clicking the corresponding target name. From other interface, one finds target name, corresponding disease condition and cross-link to other database, target function in pathway and corresponding natural ligand, known drugs or ligands directed at the target, drug function, drug therapeutic classification, and additional cross-links to other databases that provide useful information about the target. The functional properties of an identified target can be obtained through cross-linking to the On-line Medical Dictionary (OMD) database ( and the SWISS-PROT database [53]. The target sequence can be retrieved from cross-link to the SWISS-PROT database. The available 3D structure of this target can be accessed through cross-linking to the Protein Data Bank (PDB) database [54]. For an enzymatic target, its nomenclature can be obtained from cross-link to the Enzyme Data Bank [55]. Ligandbinding properties may be obtained from cross-link to the Computed Ligand Binding Energy database (CliBE) ( The related literature can be accessed from cross-link to the relevant entries in the PubMed database [56]. As the research in proteomics [57] and pathways [58] progresses, the relevant information can be incorporated or the corresponding databases can be cross-linked to TTD to provide

104 Chapter 4 more comprehensive information about the drug targets and their relationship to other biomolecules and cellular processes. Exercise: To prepare list of reported kinase targets and their inhibitors for the management of breast cancer using TTD. Step by step protocol: Open TTD database on your PC/laptop. In the “search whole database”, go to “Search Drugs and Targets by Disease” option Type “Breast cancer” and click search. A browser displaying the list of various reported protein targets along with their inhibitors will appear. Filter the kinase protein and prepare the list. Click on “Drug info” for further details of the drug.

References [1] M. Naderi, R.G. Govindaraj, M. Brylinski, e Model-BDB: a database of comparative structure models of drug-target interactions from the Binding Database, GigaScience (7)(2018). giy091. [2] B.E. Suzek, Y. Wang, H. Huang, P.B. McGarvey, C.H. Wu, U. Consortium, UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches, Bioinformatics 31 (2014) 926 932. [3] R. Leinonen, F.G. Diez, D. Binns, W. Fleischmann, R. Lopez, R. Apweiler, UniProt archive, Bioinformatics 20 (2004) 3236 3237. [4] G.I. Giraldo-Caldero´n, S.J. Emrich, R.M. MacCallum, G. Maslen, E. Dialynas, P. Topalis, et al., VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases, Nucleic Acids Res. 43 (2014) D707 D713. [5] K.L. Howe, B.J. Bolt, S. Cain, J. Chan, W.J. Chen, P. Davis, et al., WormBase 2016: expanding to enable helminth genomic research, Nucleic Acids Res. 44 (2015) D774 D780. [6] D. Medini, C. Donati, H. Tettelin, V. Masignani, R. Rappuoli, The microbial pan-genome, Curr. Opin. Genet. & Dev. 15 (2005) 589 594. [7] C. Chen, D.A. Natale, R.D. Finn, H. Huang, J. Zhang, C.H. Wu, et al., Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation, PLoS One 6 (2011) e18910. [8] M. Tognolli, D. Baratin, S. Poux, A. Bridge, M.L. Famiglietti, M. Magrane, et al., The UniProtKB guide to the human proteome, Database 2016 (2016). [9] J.S. Garavelli, The RESID database of protein modifications as a resource and annotation tool, Proteomics 4 (2004) 1527 1533. [10] L. Montecchi-Palazzi, R. Beavis, P.-A. Binz, R.J. Chalkley, J. Cottrell, D. Creasy, et al., The PSI-MOD community standard for representation of protein modification data, Nat. Biotechnol. 26 (2008) 864. [11] M.C. Chibucos, C.J. Mungall, R. Balakrishnan, K.R. Christie, R.P. Huntley, O. White, et al., Standardized description of scientific evidence using the Evidence Ontology (ECO), Database 2014 (2014). [12] M.L. Valenstein, A. Roll-Mecak, Graded control of microtubule severing by tubulin glutamylation, Cell 164 (2016) 911 921. [13] A. Mitchell, H.-Y. Chang, L. Daugherty, M. Fraser, S. Hunter, R. Lopez, et al., The InterPro protein families database: the classification resource after 15 years, Nucleic Acids Res. 43 (2014) D213 D221.

Database exploration: Selection and analysis of target protein structures 105 [14] I. Pedruzzi, C. Rivoire, A.H. Auchincloss, E. Coudert, G. Keller, E. De Castro, et al., HAMAP in 2015: updates to the protein family classification and annotation system, Nucleic Acids Res. 43 (2014) D1064 D1070. [15] A.N. Nikolskaya, C.N. Arighi, H. Huang, W.C. Barker, C.H. Wu, PIRSF family classification system for protein functional and evolutionary analysis, Evol. Bioinf. 2 (2006). 117693430600200033. [16] H. Berman, K. Henrick, H. Nakamura, Announcing the worldwide protein data bank, Nat. Struct. Mol. Biol. 10 (2003) 980. [17] P.W. Rose, B. Beran, C. Bi, W.F. Bluhm, D. Dimitropoulos, D.S. Goodsell, et al., The RCSB Protein Data Bank: redesigned web site and web services, Nucleic Acids Res. 39 (2010) D392 D401. [18] S. Velankar, Y. Alhroub, C. Best, S. Caboche, M.J. Conroy, J.M. Dana, et al., PDBe: protein data bank in Europe, Nucleic Acids Res. 40 (2011) D445 D452. [19] A.R. Kinjo, H. Suzuki, R. Yamashita, Y. Ikegawa, T. Kudou, R. Igarashi, et al., Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format, Nucleic Acids Res. 40 (2011) D453 D460. [20] E.L. Ulrich, H. Akutsu, J.F. Doreleijers, Y. Harano, Y.E. Ioannidis, J. Lin, et al., BioMagResBank, Nucleic Acids Res. 36 (2007) D402 D408. [21] P.E. Bourne, B. Beran, C. Bi, W. Bluhm, R. Dunbrack, A. Prli´c, et al., Will widgets and semantic tagging change computational biology? PLoS Comput. Biol. 6 (2010) e1000673. [22] G.O. Consortium, Gene ontology consortium: going forward, Nucleic Acids Res. 43 (2014) D1049 D1056. [23] M.H. Saier Jr, M.R. Yen, K. Noto, D.G. Tamang, C. Elkan, The transporter classification database: recent advances, Nucleic Acids Res. 37 (2008) D274 D278. [24] A.G. Murzin, S.E. Brenner, T. Hubbard, C. Chothia, SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol. 247 (1995) 536 540. [25] A.L. Cuff, I. Sillitoe, T. Lewis, A.B. Clegg, R. Rentzsch, N. Furnham, et al., Extending CATH: increasing coverage of the protein structure universe and linking structure with function, Nucleic Acids Res. 39 (2010) D420 D426. [26] N.R. Coordinators, A. Acland, R. Agarwala, T. Barrett, J. Beck, D.A. Benson, et al., Database resources of the national center for biotechnology information, Nucleic Acids Res., 42, 2014, p. D7. [27] S.F. Altschul, T.L. Madden, A.A. Scha¨ffer, J. Zhang, Z. Zhang, W. Miller, et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389 3402. [28] D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Comput. Sci. 28 (1988) 31 36. [29] U. Consortium, Reorganizing the protein space at the Universal Protein Resource (UniProt), Nucleic Acids Res. 40 (2011) D71 D75. [30] C.L. Lawson, M.L. Baker, C. Best, C. Bi, M. Dougherty, P. Feng, et al., EMDataBank. org: unified data resource for CryoEM, Nucleic Acids Res. 39 (2010) D456 D464. [31] R.D. Finn, P. Coggill, R.Y. Eberhardt, S.R. Eddy, J. Mistry, A.L. Mitchell, et al., The Pfam protein families database: towards a more sustainable future, Nucleic Acids Res. 44 (2015) D279 D285. [32] Y. Ye, A. Godzik, Flexible structure alignment by chaining aligned fragment pairs allowing twists, Bioinformatics 19 (2003) ii246 ii255. [33] A. Prli´c, S. Bliven, P.W. Rose, W.F. Bluhm, C. Bizon, A. Godzik, et al., Pre-calculated protein structure alignments at the RCSB PDB website, Bioinformatics 26 (2010) 2983 2985. [34] M.J. Sippl, M. Wiederstein, Detection of spatial correlations in protein structures and molecular complexes, Structure 20 (2012) 718 728. [35] W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolym.: Original Res. Biomol. 22 (1983) 2577 2637. [36] N. Alexandrov, I. Shindyalov, PDP: protein domain parser, Bioinformatics 19 (2003) 429 430.

106 Chapter 4 [37] E.L. Sonnhammer, S.R. Eddy, E. Birney, A. Bateman, R. Durbin, Pfam: multiple sequence alignments and HMM-profiles of protein domains, Nucleic Acids Res. 26 (1998) 320 322. [38] R.M. Hanson, Jmol a paradigm shift in crystallographic visualization, J. Appl. Crystallogr. 43 (2010) 1250 1260. [39] K. Henrick, Z. Feng, W.F. Bluhm, D. Dimitropoulos, J.F. Doreleijers, S. Dutta, et al., Remediation of the protein data bank archive, Nucleic Acids Res. 36 (2007) D426 D433. [40] Z. Feng, L. Chen, H. Maddula, O. Akcan, R. Oughtred, H.M. Berman, et al., Ligand Depot: a data warehouse for ligands bound to macromolecules, Bioinformatics 20 (2004) 2153 2155. [41] J.L. Moreland, A. Gramada, O.V. Buzko, Q. Zhang, P.E. Bourne, The Molecular Biology Toolkit (MBT): a modular platform for developing molecular visualization applications, BMC Bioinforma. 6 (2005) 21. [42] K. Stierand, M. Rarey, Drawing the PDB: protein 2 ligand complexes in two dimensions, ACS Med. Chem. Lett. 1 (2010) 540 545. [43] D. Xu, Y. Zhang, Generating triangulated macromolecular surfaces by Euclidean distance transform, PLoS One 4 (2009) e8140. [44] M. Harrower, C.A. Brewer, ColorBrewer. org: an online tool for selecting colour schemes for maps, Cartographic J. 40 (2003) 27 37. [45] L. Peltonen, V.A. McKusick, Dissecting human disease in the postgenomic era, Science 291 (2001) 1224 1229. ˇ [46] A. Sali, 100,000 protein structures for the biologist, Nat. Struct. Biol. 5 (1998) 1029. [47] E.V. Koonin, R.L. Tatusov, M.Y. Galperin, Beyond complete genomes: from sequence to structure and function, Curr. Opin. Struct. Biol. 8 (1998) 355 363. [48] K.B. Wallace, A. Starkov, Mitochondrial targets of drug toxicity, Annu. Rev. Pharmacol. Toxicol. 40 (2000) 353 388. [49] E.S. Vesell, Advances in pharmacogenetics and pharmacogenomics, J. Clin. Pharmacol. 40 (2000) 930 938. [50] W.D. Cornell, P. Cieplak, C.I. Bayly, I.R. Gould, K.M. Merz, D.M. Ferguson, et al., A second generation force field for the simulation of proteins, nucleic acids, and organic molecules, J. Am. Chem. Soc. 117 (1995) 5179 5197. Journal of the American Chemical Society, 118 (1996) 2309-2309. [51] Y. Chen, D. Zhi, Ligand protein inverse docking and its potential use in the computer search of protein targets of a small molecule, Proteins: Struct., Funct., Bioinf. 43 (2001) 217 226. [52] J. Drews, Drug discovery: a historical perspective, Science 287 (2000) 1960 1964. [53] A. Bairoch, R. Apweiler, The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000, Nucleic Acids Res. 28 (2000) 45 48. [54] J. Westbrook, Z. Feng, S. Jain, T.N. Bhat, N. Thanki, V. Ravichandran, et al., The protein data bank: unifying the archive, Nucleic Acids Res. 30 (2002) 245 248. [55] A. Bairoch, The ENZYME database in 2000, Nucleic Acids Res. 28 (2000) 304 305. [56] J. McEntyre, D. Lipman, PubMed: bridging the information gap, CMAJ 164 (2001) 1317 1319. [57] A. Dove, Proteomics: translating genomics into products? Nat. Biotechnol. 17 (1999) 233. [58] S. Scharpe, I.M. De, Peptide truncation by dipeptidyl peptidase IV: a new pathway for drug discovery? Verhandelingen-Koninklijke Academie voor Geneeskunde van Belgie 63 (2001) 5 32. discussion 32-33.


Homology modeling: Developing 3D structures of target proteins missing in databases 5.1 Introduction Structure based drug designing (SBDD) is multidisciplinary effort that fuses the ideas of traditional medicinal chemistry with analyses of 3D structure of target protein, computational chemistry, small molecule 3D database search and ab-initio design of ligand. Starting point of SBDD is to have accurate 3D structure of target protein that is under consideration. 3D structure of protein can be determined by two experimental techniques i.e. X-ray crystallography and nuclear magnetic resonance. Sometimes sample of protein is not suitable for its 3D structure determination by experimental methods because of certain limitations i.e. conformational instability, difficulty in obtaining pure crystal, not able to wait for crystallography results because of time constraint etc. and therefore alternative method is required for the same. Homology modeling is one of such theoretical technique to build a 3D structure of proteins based on amino acid sequence information and based on the fact that the amino acid sequence homology at a given level leads to similar 3D structure of proteins. It is a comparative protein modeling in which the amino acid sequence of unknown protein (also called as target protein which is to be modeled) is compared with amino acid sequence of known homologous protein (also called reference protein or template protein for which 3D structure have already been solved by NMR or X-ray crystallography) [1]. Fundamentally this technique utilizes the knowledge about the: • • • •

The structural architecture of protein The rules governing their folding The essential interactions that hold the protein together Energies involved

The success of this method depends upon at least one reference or template homologous protein structure experimentally determined either by crystallography or NMR. Given an experimentally established protein structure (template), models can be generated for a homologous sequence (target) that shares with either the template significant sequence Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


108 Chapter 5 (30% or more) or structural similarity [2]. Models, by definition, are an abstraction and hence may contain errors. Depending on the degree of sequence identity or similarity, and the quality of the alignment, the accuracy of homology models compared to the actual ˚ Cα atom RMSD (root-mean-square deviation experimental structure can be up to 1 2 A distance between corresponding Cα atoms) [3,4]. As a general rule, models built with over 50% sequence identities are accurate enough for drug discovery applications, those between 25% and 50% identities can be used to assess target druggability and design mutagenesis experiments, and those in between 10% and 25% are at most speculative [5]. Although model quality is directly related to the identity (or similarity) between template and target sequences, this rule does not always hold. Conversely, an overall high sequence identity might mask dissimilarities in certain regions like exposed loops, which are most likely to be flexible, adding uncertainty to the model, and thereby rendering it of lesser value for drug discovery applications. The choice of template, inaccurate alignments and inefficient refinement methods are the main sources of errors in homology modeling [6]. Sequence alignment plays an important role in developing an accurate homology models. Pairwise alignment tools implement dynamic programming methods to search for optimum alignments (local or global) between a pair of sequences. This approach is useful to search databases for homologous sequences. Multiple sequence alignment methods simultaneously align several sequences to identify conserved regions, predict functional sites and protein function as well as aid phylogenetic analysis [7]. This approach is particularly suited for proteins with low sequence identities. In cases where a single template fails to provide complete structural information of the target, the multi-template approach [8], or mixing and matching techniques are used which [9] incorporate structural information from multiple homologous templates to improve the overall model quality. Some important definitions One should be very much familiar with certain technical terms which are used while discussing the principle, methodology and applications of homology modeling. Herein, we give accurate meaning of these terms. Identity: The two proteins that have certain number of amino acids in common at aligned positions are said to be identical to that degree e. g. out of 144 AA, 43 were found to be common that means two proteins are 29.9% identical, Similarity: Usually number of residues will be replaced by one of the similar physicochemical properties. Such mutations are said to be conservatives and can be defined by various scoring schemes to quantify how similar the two sequences are. Such scores will be the measure of similarity. Homologous: If two proteins are evolutionary related and stemout from common ancestors they are called homologous. Constraint: It is a requirement that system is forced to satisfy e. g. in constraints, bonds or angles are forced to adopt specified value throughout the simulations.

Homology modeling: Developing 3D structures of target proteins missing in databases 109 Restraint: If bonds or angles are restraints then they are able to deviate from the desired values means that restraints only act to encourage a particular values. Domains: These are independent structural unit which can be found alone or in conjunction with other domains and is responsible for the specific function. The domains are evolutionarily related. Motif: The term motif refers to a set of contiguous secondary structure elements that either have a particular functional significance or define a portion of an independently folded domain. Fold: A protein fold is defined by the arrangement of the secondary structure elements of the structure relative to each other in space. Databases required Homology modeling: Large number of information are required for developing homology model for sequence of unknown structure. This information can be collected by database mining from various databases including sequence database, structure databases etc., available online. Most widely used are: Gene bank: This database and related source are freely accessible via NCBI homepage at SWISSPROT: protein sequence and known large database that is valued for its high quality annotation and the usage standard nomenclature, directly linked to specified database, minimum redundancy. The detailed description available at TrEMBL: It is a computer annotated supplement to the swissprot. Many of these AA sequences can be grouped into families of proteins with common structural domain. Most popular structural databases are: SCOP: Structural classification proteins are freely available at http://scop.mrc/ CATH: (Class. Architecture Topology and Homologous super family) PDB protein data bank uses to obtained 3D STRUCTURES OF PROTEIN using entry code (e.g. 3EDZ) available on PMP Protein Model Portal provide access to various models computed by comparative modeling methods provided by different partner sites, and provides access to various interactive services for model building, and quality assessment. ModBase: It is a Database of Comparative Protein Structure Models. SWISS-MODEL Repository: It is a database of annotated 3D protein structure models generated by the SWISS-MODEL homology-modeling pipeline. THE-DB: a threading model database for comparative protein structure analysis of the E. coli K12 and human proteomes.

110 Chapter 5

5.2 Methodology of homology modeling Homology based protein modeling is a multistep process which is cumulative and iterative in nature. It is cumulative because generation of correct final structure of target protein depends upon, how accurately each step is executed. Entire process involves iteration to previous steps once a model has been generated and specific problem has been identified during validation. The general procedure, as described in Fig. 5.1, involves following eight steps: • • • • •

Template recognition and initial alignment Alignment correction Backbone generation Loop modeling Side chain modeling

Figure 5.1 General steps involved in homology modeling.

Homology modeling: Developing 3D structures of target proteins missing in databases 111 • • • •

Ligand Modeling Model optimization Model validation Iteration of previous steps in case of specific problems identified.

5.2.1 Template recognition and initial alignment This is the initial step in which the program/server compares the sequence of unknown structure with known structure stored in PDB. The most popular server is BLAST (Basic Local Alignment Search Tool) PSI-BLAST (Position-Specific Iterated BLAST) or fold recognition methods [10] ( A search with BLAST against the database for optimal local alignments with the query, give a list of known protein structures that matches the sequence. There are certain point that one should consider while selecting template proteins for homology modeling. These points are: •

Sequence identity between template protein and target protein should usually be higher than 30%. When the sequence identity is below 30% homology, hits from BLAST are not reliable. For most trusted model, greater than 60% identity is necessary for sequence of 25 aligned residues whereas for large sequence of 250 aligned residues, 25% identity may be sufficient. In addition to % identity, it is advisable to consider several other factors including resolution while selecting 3D structure template protein. The best structure is one with ˚ and B-factor , 15 A ˚ 2. Structure with B-factor .50 A ˚2 resolution superior than 1.5 A should be avoided if possible. The sequence of similarity of each line is summarized with its E-value (Expected value) which is closer to zero, have high degree of similarity. E-value describe the number of times the alignment score would be expected by chance [11]. The sequence alignment is more sensitive in detecting evolutionary relationships among proteins and genes [12 14]. The resulting profile-sequence alignment properly align approximately 42 47% of residues in the 0 40% sequence identity range, this number is approximately double than that of the pairwise sequence methods [15]. If it is not possible to identify a single template protein for entire unknown target sequence than multiple sequence alignment is better alternative. Single sequence alignment Single sequence alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences. It involves pairwise sequence alignment. It is only applicable, when the target template

112 Chapter 5 sequence identity is higher than 40%. While, in case of target template sequence identity lower than 40%, multiple templates are more accurate than those built using a single template only, and this trend is accentuated as one moves into more remote target template pair cases [1]. Multiple sequence alignment Multiple alignments are typically heuristic, well known as progressive alignment. Progressive alignments are simple to perform and allow large alignments of distantly related sequences to be constructed [16]. This is implemented in the most widely used programs, ClustalW and Clustal X. Alignment of divergent protein sequences can be performed with high accuracy using ClustalW program. ClustalW includes many features like assigning individual weights to each sequence in a partial alignment and amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Specific importance is given to residue specific gap penalties in hydrophilic regions which encourage new gaps in potential loop regions [17]. HMMs (Hidden Markov Models) are a class of probabilistic models that are generally applicable to time series or linear sequence [18,19]. Profile HMM is very effective in detecting conserved patterns in multiple sequences. The SATCHMO algorithm in the LOBSTER package simultaneously constructs a similarity tree and compares multiple sequence alignments of each internal node of the tree using HMMs. A new HMM, SAMT98 ID known for finding remote homologs of protein sequences. The method begins with a single target sequence and iteratively builds a HMM from the sequence and homologs found using the HMM for database search. This is also used in the construction of model libraries automatically from sequences. The LAMA program aligns two multiple sequence alignments, initially transforming them into profiles and then comparing these two with each other by the Pearson correlation coefficient [20]. The COMPASS program was developed to locally align two multiple sequence with assessment of statistical significance, which compare two profiles by constructing a matrix of scores for matching every position in one profile to each position in the other profile, followed by either local or global dynamic programming to calculate the optimal alignment [21]. Similarly, T-Coffee uses progressive alignment as optimization technique [22]. T-Coffee can merge heterogeneous data in alignments. 3D Coffee incorporates a link to the FUGUE threading package, which carries out sequence alignment using local structural information [23]. Probabilistic-based program PROBCONS uses BAliBASE, which is a most accurate method available for multiple alignments [24]. In simple words PROBCONS is like T-Coffee, but it uses probabilities instead of the heuristic algorithms. HOMSTRAD is exclusively based on sequences with known 3D structures and PDB files. Katoh et al extended the HOMSTRAD by incorporating a large number of close homologues, as found by the BLAST search, which tend to increase the accuracy of

Homology modeling: Developing 3D structures of target proteins missing in databases 113 the alignment [25,26]. Sadreyev and Grishin reported that the accuracy of profile alignments can be increased by including confident homologues with the help of COMPASS program [21].

5.2.2 Alignment correction After identifying one or more possible template, alignment correction is performed. Alignment errors are the main cause of deviations in comparative modeling even when the correct template is chosen. In recent years, significant progress has been made in the development of sensitive alignment methods based on iterative searches, e.g. PSI-BLAST [27], Hidden Markov Models (HMM), e.g. SAM [18], HMMER [19] or profile-profile alignment such as FFAS03 [28], profilescan [29] and HHsearch. There are several approaches for scoring alignments and for removing badly aligned sequences or correcting poorly aligned regions. One of the approach was introduced by Lassmann et al. in which they introduced two functions to compare alignments: the average overlap score and the multiple overlap score. The average overlap score identifies difficult alignment cases by expressing the similarity among several alignments, while the multiple overlap score estimates the biological correctness of individual alignments [30]. Quality of the multiple sequence alignment can be improved with a protocol suggested by Muller and co-workers. Similarly several other protocols and approaches have been suggested till date for the alignment correction [31 33].

5.2.3 Step 3: Backbone building After the target template alignment, next step in the homology modeling is the model building by backbone generation. Creating the backbone is trivial for most of the model: One simply copies the coordinates of the template protein residues that show up in the alignment with the model sequence. In case two aligned residues differ, only the backbone coordinates (N, Cα, C and O) can be copied. If they are the same, one can also include the side chain (at least the more rigid side chains, since rotamers tend to be conserved). A variety of methods can be used to build a protein model for the target. Generally rigid-body assembly [34], segment matching [34], spatial restraint [35], and artificial evolution [36] are used for model building. Rigid body assembly model building relies on the natural dissection of the protein structure into conserved core regions, variable loops that connect them and side chains that decorate the backbone. Model accuracy is based on the template selection and alignment accuracy. Accordingly, significant modeling method allows a degree of flexibility and automation, making it easier and faster to obtain good models. Segment matching based on the construction of model utilizes a subset of atomic positions from template structures as guiding positions. All-atom segments that match the guiding positions can be obtained

114 Chapter 5 either by scanning all the known protein structures. In addition to that it includes those protein structures that are not related to the sequence being modeled [37], or by a conformational search restrained by an energy function [38]. Modeling by satisfaction of spatial restraints is based on the generation of many constraints or restraints on the structure of target sequence, using its alignment to related protein structures as a guide. Generation of restraints is based upon the assumption that the corresponding distances between aligned residues in the template and the target structures are similar. Model refinement Model refinement is a very important task that requires efficient sampling for conformational space and a means to accurately identify near-native structures [39]. Homology model building process evolves through a series of amino acid residue substitutions, insertions and deletions. Model refinement is based upon tuning alignment, modeling loops and side chains. The model refinement process will usually begin with an energy minimization step using one of the molecular mechanics force fields [40] and for further refinement, techniques such as molecular dynamics, Monte Carlo and genetic algorithm-based sampling can be applied [41]. Monte Carlo sampling focused on those regions which are likely to contain errors, while allowing the whole structure to relax in a physically realistic all-atom force field can significantly improve the accuracy of models in terms of both the backbone conformations and the placement of core side chains. The accuracy of alignment by modeling strongly depends on the degree of sequence similarity. Misalignment of the models some time results into the errors which may be hard to remove at the later stages of refinement [42].

5.2.4 Loop modeling Homologous proteins have gaps or insertions in sequences, referred to as loops whose structures are not conserved during evolution. Loops are considered as the most variable regions of a protein where insertion and deletion often occurs. Loops often determine the functional specificity of a protein structure. Loops contribute to active and binding sites. The accuracy of loop modeling is a major factor in determining the usefulness of homology models for studying protein-ligand interactions [43]. Loop structures are more difficult to predict than the structure of the geometrically highly regular strands and helices because loops exhibit greater structural variability than strands and helices. Length of a loop region is generally much shorter than that of the whole protein chain. Modeling a loop region possess challenges, which are not likely to be present in the global protein structure. Modeled loop structure has to be geometrically consistent with the rest of the protein structure [44].

Homology modeling: Developing 3D structures of target proteins missing in databases 115 Loop prediction methods Loop prediction methods can be evaluated in determining their utilities for: (1) backbone construction; (2) what range of lengths are possible; (3) how widely is the conformational space searched; (4) how side chains are added; (5) how the conformations scored (i.e., the potential energy function) and (6) how much has the method been tested. Most of the loop construction methods were tested only on native structures from which the loop to be built [45]. But in reality homology modeling is more complicated process requiring several choices to be made in building the complete structure. Database methods: Database methods of loop structure prediction measure the orientation and separation of the backbone segments, flanking the region to be modeled, and then search the PDB for segments of the same length that span a region of similar size and orientation. In current years, as the size of the PDB has increased, database methods have continued to attract attention. Database methods are suitable for the loops of up to 8 residues [46]. Construction methods: The main alternative to database methods is construction of loops by random or exhaustive search mechanisms. Moult and James performed a systematic search to predict loop conformations up to 6 residues long [47]. They found various useful concepts in loop modeling by construction: (1) the use of a limited number of Φ, ψ pairs for construction; (2) construction from each end of the loop simultaneously; (3) discarding conformations of partial loops that span the remaining distance with those residues left to be modeled; (4) using side-chain clashes to reject partial loop conformations and (5) the use of electrostatic and hydrophobic free energy terms in evaluating predicted loops [48]. Scaling-relaxation method: In scaling-relaxation method a full segment is sampled and its end-to-end distance is measured. If this distance is longer than the segment needs, then the segment is scaled in size so that it fits the end-to-end distance of the protein anchors, which result in very short bond distances, and physical connections to the anchors. From there, energy minimization is performed on the loop, slowly relaxing the scaling constant, until the loop is scaled back to full size [49]. Molecular mechanics/molecular dynamics: Other loop prediction methods build chains by sampling Ramachandran conformations randomly, keeping partial segments as long as they can complete the loop with the remaining residues to be built [50]. These methods are capable of building longer loops since they spend less time in unlikely conformations searched in the grid method. These methods are based on Monte Carlo or molecular dynamics simulations with simulated annealing to generate many conformations, which can then be energy minimized and tested with some energy function to choose the lowest energy conformation for prediction [51].

116 Chapter 5

5.2.5 Side-chain modeling Side-chain modeling is an important step in predicting protein structure by homology. Side-chain prediction usually involve placing side chains onto fixed backbone coordinates either obtained from a parent structure or generated from ab initio modeling simulations or a combination of these two. Protein side chains tend to exist in a limited number of low energy conformation called rotamers. In side-chain prediction methods, rotamers are selected based on the preferred protein sequence and the given backbone coordinates, by using a defined energy function and search strategy. The side-chain quality can be analyzed by root mean square deviation (RMSD) for all atoms or by detecting the fraction of correct rotamers found [52].

5.2.6 Ligand modeling Ligand-based homology modeling technique is advancement to the conventional homology modeling strategy as it combines the Boltzmann-weighted randomized modeling procedure with a specialized algorithm for the proper handling of insertions and deletions of any selected extra-atoms during the energy tests and minimization stages of the modeling procedure [53]. Ligand-based homology modeling option is often useful when one wishes to build a homology model in the presence of a ligand docked to the primary template, or other proteins known to be complexed with the sequence to be modeled. Both model building and refinement is taken into account the presence of the ligand in terms of specific steric and chemical features [54]. Ligand-based homology modeling using modeller and other freely available tools. In Modeller, a ligand can be included in a model in two ways. Firstly, in case where ligand is not present in the template structure, but is defined in the MODELLER residue topology library. Such ligands include water molecules, metal ions, nucleotides, heme groups, and many other ligands. Secondly, the case where ligand is already present in the template structure. It can be assumed that ligand interacts similarly with the target and the template, in which case we can rely on the mentioned tool to extract and satisfy distance restraints automatically, or that the relative orientation is not necessarily conserved, in this case the user needs to supply restraints on the relative orientation of the ligand and the target (the conformation of the ligand is assumed to be rigid).

5.2.7 Model optimization To optimize a complete protein, an enormous accuracy in the energy function is required, because there are many more reasons that make a protein leading away from the target structure than toward it, which is why energy minimization must be used carefully. At every minimization step, there is a removal of few big errors (like bumps, i.e., too short

Homology modeling: Developing 3D structures of target proteins missing in databases 117 atomic distances) while many small errors are introduced. Once the big errors are removed, the small ones starts accumulating making the model moves away from the target protein. As a rule of thumb, today’s modeling programs therefore either restrain the atom positions and or apply only a few hundred steps of energy minimization. In short, model optimization does not work until energy functions (force fields) get more accurate. There are two ways to achieve accuracy such as application of Quantum force-field and Self-parameterizing force-field. Third, approach for model optimization is subjecting the modeled protein to molecular dynamics simulation for a femtosecond (10215 s) timescale to mimic the true folding process. One thus hopes that the model will complete its folding and “home in” to the true structure during the simulation.

5.2.8 Model validation Each step in homology modeling is reliant on the former processes. Therefore, errors may be accidentally introduced and propagated, thus the model validation and assessment of protein is necessary for interpreting them. The protein model can be evaluated as a whole as well as in individual regions [2]. Initially, fold of a model can be assessed by a high sequence similarity with the template. One basic necessity for a constructed model is to have good stereochemistry [5]. The most important factor in the assessment of constructed model is scoring function. The programs evaluate the location of each residue in a model with respect to the expected environment as found in the high-resolution X-ray structure [55]. Techniques used to determine misthreading in X-ray structures can be used to determine alignment errors in homology models. Errors in the model are very much common and most attention is needed towards refinement and validation. Errors in model are usually estimated by (1) superposition of model onto native structure with the structure alignment program Structal [56] and calculation of RMSD of Cα atoms, (2) generation of Z-score, a measure of statistical significance between matched structures for the model, using the structure alignment program CE, scores indicate good structural similarity and (3) development of a scoring function that is capable of discriminating good and bad models. Most commonly employed method for the validation of a homology model is Ramachandran plot. It shows the empirical distribution of data points observed in a single structure. By making a Ramachandran plot, protein structural scientists can determine which torsional angles are permitted and can obtain insight into the structure of peptides. Statistical effective energy functions [57] are based on the observed properties of amino acids in known structures. A variety of statistical criteria derived for various properties such as distributions of polar and apolar residues inside or outside of protein, thus detecting the misfolded models [58]. Solvation potentials can detect local errors and complete misfolds [59], packing rules have been implemented for structure evaluation [60]. A model is said to be valid only when a few distortions in atomic contacts are present. The Ramachandran

118 Chapter 5 plot is probably the most powerful determinant of the protein quality [61], when Ramachandran plot quality of the model is comparatively worse than that of the template, then it is likely that error took place in backbone modeling. WHAT_CHECK determines Asn, His or Gln side chains need to be rotated by 180 about their C2, C2 or C3 angle, respectively. Side chain torsion angles are essential for hydrogen bonding, sometimes altered during the modeling process. Conformational free energy distinguishes the native structure of a protein from an incorrectly folded decoy. A distinct advantage of such physically derived functions is that they are based on well-defined physical interactions, thus making it easier to learn and to gain insight from their performance. In addition, ab-initio methods showed success in recent CASP. One of the major drawbacks of physical chemical description of the folding free energy of a protein is that the treatment of solvation required usually comes at a significant computational expense. Fast solvation models such as the generalized born and a variety of simplified scoring schemes may prove to be extremely useful in this regard [62]. A number of freely available programs can be used to verify homology models, among them WHAT_CHECK solves typically crystallographic problems. The validation programs are generally of two types: (1) first category (e.g. PROCHECK and WHATIF) checks for proper protein stereochemistry, such as symmetry checks, geometry checks (chirality, bond lengths, bond angles, torsion angles) and structural packing quality and (2) the second category (e.g.VERIFY3D and PROSAII) checks the fitness of sequence to structure and assigns a score for each residue fitting its current environment. GRASP2 is new model assessment software developed by Honig [63]. For example, gaps and insertions can be mapped to the structures to verify that they make sense geometrically. It is suggested that, manual inspection should be combined with existing programs to further identify problems in the model.

5.3 Software for homology modeling 5.3.1 Robetta Robetta is a publicly available fully automated protein structure prediction server ( [64]. This server uses the Robetta de novo and homology modeling structure prediction methods. As a guiding principle, the structure prediction of a protein depends on its sequence homology to a protein of known structure. In the absence of such sequence homology, models can be built using Robetta de novo structure prediction methods [65]. To predict structures for full-length protein sequences, Robetta uses a domain prediction method. Regions that are homologous to sequences with known structures are modeled using comparative modeling protocol. The initial step for this domain prediction method is called ‘Ginzu’ [66]. It involves BLAST [67], PSI-BLAST [68] and 3D-Jury [69]

Homology modeling: Developing 3D structures of target proteins missing in databases 119 to detect homologous regions in the query sequence with the experimentally determined structures. Then it uses Multiple Sequence Alignment (MSA) based methods to predict the putative domains. The order of the procedure is maintained by the reliability of each method. It starts with the most reliable method i.e. BLAST, followed by the next method PSI-BLAST and so on. After a match is found for the query sequence, the remaining unmatched portion is used as input for the next step. For de novo modeling of domains of proteins Robetta uses a protocol that is quite similar to that used in CASP5 [70]. A library of fragments that represent the range of accessible local structures for all short segments of the protein chain is made from the Protein Data Bank (PDB). Structures are then assembled randomly by fragment insertion using a scoring function that Q favors native protein-like features. It generates large numbers of alternate “decoy” conformations for the target and up to two sequence homologs and subsequently filters the decoy ensemble to remove non protein-like conformations and clusters the remaining structures based on Cα Root-mean Square Deviation (RMSD) over all ungapped positions. From the top clusters final models are selected. To identify potential similarities to proteins of known structures, searches using final models are then carried out against a representative set of PDB chains to find similar structures using Mammoth [71].

5.3.2 Modeller MODELLER ( is a computer program used for comparative protein structure modeling [72]. The input to be given to the program is a sequence alignment of the amino acid sequence of the target protein with the amino acid sequence of its template protein. The target protein is then modeled with the template structure. If the target protein has sequence similarities to more than one protein sequences, then a multiple sequence alignment of the target protein sequence with other proteins is given as input to the program. MODELLER is very fast as it can automatically calculate a model containing all non-hydrogen atoms within minutes with no user intervention on a Pentium processor. Not only model building, MODELLER can also perform some auxiliary tasks such as alignment of two or more protein sequences and/or structures or their fold-assignment [73], clustering of sequences and or structures and ab initio modeling of loops in protein structures. MODELLER models a protein structure by satisfaction of spatial restraints. To satisfy spatial restraints it uses either distance geometry or optimization techniques obtained from the alignment of the target sequence with the template structures. From two sources MODELLER can extract spatial restraints. They are as follows: First, homology-derived restraints on the distances and dihedral angles in the target sequence are extracted from its alignment with the template structure(s). Second, stereo chemical restraints such as bond length and bond angle preferences are obtained from the Charmm-22 molecular mechanics

120 Chapter 5 force field [74] and statistical preferences of dihedral angles and non-bonded atomic distances that are obtained from a representative set of all known protein structures. After a target template alignment is established, MODELLER automatically calculates a 3D structure of the target protein. The model is then optimized depending on conjugate gradients method and molecular dynamics simulation is performed to minimize errors of the spatial restraints. Thus MODELLER produces a possible best model out of the spatial restraints. Model building procedure of MODELLER is similar to that of the structure determination by NMR spectroscopy.

5.3.3 3D-JURY 3D-Jury is a simple meta-predictor which generates meta-predictions using variable sets of models obtained from various sources. It helps to improve the quality of protein structure prediction. The 3D-Jury system is available via the Structure Prediction Meta Server. The system is comparable with other meta servers but it has the highest combined specificity and sensitivity [69]. It follows an ab initio fold recognition approach. In 3D-Jury a group of models produced by a set of servers are taken as input. The system may not select the correct model from a set of preliminary models. It neglects confidence scores assigned to each of the models. All models of the group are compared with each other to assign a similarity score to each pair. This score equals to the number of Cα atom pairs that are ˚ after optimal superposition. If this score is less than 40, the pair of models is within 3.5 A considered as non-similar and the score is set to Zero. Again if the score is 40 or more it indicates that both models may have approximately 90% chance to belong to the same fold class. 3D-Jury score of a model (sum of all similarity scores of considered model pairs)/(the number of considered pairs 1 1) [75]. There are two modes of operation of a 3D-Jury system. One is the best-model-mode (3D-Jury single) which takes only one model from each server to be used in the model building process. Another is the all-models-mode (3DJury-all) which considers all models of the servers. The best-model-mode of the 3D-Jury system provides best outcomes on a set of eight servers. 3D-Jury is an extremely sensitive system as it can act on the problematic targets. Further, the 3D-Jury system can also be operated as a meta-meta-predictor where models collected from other meta-predictors are used in consensus calculations. The 3D-Jury system follows a simple protocol. So that it can be easily reproduced and incorporated into other fold recognition programs. This improves the quality of the extrapolations [69].

5.3.4 Swiss-model SWISS-MODEL ( is a server for automated comparative modeling of three dimensional protein structures. Dependent on the complexity of the each modeling task, the server aids the user in building and evaluating protein homology models

Homology modeling: Developing 3D structures of target proteins missing in databases 121 differently. The SWISS-MODEL server has capacity to model protein structure with a minimum user involvement [76]. Further, only an amino acid sequence can be provided as input data. With increasing complexity, additional user involvement may be indispensable for some modeling ventures. Thus, the SWISS-MODEL server familiarizes three foremost collaboration modes: Leading approach mode or automated mode, Alignment mode and Project mode [77].

5.4 Conclusion Homology modeling has noteworthy potential as a tool in rational drug design, specifically in high throughput in silico selection or simulation tactics. The quality of the concluding structure depends principally on the quality of the target and template orientation. Any improvement in the alignment protocols will rally the ultimate model. However, there will always be structural variances between target and templates and these variances have to be recognized and recompensed for by ab initio modeling or by optimization approaches. Exercise 1: To generate, validate and run homology model for a DHFR (Dihydrofolate reductase-thymidylate synthase) protein.


Requirements 1. Operating system: Windows (7, 8 and/or 10) 2. Free wares for non-commercial uses: Modeller 9.19 (can be downloaded at http://, Visualizer (Pymol, available at [Note: please check version of operating system and download accordingly] Step by step protocol The homology modeling was carried out via Modeller 9.19 using various modeller scripts in order to elucidate three-dimensional structure of Leishmanial DHFR-TS. Step 1: Template recognition and initial alignment The modeling was carried out with the help of ‘Advanced’ tutorial of Modeller 9.19 which consisted of modeling of protein-ligand complex based on multiple alignment, loop refinement. Three types of input files are required by the Modeller in order to generate a protein structure viz., the query sequence in PIR format, structure of template proteins in PDB format, python command files (scripts) in plain text format. Firstly, the query/target sequence of Leishmaniadonovani DHFR-TS (Accession: AAM88660.1) was obtained in the FASTA format from NCBI ( This query sequence is utilized for two purposes: (a) to run BLAST in order to identify appropriate template proteins on the basis of sequence identity and e-value (expected value) (b) FASTA format of query sequence was converted to PIR format and saved as ‘DHFR-TS.ali’ file readable by

122 Chapter 5 modeller. 3INVA and 2H2QA belonging to Trypanosoma cruzi Dihydrofolate reductasethymidylate synthase proteins having sequence identity of 67% and zero E-value (expected value) each were selected as templates. Step 2: Model building and loop refinement The solved X-ray crystal structure of both the template proteins were obtained from protein data bank ( A number of python based scripts such as,,, were executed using modeller in order to generate the protein structure. Each script has its own significance from sequence alignment to generation of probable protein models to evaluation of generated protein models. Moreover, output files were generated for each script in the form of log files. The model_mult.log file enlists all the generated models (Table 5.1). Each model was evaluated on the basis of their DOPE (Discrete Optimized Protein Energy) and GA341 Scores, where a model is considered ideal when it possesses least DOPE Score and GA341 value as 1. Therefore, on the basis of DOPE and GA341, DHFR-TS. B99990005.pdb was selected for further evaluation by running script. As a result, a dope profile for the protein model was generated as ‘model.profile’. Gnuplot was utilized to compare the dope score profiles of the templates and model (Fig. 5.2). The plot showed that the conformation of the loop around residues 200 217 had higher DOPE score as compared to the model based on single templates. Therefore, the selected model was subjected to loop refinement function of modeller. Various scripts were executed such as, model_energies_py, As a result a total of ten probable loop models were generated by modeller (Table 5.2). The best model (DHFR-TS. BL00050001.pdb) was selected on the basis of DOPE score. The generated a dope score profile for the selected model which was utilized to compare profile of previously selected best protein model (DHFR-TS. B99990005.pdb) as shown in Fig. 5.3. The DOPE score for residues around 200 217 was successfully decreased. Step 3: Validation and evaluation of generated homology model

Table 5.1: List of generated protein models via script. S.No. 1. 2. 3. 4. 5.

DHFR-TS model DHFR-TS.B99990001.pdb DHFR-TS.B99990002.pdb DHFR-TS.B99990003.pdb DHFR-TS.B99990004.pdb DHFR-TS.B99990005.pdb

DOPE score


258972.06641 259282.17969 259327.55859 259120.92578 259355.75391

1.00000 1.00000 1.00000 1.00000 1.00000

Homology modeling: Developing 3D structures of target proteins missing in databases 123

Figure 5.2 Dope score profile for protein model DHFR-TS.B99990005.pdb.

Table 5.2: Summary of successfully generated loop models. S.No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

DHFR-TS model DHFR-TS.BL00010001.pdb DHFR-TS.BL00020001.pdb DHFR-TS.BL00030001.pdb DHFR-TS.BL00040001.pdb DHFR-TS.BL00050001.pdb DHFR-TS.BL00060001.pdb DHFR-TS.BL00070001.pdb DHFR-TS.BL00080001.pdb DHFR-TS.BL00090001.pdb DHFR-TS.BL00010001.pdb

DOPE score 257552.054688 257439.125000 258450.937500 258514.281250 258533.960938 258320.531250 257791.804688 258450.85156 258181.117188 258327.539063

The validation of the generated model was done with the help of Ramachandran plot, and various web based programs such as Pro-SA (Protein structure analysis) and ProQ (Protein quality predictor). Ramachandran plot predicted that more than 99% of the non-glycine residues were in the allowed region and partially allowed region. Only three of the non-glycine residues were in disallowed region, but all of these were far from binding site therefore had less effect on

124 Chapter 5

Figure 5.3 Dope score profile for protein model after loop refinement.

the structure quality (Fig. 5.4). The ProSA analysis of L. donovani DHFR-TS predicted the Z score to be 210.55 is within the range of native conformation of crystal structure (Fig. 5.5A). The ProSA analysis also showed the overall residue energies of L. donovani DHFR-TS model (Fig. 5.5B). The ProQ web analysis also predicted the model to be extremely good with predicted LG Score 5.475 and predicted Max sub 0.420. The final refinement of the model was done with the help of molecular dynamics simulations which were carried out for a period of 50 ns. The RMSD plot of the modeled protein is given in Fig. 5.6. Exercise 2. To perform the sequence alignment of protein kinase D 1 (PKD1) with the CaMK-1G (PDB ID: 2JAM_A) using PROMAL 3D and report various parameters to assess the quality of alignment. Exercise 3. To search for a template to build a model for a target lactate dehydrogenase from Trichomonas vaginalis using BLAST tool. Exercise 4. To identify template proteins for Human aquaporin-8 using multiple sequence alignment and report top ranked template proteins along with scoring parameters. Exercise 5. To model 3D structure of mutant Cys387Ser DprE1 for designing novel inhibitors to treat total Drug resistant tuberculosis. (Hint: Cys387Ser DprE1 is a mutant of

Homology modeling: Developing 3D structures of target proteins missing in databases 125

Figure 5.4 The Ramachandran plot of modeled L.donovani DHFR-TS.

Figure 5.5 Analysis results from ProSA (A) Z-score (B) energy profile diagrams.

126 Chapter 5

Figure 5.6 The RMSD plot of modeled L.donovani DHFR-TS protein.

DprE1 which is a target of MDR TB and XDR TB, this single point mutation is responsible for total drug resistant TB. Refer article [78]). Exercise 6. To model 3D structure of human carboxylesterase hCES2 including water molecules. (Hint: the water molecule in the active binding site has important role in catalyzing the hydrolysis reaction. Refer article [79])

References [1] A. Fiser, Template-based protein structure modeling, Comput. Biol., Springer, 2010, pp. 73 94. ˇ [2] M.A. Martı´-Renom, A.C. Stuart, A. Fiser, R. Sa´nchez, F. Melo, A. Sali, Comparative protein structure modeling of genes and genomes, Annu. Rev. Biophys. Biomol. Struct. 29 (2000) 291 325. [3] S.K. Burley, A. Joachimiak, G.T. Montelione, I.A. Wilson, Contributions to the NIH-NIGMS protein structure initiative from the PSI production centers, Structure 16 (2008) 5 11. [4] K. Ginalski, Comparative modeling for protein structure prediction, Curr. Opin. Struct. Biol. 16 (2006) 172 177. [5] A. Hillisch, L.F. Pineda, R. Hilgenfeld, Utility of homology models in the drug discovery process, Drug. Discov. Today 9 (2004) 659 669.

Homology modeling: Developing 3D structures of target proteins missing in databases 127 [6] P. Larsson, B. Wallner, E. Lindahl, A. Elofsson, Using multiple templates to improve quality of homology models in automated homology modeling, Protein Sci. 17 (2008) 990 1002. [7] R.C. Edgar, S. Batzoglou, Multiple sequence alignment, Curr. Opin. Struct. Biol. 16 (2006) 368 373. [8] J. Cheng, A multi-template combination algorithm for protein comparative modeling, BMC Struct. Biol. 8 (2008) 18. [9] T. Liu, M. Guerquin, R. Samudrala, Improving the accuracy of template-based predictions by mixing and matching between initial models, BMC Struct. Biol. 8 (2008) 24. [10] W.-C. Wong, S. Maurer-Stroh, F. Eisenhaber, Not all transmembrane helices are born equal: towards the extension of the sequence homology concept to membrane proteins, Biol. Direct 6 (2011) 57. [11] W.R. Pearson, BLAST and FASTA similarity searching for multiple sequence alignment, Multiple Sequence Alignment Methods, Springer, 2014, pp. 75 101. [12] E. Lindahl, A. Elofsson, Identification of related proteins on family, superfamily and fold level, J. Mol. Biol. 295 (2000) 613 625. [13] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, et al., Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods, J. Mol. Biol. 284 (1998) 1201 1210. [14] J.M. Sauder, J.W. Arthur, R.L. Dunbrack Jr, Large-scale comparison of protein sequence alignment algorithms with structure alignments, Proteins: Struct., Funct., Bioinf. 40 (2000) 6 22. [15] J. Skolnick, J.S. Fetrow, From genes to protein structure and function: novel applications of computational approaches in the genomic era, Trends Biotechnol. 18 (2000) 34 39. [16] D.-F. Feng, R.F. Doolittle, Progressive sequence alignment as a prerequisitetto correct phylogenetic trees, J. Mol. Evol. 25 (1987) 351 360. [17] J.D. Thompson, D.G. Higgins, T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res. 22 (1994) 4673 4680. [18] K. Karplus, C. Barrett, R. Hughey, Hidden Markov models for detecting remote protein homologies, Bioinforma (Oxford, Engl.) 14 (1998) 846 856. [19] S.R. Eddy, Profile hidden Markov models, Bioinforma (Oxford, Engl.) 14 (1998) 755 763. [20] S. Pietrokovski, Searching databases of conserved sequence regions by aligning protein multiplealignments, Nucleic Acids Res. 24 (1996) 3836 3845. [21] R.I. Sadreyev, N.V. Grishin, Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs, Bioinformatics 20 (2004) 818 828. [22] C. Notredame, D.G. Higgins, J. Heringa, T-Coffee: a novel method for fast and accurate multiple sequence alignment, J. Mol. Biol. 302 (2000) 205 217. [23] J. Shi, T.L. Blundell, K. Mizuguchi, FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties, J. Mol. Biol. 310 (2001) 243 257. [24] C.B. Do, M.S. Mahabhashyam, M. Brudno, S. Batzoglou, ProbCons: probabilistic consistency-based multiple sequence alignment, Genome Res. 15 (2005) 330 340. [25] K. Katoh, K.-i. Kuma, H. Toh, T. Miyata, MAFFT version 5: improvement in accuracy of multiple sequence alignment, Nucleic Acids Res. 33 (2005) 511 518. [26] J.D. Thompson, F. Plewniak, O. Poch, A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Res. 27 (1999) 2682 2690. [27] S.F. Altschul, T.L. Madden, A.A. Scha¨ffer, J. Zhang, Z. Zhang, W. Miller, et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389 3402. [28] L. Jaroszewski, L. Rychlewski, Z. Li, W. Li, A. Godzik, FFAS03: a server for profile profile sequence alignments, Nucleic Acids Res. 33 (2005) W284 W288. [29] M.A. Marti-Renom, M. Madhusudhan, A. Sali, Alignment of protein sequences by their profiles, Protein Sci. 13 (2004) 1071 1087.

128 Chapter 5 [30] T. Lassmann, E.L. Sonnhammer, Automatic assessment of alignment quality, Nucleic Acids Res. 33 (2005) 7120 7128. [31] J. Muller, C.J. Creevey, J.D. Thompson, D. Arendt, P. Bork, AQUA: automated quality improvement for multiple sequence alignments, Bioinformatics 26 (2010) 263 265. [32] J.D. Thompson, J.-C. Thierry, O. Poch, RASCAL: rapid scanning and correction of multiple sequence alignments, Bioinformatics 19 (2003) 1155 1161. [33] A. Roy, B. Taddese, S. Vohra, P.K. Thimmaraju, C.J. Illingworth, L.M. Simpson, et al., Identifying subset errors in multiple sequence alignments, J. Biomol. Struct. Dyn. 32 (2014) 364 371. [34] V. Collura, J. Higo, J. Garnier, Modeling of protein loops by simulated annealing, Protein Sci. 2 (1993) 1502 1510. [35] B. Rost, Twilight zone of protein sequence alignments, Protein Eng. 12 (1999) 85 94. [36] Z. Xiang, Advances in homology protein structure modeling, Curr. Protein Peptide Sci. 7 (2006) 217 227. [37] M. Claessens, E. Van Cutsem, I. Lasters, S. Wodak, Modelling the polypeptide backbone with ‘spare parts’ from known protein structures, Protein Eng., Des. Sel. 2 (1989) 335 345. [38] L. Holm, C. Sander, Database algorithm for generating protein backbone and side-chain co-ordinates from a Cα trace: application to model building and detection of co-ordinate errors, J. Mol. Biol. 218 (1991) 183 194. [39] C.W. van Gelder, F.J. Leusen, J.A. Leunissen, J.H. Noordik, A molecular dynamics approach for the generation of complete protein structures from limited coordinate data, Proteins: Struct., Funct., Bioinf. 18 (1994) 174 185. [40] J. Zhu, H. Fan, X. Periole, B. Honig, A.E. Mark, Refining homology models by combining replica-exchange molecular dynamics and statistical potentials, Proteins: Struct., Funct., Bioinf. 72 (2008) 1171 1188. [41] R. Han, A. Leo-Macias, D. Zerbino, U. Bastolla, B. Contreras-Moreira, A.R. Ortiz, An efficient conformational sampling method for homology modeling, Proteins: Struct., Funct., Bioinf. 71 (2008) 175 188. [42] K.M. Misura, D. Baker, Progress and challenges in high-resolution refinement of protein structure models, Proteins: Struct., Funct., Bioinf. 59 (2005) 15 29. [43] J. Lee, D. Lee, H. Park, E.A. Coutsias, C. Seok, Protein loop modeling by using fragment assembly and analytical loop closure, Proteins: Struct., Funct., Bioinf. 78 (2010) 3428 3436. [44] E. Michalsky, A. Goede, R. Preissner, Loops In Proteins (LIP)—a comprehensive loop database for homology modelling, Protein Eng. 16 (2003) 979 985. [45] T.A. Jones, S. Thirup, Using known substructures in protein model building and crystallography, EMBO J. 5 (1986) 819 822. [46] M.J. Sutcliffe, I. Haneef, D. Carney, T. Blundell, Knowledge based modelling of homologous proteins, Part I: Three-dimensional frameworks derived from the simultaneous superposition of multiple structures, Protein Eng., Des. Sel. 1 (1987) 377 384. [47] J. Moult, M. James, An algorithm for determining the conformation of polypeptide segments in proteins by systematic search, Proteins: Struct., Funct., Bioinf. 1 (1986) 146 163. [48] J.-L. Pellequer, S.-wW. Chen, Does conformational free energy distinguish loop conformations in proteins? Biophys. J. 73 (1997) 2359 2375. [49] R. Sowdhamini, S.D. Rufino, T.L. Blundell, A database of globular protein structural domains: clustering of representative family members into similar folds, Fold. Des. 1 (1996) 209 220. [50] P.S. Shenkin, D.L. Yarmush, R.M. Fine, H. Wang, C. Levinthal, Predicting antibody hypervariable loop conformation. I. Ensembles of random conformations for ringlike structures, Biopolym.: Orig. Res. Biomol. 26 (1987) 2053 2085. [51] H. Shirai, N. Nakajima, J. Higo, A. Kidera, H. Nakamura, Conformational sampling of CDR-H3 in antibodies by multicanonical molecular dynamics simulation, J. Mol. Biol. 278 (1998) 481 496.

Homology modeling: Developing 3D structures of target proteins missing in databases 129 [52] A.A. Canutescu, A.A. Shelenkov, R.L. Dunbrack Jr, A graph-theory algorithm for rapid protein side-chain prediction, Protein Sci. 12 (2003) 2001 2014. [53] M. Levitt, Accurate modeling of protein conformation by automatic segment matching, J. Mol. Biol. 226 (1992) 507 533. [54] L. Michielan, M. Bacilieri, A. Schiesaro, C. Bolcato, G. Pastorin, G. Spalluto, et al., Linear and nonlinear 3D-QSAR approaches in tandem with ligand-based homology modeling as a computational strategy to depict the pyrazolo-triazolo-pyrimidine antagonists binding site of the human adenosine A2A receptor, J. Chem. Inf. Model. 48 (2008) 350 363. [55] D. Petrey, B. Honig, Protein structure prediction: inroads to biology, Mol. Cell 20 (2005) 811 819. [56] M. Gerstein, M. Levitt, Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins, Protein Sci. 7 (1998) 445 456. [57] M.J. Sippl, Knowledge-based potentials for proteins, Curr. Opin. Struct. Biol. 5 (1995) 229 235. [58] G. Baumann, C. Froo¨mmel, C. Sander, Polarity as a criterion in protein design, Protein Eng., Des. Sel. 2 (1989) 329 334. [59] L. Holm, C. Sander, Evaluation of protein models by atomic solvation preference, J. Mol. Biol. 225 (1992) 93 105. [60] L.M. Gregoret, F.E. Cohen, Novel method for the rapid evaluation of packing in protein structures, J. Mol. Biol. 211 (1990) 959 974. [61] R.W. Hooft, G. Vriend, C. Sander, E.E. Abola, Errors in protein structures, Nature 381 (1996) 272. [62] D. Petrey, B. Honig, Free energy determinants of tertiary structure and the evaluation of protein models, Protein Sci. 9 (2000) 2181 2191. [63] D. Petrey, B. Honig, GRASP2: visualization, surface properties, and electrostatics of macromolecular structures and sequences, Methods Enzymol., Elsevier, 2003, pp. 492 509. [64] D.E. Kim, D. Chivian, D. Baker, Protein structure prediction and analysis using the Robetta server, Nucleic Acids Res. 32 (2004) W526 W531. [65] C.A. Rohl, C.E. Strauss, D. Chivian, D. Baker, Modeling structurally variable regions in homologous proteins with rosetta, Proteins: Struct., Funct., Bioinf. 55 (2004) 656 677. [66] D. Chivian, D.E. Kim, L. Malmstro¨m, P. Bradley, T. Robertson, P. Murphy, et al., Automated prediction of CASP-5 structures using the Robetta server, Proteins: Struct., Funct., Bioinf. 53 (2003) 524 533. [67] S. McGinnis, T.L. Madden, BLAST: at the core of a powerful and diverse set of sequence analysis tools, Nucleic Acids Res. 32 (2004) W20 W25. [68] S.F. Altschul, E.V. Koonin, Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases, Trends Biochem. Sci. 23 (1998) 444 447. [69] Mv. Grotthuss, J. Pas, L. Wyrwicz, K. Ginalski, L. Rychlewski, Application of 3D-Jury, GRDB, and Verify3D in fold recognition, Proteins: Struct., Funct., Bioinf. 53 (2003) 418 423. [70] J. Westbrook, Z. Feng, L. Chen, H. Yang, H.M. Berman, The protein data bank and structural genomics, Nucleic Acids Res. 31 (2003) 489 491. [71] A.R. Ortiz, C.E. Strauss, O. Olmea, MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison, Protein Sci. 11 (2002) 2606 2621. [72] N. Eswar, D. Eramian, B. Webb, M.-Y. Shen, A. Sali, Protein structure modeling with MODELLER, Structural Proteomics, Springer, 2008, pp. 145 159. [73] N. Eswar, B. Webb, M.A. Marti-Renom, M. Madhusudhan, D. Eramian, My. Shen, et al., Comparative protein structure modeling using MODELLER, Curr. Protoc. ProteSci. 50 (2007). 2.9. 1-2.9. 31. [74] N. Eswar, M. Madhusudan, M. Marti-Renom, A. Sali, Build_profile: a module for calculating sequence profile in MODELLER. http, in, 2005. [75] My. Shen, A. Sali, Statistical potential for assessment and prediction of protein structures, Protein Sci. 15 (2006) 2507 2524. [76] K. Arnold, L. Bordoli, J. Kopp, T. Schwede, The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling, Bioinformatics 22 (2006) 195 201.

130 Chapter 5 [77] F. Kiefer, K. Arnold, M. Ku¨nzli, L. Bordoli, T. Schwede, The SWISS-MODEL repository and associated resources, Nucleic Acids Res. 37 (2008) D387 D392. [78] H. Verma, S. Choudhary, P.K. Singh, A. Kashyap, O. Silakari, Decoding the signature of molecular mechanism involved in mutation associated resistance to 1, 3-benzothiazin-4-ones (Btzs) based DprE1 inhibitors using BTZ043 as a reference drug, Mol. Simul. 45 (2019) 1515 1523. [79] S. Choudhary, O. Silakari, hCES1 and hCES2 mediated activation of epalrestat-antioxidant mutual prodrugs: unwinding the hydrolytic mechanism using in silico approaches, J. Mol. Graph. Model. 91 (2019) 148 163.


Molecular docking analysis: Basic technique to predict drug-receptor interactions 6.1 Introduction: what is molecular docking? Molecular docking analysis is an important and very commonly employed tool in structure based drug designing (SBDD) strategy, to gain the understanding of the binding interactions between a ligand (small molecule) and its target receptor (protein). It represents an optimization of a model of ligand protein complex by the refinement of the separation between the partners, with variation in their relative orientation but fixed internal geometry of each component of the model [1 5]. In more simple terms, docking is a method that predicts the preferred orientation of one molecule to a second when they bind each other to form a stable complex. It involves the binding of ligand to protein in order to achieve an overall ‘best-fit’. Docking analysis therefore can be used to optimize lead molecule for desired molecular interaction by anticipating required modification in its structure on the basis of 3D structure of interacting protein. Nowadays docking is not limited to small molecules, it is also used to model protein-protein binding, to analyse and predict protein-protein interactions. Through decades, docking analysis has established itself as an easy, reliable and time-saving technique for the rational selection of hit molecules against specific biological target [6 9]. The molecular docking analysis helps user to model the atomic level interaction between a small molecule and a protein, which allows one to characterize the orientation and conformation of the small molecule in the binding site of the target protein and thereby disclose the fundamental mechanism of regulating the biochemical processes [10 12]. To simplify, the docking process can be better understood in two basic steps. First one is the prediction of the ligand conformation and orientation within the binding cavity of the macromolecule, commonly phrased as pose and the second step is the assessment and calculation of the binding affinity between the two components of the model, i.e, ligand and the protein, usually termed as docking score. So to provide any value or significance to the docking exercise, determining the exact location of the binding site is the primary and basic requirement. Docking without any assumption about the binding site is called blind docking. Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


132 Chapter 6 Considering the initial explanation of the ligand-receptor binding mechanism as “lock and key”, the earliest reported docking methods [13] considered both the ligand and receptor, as rigid bodies. However, the introduction of “induced-fit” theory [14,15] led to the revision in the molecular docking protocols, understanding that both the ligand and the protein structure (receptor) should be studied as flexible during molecular docking. Though, due to the computational limitations, molecular docking analysis is generally and most popularly performed with a flexible ligand and a rigid receptor [16,17]. With time, several molecular modeling experts developed different tools to compensate for the flexibility of the receptor [18,19]. Still these are not very commonly employed probably because of the advancement in the field of molecular dynamics (refer Chapter 7), which involves simulation of a ligandprotein complex in physiological environment, providing the most relevant real-life analysis. To understand the theoretical aspects of docking analysis, there are certain terms with which one should be familiar. These include: Conformation - The 3D shape of the ligand being docked. Translation - A position in Cartesian space, relative to the protein’s active site, Orientation - Rotations around the Cartesian axes while keeping the center of mass at a fixed point in space.

6.2 Theory of docking Fundamentally, the aim of molecular docking is to predict a putative model of ligandreceptor complex using computational tools and methods. As mentioned in the introduction, molecular docking mainly involves two interrelated steps: first is the sampling of the conformations of the ligand in the binding/active site of the protein and second is ranking these conformations using a scoring function. In an ideal scenario the employed sampling algorithms should be able to provide a binding mode/pose of the ligand which is coherent with the experimental binding mode and the employed scoring function should be able to rank the most probable pose as highest among all the other generated poses/binding modes.

6.2.1 Sampling algorithms Considering the number of parameters (six degrees of translational and rotational freedom and the conformational degrees of freedom of both the ligand and protein) involved in determining the binding modes between two molecules, there are presumably a large number of possible binding modes in a ligand-protein complex model. Thus, various sampling algorithms have been developed and widely used in molecular docking softwares. Fig. 6.1 outlines various sampling algorithm commonly used in different docking tools.

Molecular docking analysis: Basic technique to predict drug-receptor interactions 133

Figure 6.1 Graphical representation of different sampling algorithms used in molecular docking. Shape matching algorithm Shape matching algorithm considers the geometrical overlap between two molecules and is based on the molecular shape map of a ligand in the binding site of the target protein, in terms of shape features and chemical information [20,21]. Different algorithms are employed in order to make several alignments between ligand and receptor. The protein and the ligand are represented as pharmacophores and each distance of the pharmacophore within the protein and ligand is calculated for a match [22]. New ligand conformations are governed by the distance matrix, made by using atomic coordinates between the pharmacophore and the corresponding ligand atoms. Chemical properties, like hydrogen-bond donors and acceptors, can be taken into account during the match. Such shape matching algorithms for ligand docking are available in DOCK [13], FLOG [23], LibDock [24] and SANDOCK [25] programs. This approach may also identify the possible binding sites of a protein by a macromolecular surface search. Besides, shape matching specific algorithms also establish the possible conformations of the predicted binding sites. Incremental construction This method split ligand into fragments that are separately put into an active site in a fragmental and incremental fashion [26]. After the fragments are docked the parts are fused together. Different orientations of the fragments are generated to fit in the active site, thus this fragmentation allows the algorithm to realize ligand flexibility. Rigid fragments that are docked initially work like “anchors” and is usually the largest fragment or the piece which may have significant interaction with protein. Further this anchor is united by flexible parts of the ligand which have rotatable bonds. In this way the ligand is gradually “constructed” inside the binding site of the receptor, thereby this methods is also known as anchor and-grow

134 Chapter 6 method. The incremental construction method has been used in DOCK 4.0 [27], FlexX [26], Hammerhead [28], SLIDE [29] and eHiTS [30]. Multiple copy simultaneous search MCSS is a fragment-based method for either de novo design of ligands or modification of known ligands to enhance their binding affinity. In this algorithm, 1000 to 5000 copies of a functional group are randomly docked in the binding site and subjected to simultaneous energy minimization. Any interaction among the copies are omitted and only interactions with the proteins are considered. Eventually, a set of energetically favorable binding sites and orientations of the fragments are identified. Similarly, whole binding site is mapped by using different fragments and a novel molecule can be thus designed through linkage of selected fragments having perfect match with the binding site [31,32]. LUDI LUDI is another de novo design algorithm, consisting of interaction site, molecular fragment and a bridge. An interaction site is a position in space, not occupied by the enzyme, where a fragment can have favorable interactions with the enzyme. This algorithm focuses on the hydrogen bonds and hydrophobic contacts which could be formed between the ligand and the protein [33]. Accordingly, a set of interaction sites is generated either by searching the database or by using the rules. Followed by fitting of the fragments into the interaction sites and evaluation by distance criteria. The final step is the connection of the fitted fragments via bridge to obtain a single molecule [34]. Monte Carlo (MC) method Monte Carlo (MC) method is one of the stochastic method which search for the conformational space by randomly modifying a ligand conformation. In this algorithm, poses are generated for the ligand through bond rotation, rigid-body translation or rotation [35]. Thus obtained conformations are then evaluated by an energy-based selection criterion. Successful conformations are saved and the procedure is continued until a predefined number of conformations are collected. This algorithm provide a genuine advantage of allowing the ligand to crosse the energy barriers on the potential energy due to possible large variations. Few softwares employing MC algorithm include an earlier version of AutoDock [36], ICM [37] and QXP [38]. Genetic algorithms Another algorithm like Monte Carlo which belongs to the class of well-known stochastic method, is genetic algorithm [39]. The basic concept of GA is based on Darwin’s theory of evolution. In this algorithm, degrees of freedom of the ligand are encoded as binary strings and are termed as genes. These genes then make up the hypothetical ‘chromosome’ which practically represents the pose of the ligand. Similarly, GA algorithms also include two kind

Molecular docking analysis: Basic technique to predict drug-receptor interactions 135 of genetic operators, mutation and crossover. Mutation results in random changes to the genes and crossover exchanges genes between two chromosomes, accordingly these operators in GA affect the conformation of the ligand resulting in the new pose. New pose is then assessed by scoring function, and the ones that survive (i.e., exceeded a threshold) are used for the next generation. Genetic algorithms have been used in AutoDock [40], GOLD [41], DIVALI [42] and DARWIN [43].

6.2.2 Scoring functions In a molecular docking analysis, conformational sampling is regulated by a scoring function (or energy function) that is used to determine the fitness between the protein and the ligand or in simpler terms, conformational sampling separate the correct poses from incorrect poses. The final docked conformations are also picked up on the basis of these scores. Therefore, accuracy of the scoring function is important parameter to regulate the quality of molecular docking results. However, scoring functions involve estimating, rather than calculating the binding affinity between the protein and ligand, through these functions, adopting various assumptions and simplifications [44] (Fig. 6.2). Overall scoring functions can be divided into following sub-types: force-field-based, empirical and knowledge-based scoring functions [45]. Force field Force field-based scoring functions work via calculation of the sum of the non-bonded (electrostatics and van der Waals) interactions to assess the binding energy [46]. The electrostatic interactions are calculated by a Coulombic formulation using a distancedependent dielectric function to modulate the contribution of charge charge interactions, as such point charge calculations does not accurately model the real environment of protein. Similarly, van der Waals terms are calculated considering Lennard-Jones potential function

Figure 6.2 Graphical representation of different scoring functions used in molecular docking.

136 Chapter 6 and different parameter sets can result in varied “hardness” of the potential, regulating the vicinity of ligand atoms to binding pocket. Extensions of force-field-based scoring functions consider the hydrogen bonds, solvations and entropy contributions. However, force-fieldbased scoring functions have the limitation of slow computational speed. Also, due to use of cut-off distance to manage non-bonded interactions, the accuracy of long-range effects involved in binding is limited. Molecular docking tools such as DOCK, GOLD and AutoDock, use these scoring functions with slight variations such as treatment of hydrogen bonds, form of the energy function etc. Empirical scoring functions Empirical scoring functions considers total binding energy as a composition of several energy components including hydrogen bond, ionic interaction, hydrophobic effect and binding entropy [47 49]. For the calculation, each component is multiplied by a coefficient and then summed up to give a final score. These coefficients are determined via regression analysis fitted to a test set of ligand-protein complexes with known binding affinities. Although, empirical scoring functions have relatively simple energy terms to evaluate, their practical applicability for ligand-protein complexes beyond the training set is unclear. Additionally, each term in empirical scoring functions may be treated in a different manner by different software, and the number of the terms included are also different. LUDI, PLP [50], ChemScore [51] are examples derived from empirical scoring functions. Knowledge-based Knowledge-based scoring functions use statistical analysis of crystal structures of ligandprotein complex to obtain the interatomic contact frequencies and/or distances between the ligand and protein [52,53]. They are based on the assumption that more favorable an interaction, greater will be the frequency of occurrence. These frequency distributions are further converted into pairwise atom-type potentials. The score is calculated by favoring preferred contacts and penalizing repulsive interactions between each atom in the ligand and protein within a given cut-off. The key advantage of knowledge-based function is their computational simplicity, which can be employed to screen large compound databases. They can also model some uncommon interactions such as sulfur-aromatic or cation-π, which are often poorly handled in empirical approaches. However, as some interactions are less observed in the training sets of crystal structures, knowledge-based functions suffer from bias inherent in the selection of proteins. Thus, the obtained parameters may not be suitable for widespread use, especially with interactions involving metals or halogens. PMF [54], DrugScore [55], SMoG [56] and Bleep [57] are examples of knowledge-based functions which differ mainly in the size of training sets, the form of the energy function, the definition of atom types, distance cut-off or other parameters.

Molecular docking analysis: Basic technique to predict drug-receptor interactions 137 Consensus Consensus scoring is a recent strategy that combines several different scores to assess the docking conformation [58]. A pose of ligand or a potential binder could be accepted when it scores well under a number of different scoring schemes. Consensus scoring usually substantially improves enrichment (i.e., the percentage of strong binder among the high scoring ligands) in virtual screening, and improves the prediction of bound conformations and poses [59]. However, the prediction of binding energies may still be inaccurate. Also, the usefulness of consensus scoring diminishes when terms in different scoring functions are significantly correlated. CScore [60] is an example which combines DOCK, ChemScore, PMF, GOLD, and FlexX scoring functions. Typical scoring functions face the problem of affinity prediction partly because of the limited treatment of solvation effect. One of the ways to solve this problem is physics-based scoring, e.g. MM-PB/SA and MM-GB/SA (MM stands for molecular mechanics, PB and GB for Poisson-Boltzmann and Generalized Born, respectively, SA for solventaccessible surface area), which is involved in rescoring or lead optimization to improve the accuracy of binding affinity prediction. Promising results were obtained using MM-PB/SA [61] or MM-GB/SA [62] in some studies.

6.3 Types of molecular docking 6.3.1 Rigid docking: rigid ligand and rigid receptor docking The first type of molecular docking involve both the ligand and receptor as rigid bodies. In this type of protocol, the search space is very limited and involves consideration of only three translational and three rotational degrees of freedom. However, in this case, the ligand flexibility can be considered by employing a pre-computed set of ligand conformations, or by allowing for a degree of atom atom overlap between the protein and the ligand. The early versions of softwares like DOCK, FLOG and some protein-protein docking programs, such as FTDOCK [63], utilizes rigid ligand and rigid receptor docking.

6.3.2 Constrained docking: flexible ligand and rigid receptor In this method, ligand is considered flexible whereas protein is kept as constant. Conformational changes occur in ligand only and achieve optimized conformation for ligand and form stable complex with constrained protein molecule. As under induced fit paradigm both the ligand and receptor change their conformations to form a minimum energy perfect-fit complex [15], the ideal type of docking should consider the flexibilities of both the ligand and receptor. However, the cost is very high when the receptor is also flexible. Thus the common approach, also a trade-off between accuracy and computational time, is treating the ligand as flexible while the receptor is kept rigid during docking. Almost all the docking programs such as AutoDock and FlexX adopt this methodology.

138 Chapter 6

6.3.3 Flexible docking: flexible ligand and flexible receptor docking In this methodology both protein and ligand are considered flexible, i.e., conformational changes occur in both of them. The intrinsic mobility of proteins has been proved to be closely related to ligand binding behavior [64]. Incorporating the receptor flexibility is significant challenge in the field of docking. Ideally, use of MD simulations could model all the degrees of freedom in the ligand-receptor complex, but MD has the problem of inadequate sampling. Another hurdle is its high computational expense, which prevents this method from being used in the screening of large chemical database. Thereby, several theoretical models considering conformer selection and conformational induction have been proposed to illustrate the flexible ligand-protein binding process. According to the definition given by Teague, conformer selection refers to a process when a ligand selectively binds to a favorable conformation from a number of protein conformations; conformational induction describes a process in which the ligand converts the protein into a conformation that it would not spontaneously adopt in its unbound state. In some cases, this conformational conversion can be likened to a partial refolding of the protein. Various methods are currently available to implement the receptor flexibility. The simplest one is called “soft-docking” [65], which involves decreasing the van der Waals repulsion energy term in the scoring function to allow for a degree of atom-atom overlap between the receptor and ligand. For example, the LJ 8-4 potential in GOLD and smooth potential in AutoDock 3.0 belong to this class. Although this method may not include adequate flexibility, it has the advantage of computational efficiency as the receptor coordinates are fixed, simply by adjusting van der Waals parameters. Utilizing rotamer libraries is another approach to modeling receptor flexibility [66]. Rotamer libraries include a set of side-chain conformations which are usually determined from statistical analysis of structural experimental data. The advantage of using rotamers is the relative speed in sampling, and avoiding of minimization barriers. ICM (Internal Coordinates Mechanics) [37] is a program using rotamer libraries with the biased probability methodology [67], coupled with Monte Carlo search of the ligand conformation. AutoDock 4 [68] adopts a simultaneous sample method to deal with side chain flexibility. Several side chains of the receptor can be selected by users and simultaneously sampled with a ligand using the same methods. Other portions of the receptor are treated rigidly with a grid energy map during sampling. Grid energy map introduced by Goodford [69] is used to store energy information of the receptor and simplify interaction energy calculation between ligand and receptor. Another way to deal with the protein flexibility is to use an ensemble of protein conformations, which corresponds to the theory of conformer selection [70]. A ligand is separately docked into a set of rigid protein conformations rather than a single one, and the results are merged depending on the method of choice [71]. This method was originally

Molecular docking analysis: Basic technique to predict drug-receptor interactions 139 implemented in DOCK, which generates an average potential energy grid of the ensemble and is extended in many programs in different ways. For example, FlexE collects multiple crystal structures of a certain protein, merging the similar parts while marking the dissimilar areas as different alternatives. During the incremental construction of a ligand, discrete protein conformations are sampled in a combinatorial fashion. The highest scoring protein structure is selected based on a comparison between the ligand and each alternative. Hybrid method is another practical strategy to model receptor flexibility. One example is Glide, a very popular program in the field of docking. Methods mentioned above either include only side chain flexibility or full flexibility of the receptor, so another method, Local Move Monte Carlo (LMMC) loop sampling, focuses on sampling ligand conformation within loop-containing active sites. This method starts with changing one torsion angle (called the driver torsion) followed by the adjustment of the six subsequent torsions to allow the rest of the chain to remain in its original position while preserving all bond lengths and bond angles. The pioneering work on local move was done by Go and Scheraga [72], who developed a solution for the system of equations defining the values of the six torsion angles that preserve the backbone bond lengths and angles. Hoffmann and Knapp first applied the local move method in a MC simulation of polyalanine folding that included a suitable Jacobian [73], required for maintaining detailed balance. They demonstrated that this method samples the conformational space more efficiently than single move [74].

6.4 Standard methodology for molecular docking Molecular docking exercise falls under structure based drug designing approach. New chemical entity as a drug can be designed using molecular docking analysis by predicting favorable binding mode of a ligand in the active site of the target of interest. Upon analyzing the predicted binding mode, one can either identify new lead molecule (lead identification) or optimize already existing lead molecule (lead optimization). Irrespective of software selected, there are some general steps for docking analysis as described in Fig. 6.3 Step I: Selection of appropriate 3D structure of target protein The accuracy of molecular docking results heavily depend upon the quality of 3D structure of the target protein. The structure of all the proteins are usually determined by two experimental methods i.e. X-ray crystallography and NMR spectroscopy, and also by one theoretical technique i.e. homology modeling. Ideally, the 3D structure of the protein used for docking should be the one determined by experimental methods. Even in experimental methods, the preference should be given to X-ray crystallography. If experimentally solved structure is not available, homology modeling can be used to generate the structure, however the reliability of such docking results depend on the quality of developed

140 Chapter 6

Figure 6.3 General outline for docking analysis.

homology model. There are some practical considerations for selecting appropriate structures of target protein for the docking analysis. These are as follow: • •

˚ should be selected. High resolution structures, ideally better than 2.5 A Avoid the structures that are incomplete or having missing side chain/loop.

Molecular docking analysis: Basic technique to predict drug-receptor interactions 141 •

• • •

Consider B-value while selecting structure. It is a quantitative measure to indicate accuracy of the X-ray crystallographic structure. Its high value indicates poor accuracy. Ideally, structure of target protein co-crystallized with the ligand of our interest is preferred as the starting structure for the analysis. If target protein requires a biologically important co-factor then the crystallographic structure attached with co-factor should be used. If there are more than one target structures then key residues can be overlapped by least square super positioning method to identify structural variability and subsequent reconstruction of missing side chain. Representative structure then can be selected for analysis. 3D structures of target protein can be collected from open source database including Protein Data Bank (PDB), ReLi Base, Binding MOAD etc.

Step II: Preparation of target protein Structures available on open source databases are not ready to use. Therefore, before submitting them for docking jobs in any software, these structures should be prepared for the same. Preparation of protein structures includes: Removal of exterior water molecules, except those that are involved in binding with the ligand or catalytic waters. Addition of hydrogen atoms to the target at desired pH; Side chain of some of the amino acid residues like Arg, Lys, Asp and Glu are ionized at physiological pH. Calculation of partial charges of amino acid side chain, co-factors etc., wherever required. Not all but some docking tools require this type of consideration. Correction of bond order and atoms type. Optimization of hydroxyl states, for residues like Asn, Gly and His, wherever applicable. Restrained minimization of the protein structures to reorient side chain hydroxyl groups to alleviate potential steric clashes. Step III: Selection of ligand Step IV: Ligand preparation All docking tools require 3D structures of each ligand including explicit hydrogen. Different tools offer different options for generating 3D structure of ligand and placing them in active site. Molecules used for docking are either real molecules or molecules that are going to be synthesized or those obtained from databases (Zinc database, Asinex database, Cambridge database, Maybridge database, InterBioScreen database etc.) Depending upon the sources, the steps required to process the molecules may vary. While selecting a molecule for docking, it is important that the protonation, tautomeric and

142 Chapter 6 stereoisomeric forms of ligand are well defined, so that the error due to these parameters can be avoided. Step V: Specification of active site On the basis of available information about the active site, search space can be defined for docking analysis. There are two possibilities regarding availability of information of active site in target:• •

Binding site in target site is not defined and then blind docking is performed by exploring entire surface of target. Binding site is previously defined, where complex of target co-crystallized with corresponding ligand is available or active site amino acid residues are reported or mutagenesis data is available, then sample space can be reduced to concentrate on the region of our interest. It is important to note that the bounding box extends a reasonable distance beyond the active site in all directions.

Step VI: Setting of parameters as per project under consideration Setting of input parameters for running docking program depends upon the type of software used and research problem being addressed. •

• •

Number of options are available for flexible active sites including Scoring methods Search methods Solvation method for handling of encapsulated active sites etc. It is highly advisable to understand all the options available in the software package. It is also recommended that input options should be validated at the beginning of the project.

Step VII: Interpretation of docking results Docking is basically used to understand ‘molecular recognition’ process in a variety of contexts including enzyme-substrate complex formation, ligand-receptor complex formation, drug-receptor interactions, protein-protein interactions etc. To predict the binding mode drug designers explore docking results in following ways: •

Binding energy

The binding energy of ligand to active site is very important quantitative result of docking analysis and can be used to1. Compare relative biological activity of the ligand by comparing with the standard drugs available. 2. Calculate enthalpic-entropic contribution of the binding ligand to its target. 3. Understand the effect of solvation of ligands on binding.

Molecular docking analysis: Basic technique to predict drug-receptor interactions 143 •

Binding pose

Binding pose of the ligand fit to the active site should be visualized and properly examined. Analysis of binding pose in active site can be used to identify1. Functional groups that are hanging out of the active site and may be extraneous. 2. The spots that the ligand does not fill in the active site and therefore can be explored for extending ligand to give stronger fit. 3. Some important residues which can be explored for selectivity, specificity and differential binding in the selected target. •


If the crystal structure of target protein in complex with reference ligand is known then RMSD between docked pose and reference crystallographic binding pose can be calculated ˚ for good to compare the docking efficiency of the tool. RMSD values should be ,2 A efficiency.

6.5 Softwares available for molecular docking DOCK is the first automated procedure for docking a molecule into a receptor site and is being continuously developed. It characterizes the ligand and receptor as sets of spheres which could be overlaid by means of a clique detection procedure [75]. Geometrical and chemical matching algorithms are used, and the ligand-receptor complexes can be scored by accounting for steric fit, chemical complementation and/or pharmacophore similarity. In its improved versions, incremental construction method and exhaustive search are added to consider the ligand flexibility. The exhaustive search randomly generates a user-defined number of conformers as a multiple of the number of rotatable bonds in the ligand. With respect to scoring, the latest version DOCK 6.4 has included both an AMBER-derived forcefield scoring with implicit solvent [76] and GB/SA, PB/SA solvation scoring [77]. FLOG generate ligand conformations on the basis of distance geometry and uses a clique finding algorithm to calculate the sets of distances. Up to 25 explicit conformations of the ligand could be used to dock for some flexibility. FLOG allow users to define essential points which must be paired with a ligand atom. This approach is useful if an important interaction is already known before docking. Conformations are scored with a function considering van der Waals, electrostatic, hydrogen bonding and hydrophobic interactions. AutoDock incorporates Monte Carlo simulated annealing, evolutionary, genetic and Lamarckian genetic algorithm methods to model the ligand flexibility while keeping the receptor rigid. The scoring function is based on the AMBER force field, including van derWaals, hydrogen bonding, electrostatic interactions, conformational entropy and desolvation terms. Each term is weighed using an empirical scaling factor obtained from

144 Chapter 6 experimental data. AutoDock 4.0 is able to model receptor flexibility by allowing sidechains to move. Additionally, interaction of protein-protein docking could be evaluated in this version of AutoDock. AutoDockVina was released as the latest version for molecular docking and virtual screening [78]. By redocking the 190 receptor-ligand complexes that had been used as a training set for the AutoDock 4, AutoDockVina simultaneously showed approximately a two orders exponential improvement of magnitude in speed and a significantly better accuracy of the binding mode prediction. FlexX uses an incremental construction algorithm to sample ligand conformations. The base fragment is first docked into the active site by matching hydrogen bond pairs, metal and aromatic ring interactions between the ligand and protein. Then the remaining components are incrementally built-up in accordance with a set of predefined rotatable torsion angles to account for ligand flexibility. The FlexX scoring function is based on Bo¨hm’s work [79]. Its current version includes terms of electrostatic interactions, directional hydrogen bonds, rotational entropy and, aromatic and lipophilic interactions. The interactions between functional groups are also taken into account through assigning the type and geometry for groups. Glide designs a series of hierarchical filters to search the possible poses and orientations of the ligand within the binding site of the receptor. Ligand flexibility is handled by an exhaustive search of the ligand torsion angle space. Initial ligand conformations are selected based on torsion energies and docked into receptor binding sites with soft potentials. Then a rotamer exploration is used to further model receptor flexibility. GOLD (Genetic Optimization for Ligand Docking) is a genetic algorithm for docking flexible ligands into protein binding sites. GOLD has been extensively tested and has shown excellent performance for pose prediction and good results for virtual screening. GOLD is supplied as part of CSD-Discovery, which also includes Hermes. Hermes provides the graphical user interface for GOLD. It is designed to assist with the preparation of input information for docking with GOLD, visualization of docking results and calculation of descriptors. GOLD can be run in batch mode with one or more prepared protein input structures; prepared ligand input structures; a GOLD .conf file including all the settings for the docking, e.g. definition of the binding site, constraints. GOLD will only produce reliable results if ligands have correct protonation states set. Hermes will automatically derive SYBYL atom types from the input provided. Protein and ligand structures can be prepared using standard molecular modeling packages, as long as SYBYL atom types are set correctly [80]. IFREDA [71] utilizes a hybrid method that combines soft potential and multiple receptor conformations, accounting for receptor flexibility. Other programs, like QXP, perform a Monte Carlo search of ligand conformations followed by a minimization step. During

Molecular docking analysis: Basic technique to predict drug-receptor interactions 145 Table 6.1: List of some common molecular docking tools used for SBDD. Name of software

Year of release



DOCK Autodock GOLD Flex Surflex-Dock Glide MOE MolDock AutodockVina Swissdock

1988 1990 1995 2001 2003 2004 2005 2006 2010 2011

University of California-San Francisco The Scripps Research Institute University of Sheffield, Glaxosmith Kline, and CCDC BioSolveIT Trios ¨dinger Schro Chemical Computing Group MolegroAps The Scripps Research Institute Swiss Institute of Bioinformatics

Freeware Freeware Commercial Commercial Commercial Commercial Commercial Freeware Freeware Freeware

minimization, the user-defined parts of the protein are allowed to move in order to avoid atom clashes between the ligand and receptor. SLIDE is designed to incorporate flexibility with the ability to remove clashes by directed, single bond rotation of either the ligand or the side chains of the protein. An optimization approach based on the mean-field theory is applied to model induced-fit complementarities between the ligand and the protein (Table 6.1).

6.6 Conclusion In recent years, molecular docking based screening approach involving docking small molecules into a known protein structure is considered as the basic and primary strategy for validating the design of small molecule heterocycles for that specific target. Nowadays, research scholars in the field of medicinal chemistry are expected to have the basic knowledge of molecular docking along with practical, user level experience. Freeware tools like AutoDock, are considered accurate enough to validate the designing and to report in drug designing studies. It is an excellent non-commercial docking program that is widely used. The use of complementary experimental and informatic techniques increases the chance of success in many stages of the drug discovery process. Thus, in the experimental exercise section of this chapter, we are providing with a molecular docking exercise. It involves the basic steps employed in performing molecular docking analysis via Autodock and discussion about how to interpret the obtained results to derive useful information and to determine the accuracy of your own protocol. Exercise 1: To perform molecular docking studies of given set of molecules into a PARP-1 protein. Note: In this exercise we are going to perform the molecular docking study of a set of small molecule heterocycles in PARP-1 target. PARP-1 is a well-known target for breast cancer and other pathological conditions. There are several FDA approved drugs for the

146 Chapter 6 inhibition of PARP-1, including olaparib. To validate the study, redocking of olaparib will also be performed. The set of ligand chosen for this exercise are randomly selected from the literature and belong to triazole scaffold. Requirements: 1. Operating system: Windows (7, 8 and/or 10) 2. Free wares for non-commercial uses: MGL tools (can be downloaded at http://, Visualizer (Pymol, available at or Discovery Studio Visualizer, available at visualization-download.php), ACD/chemsketch available at resources/freeware/chemsketch/ [Note: please check version of operating system and download accordingly] 3. Binary files: Autodock and Autogrid .exe files available freely at http://autodock. Download, extract and copy autodock4.exe and autogrid4.exe into a folder (PARP_docking) in download folder of the computer. Step by step protocol: 1. Retrieving protein .pdb files from major databases.

The first and foremost in the process of molecular docking is the availability of 3D structure of target protein. Usually, these structures can be downloaded from easily accessible protein databases, in case protein structure is not available, homology modeling is employed (discussed in separate chapter). Most commonly used database to retrieve protein structure is RCSB, visit the home page at home/ Type the query protein or enzyme. (In this case, our query was “PARP1”). Select enzyme; download files; Click PDB file and download it For the selection of protein the first and foremost criteria which is considered includes the process of solving its crystallized structure, i.e., 3D structures solved via

Molecular docking analysis: Basic technique to predict drug-receptor interactions 147 X-ray crystallography should be preferred over NMR. The second criteria to be considered should be the resolution of the structure and finally, if multiple protein structures with high resolution are available, cross-docking [81] protocol can be employed. For the current study, we utilized PDB ID: 2rd6 with co-crystallized benzimidazole derivative, as the 3D structure for PARP-1. 2. Sketching and preparing the set of ligands. This structures of all the ligands used in the exercise are given below in table 6.2. Several free tools are available to sketch and prepare the ligands to perform molecular docking, in the current exercise 3D structures of the ligands are developed using chemsketch. As AutoDock accepts files only in limited format, so ligands must be converted into .pdb Table 6.2: List of small molecule heterocycles selected from literature for the exercise. Compound ID


Molecular weight









148 Chapter 6 format. If the sketching tool does not provide such an option, OpenBabel can be used to convert the file format. Note: move the .pdb files for both the protein and ligands into the folder ‘PARP_docking’ in the download folder. 3. Preparing PDBQT format for the protein and ligands (Target.pdbqt, Ligand. pdbqt), grid and docking parameter file (a.gpf and a.dpf) using AutoDock 4.2 Open AutoDock present on desktop (created after successful installation of MGL Tools) Open File; Read Molecule; Select and open 2rd6.pdb (Target molecule will appear on screen). In the dashboard, click on 2rd6, click on chain A, scroll down to delete the cocrystallized ligand represented as 78P900 Click on Edit; click on Delete Water Again Edit; Click on Hydrogens; Click on Add; Click Polar Only; Click OK Again Edit;click on atoms; click on assign AD4 type Again Edit; Click Charges; Add Kollman Charges; Click OK click on file; clock on save; click on write as PDBQT and save as protein.pdbqt into the PARP_docking folder. select 2rd6 on the dashboard; right click and delete. Open Ligand; Click Input; Click Open Select Ligand; Click Open; Click OK; Again Open Ligand; Click Torsion Tree; Click Detect Root Again Open Ligand; Click Torsion Tree; Click Set Number of Torsions; Set number of active torsions from 1 to 6; Click Dismiss Again Open Ligand; Click Aromatic Carbons; Click Aromaticity criterion; Click OK ( if ‘Enter angle in Degrees: 7.5’) Again Open Ligand; Click Output; Click Save as PDBQT; Save Ligand file in PARP_docking folder; repeat this exercise for the all the ligands For the preparation of Grid Parameter File (.gpf); Open Grid; click macromolecule; click open; go to PARP_docking folder and select protein.pdbqt Again open Grid; Click Set Map Types; as in this case we have already deleted the co-crystallized ligand, click directly; check the map types; Click Accept. Again Open Grid; Click Grid Box ( in this case X,Y, Z dimension as 50 3 50 3 50 are used. Further X, Y, Z center (Center Grid Box) was used to orient grid box exactly over the cavity. Click File; Click Close saving current. Again Open Grid; Click Output; Click Save GPF; Name the file as a.gpf; save a.gpf file in the same folder (PARP_docking). For the preparation of Docking Parameter File (a.dpf); Open Docking; Click Macromolecules; Click Set Rigid Filename; go to PARP_docking folder and select protein.pdbqt

Molecular docking analysis: Basic technique to predict drug-receptor interactions 149 Again Docking; Click Ligand; Click open;go to PARP_docking folder, Select Ligand 1 Again Docking; Click Search Parameters; Click Genetic Algorithm; Click Accept ( Using Default but we can change no. of GA runs) Again Docking; Click Docking parameters; Click Accept ( Using Default) Again docking; Click Output; Click LamarkianGA(4.2); name the file name as a.dpf; Savea.dpf file (.dpf format) in PARP_docking folder. 4. Performing Molecular Docking using command prompt Open Command prompt ( By typing cmd in the start menu) Use the following commands in command prompt and pressenter after each command: cd , space . Download cd , space . PARP_docking autogrid4.exe -p C:/User/Downloads/PARP_dockinga.gpf -l C:/User/Downloads/ PARP_dockinga.glg & Note: Please check the address of the folder where the .gpf file is stored. autodock4.exe -p C:/User/Downloads/PARP_dockinga.dpf -l C:/User/Downloads/ PARP_dockinga.dlg & Note: please run the autodock4.exe command after completion of autogrid.exe command. The status of the submitted job can be check using task manager of the system. 5. Analyzing results Open AutoDock; Click Analyze; Click Docking; Click Open; Select a.dlg; Click Open; Click OK Again Analyze; Click Conformations; Click Play ranked with energy; Click &; Click show information Click to observe each conformation from 1 to 10 ( In this case, ligand4 was found to be best with binding energy as 29.99 Kcal/mol and inhibition constant of 47.49 nM) (Fig. 6.4). Exercise 2: To compare the docking efficiency of different software considering protein structures of COX-1 and COX-2 co-crystallized with different ligands using cross-docking experiments. Note: This exercise is taken from a research study published by Consalvi et al. [82]. For details, please refer to the paper available online at bmc.2014.12.041.

150 Chapter 6

Figure 6.4 Cartoon representation of ligands in overlapped manner, docked into PARP1.

Step-by-step protocol Open protein data bank site on your PC/laptop. Download the crystal structures of proteins co-crystallized with ligands in pdb format. Divide the energy minimized complex of COX into ligand and protein. Randomize the separated ligand conformation and dock into all the proteins, except the native ones. Superimpose the cross-docked pose of ligand and each protein over the original crystal pose of the corresponding ligand. Calculate the RMSD between all the superimposed poses Further, calculate the average RMSD for each protein-ligand cross-docked complex. Results: Provided in tables below (Tables 6.3 and 6.4). Unsolved problems Exercise 1: To download the structure of PPRγ co-crystallized with various clinically used ˚ and analyze various types of intermolecular drugs having resolution better than 2.5 A interactions between ligand and receptor. Exercise 2: To download the structure of COX-1 enzyme co-crystallized with clinically used acidic NSAIDs and identify important amino acid residues in their respective active sites. Also compare their interactions.

Molecular docking analysis: Basic technique to predict drug-receptor interactions 151 Table 6.3: Docking efficiency for COX-1. Paradocks












0.60 1.67 3.87 13.71 0.66 0.75 1.52 1.91 2.18 4.50 3.27 3.56 1.06 2.25 2.38 0.64 6.72 1.62 58.33 2.94 3.13

0.66 1.71 3.15 4.02 6.56 3.16 1.39 3.15 4.68 1.67 3.28 3.55 7.25 2.84 2.88 2.86 4.89 1.74 36.11 3.30 1.72

0.64 0.44 1.21 2.29 0.53 1.70 10.18 1.95 1.98 0.86 5.47 6.18 3.97 4.12 5.37 0.54 4.82 1.67 58.33 3.00 2.62

3.18 3.41 3.48 4.31 3.06 3.37 17.83 3.03 4.06 4.18 3.71 6.63 8.35 3.65 2.45 2.94 1.05 3.18 11.11 4.55 3.67

0.45 0.46 3.31 6.38 1.80 0.50 1.90 0.91 4.65 2.70 2.63 6.50 8.35 3.11 5.38 0.89 0.99 0.41 55.56 2.85 2.45

1.87 1.08 0.92 4.38 0.59 0.67 1.47 3.97 0.91 7.10 6.58 6.90 7.27 2.39 4.22 3.10 0.74 0.63 52.78 3.04 2.49

3.45 3.82 11.15 4.27 7.19 3.75 3.82 3.99 7.14 7.27 6.56 3.55 7.67 2.36 4.25 3.72 5.93 3.68 2.78 5.20 2.21

3.23 3.54 3.68 4.38 4.20 3.71 1.43 3.97 7.14 7.11 6.57 3.56 7.27 2.10 4.27 3.37 5.95 3.43 8.33 4.38 1.72

Table 6.4: Docking efficiency for COX-2. Paradocks












1PXX 3LN0 3LN1 3MQE 3NT1 3NTB 3NTG 3Q7D 3QMO 3RR3 4E1G 4FM5 4M10 4M11 DA%a Average RMSD SDb

5.73 5.25 1.05 6.66 0.76 0.62 1.04 0.29 5.07 0.99 5.65 1.65 7.40 8.32 50.00 3.60 2.93

7.60 1.16 0.83 3.00 0.29 0.73 3.74 0.59 0.72 0.46 6.91 1.80 1.45 5.20 67.86 2.46 2.48

5.72 5.28 0.83 6.16 0.49 1.88 1.02 0.14 1.03 0.54 8.98 0.44 1.77 5.65 64.29 2.85 2.87

1.68 1.18 0.83 4.82 0.68 6.47 3.75 6.31 0.92 1.73 14.83 0.90 7.61 7.32 50.00 4.22 4.03

1.32 1.15 0.90 0.95 6.67 1.23 3.23 6.47 1.03 1.60 11.31 0.37 1.96 6.90 64.29 3.22 3.29

6.79 0.76 1.05 4.78 0.28 0.64 3.60 0.36 0.89 0.68 7.03 1.78 7.25 7.27 57.14 3.08 2.91

6.94 0.76 0.91 4.48 0.58 4.50 4.53 4.88 1.09 0.71 6.80 1.77 7.27 7.37 42.86 3.76 2.71

5.78 1.52 0.63 4.49 0.32 0.47 3.71 1.92 0.90 1.99 6.92 2.07 5.83 7.32 53.57 3.07 2.55

Docking accuracy. Standard deviation value. c Names of scoring functions available with respective docking programs. b




152 Chapter 6 Exercise 3: To calculate the RMSD in re-docking exercise for acetylcholinesterase (AChE) co-crystallized with donepezil (PDB ID: 4EY7, 6O4W, 5NAP, 5NAU) and rivastigmine (PDB ID: 1GQR, 6EUE, 2BAG).

References [1] A.C. Anderson, The process of structure-based drug design, Chem. Biol. 10 (2003) 787 797. [2] G. Schneider, Virtual screening: an endless staircase? Nat. Rev. Drug. Discov. 9 (2010) 273. [3] J. de Ruyck, G. Brysbaert, R. Blossey, M.F. Lensink, Molecular docking as a popular tool in drug design, an in silico travel, Adv. Appl. Bioinforma. Chem.: AABC 9 (2016) 1. [4] Y.-C. Chen, Beware of docking!, Trends Pharmacol. Sci. 36 (2015) 78 95. [5] T. Katsila, G.A. Spyroulias, G.P. Patrinos, M.-T. Matsoukas, Computational approaches in target identification and drug discovery, Comput. Struct. Biotechnol. J. 14 (2016) 177 184. [6] W.P. Walters, M.T. Stahl, M.A. Murcko, Virtual screening—an overview, Drug. Discov. Today 3 (1998) 160 178. [7] G. Schneider, H.-J. Bo¨hm, Virtual screening and fast automated docking methods, Drug. Discov. Today 7 (2002) 64 70. [8] B. Waszkowycz, T.D.J. Perkins, R.A. Sykes, J. Li, Large-scale virtual screening for discovering leads in the postgenomic era, IBM Syst. J. 40 (2001) 360 376. [9] M.J. Yunta, Docking and ligand binding affinity: uses and pitfalls, Am. J. Model. Optim. 4 (2016) 74 114. [10] B.J. McConkey, V. Sobolev, M. Edelman, The performance of current methods in ligand protein docking, Curr. Sci. (2002) 845 856. [11] M. Martinez-Archundia, B. Colin-Astudillo, L. Go´mez-Herna´ndez, E. Abarca-Rojano, J. Correa-Basurto, Docking analysis provide structural insights to design novel ligands that target PKM2 and HDC8 with potential use for cancer therapy, Mol. Simul. (2019) 1 9. [12] M. Buonanno, E. Langella, N. Zambrano, M. Succoio, E. Sasso, V. Alterio, et al., Disclosing the interaction of carbonic anhydrase IX with cullin-associated NEDD8-dissociated protein 1 by molecular modeling and integrated binding measurements, ACS Chem. Biol. 12 (2017) 1460 1465. [13] I.D. Kuntz, J.M. Blaney, S.J. Oatley, R. Langridge, T.E. Ferrin, A geometric approach to macromoleculeligand interactions, J. Mol. Biol. 161 (1982) 269 288. [14] D. Koshland, Correlation of structure and function in enzyme action, Science 142 (1963) 1533 1541. [15] G.G. Hammes, Multiple conformational changes in enzyme catalysis, Biochemistry 41 (2002) 8221 8228. [16] N. Moitessier, P. Englebienne, D. Lee, J. Lawandi, C.R. Corbeil, Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go, Br. J. Pharmacol. 153 (2008) S7 S26. [17] E. Perola, W.P. Walters, P.S. Charifson, A detailed comparison of current docking and scoring methods on systems of pharmaceutical relevance, Proteins: Struct., Funct., Bioinf. 56 (2004) 235 249. [18] W. Sherman, T. Day, M.P. Jacobson, R.A. Friesner, R. Farid, Novel procedure for modeling ligand/ receptor induced fit effects, J. Med. Chem. 49 (2006) 534 553. [19] T. Sander, T. Liljefors, T. Balle, Prediction of the receptor conformation for iGluR2 agonist binding: QM/ MM docking to an extensive conformational ensemble generated using normal mode analysis, J. Mol. Graph. Model. 26 (2008) 1259 1268. [20] A.T. Brint, P. Willett, Algorithms for the identification of three-dimensional maximal common substructures, J. Chem. Inf. Comput. Sci. 27 (1987) 152 158. [21] R. Norel, D. Fischer, H.J. Wolfson, R. Nussinov, Molecular surface recognition by a computer visionbased technique, Protein Engin., Des. Selection 7 (1994) 39 46. [22] R. Dias, J. de Azevedo, F. Walter, Molecular docking algorithms, Curr. Drug. Targets 9 (2008) 1040 1047.

Molecular docking analysis: Basic technique to predict drug-receptor interactions 153 [23] M.D. Miller, S.K. Kearsley, D.J. Underwood, R.P. Sheridan, FLOG: a system to select ‘quasi-flexible’ ligands complementary to a receptor of known three-dimensional structure, J. Comput. Mol. Des. 8 (1994) 153 174. [24] D.J. Diller, K.M. Merz Jr, High throughput docking for library design and library prioritization, Proteins: Struct., Funct., Bioinf. 43 (2001) 113 124. [25] P. Burkhard, P. Taylor, M. Walkinshaw, An example of a protein ligand found by database mining: ˚ X-ray structure of a Thrombin-Ligand description of the docking method and its verification by a 2.3 A complex, J. Mol. Biol. 277 (1998) 449 466. [26] M. Rarey, B. Kramer, T. Lengauer, G. Klebe, A fast flexible docking method using an incremental construction algorithm, J. Mol. Biol. 261 (1996) 470 489. [27] T.J. Ewing, S. Makino, A.G. Skillman, I.D. Kuntz, DOCK 4.0: search strategies for automated molecular docking of flexible molecule databases, J. Comput. Mol. Des. 15 (2001) 411 428. [28] W. Welch, J. Ruppert, A.N. Jain, Hammerhead: fast, fully automated docking of flexible ligands to protein binding sites, Chem. Biol. 3 (1996) 449 462. [29] V. Schnecke, L.A. Kuhn, Virtual screening with solvation and ligand-induced complementarity, Virtual Screening: An Alternative or Complement to High Throughput Screening? Springer, 2000, pp. 171 190. [30] Z. Zsoldos, D. Reid, A. Simon, B.S. Sadjad, A. Peter Johnson, eHiTS: an innovative approach to the docking and scoring function problems, Curr. Protein Peptide Sci. 7 (2006) 421 435. [31] A. Miranker, M. Karplus, Functionality maps of binding sites: a multiple copy simultaneous search method, Proteins: Struct., Funct., Bioinf. 11 (1991) 29 34. [32] A. Caflisch, A. Miranker, M. Karplus, Multiple copy simultaneous search and construction of ligands in binding sites: application to inhibitors of HIV-1 aspartic proteinase, J. Med. Chem. 36 (1993) 2142 2167. [33] H.-J. Bo¨hm, LUDI: rule-based automatic design of new substituents for enzyme inhibitor leads, J. Comput. Mol. Des. 6 (1992) 593 606. [34] H.-J. Bo¨hm, The computer program LUDI: a new method for the de novo design of enzyme inhibitors, J. Comput. Mol. Des. 6 (1992) 61 78. [35] T.N. Hart, R.J. Read, A multiple-start Monte Carlo docking method, Proteins: Struct., Funct., Bioinf. 13 (1992) 206 222. [36] D.S. Goodsell, A.J. Olson, Automated docking of substrates to proteins by simulated annealing, Proteins: Struct., Funct., Bioinf. 8 (1990) 195 202. [37] R. Abagyan, M. Totrov, D. Kuznetsov, ICM—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation, J. Comput. Chem. 15 (1994) 488 506. [38] C. McMartin, R.S. Bohacek, QXP: powerful, rapid computer algorithms for structure-based drug design, J. Comput. Mol. Des. 11 (1997) 333 344. [39] C.M. Oshiro, I.D. Kuntz, J.S. Dixon, Flexible ligand docking using a genetic algorithm, J. Comput. Mol. Des. 9 (1995) 113 130. [40] G.M. Morris, D.S. Goodsell, R.S. Halliday, R. Huey, W.E. Hart, R.K. Belew, et al., Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function, J. Comput. Chem. 19 (1998) 1639 1662. [41] M.L. Verdonk, J.C. Cole, M.J. Hartshorn, C.W. Murray, R.D. Taylor, Improved protein ligand docking using GOLD, Proteins: Struct., Funct., Bioinf. 52 (2003) 609 623. [42] K.P. Clark, Flexible ligand docking without parameter adjustment across four ligand receptor complexes, J. Comput. Chem. 16 (1995) 1210 1226. [43] J.S. Taylor, R.M. Burnett, DARWIN: a program for docking flexible molecules, Proteins: Struct., Funct., Bioinf., 41, 2000, pp. 173 191. [44] R. Wang, Y. Lu, S. Wang, Comparative evaluation of 11 scoring functions for molecular docking, J. Med. Chem. 46 (2003) 2287 2303. [45] D.B. Kitchen, H. Decornez, J.R. Furr, J. Bajorath, Docking and scoring in virtual screening for drug discovery: methods and applications, Nat. Rev. Drug. Discov. 3 (2004) 935. ˚ qvist, V.B. Luzhkov, B.O. Brandsdal, Ligand binding affinities from MD simulations, Acc. Chem. [46] J. A Res. 35 (2002) 358 365.

154 Chapter 6 [47] H.-J. Bo¨hm, Prediction of binding constants of protein ligands: a fast method for the prioritization of hits obtained from de novo design or 3D database search programs, J. Comput. Mol. Des. 12 (1998). 309-309. [48] A.N. Jain, Scoring noncovalent protein-ligand interactions: a continuous differentiable function tuned to compute binding affinities, J. Comput. Mol. Des. 10 (1996) 427 440. [49] R.D. Head, M.L. Smythe, T.I. Oprea, C.L. Waller, S.M. Green, G.R. Marshall, VALIDATE: a new method for the receptor-based prediction of binding affinities of novel ligands, J. Am. Chem. Soc. 118 (1996) 3959 3969. [50] G.M. Verkhivker, D. Bouzida, D.K. Gehlhaar, P.A. Rejto, S. Arthurs, A.B. Colson, et al., Deciphering common failures in molecular docking of ligand-protein complexes, J. Comput. Mol. Des. 14 (2000) 731 751. [51] M.D. Eldridge, C.W. Murray, T.R. Auton, G.V. Paolini, R.P. Mee, Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes, J. Comput. Mol. Des. 11 (1997) 425 445. [52] A.V. Ishchenko, E.I. Shakhnovich, Small molecule growth 2001 (SMoG2001): an improved knowledgebased scoring function for protein 2 ligand interactions, J. Med. Chem. 45 (2002) 2770 2780. [53] M. Feher, E. Deretey, S. Roy, BHB: a simple knowledge-based scoring function to improve the efficiency of database screening, J. Chem. Inf. Comput. Sci. 43 (2003) 1316 1327. [54] I. Muegge, Y.C. Martin, A general and fast scoring function for protein 2 ligand interactions: a simplified potential approach, J. Med. Chem. 42 (1999) 791 804. [55] H. Gohlke, M. Hendlich, G. Klebe, Knowledge-based scoring function to predict protein-ligand interactions, J. Mol. Biol. 295 (2000) 337 356. [56] R.S. DeWitte, E.I. Shakhnovich, SMoG: de novo design method based on simple, fast, and accurate free energy estimates. 1. Methodology and supporting evidence, J. Am. Chem. Soc. 118 (1996) 11733 11744. [57] J.B. Mitchell, R.A. Laskowski, A. Alex, J.M. Thornton, BLEEP—potential of mean force describing protein ligand interactions: I. Generating potential, J. Computational Chem. 20 (1999) 1165 1176. [58] P.S. Charifson, J.J. Corkery, M.A. Murcko, W.P. Walters, Consensus scoring: a method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins, J. Med. Chem. 42 (1999) 5100 5109. [59] M. Feher, Consensus scoring for protein ligand interactions, Drug. Discov. Today 11 (2006) 421 428. [60] R.D. Clark, A. Strizhev, J.M. Leonard, J.F. Blake, J.B. Matthew, Consensus scoring for ligand/protein interactions, J. Mol. Graph. Model. 20 (2002) 281 295. [61] P.A. Kollman, I. Massova, C. Reyes, B. Kuhn, S. Huo, L. Chong, et al., Calculating structures and free energies of complex molecules: combining molecular mechanics and continuum models, Acc. Chem. Res. 33 (2000) 889 897. [62] W.C. Still, A. Tempczyk, R.C. Hawley, T. Hendrickson, Semianalytical treatment of solvation for molecular mechanics and dynamics, J. Am. Chem. Soc. 112 (1990) 6127 6129. [63] H.A. Gabb, R.M. Jackson, M.J. Sternberg, Modelling protein docking using shape complementarity, electrostatics and biochemical information, J. Mol. Biol. 272 (1997) 106 120. [64] S.J. Teague, Implications of protein flexibility for drug discovery, Nat. Rev. Drug. Discov. 2 (2003) 527. [65] D.A. Gschwend, A.C. Good, I.D. Kuntz, Molecular docking towards drug discovery, J. Mol. Recogn.: An. Interdiscip. J. 9 (1996) 175 186. [66] A.R. Leach, Ligand docking to proteins with discrete side-chain flexibility, J. Mol. Biol. 235 (1994) 345 356. [67] R. Abagyan, M. Totrov, Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and proteins, J. Mol. Biol. 235 (1994) 983 1002. [68] G.M. Morris, R. Huey, W. Lindstrom, M.F. Sanner, R.K. Belew, D.S. Goodsell, et al., AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility, J. Comput. Chem. 30 (2009) 2785 2791. [69] P.J. Goodford, A computational procedure for determining energetically favorable binding sites on biologically important macromolecules, J. Med. Chem. 28 (1985) 849 857.

Molecular docking analysis: Basic technique to predict drug-receptor interactions 155 [70] R.M. Knegtel, I.D. Kuntz, C. Oshiro, Molecular docking to ensembles of protein structures, J. Mol. Biol. 266 (1997) 424 440. [71] C.N. Cavasotto, R.A. Abagyan, Protein flexibility in ligand docking and virtual screening to protein kinases, J. Mol. Biol. 337 (2004) 209 225. [72] N. Go, H.A. Scheraga, Ring closure and local conformational deformations of chain molecules, Macromolecules 3 (1970) 178 187. [73] L. Dodd, T. Boone, D. Theodorou, A concerted rotation algorithm for atomistic Monte Carlo simulation of polymer melts and glasses, Mol. Phys. 78 (1993) 961 996. [74] D. Hoffmann, E.-W. Knapp, Polypeptide folding with off-lattice Monte Carlo dynamics: the method, Eur. Biophys. J. 24 (1996) 387 403. [75] C. Bron, J. Kerbosch, Algorithm 457: finding all cliques of an undirected graph, Commun. ACM 16 (1973) 575 577. [76] E.C. Meng, B.K. Shoichet, I.D. Kuntz, Automated docking with grid-based energy evaluation, J. Comput. Chem. 13 (1992) 505 524. [77] X. Zou, Y. Sun, I.D. Kuntz, Inclusion of solvation in ligand binding free energy calculations using the generalized-born model, J. Am. Chem. Soc. 121 (1999) 8033 8043. [78] O. Trott, A.J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading, J. Comput. Chem. 31 (2010) 455 461. [79] H.-J. Bo¨hm, The development of a simple empirical scoring function to estimate the binding constant for a protein-ligand complex of known three-dimensional structure, J. Comput. Mol. Des. 8 (1994) 243 256. [80] N.S. Pagadala, K. Syed, J. Tuszynski, Software for molecular docking: a review, Biophys. Rev. 9 (2017) 91 102. [81] P. Singh, O. Silakari, Molecular dynamics and pharmacophore modelling studies of different subtype (ALK and EGFR (T790M)) inhibitors in NSCLC, SAR. QSAR Environ. Res. 28 (2017) 221 233. [82] S. Consalvi, S. Alfonso, A. Di Capua, G. Poce, A. Pirolli, M. Sabatino, et al., Synthesis, biological evaluation and docking analysis of a new series of methylsulfonyl and sulfamoyl acetamides and ethyl acetates as potent COX-2 inhibitors, Bioorganic Med. Chem. 23 (2015) 810 820.


Molecular dynamic simulations: Technique to analyze real-time interactions of drug-receptor complexes 7.1 Introduction One of the principal tools in the theoretical study of biological molecules is the method of molecular dynamics (MD) simulations. This computational method calculates the timedependent behavior of a molecular system. The MD method was first introduced by Alder and Wainwright in the late 1950s to study the interactions of hard spheres [1,2]. Many important insights concerning the behavior of simple liquids emerged from their studies. The next major advance was in 1964 when Rahman carried out the first simulation using a realistic potential for liquid argon [3]. The first MD simulation of a realistic system was done by Rahman and Stillinger in the year 1974, where they carried out a simulation of liquid water [4]. The first protein simulations appeared in 1977 with the simulation of the bovine pancreatic trypsin inhibitor (BPTI) [5]. Hereafter, MD simulation became a very common practice in drug designing. Today, in the literature, one can routinely find MD simulations of solvated proteins, protein-DNA complexes as well as lipid systems addressing a variety of issues including the thermodynamics of ligand binding and the folding of small proteins. The number of simulation techniques has also greatly expanded, and now, there exist so many specialized techniques for particular problems, including mixed quantum mechanical-classical simulations, which are being employed to study enzymatic reactions in the context of the full protein. MD simulation techniques are also widely used in experimental procedures such as X-ray crystallography and NMR structure determination. MD simulations provide detailed information on the fluctuations and conformational changes of proteins and nucleic acids. These methods are now routinely used to investigate the structure, dynamics and thermodynamics of biological molecules and their complexes. These are also used in the determination of structures from x-ray crystallography and from NMR experiments. Biological molecules exhibit a wide range of time scales over which ˚, specific processes occur; for example, local motions occur over a time frame of 0.015 A 15 21 10 to 10 s (atomic fluctuations, sidechain motions, loop motions), rigid body motions Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


158 Chapter 7 ˚ , 1029 to 1 s (helix motions, domain motions (hinge occur over a time frame of 110 A ˚, bending), subunit motions), and large-scale motions occurs over a time frame of .5 A 27 4 10 to 10 s (helix coil transitions, dissociation/association, folding and unfolding) [6]. MD simulations permit the study of complex, dynamic processes that occur in biological systems. These include protein stability, conformational changes, protein folding, molecular recognition: proteins, DNA, membranes, complexes, ion transport in biological systems and provide the means to carry out drug design and structure determination: X-ray and NMR.

7.2 Principles of MD simulations MD simulations generate information at the microscopic level, including atomic positions and velocities. The conversion of this microscopic information to macroscopic observables such as pressure, energy, heat capacities, etc., requires statistical mechanics. Statistical mechanics is fundamental to the study of biological systems by MD simulation [7]. In a MD simulation, one often wishes to explore the macroscopic properties of a system through microscopic simulations, for example, to calculate changes in the binding free energy of a particular drug candidate, or to examine the energetics and mechanisms of conformational change. The connection between microscopic simulations and macroscopic properties is made via statistical mechanics which provides the rigorous mathematical expressions that relate macroscopic properties to the distribution and motion of the atoms and molecules of the N-body system; MD simulations provide the means to solve the equation of motion of the particles and evaluate these mathematical formulas. With MD simulations, one can study both thermodynamic properties and/or time-dependent (kinetic) phenomenon [8]. The goal of MD simulations is to understand and to predict macroscopic phenomena from the properties of individual molecules making up the system. The system could range from a collection of solvent molecules to a solvated protein-DNA complex [9]. In order to connect the macroscopic system to the microscopic system, time-independent statistical averages are often introduced. We start this discussion by introducing a few definitions.

7.2.1 Definitions The thermodynamic state of a system is usually defined by a small set of parameters, for example, the temperature, T, the pressure, P, and the number of particles, N. Other thermodynamic properties may be derived from the equations of state and other fundamental thermodynamic equations [10]. The mechanical or microscopic state of a system is defined by the atomic positions, q, and momenta, p; these can also be considered as coordinates in a multi-dimensional space

Molecular dynamic simulations: Technique to analyze real-time interactions 159 called phase space. For a system of N particles, this space has 6 N dimensions. A single point in phase space, denoted by G, describes the state of the system. An ensemble is a collection of points in phase space satisfying the conditions of a particular thermodynamic state [11]. A MD simulation generates a sequence of points in phase space as a function of time; these points belong to the same ensemble, and they correspond to the different conformations of the system and their respective momenta. Several different ensembles are described below [12]. An ensemble is a collection of all possible systems which have different microscopic states but have an identical macroscopic or thermodynamic state. There exist different ensembles with different characteristics [13]. Microcanonical ensemble (NVE): The thermodynamic state characterized by a fixed number of atoms, N, a fixed volume, V, and fixed energy, E. This corresponds to an isolated system [14]. Canonical Ensemble (NVT): This is a collection of all systems whose thermodynamic state is characterized by a fixed number of atoms, N, a fixed volume, V, and a fixed temperature, T [15]. Isobaric-Isothermal Ensemble (NPT): This ensemble is characterized by a fixed number of atoms, N, a fixed pressure, P, and fixed temperature, T [16]. Grand canonical Ensemble (mVT): The thermodynamic state for this ensemble is characterized by fixed chemical potential, m, a fixed volume, V, and fixed temperature, T [17].

7.2.2 Calculating averages from a MD simulation An experiment is usually made on a macroscopic sample that contains an extremely large number of atoms or molecules sampling an enormous number of conformations. In statistical mechanics, averages corresponding to experimental observables are defined in terms of ensemble averages. An ensemble average is an average taken over a large number of replicas of the system considered simultaneously [18]. In statistical mechanics, average values are defined as ensemble averages. The ensemble average is given by ZZ dpN dr N AðpN ; r N ÞρðpN ; rN Þ hAiensemble 5 Where, AðpN ; rN Þ is the observable of interest and it is expressed as a function of the momenta, p, and the positions, r, of the system. The integration is over all possible variables of r and p.

160 Chapter 7 The probability density of the ensemble is given by ρðpN ; rN Þ 5

  1 exp 2HðpN ; rN =kB TÞ Q

where H is the Hamiltonian, T is the temperature, kB is Boltzmann’s constant and Q is the partition function ZZ Q 5 dpN drN exp½ 2 HðpN ; r N =kB TÞ This integral is generally extremely difficult to calculate because one must calculate all possible states of the system. In a molecular dynamics simulation, the points in the ensemble are calculated sequentially in time, so to calculate an ensemble average, the molecular dynamics simulations must pass through all possible states corresponding to the particular thermodynamic constraints. Another way, as done in an MD simulation, is to determine a time average of A, which is expressed as 1 hAitime 5 lim τ -N τ


τ t50

AðpN ðtÞ; rN ðtÞÞdt 

M 1X AðpN ; rN Þ M t51

Where, t is the simulation time, M is the number of time steps in the simulation and A(pN, rN) is the instantaneous value of A. The dilemma appears to be that one can calculate time averages by the MD simulation, but the experimental observables are assumed to be ensemble averages. Resolving this leads us to one of the most fundamental axioms of statistical mechanics, the ergodic hypothesis, which states that the time average equals the ensemble average. The basic idea is that if one allows the system to evolve in time indefinitely, that system will eventually pass through all possible states. Therefore, one goal of a MD simulation is to generate enough representative conformations such that this equality is satisfied. If this is the case, experimentally relevant information concerning structural, dynamic and thermodynamic properties may then be calculated using a feasible amount of computer resources. Because the simulations are of fixed duration, one must be certain to sample a sufficient amount of phase space. A MD simulation must be sufficiently long so that enough representative conformations have been sampled.

7.2.3 Classical mechanics The MD simulation method is based on Newton’s second law or the equation of motion, F 5 ma, where “F” is the force exerted on the particle, “m” is its mass and “a”

Molecular dynamic simulations: Technique to analyze real-time interactions 161 is its acceleration. From a knowledge of the force on each atom, it is possible to determine the acceleration of each atom in the system [19]. Integration of the equations of motion then yields a trajectory that describes the positions, velocities and accelerations of the particles as they vary with time. From this trajectory, the average values of properties can be determined. The method is deterministic; once the positions and velocities of each atom are known, the state of the system can be predicted at any time in the future or the past. Molecular dynamics simulations can be time-consuming and computationally expensive. Simulations of solvated proteins are calculated up to the nanosecond time scale, however, simulations into the millisecond regime have been reported. Newton’s equation of motion is given by F i 5 mi a i where Fi is the force exerted on particle i, mi is the mass of particle i and ai is the acceleration of particle i. The force can also be expressed as the gradient of the potential energy, Fi 5 2 ri V Combining these two equations yields 2

dv d2 ri 5 mi 2 dri dt

where V is the potential energy of the system. Newton’s equation of motion can then relate the derivative of the potential energy to the changes in position as a function of time. A simple application of Newton’s Second Law of motion: F 5 mUa 5 mU

dv d2 x 5 mU 2 dt dt

Taking the simple case where the acceleration is constant, a5

dv dt

we obtain an expression for the velocity after integration v 5 at 1 v0 and since v5

dx dt

162 Chapter 7 we can once again integrate to obtain x 5 vUt 1 x0 Combining this equation with the expression for the velocity, we obtain the following relation which gives the value of x at time t as a function of the acceleration, a, the initial position, x0, and the initial velocity, v0. x5

1  2 a t 1 v0  t 1 x0 2

The acceleration is given as the derivative of the potential energy with respect to the position, r, a52

1 dE m dr

Therefore, to calculate a trajectory, one only needs the initial positions of the atoms, an initial distribution of velocities and the acceleration, which is determined by the gradient of the potential energy function. The equations of motion are deterministic, e.g., the positions and the velocities at time zero determine the positions and velocities at all other times, t. The initial positions can be obtained from experimental structures, such as the x-ray crystal structure of the protein or the solution structure determined by NMR spectroscopy. The initial distribution of velocities are usually determined from a random distribution with the magnitudes conforming to the required temperature and corrected so there is no overall momentum, i.e., P5


mi vi 5 0


The velocities, vi, are often chosen randomly from a Maxwell-Boltzmann or Gaussian distribution at a given temperature, which gives the probability that an atom i has a velocity vx in the x-direction at a temperature T.  1=2   mi 1 mi v2ix pðvix Þ 5 exp 2 2 kB T 2πkB T The temperature can be calculated from the velocities using the relation   N   pi 1 X T5 ð3NÞ i 5 1 2mi where N is the number of atoms in the system.

Molecular dynamic simulations: Technique to analyze real-time interactions 163

7.2.4 Algorithms The potential energy is a function of the atomic positions (3N) of all the atoms in the system. Due to the complicated nature of this function, there is no analytical solution to the equations of motion; they must be solved numerically [20]. Numerous numerical algorithms have been developed for integrating the equations of motion. Some of them include: i. ii. iii. iv.

Verlet algorithm Leap-frog algorithm Velocity Verlet Beeman’s algorithm

Important: In choosing which algorithm to use, one should consider the following criteria: a. The algorithm should conserve energy and momentum. b. It should be computationally efficient c. It should permit a long time step for integration. Integration Algorithms All the integration algorithms assume the positions, velocities and accelerations can be approximated by a Taylor series expansion: 1 rðt 1 δtÞ 5 rðtÞ 1 vðtÞδt 1 aðtÞδt2 1 ? 2 1 vðt 1 δtÞ 5 vðtÞ 1 aðtÞδt 1 bðtÞδt2 1 ? 2 aðt 1 δtÞ 5 aðtÞ 1 bðtÞδt 1 ? Where r is the position, v is the velocity (the first derivative with respect to time), a is the acceleration (the second derivative with respect to time), etc. Verlet algorithm To derive the Verlet algorithm one can write 1 rðt 1 δtÞ 5 rðtÞ 1 vðtÞδt 1 aðtÞδt2 2 1 rðt 2 δtÞ 5 rðtÞ 2 vðtÞδt 1 aðtÞδt2 2 Summing these two equations, one obtains rðt 1 δtÞ 5 2rðtÞ 2 rðt 2 δtÞ 1 aðtÞδt2

164 Chapter 7 The Verlet algorithm uses positions and accelerations at time t and the positions from time t 2 dt to calculate new positions at time t 1 dt. The Verlet algorithm uses no explicit velocities. The advantages of the Verlet algorithm are, (i) it is straightforward, and (ii) the storage requirements are modest. The disadvantage is that the algorithm is of moderate precision. The Leap-frog algorithm

  1 rðt 1 δtÞ 5 rðtÞ 1 v t 1 δt δt 2     1 1 v t 1 δt 5 v t 2 δt 1 aðtÞδt 2 2

In this algorithm, the velocities are first calculated at time t 1 (1/2)dt; these are used to calculate the positions, r, at time t 1 dt. In this way, the velocities leap over the positions, then the positions leap over the velocities. The advantage of this algorithm is that the velocities are explicitly calculated, however, the disadvantage is that they are not calculated at the same time as the positions. The velocities at time t can be approximated by the relationship:      1 1 1 vðtÞ 5 v t 2 δt 1 v t 1 δt 2 2 2 The Velocity Verlet algorithm This algorithm yields positions, velocities and accelerations at time t. There is no compromise on precision. 1 rðt 1 δtÞ 5 rðrÞ 1 vðtÞδt 1 aðtÞδt2 2 1 vðt 1 δtÞ 5 vðtÞ 1 ½aðtÞ 1 aðt 1 δtÞδt 2 Beeman’s algorithm This algorithm is closely related to the Verlet algorithm 2 1 rðt 1 δtÞ 5 rðtÞ 1 vðtÞδt 1 aðtÞδt2 2 aðt 2 δtÞδt2 3 6 1 5 1 vðt 1 δtÞ 5 vðtÞ 1 vðtÞδt 1 aðtÞδt 2 aðtÞδt 2 aðt 2 δtÞδt 3 6 6 The advantage of this algorithm is that it provides a more accurate expression for the velocities and better energy conservation. The disadvantage is that the more complex expressions make the calculation more expensive.

Molecular dynamic simulations: Technique to analyze real-time interactions 165

7.3 Steps of MD simulations This part describes the general steps in setting and running MD simulations (Fig. 7.1). Stages in a typical MD trajectory include: i. ii. iii. iv. v. vi.

Initial input Energy minimization Heating Equilibration Production NVE MD simulations

The goal of MD simulations is to obtain a single production trajectory for a given nanosecond, from which conformational properties of a protein can be computed. These steps are presented assuming that the initial solvated structure of a protein and the topology files are already prepared.

7.3.1 Initialization Three input files are needed to start the simulations. These are the topology file, the coordinate file, and the force field file. The topology file contains all the information about the structure and connectivity of atoms in the system as well as a few parameters of the force field. Structure of the topology file is as follows: (a) (b) (c) (d) (e) (f) (g) (h) (i)

atom number; segment name; residue number (may not be sequential); residue name; atom name; atom type; partial charge; atomic mass; flag used to indicate constrained atom

The topology file also contains information (atom numbers), which defines covalent bonds, bond angles, dihedral angles, and improper torsion angles. Bond lengths (defined by pairs of atom numbers) Bond angles (defined by triplets of atom numbers)

166 Chapter 7

Figure 7.1 Steps involved in MD simulations.

Molecular dynamic simulations: Technique to analyze real-time interactions 167 Dihedral angles (defined by sets of four atom numbers) Improper torsion angles (sets of four atom numbers) The coordinates of the initial structure are taken from. pdb file (written according to PDB format): (a) (b) (c) (d) (e) (f) (g) (h)

atom number; atom name; residue name; residue number (may not be consequent); xyz coordinates; occupancy (confidence in determination of the atom position in X-ray diffraction). temperature factor (uncertainty due to thermal disorder); segment name

The third input file is for the force field, which provides virtually all the parameters for the force field. These include the parameters for the bond angle, length, dihedral angle, improper and Lennard-Jones potentials. residue number (may not be sequential); residue name; atom name; atom type; partial charge; atomic mass; flag used to indicate constrained atom residue name; atom name; atom type; partial charge; atomic mass; flag used to indicate constrained atom

7.3.2 Energy minimization The first “real” step in preparing the system for production MD simulations involves energy minimization. The purpose of this stage is not to find a true global energy minimum, but to

168 Chapter 7 adjust the structure to the force field, particular distribution of solvent molecules, and to relax possible steric clashes created by guessing coordinates of atoms during the generation of topology file.

7.3.3 Heating the simulation system During this stage, the temperature of the system is linearly increased from 0K to 310K within 300 ps. At each integration, step velocities are reassigned (i.e., drawn) from a new Maxwell distribution and the temperature is incremented by 0.001K. The increase in the temperature T(t) (upper panel) and potential energy Ep (lower panel) as a function of time. Note that instantaneous temperature T(t) and energy Ep fluctuate due to finite system size.

7.3.4 Equilibration at a constant temperature Equilibration stage is used to equilibrate kinetic and potential energies, i.e., to distribute the kinetic energy “pumped” into the system during heating among all degrees of freedom. This usually implies that the potential energy “lags on” and must be equilibrated with the kinetic energy. In other words, the kinetic energy must be transferred to potential energy. As soon as potential energy levels off, the equilibration stage is completed.

7.3.5 Production stage of MD trajectory (NVE ensemble) This stage of MD trajectory is used to sample the structural characteristics and dynamics of protein at 310 K. Velocity reassigning or rescaling must now be turned off.

7.4 Applications of MD simulations in drug discovery MD simulations offer insight into protein motion which often plays important roles in drug discovery. A single protein conformation tells little about protein dynamics. The static models produced by NMR, X-ray crystallography, and homology modeling provide valuable insights into the macromolecular structure, but molecular recognition and drug binding are very dynamic processes. When a small molecule like a drug approaches its target in solution, it encounters not a single, frozen structure, but rather a macromolecule in constant motion. Upon binding, the ligand may further induce conformational changes that are not typically sampled when the ligand is absent [21]. Regardless, receptor motions clearly play an essential role in the binding of most small-molecule drugs. Several techniques have been developed to exploit the information about these motions that molecular dynamics simulations can provide.

Molecular dynamic simulations: Technique to analyze real-time interactions 169

7.4.1 Identifying cryptic and allosteric binding sites NMR and X-ray crystallographic structures often reveal well-defined binding pockets that accommodate endogenous ligands; however, sometimes the models produced by these experimental techniques obscure other potentially drug-able sites. As these sites are not immediately obvious from available structures, they are sometimes called cryptic binding sites. MD simulations are excellent tools for identifying such sites [22]. Aside from cryptic binding sites, MD simulations can also be used to identify druggable allosteric sites. In one study, Ivetac and McCammon performed simulations of the human β1 (β1AR) and β2 (β2AR) adrenergic receptors [23]. Multiple protein conformations were extracted from these simulations, and the protein surface was computationally ‘flooded’ with small organic probes using FTMAP to identify potential binding sites [24]. Regions of the protein surface where the organic probes consistently congregated across multiple structures were then identified as potential allosteric sites. In all, five potential sites were identified, some of which were not evident in any of the existing crystal structures.

7.4.2 Improving the computational identification of small-molecule binders One common technique used to identify the precursors of potential drugs in-silico is virtual screening. A docking program is used to predict the binding pose and energy of a small-molecule model within a selected receptor binding pocket. Unfortunately, traditional docking relying on a single receptor structure is problematic. Some legitimate ligands may indeed bind to the single structure selected, but in reality, most receptor binding pockets have many valid conformational states, any one of which may be drug-able. In a traditional virtual screening, true ligands are often discarded because they in fact bind to receptor conformations that differ markedly from that of the single static structure chosen. To better account for receptor flexibility, a new virtual-screening protocol has been developed called the relaxed complex scheme (RCS) [25]. Rather than docking many compound models into a single NMR or crystal structure, each potential ligand is docked into multiple protein conformations, typically extracted from a molecular dynamics simulation. Thus, each ligand is associated not with a single docking score but rather with a whole spectrum of scores. Ligands can be ranked by a number of spectrum characteristics, such as the average score over all receptors. Thus, the RCS effectively accounts for the many receptor conformations sampled by the simulations. It has been used successfully to identify a number of protein inhibitors [26]. While these successes are promising, the relaxed complex scheme certainly has its weaknesses. Aside from being based on molecular dynamics simulations that are

170 Chapter 7 themselves subject to crude force-field approximations and inadequate conformational sampling, the scheme relies on computer-docking scoring functions that of necessity are optimized for speed at the expense of accuracy. In order to facilitate high-throughput virtual screening, these scoring functions often treat subtle influences on binding energy like conformational entropy and solvation energy only superficially, thus sacrificing accuracy for the sake of greater speed [27].

7.4.3 Advanced free-energy calculations using MD simulations Though docking programs are optimized for speed rather than accuracy, more accurate, albeit computationally intensive, techniques for predicting binding affinity do exist. These techniques, which include thermodynamic integration [28], single-step perturbation [29], and free energy perturbation [30], are based in large part on MD simulations. Free energy is a state function, meaning that the free-energy difference associated with a given event like a drug binding to its receptor is determined only by the energy prior to that event and the energy following it; while the path taken from the initial to the final state may influence receptor-ligand kinetics, it has no bearing on the free energy. Perhaps the ligand diffuses slowly towards the active site and slips easily into the binding pocket. Perhaps the protein unfolds entirely and then refolds around the ligand. Perhaps the ligand in solution is beamed to a starship in orbit, only to rematerialize in the active site a few seconds later. The mechanics do not matter; the free energy depends only on the initial energy in solution and the final energy following the binding event. Simulating a receptor-ligand system long enough to capture an entire binding event is not currently feasible. However, it is still possible to calculate a drug’s binding affinity using a technique called ‘alchemical transformation’, first described in 1984 [31]. This transformation is not too different from the starship example given above. During the course of a molecular dynamics simulation, the electrostatic and van der Waals forces produced by ligand atoms are turned down gradually enough to avoid undesirable artifacts. Eventually, the ligand is no longer able to interact with the protein or solvent. For all practical purposes, the ligand has disappeared. It does not matter that this transformation is not at all physical; free energy is a state function, so the path from the initial to the final state, whether real or imaginary, is irrelevant. It is not clear, however, in what context the ligand should be figuratively annihilated in this way. Should a molecular dynamics simulation be run in which the bound ligand vanishes? What about the ligand in the solution? To address these questions, alchemical

Molecular dynamic simulations: Technique to analyze real-time interactions 171 transformations are selected based on the thermodynamic cycle. As free energy is a state function that depends only on the energy of the initial and final states, a system that proceeds from one state around this free-energy cycle only to return to the same initial state should have no change in total free energy (that is, ΔGbind 1 ΔGprotein 2 ΔG 2 ΔGwater 5 0). Therefore, it is possible to estimate a drug’s free energy of binding, an indirect measurement of drug potency, by running two simulations, one in which the receptor-bound ligand disappears, and one in which the solvated ligand disappears. A similar task, calculating relative ligand binding energies, is useful during drug optimization when one wishes to determine if a given chemical change will improve the potency of a candidate ligand. In this case, rather than annihilating the entire ligand, one section of the ligand is gradually transformed. For example, a key carbon atom might be gradually converted into an oxygen atom to see if the binding affinity is improved or diminished. These kinds of alchemical molecular dynamics simulations may provide medicinal chemists with useful insights that can guide further drug development. A series of early fortuitous results that agreed remarkably well with experiments led many to enthusiasm for molecular-dynamics-based free-energy calculations in the 1980s and early 1990s that was not matched in subsequent decades as computational predictions fell short of experimental measurements [32]. However, steady algorithmic and engineering advances in recent years have led to renewed attention. All molecular-dynamics-based drug-discovery techniques would benefit from improved force fields, but the alchemical techniques are, in addition, uniquely sensitive to inadequate conformational sampling [33]. When molecular dynamics simulations of insufficient length are used to identify cryptic sites, allosteric sites, or pharmacologically relevant binding-pocket conformations for virtual-screening projects, the risk is that some suitable receptor conformations will be missed. The conformations that are identified, however, are still useful; the results of the simulation are therefore incomplete, but not necessarily wrong. However, the alchemical techniques used to calculate free energies of binding are far more dependent on thorough conformational sampling than are RCS screens. If molecular dynamics simulations fail to sample system conformations that are in fact sampled ex silico, these conformations will not contribute to the total calculated energy, leading to an incorrect prediction of the binding affinity. MD simulations are computationally demanding and often of necessity unacceptably short; insufficient conformational sampling is therefore a common problem that future algorithmic and hardware-engineering efforts must address.

172 Chapter 7

7.5 MD simulations: Current limitations These successes aside, the utility of MD simulations is still limited by two principal challenges: the force fields used require further refinement, and high computational demands prohibit routine simulations greater than a microsecond in length, leading in many cases to inadequate sampling of conformational states [34]. As an example of these high computational demands, consider that a 1-μs simulation of a relatively small system (approximately 25,000 atoms) running on 24 processors takes several months to complete. Apart from the challenges related to the high computational demands of these simulations, the force-fields used are also approximations of the quantum-mechanical reality that reigns in the atomic regime. While simulations can accurately predict many important molecular motions, these simulations are poorly suited to systems where quantum effects are important, for example, when transition metal atoms are involved in binding. To overcome this challenge, some researchers have introduced quantum mechanical calculations into classic molecular-dynamics force-fields; the motions and reactions of enzymatic active sites or other limited areas of interest are simulated according to the laws of quantum mechanics, and the motions of the larger system are approximated using molecular dynamics. While far from the computationally intractable ‘ideal’ of using quantum mechanics to describe the entire system, this hybrid technique has nevertheless been used successfully to study a number of systems. For example, in one recent simulation of Desulfovibrio desulfuricans and Clostridium pasteurianum [Fe-Fe] hydrogenases, a ‘QM [quantum mechanical] region’ was defined encompassing a metalcontaining region of the protein thought to be catalytically important, and the remainder of the protein was simulated using classical MD [35]. The simulations revealed an important proton transfer in the QM region, a bond-breaking and bond-formation event that could not have been modeled with a traditional force field. The hypothesized catalytic mechanism was subsequently supported by experimental evidence. Besides bond breaking and formation, electronic polarization, caused by the flow of electrons from one atomic nucleus to another among groups of atoms that are chemically bonded, is another quantum-mechanical effect that, with few exceptions, is generally ignored. In classical molecular dynamics simulations, each atom is assigned a fixed partial charge before the simulation begins. In reality, however, the electron clouds surrounding atoms are constantly shifting according to their environments, so that the partial charges of atoms would be better represented as dynamic and responsive. Despite wide agreement on the importance of accounting for electronic polarization, after 30 years of development a generally accepted polarizable force field has not been forthcoming, and molecular dynamics simulations using those force fields that are available are rare [36]. Nevertheless, a number of polarizable force fields are currently under development, and future implementations may lead to improved accuracy [37].

Molecular dynamic simulations: Technique to analyze real-time interactions 173 In addition to neglecting quantum-mechanical effects, molecular dynamics studies are also limited by the short time scales typically simulated. To reproduce thermodynamic properties and/or to fully elucidate all binding-pocket configurations relevant to drug design, all the possible conformational states of the protein must be explored by the simulation. Unfortunately, many biochemical processes, including receptor conformational shifts relevant to drug binding, occur on time scales that are much longer than those amenable to simulation. With some important exceptions, simulations are currently limited to at most millionths of a second; indeed, most simulations are measured in billionths of a second [38]. A number of solutions to this challenge have already seen limited use. For example, in accelerated molecular dynamics (aMD), large energy barriers are artificially reduced [39]. Though this process inevitably introduces some artifacts, it does allow proteins to shift between conformations that would not be accessible given the time scales of conventional molecular dynamics. These novel conformations can then be further studied using classical molecular dynamics or other techniques. Novel hardware has also been used to overcome the time-scale limitations of conventional molecular dynamics simulations. Many of the same calculations required for these simulations are commonly performed by video-game and computer-graphics applications. Consequently, the same graphics-processing-units (GPUs) designed to speed up video games can be used to speed up molecular dynamics simulations as well, usually by an order of magnitude [40]. Not satisfied with merely adapting MD code to run on specialized graphics processors, some engineers have designed new processors specifically for these simulations. The research group of DE Shaw is one notable advocate of this approach. They have built a supercomputer codenamed Anton capable of performing microseconds of simulation per day. With Anton, simulations longer than one millisecond have successfully captured protein folding and unfolding as well as drug-binding events [41]. Shortcomings certainly still exist, but these and other future techniques will likely make great progress towards overcoming current limitations on conformational sampling. Exercise: To perform MD simulation for a complex of triazolopyrimidine ligand into the protein structure of c-MET (PDB ID: 3DKF) Note: In this exercise, we are going to use previously docked complex so the step of protein preparation is skipped. i. Operating system: Linux ii. Softwares: Maestro as the interface provided by Schro¨dinger and Desmond to run simulations [Note: please check the version of operating system and download accordingly] iii. Binary files: pdb file for the complex.

174 Chapter 7 STEP BY STEP PROTOCOL: The complex obtained after docking (ap_isopropyl-cmet.pdb) of triazolopyrimidine ligand into the protein structure of c-MET (PDB ID: 3DKF) is used in the exercise for Desmond simulation. Note: Please click on the following link to download the .pdb file used in this exercise. •

• • • • •

Select Applications . Desmond . System Builder This launches the Desmond System Builder panel. The System Builder generates a solvated system that includes the solute (protein, protein complex, protein-ligand complex, protein immersed in a membrane bilayer, etc.) and the solvent water molecules with counter ions. Select TIP3P from the Predefined option menu. Select Orthorhombic from the Box shape option menu. Select Buffer from the Box size calculation method. ˚ ) box. Enter 10.0 in the Distances (A Select Show boundary box and click Calculate.

The System Builder can also be instructed to minimize the volume of the simulation box by aligning the principal axes of the solute along the box vectors or the diagonal. This can save computational time if the solute is not allowed to rotate in the simulation box. • •

Click the Ions tab. For Ion placement, select Neutralize.

In the Salt concentration box, enter 0.150. This will add ions to the simulation box that represent background salt at physiological conditions •

Click Start to start building the solvated structure. You will need to provide a job name and choose a host on which to run the System Builder job. When complete, the solvated structure in its simulation box appears in the workspace. Now we are ready to perform the Desmond simulation. Expert users will typically want to start a simulation from the command-line. However, for this tutorial, we will run it from the MD panel in Maestro. Select Applications . Desmond . Molecular Dynamics. Import the model system into the MD environment: select either Load from Workspace or Import from file (and select a. cms file), and then click Load. The import process may take several minutes for large systems.

Molecular dynamic simulations: Technique to analyze real-time interactions 175 • •

In the Simulation time box, set the total simulation time to 10 ns. Select the Relax model system before simulation.

This is a vital step to prepare a molecular system for production-quality MD simulation. Maestro’s default relaxation protocol includes two stages of minimization (restrained and unrestrained) followed by four stages of MD runs with gradually diminishing restraints. • •

Click Advanced Options to set parameters for the simulation. Click Start. The Molecular Dynamics-Start dialog box appears. Select Append new entries from the incorporate option menu in the Output area to indicate that the results of the Desmond simulation should be added to the current Maestro project. Click Start. The Desmond simulation process begins execution

A completed Desmond simulation of the ap_isopropyl-cmet.pdb structure in your working directory with the base name of desmond_job by default • •

Select Project . Import Structures: Read in the desmond_job-out.cms file. Select Project . Show Table: Open the Project Table. The Trajectory Player can be launched from the Project Table. Note the blue T in the Title column in the Project Table Click T to open the Trajectory Player (Fig. 7.2).

Figure 7.2 Interaction diagram obtained after MD simulations.

176 Chapter 7 Performing simulation quality analysis • • •

Applications . Desmond . Simulation Quality Analysis. Simulation Event Analysis tool (SEA) Applications . Desmond menu. When the SEA panel opens, the Trajectory Player panel opens automatically and the Workspace view changes to only show the solute in wireframe representation. Keep the Trajectory Player panel open for the full duration of the analysis. Click Browse next to Structure to select the desmond_job-out.cms file. Then, determine the analysis to perform (for example RMSD), and the atoms to which to apply the analysis (for example protein backbone), from the Properties area which has multiple tabs (Fig. 7.3).

Figure 7.3 RMSD plot obtained after Simulation Event Analysis.

Molecular dynamic simulations: Technique to analyze real-time interactions 177

References [1] B.J. Alder, T.E. Wainwright, Phase transition for a hard sphere system, J. Chem. Phys. 27 (1957) 12081209. [2] B.J. Alder, T.E. Wainwright, Studies in molecular dynamics. I. General method, J. Chem. Phys. 31 (1959) 459466. [3] A. Rahman, Correlations in the motion of atoms in liquid argon, Phys. Rev. 136 (1964) A405. [4] F.H. Stillinger, A. Rahman, Improved simulation of liquid water by molecular dynamics, J. Chem. Phys. 60 (1974) 15451557. [5] J.A. McCammon, B.R. Gelin, M. Karplus, Dynamics of folded proteins, Nature 267 (1977) 585. [6] P. Priya, M. Kesheri, R.P. Sinha, S. Kanchan, Molecular dynamics simulations for biological systems, Pharmaceutical Sciences: Breakthroughs in Research and Practice, IGI Global, 2017, pp. 10441071. [7] D. McQuarrie, Statistical Mechanics, Harper and Row, New York, 1976 (Chapter 21). [8] D. Chandler, Introduction to Modern Statistical Mechanics, Oxford University Press, 1987, pp. 288. (Foreword by David Chandler. ISBN-10: 0195042778. ISBN-13: 9780195042771. [9] R.E. Wilde, S. Singh, Statistical Mechanics: Fundamentals and Modern Applications, Wiley-Interscience, 1998. [10] T.L. Hill, Thermodynamics of Small Systems, Courier Corporation, 1994. [11] J. Zhang, Molecular Dynamics Analyses of Prion Protein Structures: The Resistance to Prion Diseases Down Under, Springer, 2018. [12] A.A. Berlin, R. Joswik, N.I. Vatin, The Chemistry and Physics of Engineering Materials: Modern Analytical Methodologies, CRC Press, 2018. [13] T.L. Hill, An Introduction to Statistical Thermodynamics, Courier Corporation, 1986. [14] D.J. Callaway, A. Rahman, Microcanonical ensemble formulation of lattice gauge theory, Phys. Rev. Lett. 49 (1982) 613. [15] J.-H. Kim, S.-H. Lee, Molecular dynamics simulation studies of benzene, toluene, and p-xylene in a canonical ensemble, Bulletin-Korean Chem. Soc. 23 (2002) 441446. [16] T. Okabe, M. Kawata, Y. Okamoto, M. Mikami, Replica-exchange Monte Carlo method for the isobaricisothermal ensemble, Chem. Phys. Lett. 335 (2001) 435439. [17] D. Adams, Grand canonical ensemble Monte Carlo for a Lennard-Jones fluid, Mol. Phys. 29 (1975) 307311. [18] S. Park, K. Schulten, Calculating potentials of mean force from steered molecular dynamics simulations, J. Chem. Phys. 120 (2004) 59465961. [19] P. Coveney, F. Giordanetto, N. Stringfellow, Obtaining Scalable Performance from Molecular Dynamics Codes on HPC Machines, 2002. [20] M. Ciocoiu, Materials Behavior: Research Methodology and Mathematical Models, CRC Press, 2018. [21] D.E. Koshland, Application of a theory of enzyme specificity to protein synthesis, Proc. Natl Acad. Sci. 44 (1958) 98104. [22] J.D. Durrant, H. Kera¨nen, B.A. Wilson, J.A. McCammon, Computational identification of uncharacterized cruzain binding sites, Plos Neglected Trop. Dis. 4 (2010) e676. [23] A. Ivetac, J.A. McCammon, Mapping the druggable allosteric space of G-protein coupled receptors: a fragment-based molecular dynamics approach, Chem. Biol. Drug. Des. 76 (2010) 201217. [24] R. Brenke, D. Kozakov, G.-Y. Chuang, D. Beglov, D. Hall, M.R. Landon, et al., Fragment-based identification of druggable ‘hot spots’ of proteins using Fourier domain correlation techniques, Bioinformatics 25 (2009) 621627. [25] R.E. Amaro, R. Baron, J.A. McCammon, An improved relaxed complex scheme for receptor flexibility in computer-aided drug design, J. Comput. Mol. Des. 22 (2008) 693705. [26] J.D. Durrant, M.D. Urbaniak, M.A. Ferguson, J.A. McCammon, Computer-aided identification of Trypanosoma brucei uridine diphosphate galactose 40 -epimerase inhibitors: toward the development of novel therapies for African sleeping sickness, J. Med. Chem. 53 (2010) 50255032.

178 Chapter 7 [27] D.B. Kitchen, H. Decornez, J.R. Furr, J. Bajorath, Docking and scoring in virtual screening for drug discovery: methods and applications, Nat. Rev. Drug. Discov. 3 (2004) 935. [28] S.A. Adcock, J.A. McCammon, Molecular dynamics: survey of methods for simulating the activity of proteins, Chem. Rev. 106 (2006) 15891615. [29] F. Schwab, W.Fv. Gunsteren, B. Zagrovic, Computational study of the mechanism and the relative free energies of binding of anticholesteremic inhibitors to squalene-hopene cyclase, Biochemistry 47 (2008) 29452951. [30] J.T. Kim, A.D. Hamilton, C.M. Bailey, R.A. Domoal, L. Wang, K.S. Anderson, et al., FEP-guided selection of bicyclic heterocycles in lead optimization for non-nucleoside inhibitors of HIV-1 reverse transcriptase, J. Am. Chem. Soc. 128 (2006) 1537215373. [31] C. Chipot, A. Pohorille, Free Energy Calculations, Springer, 2007. [32] C. Chipot, D.A. Pearlman, Free energy calculations. Long and Winding Gilded Road, Mol. Simul. 28 (2002) 112. [33] J.A. McCammon, Computer-aided drug discovery: physics-based simulations from the molecular to the cellular level, Physical Biology: From Atoms to Medicine, World Scientific, 2008, pp. 401410. [34] J.D. Chodera, D.L. Mobley, M.R. Shirts, R.W. Dixon, K. Branson, V.S. Pande, Alchemical free energy methods for drug discovery: progress and challenges, Curr. Opin. Struct. Biol. 21 (2011) 150160. [35] G. Hong, A. Cornish, E. Hegg, R. Pachter, On understanding proton transfer to the biocatalytic [FeFe] H sub-cluster in [FeFe] H2ases: QM/MM MD simulations, Biochimica et. Biophysica Acta (BBA)Bioenergetics 1807 (2011) 510517. [36] W.L. Jorgensen, Special Issue on Polarization, ACS Publications, 2007. [37] P. Cieplak, F.-Y. Dupradeau, Y. Duan, J. Wang, Polarization effects in molecular mechanical force fields, J. Phys.: Condens. Matter 21 (2009) 333102. [38] D. Shaw, P. Maragakis, K. Lindorff-Larsen, S. Piana, R. Dror, M. Eastwood, et al., Atomic-level characterization of the structural dynamics of proteins, Science 330 (2010) 341346. [39] D. Hamelberg, J. Mongan, J.A. McCammon, Accelerated molecular dynamics: a promising and efficient simulation method for biomolecules, J. Chem. Phys. 120 (2004) 1191911929. [40] M.S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand, A.L. Beberg, et al., Accelerating molecular dynamic simulation on graphics processing units, J. Comput. Chem. 30 (2009) 864872. [41] Y. Shan, E.T. Kim, M.P. Eastwood, R.O. Dror, M.A. Seeliger, D.E. Shaw, How does a drug molecule find its target binding site? J. Am. Chem. Soc. 133 (2011) 91819183.


Water mapping: Analysis of binding site spaces to enhance binding 8.1 Introduction The presence of water molecules plays a very important role in the binding of a ligand to the protein. It not just surrounds protein but is often an integral part of the protein-ligand binding where it is involved in key mechanistic steps as well [1]. These multiple roles played by water molecules are associated with its unusual and unique properties; its small size, the dipolar nature caused by its charge distribution, the capacity to act both as a hydrogen bond donor and acceptor, and the entropic gain associated with the release to bulk solvent, when bound to proteins and ligands [2]. Moreover, water can form a complex hydrogen-bonding network between ligand and protein [3]. As evident from a study of 392 protein-ligand complexes, the water-mediated binding has now become very common. Among these complexes, about 85% complexes suggested the involvement of at least one or more water molecules that bridge the interaction between the ligand and the protein [4]. Furthermore, the displacement of an ordered water molecule can drastically affect a ligand’s binding affinity [5]. Therefore, it is very crucial to include explicit water molecules in computational drug design [6]. Additionally, the careful consideration of hydration sites has also been shown to aid the predictability of 3D QSAR models [7], ensure stable simulations with molecular dynamics [8], and improve the accuracy of rigorous free energy calculations [9]. In some cases, the continuum solvent models have also been reported to improve with the addition of explicit water molecules [10]. Traditionally, ordered water molecules were ignored in ligand docking studies and ligands were docked into desolvated binding sites. Nevertheless, there are now a number of docking protocols that include explicit water molecules and claim to improve accuracy in many cases [11]. On the flip side, it has also been reported that including such water molecules may hamper efforts to predict a ligand’s correct binding mode [12]. A popular strategy in rational drug design is to modify a ligand so that it can displace an ordered water molecule into the bulk solvent [13]. This happens due to the favorable entropic gain that can be as a result of increase in the water molecule’s translational and orientational degrees of freedom. However, the targeted displacement of an ordered water molecule may be unsuccessful [14], which can also lead to a decrease in affinity if the Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


180 Chapter 8 ligand is unable to replace the water molecule’s hydrogen bonds correctly and fulfil its stabilizing role [15]. Also, the water molecules have important implications in the lead optimization process. For an instance, through rigorous theoretical studies it was investigated that how changing a water displacing functional group affects a ligand’s affinity [16]. In addition, water molecules are important pharmacophoric features of a binding site [17], and the chemical diversity of potential inhibitors generated in silico has been reported to be greatly affected by the targeted displacement of ordered water molecules [18]. The location of water molecule are typically taken from X-ray crystal structures and may be validated by observing the same position in other crystal structures of the same protein. Nevertheless, there are inherent problems with identifying hydration sites with crystallography. Sometimes, the water molecules can be artefactual, may be too mobile to identify or not observed because of low resolution [19]. In cases like homology modeling, there will be no structural knowledge of water molecules. Hence, it is necessary to be able to accurately predict water locations within binding sites. Water molecules within the active site of a protein target can take on various roles and should therefore be incorporated in structure-based drug design. They may mediate proteinligand interactions by bridging between the ligand and the protein, or maybe displaced by a ligand. A ligand that can displace and mimic these structural water molecules will have a more favorable binding free energy [20]. In order to investigate the presence of water molecules within protein-ligand complexes experimentally, X-ray crystallography is the major source of information. However, it should be noted that the inherent mobility of water molecules makes their determination and visualization very challenging. When the crystal structure of a protein does not show any explicit water molecules in the active site, this does not necessarily mean that they are not present in reality. Water molecules that are in rapid exchange with bulk solvent, or that are relatively mobile within the binding site, will not be observed in crystallographic studies, but may still play a major role in the protein function. An example of this is the cytochrome c oxidase protein, a redox-driven proton pump that creates a membrane proton gradient responsible for driving ATP synthesis in aerobic cells. The crystal structure does not show any water molecule in the catalytic center [21]. However, it is believed that, since protons are delivered to the catalytic centre, where the reduction of molecular oxygen occurs, at least some water molecules must be present there [22]. In the year 2007, Lu et al. performed an extensive analysis of water molecules at the protein-ligand interface, observed in high-resolution crystal structures [4]. A total of 1829 ligand-bound water molecules were observed in 392 complexes, of which 18% are surface water molecules, and 72% are interfacial water molecules, of which 76% are considered to be bridging water molecules. The number of ligand-bound water molecules in each

Water mapping: Analysis of binding site spaces to enhance binding 181 complex ranges from 0 to 21. On an average, there are between three and four water molecules involved in bridging hydrogen bonds between a protein and its ligand. Several examples of hydrogen-bonded bridges between protein and ligand are reported in the literature [23]. In structure-based drug design, one of the central strategies is to modify lead molecules slightly to obtain or improve certain therapeutic properties [24]. The rationale behind this approach is that the similar molecules bind in a similar fashion to a target receptor, thus possibly inducing the same effect. Nevertheless, the new compound may adopt a different binding mode, due to the presence of internal water molecules. This effect was examined in 2006, when a comparative study was performed on X-ray structures of 206 pairs of structurally similar ligands binding to a common protein [25]. When comparing the water molecule architecture within a pair, a difference in the water positions in 68% of the cases was observed. This effect was seen throughout all protein classes of the 206 complexes. A well-known example is HIV-1 protease which is a homodimer forming a single symmetric active site [26]. The substrate-free HIV-1 protease active site (PDB code 1G61) displays the two catalytic residues Asp25 and Asp125 as well as two water molecules (300 and 301) [27]. Water 301 is a conserved water molecule, located on the HIV-1 protease symmetry axis, bridging the two subunits. NMR studies showed that this water molecule has a long residence time [28]. It is commonly observed to donate two hydrogen bonds to inhibitors and to accept two hydrogen bonds from backbone of the protein residues Ile50 and Ile150, situated on the protein ‘flaps’ [29]. For example, Kynostatin 272 (KNI-272), a potent and selective inhibitor of HIV-1 protease [30]. Four structural water molecules (including water 301) were observed both by crystallography and by NMR spectroscopy at the interface between HIV-1 protease and KNI-272 (PDB code 1HPX) [31]. On the other hand, potent cyclic urea inhibitors were designed to displace and mimic the reactions of water 301, leading to more potent inhibitors [32].

8.2 Thermodynamics For drug discovery purposes, a detailed understanding of interactions between biomolecular macromolecules and small chemical compounds is of vital importance. Multiple factors play roles in protein-ligand binding, elucidated with in-depth computational studies [33]. One of the most important factors among them is the free energy of binding (ΔG). Assessment of the ΔG offers insight into this process of ligand binding. The efficient and accurate calculation of ΔG is still one of the holy grails in computational chemistry. In recent years, experimental methods such as isothermal calorimetry to determine the thermodynamics of ligand binding have become quite popular [34].

182 Chapter 8 The binding free energy is the result of a series of events occurring upon ligand binding, one of which is the (de)solvation of both ligand and host. In order to score ligands according to their binding affinity, three factors should be taken into account: the ligand, the protein and the solvent in which both are solvated. From the perspective of the ligand, the complexation with the protein should offer favorable interactions such as hydrogen bonds, π-π interactions, etc. These interactions will add favorably to the enthalpy of binding. On the other hand, upon binding the ligand loses conformational, translational and rotational freedom, resulting in an unfavorable entropic contribution to the binding process. According to the conformational selection model, a protein is present in an ensemble of different conformations [35]. Low-energy conformations are more likely and will be observed more often, while higher-energy conformations will occur only rarely. A ligand that binds to the active site may only be able to do so if the protein is in a specific conformation. For tightly binding ligands, the favorable protein-ligand enthalpy may be sufficient to stabilize the protein when it is in a high-energy conformational state. From the perspective of the protein, this is an unfavorable enthalpic contribution. As the ligand now locks the protein in this conformational state, its flexibility is also reduced, leading to an additional unfavorable entropic contribution. Although an important contribution to the free energy, the effect of the solvent is often ignored in modeling approaches. On one hand, the loss of direct solvent ligand and solvent protein interactions (hydrogen bonds) lead to an unfavorable enthalpic contribution, while on the other hand, these structured water molecules are released to bulk solvent upon ligand binding, which leads to a favorable increase in entropy [36]. Dunitz showed that the release of a highly ordered water molecule from the active site to bulk solution theoretically results in an entropic gain of 7 cal mol/K [37]. Li and Lazaridis assessed the role of structural water molecules in protein-ligand complex formation by calculating their contribution to the thermodynamics of protein solvation. Using statistical mechanics and the inhomogeneous fluid solvation theory, they computed the contribution of the displaced water molecule in HIV-1 protease upon binding of DMP450 and concluded that water displacement is indeed favorable for binding [38]. In another study, it was found that the thermodynamic consequences of the displacement of an ordered water molecule in a concanavalin A complex by ligand modification resulted in a decrease in binding free energy, in agreement with experimental data [5], again by calculating the contribution of this water molecule to the thermodynamic properties. It was shown that the entropic penalty of order is large but is outweighed by the favorable enthalpy gain of the water protein interactions [39]. Bridging water molecules (possibly outside the protein-ligand interface) were estimated to contribute up to 3 kcal/mol to the binding affinity [40]. This indicates that structural details of the binding cavity are very important and that the contributions to the binding free energy of single water molecules cannot be generalized in a single value.

Water mapping: Analysis of binding site spaces to enhance binding 183

8.3 Predicting location and nature of water molecule: To be or not to be replaced? A lot of studies have shown the importance of being able to predict whether water molecules in the protein binding site will be conserved or displaced upon ligand binding. Several methods are available that address this problem (Fig. 8.1). Water sites can be predicted by running molecular dynamics or Monte Carlo simulations with an explicit water model and taking the peaks in water density or averaging over water molecule locations [41]. These techniques have the benefit of including entropic effects in the prediction but can be very time consuming to run, especially with buried cavities due to the long time it takes for water to permeate within the protein. Grand canonical Monte

Figure 8.1 Various water-mapping tools with their basic principle.

184 Chapter 8 Carlo methods can significantly reduce the length of the simulation [42], although can still be computationally demanding. The grid-based Monte Carlo method JAWS attempts to strike a balance between rapid solvation techniques and full molecular simulations that explicitly treat entropic effects [43]. Moreover, it also has an added advantage of producing an estimate of the free energy of displacing the water molecule into the bulk solvent, although, the value may not be well converged [44]. A notable integral theory approach, called the 3D reference interaction site model (3DRISM), has been reported to be successful in predicting the solvation structure within protein cavities [45] and in ligand binding sites [46]. Inhomogeneous fluid solvation theory (IFST), as popularized by Lazaridis, uses a short molecular simulation to calculate the thermodynamics of water molecules in protein binding sites [47]. A great advantage of using IFST is that the free energy is broken down into its enthalpic and entropic contributions and these values are then used to understand the thermodynamics of ligand binding [48]. Besides, IFST also forms the basis behind WaterMap [49], which calculates the binding thermodynamics of displaced water molecules and has been used to understand the affinity and ligand selectivity in a number of ways [50]. Fast solvation methods have also been pursued for a number of years. A popular empirical method is GRID, which calculates the interaction energy of a chemical probe around a protein [51]. To predict favorable hydration sites, GRID places a small chemical probe molecule on a grid (GRID) and calculates empirical interaction energy at all grid points [51]. The water probe is able to make up to 4 hydrogen bonds with the protein. A novel mean-field method has been reported by Setny and Zacharias that places potential water sites on a lattice and iteratively solves the solvent distribution using a semi-heuristic cellular automata approach [52]. The fact that water sites form distinctive distributions around amino acids has been exploited by a number of knowledge-based methods [53]. The GRID program has proven to be useful both for checking water locations and for placing water molecules that appear to be missing from the complexes due to crystallographic uncertainty [54]. As an example, GRID was used to predict the required water molecules to propose correct binding modes of carbohydrates in heat-labile Enterotoxin. One of the early example, AQUARIUS, predicted solvent sites within a protein by mapping each amino acid to a data set of crystal structures [55]. SuperStar is another knowledge-based method that combines structural data from the Protein Data Bank (PDB) and the Cambridge Structural Database (CSD) to predict chemical propensity maps within protein cavities [56]. Schymkowitz et al. similarly used water distributions around amino acids to predict buried water molecules [57]. The distributions were clustered and then optimized using the Fold X forcefield. When the water

Water mapping: Analysis of binding site spaces to enhance binding 185 molecules that were coordinated by 2 or more polar atoms were considered, Fold X reported a success rate of 76%. Most recently, Rossato et al. developed AcquaAlta, which identified favorable water geometries from the CSD and ab initio calculations to predict the location of water molecules that bridge polar interactions between the ligand and the protein [58]. AcquaAlta predicted 76% of crystallographic water positions in the training set and 66% in the test set. As the affinities, binding modes and chemical diversity of a series of ligands can be greatly affected by the water molecules in a protein binding site, it is important to predict which water molecules are displaced or conserved during the binding process. Some docking procedures, although different in implementation, involves switching explicit water molecules ‘on’’ and ‘off’’ [59]. Other approaches have used the structural features of a water molecule’s environment to predict whether it will be displaced or not without any prior knowledge of the ligand. Using a K-nearest neighbors genetic algorithm, Consolv reported 75% accuracy in predicting whether a binding site water molecule would be displaced or not [60]. The program examines the environment of each water molecule in the ligand-free structure by calculating the temperature and B-factors of water molecules, the number of hydrogen bonds between protein and water, and the density and hydrophilicity of neighboring protein atoms. Nevertheless, Consolv failed to predict the water molecules that are displaced by a polar group in the ligand. Consolv is incorporated in the SLIDE docking procedure, to predict water-mediated interactions with the ligand [61]. However, since Consolv used crystallographic temperature factors as structural descriptors, it cannot be applied to predicted water sites. Amadasi and coworkers have combined the HINT force field with the Rank score to classify water molecules into 2 broad categories; conserved/functionally displaced and sterically displaced/missing. The HINT force field is a non-Newtonian force field based on experimentally determined log Po/w values [62]. It implicitly includes entropic contributions arising from water molecules in the bulk solvent. The total HINT score for the complex interaction is given by the sum of the contributions resulting from protein-ligand, ligand-water, and protein water interactions. Their first study correctly classified 76% of the water molecules tested while their second study reported a classification accuracy of 87%. ˚ away Their analysis included weakly bound water molecules, which were maximally 4 A from the protein. On the other hand, Rank algorithm calculates the geometrical quality of potential hydrogen bonds formed by each water molecule to non-water atoms in a solvated protein [63]. Amadasi et al. combined HINT and Rank analysis to characterize water molecules bound to proteins both in the presence and absence of ligands [64]. A water molecule with a high Rank and HINT score is regarded to be unlikely to make further interactions with the ligand and is largely irrelevant to the binding process, while a water molecule with moderate Rank

186 Chapter 8 and high HINT score is available for ligand-water interaction. A water molecule that is being displaced by a ligand is characterized by a lower Rank and HINT score. With these ‘guidelines’, HINT and Rank scores were calculated for 50 water molecules bound in the active site of 4 apo-proteins for which the holo-structure was also known. For 76% of these water molecules, the behavior in the complex was predicted correctly. The HINT method has been applied successfully in several studies. When it was used to estimate the free energy of binding in HIV-1 protease-ligand complexes, the correlation between the HINT scores and the experimentally determined binding constants showed a r2 5 0.63 with a standard error of 6 0.95 kcal/mol. WaterScore, developed by Garcia-Sosa et al. uses multivariate logistic regression analysis to establish a statistical correlation between the structural properties of water molecules in the binding site of a free protein crystal structure, and the probability of observing the water molecules in the same location in the crystal structure of the protein-ligand complex form [65]. Using the B factor of the water molecule, the solvent-accessible surface area, the total hydrogen bond energy, and the number of protein-water contacts, it tries to distinguish between conserved or displaced water molecules. Using multivariate logistic statistical regression, WaterScore reported 67% accuracy in classifying displaced and conserved waters, although water molecules that were displaced because of steric clashes with the ligand are not considered, indicating that the program cannot be used to test a design strategy in which the ligand explicitly competes with the solvent for a binding site. Barillari et al. used the computationally expensive double-decoupling method to calculate the binding energies of 54 water molecules in protein-ligand complexes [66]. They found that water molecules that could be displaced by a ligand were on average less strongly bound than conserved water molecules by 2.5 kcal/mol. Another protocol, WaterDock, utilized freely available AutoDock Vina tool to predict the location of ordered water molecules in ligand binding sites to a very high degree of accuracy [67]. Crucially, a WaterDock prediction only takes a matter of seconds to produce. WaterDock was validated against high-resolution crystal structures, neutron diffraction data, and molecular dynamics simulations. Using a validation set of proteins for which high-resolution X-ray structures have been determined at least twice, it was found that WaterDock was able to predict 88% of ‘consensus’’ water sites with a mean error of ˚ . Using 14 structures of OppA bound to lysine-X-lysine tripeptides, WaterDock 0.78 A predicted 97% of the ordered water molecules, with on average 1 false positive per structure. DOWSER predicts favorable hydration sites based on the average interaction energy during a short molecular dynamics simulation [68]. An energy cutoff of 210 kcal/mol is used to determine if interior cavities are occupied by water or not. It was successfully applied to

Water mapping: Analysis of binding site spaces to enhance binding 187 identify water molecules in the internal cavity of Staphylococcal nuclease protein, prior to molecular dynamics simulations [69]. Despite the positive strides that have been made in understanding the role of ordered waters, no single method is able to answer how displaceable a water molecule is, and what is it likely to be displaced by. When there is limited experimental knowledge of a binding site’s solvation structure, addressing these questions becomes even less clear.

8.4 Strategies to identify cavity “waters” 8.4.1 Molecular docking Among the most commonly used virtual screening methods in computational drug design are docking methods, which have been successfully applied on a number of pharmaceutical targets, to predict the binding modes and affinities of potential receptor agonists and enzyme inhibitors. Currently, more than 60 docking programs and more than 30 scoring functions are available [70]. Although considerable efforts have been devoted to the development of accurate and fast docking procedures, there certainly still is room for improvement. Protein flexibility, the presence of water molecules and the evaluation of the binding free energies are the main issues that have not been solved satisfactorily. For many years, explicit water molecules were not taken into account in docking studies but were simply being removed. Solvation effects are often considered implicitly through the empirical scoring function. If it is known (or predicted) that which water molecules in the protein active site are to be conserved and which are to be replaced, one can include these in the docking experiment. It remains a problem though if the behavior of the water molecules is different per ligand [71]. In those cases, one attempts to apply a protocol that will decide whether to keep or displace each of the important water molecules [59]. AUTODOCK was developed as an automated docking procedure for flexible ligands, which uses a Lamarckian genetic algorithm [72]. Structural water heterogeneity and protein mobility were incorporated by combining multiple target structures within a single grid of interaction energies [73]. The approach was tested on 21 complexes of HIV-1 protease with different inhibitors. Twenty of these inhibitors rely on the structural water for stability, and one inhibitor displaces this water, requiring the water site to be empty for proper binding. Multiple solvation models and protein conformations were incorporated into one docking procedure by using a combined grid derived from 20 structures both with and without the crystallographic water. The disadvantage of this approach is the computational cost of generating the various grids.

188 Chapter 8 FITTED was developed to address the major challenges in molecular docking, protein flexibility, and displaceable water molecules, and, in the latest version (2.6), also ring conformational search [74]. It is based on a genetic algorithm, with a new potential energy term added to the AMBER force field that accurately accounts for displaceable water molecules. It makes use of a switching function that scales the (intermolecular) energies involving a water molecule. FITTED successfully improved docking results of inhibitors to a set of HIV-1 protease, thymidine kinase, trypsin, factor Xa and MMP when water molecules were considered, as opposed to when they were removed, with 79% and 67% accuracy, respectively. Also, the occurrence of water molecules was predicted with nearly 80% accuracy [75]. Although the first version of FITTED was found to be rather inefficient, enhanced versions did increase in speed. However, in comparison to other docking methods, it was found to be (too) time-consuming, and therefore not competitive. FlexX samples the conformational space of the ligand on the basis of a discrete model and uses a tree-search technique for placing the ligand incrementally into a rigid binding site [76]. The scoring function is modified from Bo¨hm’s scoring function which includes entropic, hydrogen-bonding, ionic, aromatic and lipophilic terms [77]. The FlexX docking procedure is unique in the way it ‘predicts’ potential locations of water molecules, rather than rely on crystallographic positions. An algorithmic extension is presented as the particle concept [78]. The particle concept places spherical objects between the ligand and the protein during the incremental docking procedure. The objects (particles) have the ability to form molecular interactions like hydrogen bonds to the ligand as well as to the protein. They interact with the protein and the ligand sterically and physico-chemically. A single water molecule is modeled as a particle with the radius of an oxygen atom able to form four hydrogen bonds, but particles can also be used to integrate small molecules, single atoms or metal atoms in the protein-ligand interface. The FlexX docking method using the particle concept was tested on a set of 200 known protein-ligand complexes taken from the Protein Data Bank. The average improvement of docking accuracy using the particle concept was small. Water locations were predicted that were also observed in the crystal structures in only 35% of the cases. However, the docking result is drastically improved in specific test cases like HIV-1 protease, where water 301 is known to play a critical role. GLIDE (Grid-based ligand docking with energetics) approximates a complete systematic search of the conformational, orientational and positional space of the docked ligand [79]. It uses a series of hierarchical filters, a screening funnel, to search for possible locations of the ligand in the active site. To include solvation effects, Glide docks explicit water molecules (determined from protein crystal structures) into the binding site for each energetically competitive ligand pose and employs empirical scoring terms in the GlideScore scoring function that measure the exposure of various functional groups to the explicit waters.

Water mapping: Analysis of binding site spaces to enhance binding 189 GOLD is an automated ligand docking program that uses a genetic search algorithm to explore the full range of ligand conformational flexibility with partial flexibility of the protein and satisfies the fundamental requirement that the ligand must displace loosely bound water on binding [80]. In order to predict whether a specific water molecule should be bound or be displaced, GOLD estimates the free-energy change associated with the transfer of a water molecule from bulk solvent to its binding site in a protein-ligand complex. For a water molecule to be bound to a protein-ligand complex, its intrinsic binding affinity needs to outweigh the loss of rigid-body entropy on binding, parameterized as a constant penalty. In addition to being switched ‘on’ or ‘off’, water molecules can be allowed to rotate. Since allowing water molecules to toggle increases the complexity of the docking procedure, there is a limit to the number of water molecules that are included. Ideally, a maximum of three water molecules is advised. SLIDE (Screening for Ligands by Induced-Fit Docking) is a screening tool which looks for the chemical and geometrical similarity between the ligand and a binding site template. First, the binding site of the protein is described by a template of favorable interaction points above its surface, onto which ligand atoms are mapped during the search. An empirical scoring function (SlideScore) is used to efficiently divide large databases into feasible and infeasible compounds. The most promising ligand candidates are docked, and a ranked list of some 100 potential ligands for a given protein target is produced. SLIDE is able to reduce large compound databases of more than 175.000 organic compounds to a ranked list of approximately 100 docked potential ligands within an hour to a day. The positions of water molecules in the binding site from the protein crystal structure are analyzed, and predicted by Consolv. SLIDE shifts the water molecules when they collide with protein or ligand atoms. Therefore, it addresses the problem that the location of water molecules should be optimized upon the docking of ligands. However, the program does not consider the energetics associated with these moves. If the collision of the water molecule cannot be resolved by relocation, the water molecule is considered to be displaced and a penalty is added to the final score if a lost hydrogen bond is not replaced by a corresponding protein-ligand interaction.

8.4.2 Molecular dynamics Internal water molecules may have an important contribution to the protein-ligand interaction. High-resolution crystallography studies have contributed a detailed understanding of the functional roles of internal water molecules. However, the amount of structural information on water molecules is relative. Only the water molecules that are highly ordered and bound at specific sites are observed in crystal structures. Structurally disordered water molecules are

190 Chapter 8 invisible to X-ray diffraction methods, yet these might also be of importance to the proteinligand complex. MD simulations may be used to detect and study the structure and dynamics of internal water molecules. The number of water molecules observed in MD simulations can be larger than the number reported in crystal structures. Damjanovic et al. performed MD simulations of 10 ns on the Staphylococcal nuclease (SN) protein. The locations of internal water molecules were determined from the trajectories and compared with the locations of water molecules observed in 100 available crystal structures. Three of the internal water molecules found by MD simulations correspond to water molecules observed in the crystal structure, and all have a residence time . 4 ns. Two of these water molecules are bound in a protein loop and have the longest residence times (7.6 and 10 ns). These hydration sites remain occupied during the entire length of the simulation. The third water molecule is buried inside the hydrophobic core and has a residence time of 4.4 ns. Furthermore, the simulations reveal the presence of water molecules situated in a hydrophobic region in the interior cavity at locations where no water molecules have been observed crystallographically [69]. The results appeared to be independent of the force field that was used (CHARMM or AMBER). In another study, the hydration of water molecules in the binding pocket of a fatty acidbinding protein (FABP) was investigated [81]. MD simulations were carried out for rat FABP, in apo-form and with bound palmitate (holo-form). FABP in apo-form contains a large interior cavity, filled with water molecules. Crystal structures of apo-FABP display 30 water sites, including seven to nine well-ordered water molecules, a number which is confirmed by MD simulations. This large water content of the internal cavity implies its relevance for understanding protein-ligand interactions. Also, solvent exchange pathways between the FABP interior and exterior were studied. From MD simulations, it was found that holo-FABP shows a relatively constant count of 21 22 internal water molecules. Hence, it can be concluded that fatty acid binding displaces about eight water molecules. This number would not be available from crystal structure determination, which can only identify well-ordered water molecules. Furthermore, it was shown that solvent inside apo-FABP showed characteristics of a water droplet, while solvent in the holo-FABP benefits from interactions with the ligand head group and forms slightly stronger interactions with protein residues. Another interesting finding was that the rate of water exchange between the interior cavity and the bulk was much higher in holo-FABP than in the apo-FABP, which is counter-intuitive given the smaller number of water molecules inside the binding cavity. The authors speculated that a favorable exchange pathway is opened when a ligand is bound.

8.4.3 Free energy calculations As the driving force of all molecular processes, the free energy is the key property to calculate when describing ligands binding to a protein. As several factors contribute to the

Water mapping: Analysis of binding site spaces to enhance binding 191 free energy of binding, accurate calculation of free energy of binding is complex, still, various methods have been developed to address these factors [82]. The double decoupling method uses molecular dynamics simulations to calculate the free energy of tying up water molecules in the binding site of protein-complexes in a theoretically sound way [83]. A number of studies have been published in which free energy calculations are used to calculate the contribution of water molecules to binding free energies [84]. A slightly older, but an extensive example is formed by the Cytochrome P450cam (CYP450cam) enzyme. Its substrate binds in a buried active site, displacing partially ordered solvent. Although six water sites have been assigned crystallographically to the substrate-free cavity, they are partially disordered. Moreover, the volume of the entire cavity of CYP450cam is around ˚ 3, suggesting that it could accommodate up to ten water molecules. However, the 300 A surroundings of the active site are mainly hydrophobic. Using MD simulations with thermodynamic integration for cavities containing five to eight water molecules, it was confirmed that six water molecules were thermodynamically most favorable [85]. Free energy calculations are also used to quantify the favorable contribution of a water molecule mediating the interaction between CYP450cam and 2-phenyl-imidazole at 211.6 kJ/mol. CYP450cam in complex with camphor offers space for a water molecule hydrogen bonding with the camphor molecule. In agreement with the X-ray structure in which this water was not observed, it was calculated to contribute unfavorably with 1 15.8 kJ/mol. These findings were combined to calculate the absolute binding affinity of camphor to CYP450cam by exchanging the substrate against six water molecules, using the free energy perturbation (FEP) method [86]. Using the double decoupling method, Hamelberg and McCammon showed that the standard free energy associated with localization of a water molecule in the binding site is 21.9 6 0.5 kcal/mol for the trypsin/benzyldiamine complex, and 23.1 6 0.6 kcal/mol for the HIV-1 protease/KNI-272 complex, respectively. In both cases the localized water molecules stabilize the protein-ligand complex, most strongly so for the HIV-1 protease. All four waters observed by X-ray diffraction and NMR spectroscopy in HIV-1 protease were examined using the same method by Lu et al. [87]. They confirmed that waters 301 and 607 contribute favorably to the complexes by 21.9 6 0.4 kcal/mol and 21.1 6 0.3 kcal/mol, respectively. On the other hand, waters 566 and 608 seem to contribute only minimally with 0.4 6 0.5 kcal/mol and 20.3 6 0.3 kcal/mol, respectively. This indicates that not all structural waters contribute with the same amount to the stabilization and binding of the ligands. The results were also shown to depend on the protonation state of Asp25 and Asp125, with variations as big as 1.3 kcal/mol for water 301. Using a similar approach, the binding free energies of 54 water molecules in six proteins (HIV-1 protease, neuraminidase, trypsin, factor Xa, scytalone dehydratase, OppA) were calculated [66]. The 54 water molecules were divided into 18 conserved water molecules and 36 displaceable ones.

192 Chapter 8

Figure 8.2 Strategies used for water mapping.

The average binding free energy of conserved water molecules was calculated to be 26.2 kcal/mol, and that of water molecules displaced by ligands 23.7 kcal/mol. Tightly bound water molecules are generally located in highly polar cavities and make three to four hydrogen bonds with the protein and the ligand. Loosely bound water molecules, on the other hand, are generally located in partially apolar cavities and are involved in less than three hydrogen bonds with the protein and the ligand. Bayesian statistics were subsequently applied to calculate the probability that a water molecule is displaced, given its binding free energy. Based on these predictions, the design strategy may be modified such as to maximize the interactions with conserved water molecules and target those that may be displaced (Fig. 8.2).

8.5 Loopholes and limitations The important roles of water molecules in the binding pocket of proteins have been addressed during several decades in the experimental and theoretical work [88]. Despite the fact that water molecules play an essential role in protein-ligand recognition and interaction, only a few studies shown that the current methodology to include active site water molecules improves the accuracy of automated docking procedures. An extensive study by Roberts and Mancera on 240 protein-ligand complexes, that contain water molecules in the binding site, confirmed the importance of including water molecules in docking studies. In 2002, Nissink et al. observed a small effect of leaving out water molecules in complexes where they are mediating protein-ligand complexes [89]. They compared the quality of the docking poses of a set of 55 protein complexes in which no water is present in the active

Water mapping: Analysis of binding site spaces to enhance binding 193 site to the same values for a set of 40 protein complexes in which a water molecule is observed experimentally, but not included in the GOLD docking experiment. Higher ˚ ) were observed for the set without waters than for success rates (73% docked within 1.5 A the set with waters (61%), but the effect is actually rather small. The role of water in using several docking programs and scoring functions was examined for 19 Cytochrome P450s and 19 thymidine kinase complexes. Ligands were docked including (i) no water, (ii) crystallographically observed water or (iii) water molecules predicted by a GRID-based protocol. Taking into consideration any kind of water molecules increased the performance for all three tested docking programs, GOLD, FlexX and Autodock. As the results with predicted water molecules seemed quite satisfying, the study was later extended to a homology model of Cytochrome P450 2D6, which was used to predict sites of metabolism in 65 substrates in a virtual screening experiment [90]. Again, the inclusion of predicted (static) water molecules seemed to improve the results. Later, the X-ray structure of the protein became available, which did not contain active site water molecules [91]. Rather, similar success rates in the docking poses could be obtained by using multiple water-free conformations of the very malleable active site [92]. Very recently, it has been shown that the inclusion of water molecules determined from MD simulations does have an effect on the docking results of individual substrates, but that no overall improvement could be observed [93]. Most of the work mentioned above evaluates the effect of water molecules on systems in which the ‘correct’ poses are known. Even though no final conclusion can be drawn yet as to the need to include water molecules or the way in which this is best done, water molecules have already been described to play important roles in predictive virtual screening studies [94]. Molecular Dynamics simulations have proven to be useful for studying protein hydration and provide microscopic details. While docking can be considered to be a high-throughput virtual screening procedure, MD simulations are still a low-throughput approach. MD simulations are often performed in an explicit water solvent, making it straightforward to include structural active site water molecules. The ability of MD simulations to determine the most likely sites of water binding in internal pockets and cavities depends on its efficiency in sampling all the hydration possibilities of internal protein sites. This can be enhanced significantly by performing multiple MD simulations as well as simulations started from different initial hydration states [95]. Statistical-mechanically sound evaluations of the binding affinity of water molecules to such sites are informative, but comes at an even larger computational cost.

8.6 Conclusion Water molecules play multiple roles in the life of organisms, which can be associated with its unusual and unique properties; its small size, the dipolar nature caused by its charge

194 Chapter 8 distribution, the capacity to act both as a hydrogen bond donor and acceptor, and the entropic gain associated with the release to bulk solvent, when bound to proteins and ligands. Although water molecules are small and consist of only two atom types, and although the association of a ligand with a protein in an aqueous medium is described by a ‘simple’ process of molecular association, water molecules are difficult to determine and model. We have examined the role of water molecules in protein-ligand interactions and the relevance of including them in computational drug design. Water molecules can be present in the protein active site, and may subsequently be displaced or conserved upon ligand binding. Conserved waters are trapped in the binding site and may stabilize the complex and mediate protein-ligand interactions by the formation of hydrogen bond networks. Water molecules that are displaced by an incoming ligand are released to bulk solvent, thereby adding a favorable entropic contribution to the binding free energy of the ligand. We have discussed the details and difficulties of including water molecules into computational structure-based drug design methods, such as docking procedures and molecular dynamics simulations. Despite the evidence of its role as a moderator and mediator in protein-ligand interactions in a manner that can increase binding affinity and selectivity, it remains difficult to predict and interpret this role. Therefore, the versatility of water molecules has probably not been exploited to its fullest extent in structure-based drug design. Exercise: To perform an analysis of specific water sites using the WATCLUST plugin. Note: To start using watclust, first you have to perform a molecular dynamics simulation of your system in an explicit solvent [96]. Then, you have to load the trajectory snapshots onto VMD. The algorithm makes alignment of each snapshot structure to the reference and finds the selected regions of space where water sites are allocated. You may choose which atoms/ residues should be included in the alignment and the region of space to be sampled. For thermodynamic calculations, the program uses the NAMD software. You can download and install it from following the instructions and exporting the namd2 file. You will then just need to copy the executable file namd2 to /usr/ bin directory. 1. Operating system: Linux 2. Softwares: VMD, NAMD and WATCLUST plugin [Note: please check the version of operating system and download accordingly] 3. Binary files: pdb file for the complex. Installing watclust 1. Download the compressed file from DOWNLOADS 2. Untar the file in a directory of your choice:

Water mapping: Analysis of binding site spaces to enhance binding 195 3. Install the plugin. (The plugin should now be available in “VMD Main” menu at “Extensions” . “Analysis” . “WATCLUST”) Using watclust To access the plugin from “VMD Main” menu go to: “Extensions” . “Analysis” . “WATCLUST” The menu is divided into several sub-menus: “Selections”, “Reference” and “Settings”. Selections menu: Use molecule: Choose from the molecule ID containing the explicit solvent MD trajectory. Selection: Select the atoms/residues that define the site of interest where the clustering analysis will be made. You can write the atoms/residues in the usual VMD way and select the molecule ID containing the explicit solvent MD trajectory. Alternatively, you may use an Autodock grid map to perform the selection of the atoms laying inside the box. For this sake, load the map in the VMD Main window, choose its molecule ID and click on the “Grid as selection” button. Frames: Choose the frames from the trajectory to be considered in the analysis, starting from the “First” frame, until the frame “Last” is reached, selecting every “Step” frame. Reference menu: To obtain a reference for the alignment of MD snapshots, you may select a given frame from the trajectory or use a structure of the protein loaded with another molecule ID. Frame: Choose a frame from the loaded trajectory to use as reference in the alignment for the rest of the frames. Use molecule: Choose a loaded molecule to use as a reference to align all the selected frames from the loaded trajectory. Settings menu: This section allows changing the parameters of the clustering algorithm and the expression of the results. Watnumbermin: Minimum number of water molecules a cluster should host, considering all selected frames, in order to define an actual water site. ˚ , between contiguous water molecules from different Dist: Maximum distance, in A snapshots in order to be part of the same water site. Once you have defined every adjustable parameter, click the “CALCULATE WATER SITES” button to perform the calculation.

196 Chapter 8 Once the calculation has finished, click the “COLORING WATER SITES” button to represent the different water sites found by the program as VDW balls. “Coloring parameter” was set to WFP, the different colors of the water sites represent variations in their relative water finding probability: red for low WFP value, white for intermediate WFP and blue for high WFP. The “Clusters” box (green circle) shows the different water sites found by the program with some statistical properties. All this information is saved in a text file called watclust.dat. Thermodynamics calculations: IMPORTANT: You need NAMD to perform these calculations. This allows the calculation of thermodynamics properties for the water sites (ΔE and ΔS of solvation, according to the selection of energy, entropy or both). Click the CALCULATE button to perform the calculations. You will find the results in the energy.csv, entropy.csv and thermodynamics.dat files. Once the water site determination is done, the program generates a log file with the parameters set by the user for the run and a folder containing several files with water site properties. You can find the log file and the different files in the destination shown in the “Current Log File” box. Generated files: The program generates a folder named WSDATA.# containing files with different properties for the water sites. The number # is randomly generated and can be found in the “Current Log File” box (both the name of the log file and the folder are the same). watcent.pdb: A file with PDB format containing the coordinates of the centre of mass of all the water sites. You can superimpose this file to the reference frame/structure to visualize the different water sites calculated. watclust#.pdb: A set of files with PDB format, one for each water site. Each file contains the coordinates for the water oxygen of all the water molecules that define the water site number #. watclust.dat: A text file summarizing the statistical properties for all the water sites shown in the “Clusters” box. watr#.dat: A set of text files, one for each water site. Each file contains the data for the WFP(r) and g(r) graphics for the water site number #. The increment in the distance to calculate the g(r) and WFP(r) values are set by the user in the “dr” parameter. ˚. r 5 distance from the centre of the water site, in A g 5 radial distribution function values. WFP 5 water finding probability values. energy.csv (A text file summarizing the electrostatic and vdw energies for all the water sites. Energies are decomposed against protein and solvent. Standard errors are included)

Water mapping: Analysis of binding site spaces to enhance binding 197 entropy.csv (A text file summarizing the rotational and transnational entropies for all the water sites) thermodynamics.dat (A text file summarizing the thermodynamic properties for all the water sites) If you want to load a previous run, click “File” . . “Load Log File. . .” and chose the log file for that run. “Save Map As. . .” button: This function modifies an autodock grid map with the information from the water sites.

References [1] P. Ball, Water as an active constituent in cell biology, Chem. Rev. 108 (2008) 74 108. [2] P. Cozzini, M. Fornabaio, A. Marabotti, D.J. Abraham, G.E. Kellogg, A. Mozzarelli, Free energy of ligand binding to protein: evaluation of the contribution of water molecules by computational methods, Curr. Med. Chem. 11 (2004) 3093 3118. [3] S. Sleigh, P. Seavers, A. Wilkinson, J. Ladbury, J. Tame, Crystallographic and calorimetric analysis of peptide binding to OppA protein, J. Mol. Biol. 291 (1999) 393 415. [4] Y. Lu, R. Wang, C.-Y. Yang, S. Wang, Analysis of ligand-bound water molecules in high-resolution crystal structures of protein 2 ligand complexes, J. Chem. Inf. Model. 47 (2007) 668 675. [5] C. Clarke, R.J. Woods, J. Gluska, A. Cooper, M.A. Nutley, G.-J. Boons, Involvement of water in carbohydrate 2 protein binding, J. Am. Chem. Soc. 123 (2001) 12238 12247. [6] S.E. Wong, F.C. Lightstone, Accounting for water molecules in drug design, Expert. Opin. Drug. Discov. 6 (2011) 65 74. [7] M.O. Taha, M. Habash, Z. Al-Hadidi, A. Al-Bakri, K. Younis, S. Sisan, Docking-based comparative intermolecular contacts analysis as new 3-D QSAR concept for validating docking studies and in silico screening: NMT and GP inhibitors as case studies, J. Chem. Inf. Model. 51 (2011) 647 669. [8] H.G. Wallnoefer, S. Handschuh, K.R. Liedl, T. Fox, Stabilizing of a globular protein by a highly complex water network: a molecular dynamics simulation study on factor Xa, J. Phys. Chem. B 114 (2010) 7405 7412. [9] J. Luccarelli, J. Michel, J. Tirado-Rives, W.L. Jorgensen, Effects of water placement on predictions of binding affinities for p38α MAP kinase inhibitors, J. Chem. Theory Comput. 6 (2010) 3850 3856. [10] H.G. Wallnoefer, K.R. Liedl, T. Fox, A challenging system: free energy prediction for factor Xa, J. Comput. Chem. 32 (2011) 1743 1752. [11] R. Thilagavathi, R.L. Mancera, Ligand 2 protein cross-docking with water molecules, J. Chem. Inf. Model. 50 (2010) 415 421. [12] D. Bellocchi, A. Macchiarulo, G. Costantino, R. Pellicciari, Docking studies on PARP-1 inhibitors: insights into the role of a binding pocket water molecule, Bioorg. & Med. Chem. 13 (2005) 1151 1157. [13] J.M. Chen, S.L. Xu, Z. Wawrzak, G.S. Basarab, D.B. Jordan, Structure-based design of potent inhibitors of scytalone dehydratase: displacement of a water molecule from the active site, Biochemistry 37 (1998) 17735 17744. [14] R. Kadirvelraj, B.L. Foley, J.D. Dyekjær, R.J. Woods, Involvement of water in carbohydrate 2 protein binding: concanavalin a revisited, J. Am. Chem. Soc. 130 (2008) 16933 16942. [15] V. Mikol, C. Papageorgiou, X. Borer, The role of water molecules in the structure-based design of (5hydroxynorvaline)-2-cyclosporin: synthesis, biological activity, and crystallographic analysis with cyclophilin A, J. Med. Chem. 38 (1995) 3361 3367. [16] A.T. Garcı´a-Sosa, R.L. Mancera, Free energy calculations of mutations involving a tightly bound water molecule and ligand substitutions in a ligand-protein complex, Mol. Inform. 29 (2010) 589 600.

198 Chapter 8 [17] D.G. Lloyd, A.T. Garcı´a-Sosa, I.L. Alberts, N.P. Todorov, R.L. Mancera, The effect of tightly bound water molecules on the structural interpretation of ligand-derived pharmacophore models, J. Comput. Mol. Des. 18 (2004) 89 100. [18] A.T. Garcı´a-Sosa, R.L. Mancera, The effect of a tightly bound water molecule on scaffold diversity in the computer-aided de novo ligand design of CDK2 inhibitors, J. Mol. Model. 12 (2006) 422 431. [19] A.M. Davis, S.J. Teague, G.J. Kleywegt, Application and limitations of X-ray crystallographic data in structure-guided ligand and drug design, Comput. Struct. Approach. Drug. Discov.: Ligand-Protein Interact. (2007) 73. [20] Z. Li, T. Lazaridis, Water at biomolecular binding interfaces, Phys. Chem. Chem. Phys. 9 (2007) 573 581. [21] T. Tsukihara, K. Shimokata, Y. Katayama, H. Shimada, K. Muramoto, H. Aoyama, et al., The low-spin heme of cytochrome c oxidase as the driving element of the proton-pumping process, Proc. Natl Acad. Sci. 100 (2003) 15304 15309. [22] M. Tashiro, A.A. Stuchebrukhov, Thermodynamic properties of internal water molecules in the hydrophobic cavity around the catalytic center of cytochrome c oxidase, J. Phys. Chem. B 109 (2005) 1015 1022. [23] Y.K. Yee, A.L. Tebbe, J.H. Linebarger, D.W. Beight, T.J. Craft, D. Gifford-Moore, et al., N 2Aroylanthranilamide inhibitors of human factor Xa, J. Med. Chem. 43 (2000) 873 882. [24] W.L. Jorgensen, Efficient drug lead discovery and optimization, Acc. Chem. Res. 42 (2009) 724 733. [25] J. Bostro¨m, A. Hogner, S. Schmitt, Do structurally similar ligands bind in a similar fashion? J. Med. Chem. 49 (2006) 6716 6725. [26] A. Wlodawer, J. Vondrasek, Inhibitors of HIV-1 protease: a major success of structure-assisted drug design, Annu. Rev. Biophys. Biomol. Struct. 27 (1998) 249 284. ˚ x-ray study shows closed flap conformation in crystals of tethered [27] B. Pillai, K. Kannan, M. Hosur, 1.9 A HIV-1 PR, Proteins: Struct., Funct., Bioinf. 43 (2001) 57 64. [28] Y.-X. Wang, D.I. Freedberg, P.T. Wingfield, S.J. Stahl, J.D. Kaufman, Y. Kiso, et al., Bound water molecules at the interface between the HIV-1 protease and a potent inhibitor, KNI-272, determined by NMR, J. Am. Chem. Soc. 118 (1996) 12287 12290. [29] M. Fornabaio, F. Spyrakis, A. Mozzarelli, P. Cozzini, D.J. Abraham, G.E. Kellogg, Simple, intuitive calculations of free energy of binding for protein 2 ligand complexes. 3. The free energy contribution of structural water molecules in HIV-1 protease complexes, J. Med. Chem. 47 (2004) 4507 4516. [30] S. Kageyama, T. Mimoto, Y. Murakawa, M. Nomizu, H. Ford, T. Shirasaka, et al., In vitro anti-human immunodeficiency virus (HIV) activities of transition state mimetic HIV protease inhibitors containing allophenylnorstatine, Antimicrob. Agents Chemother. 37 (1993) 810 817. [31] E.T. Baldwin, T.N. Bhat, S. Gulnik, B. Liu, I.A. Topol, Y. Kiso, et al., Structure of HIV-1 protease with KNI-272, a tight-binding transition-state analog containing allophenylnorstatine, Structure 3 (1995) 581 590. [32] S. Grzesiek, A. Bax, L.K. Nicholson, T. Yamazaki, P. Wingfield, S.J. Stahl, et al., NMR evidence for the displacement of a conserved interior water molecule in HIV protease by a non-peptide cyclic urea-based inhibitor, J. Am. Chem. Soc. 116 (1994) 1581 1582. [33] D.L. Mobley, K.A. Dill, Binding of small-molecule ligands to proteins: “what you see” is not always “what you get”, Structure 17 (2009) 489 498. [34] S. Leavitt, E. Freire, Direct measurement of protein binding energetics by isothermal titration calorimetry, Curr. Opin. Struct. Biol. 11 (2001) 560 566. [35] H.A. Carlson, Protein flexibility and drug design: how to hit a moving target, Curr. Opin. Chem. Biol. 6 (2002) 447 452. [36] V. Helms, Protein dynamics tightly connected to the dynamics of surrounding and internal water molecules, ChemPhysChem 8 (2007) 23 33. [37] J.D. Dunitz, The entropic cost of bound water in crystals and biomolecules, Science 264 (1994) 670 671. [38] Z. Li, T. Lazaridis, Thermodynamic contributions of the ordered water molecule in HIV-1 protease, J. Am. Chem. Soc. 125 (2003) 6636 6637.

Water mapping: Analysis of binding site spaces to enhance binding 199 [39] Z. Li, T. Lazaridis, The effect of water displacement on binding thermodynamics: concanavalin A, J. Phys. Chem. B 109 (2005) 662 670. [40] H. Wang, A. Ben-Naim, A possible involvement of solvent-induced interactions in drug design, J. Med. Chem. 39 (1996) 1531 1539. [41] R.H. Henchman, J.A. McCammon, Extracting hydration sites around proteins from explicit water simulations, J. Comput. Chem. 23 (2002) 861 869. [42] H. Resat, M. Mezei, Grand canonical ensemble Monte Carlo simulation of the dCpG/proflavine crystal hydrate, Biophys. J. 71 (1996) 1179 1190. [43] J. Michel, J. Tirado-Rives, W.L. Jorgensen, Energetics of displacing water molecules from protein binding sites: consequences for ligand optimization, J. Am. Chem. Soc. 131 (2009) 15403 15411. [44] J. Michel, J.W. Essex, Prediction of protein ligand binding affinity by free energy simulations: assumptions, pitfalls and expectations, J. Comput. Mol. Des. 24 (2010) 639 658. [45] T. Imai, R. Hiraoka, A. Kovalenko, F. Hirata, Locating missing water molecules in protein cavities by the three-dimensional reference interaction site model theory of molecular solvation, Proteins: Struct., Funct., Bioinf. 66 (2007) 804 813. [46] T. Imai, K. Oda, A. Kovalenko, F. Hirata, A. Kidera, Ligand mapping on protein surfaces by the 3D-RISM theory: toward computational fragment-based drug design, J. Am. Chem. Soc. 131 (2009) 12430 12440. [47] T. Lazaridis, Inhomogeneous fluid approach to solvation thermodynamics. 2. Applications to simple fluids, J. Phys. Chem. B 102 (1998) 3542 3550. [48] Z. Li, T. Lazaridis, Thermodynamics of buried water clusters at a protein 2 ligand binding interface, J. Phys. Chem. B 110 (2006) 1464 1475. [49] R. Abel, T. Young, R. Farid, B.J. Berne, R.A. Friesner, Role of the active-site solvent in the thermodynamics of factor Xa ligand binding, J. Am. Chem. Soc. 130 (2008) 2817 2831. [50] D.D. Robinson, W. Sherman, R. Farid, Understanding kinase selectivity through energetic analysis of binding site waters, Chem. Med. Chem: Chem. Enabling Drug. Discov. 5 (2010) 618 627. [51] P.J. Goodford, A computational procedure for determining energetically favorable binding sites on biologically important macromolecules, J. Med. Chem. 28 (1985) 849 857. [52] P. Setny, M. Zacharias, Hydration in discrete water. A mean field, cellular automata based approach to calculating hydration free energies, J. Phys. Chem. B 114 (2010) 8667 8675. [53] N. Thanki, J. Thornton, J. Goodfellow, Distributions of water around amino acid residues in proteins, J. Mol. Biol. 202 (1988) 637 657. [54] R.C. Wade, K.J. Clark, P.J. Goodford, Further development of hydrogen bond functions for use in determining energetically favorable binding sites on molecules of known structure. 1. Ligand probe groups with the ability to form two hydrogen bonds, J. Med. Chem. 36 (1993) 140 147. [55] W.R. Pitt, J.M. Goodfellow, Modelling of solvent positions around polar groups in proteins, Protein Eng., Des. Sel. 4 (1991) 531 537. [56] M.L. Verdonk, J.C. Cole, R. Taylor, SuperStar: a knowledge-based approach for identifying interaction sites in proteins, J. Mol. Biol. 289 (1999) 1093 1108. [57] J.W. Schymkowitz, F. Rousseau, I.C. Martins, J. Ferkinghoff-Borg, F. Stricher, L. Serrano, Prediction of water and metal binding sites and their affinities by using the Fold-X force field, Proc. Natl Acad. Sci. 102 (2005) 10147 10152. [58] G. Rossato, B. Ernst, A. Vedani, M. Smiesko, AcquaAlta: a directional approach to the solvation of ligand protein complexes, J. Chem. Inf. Model. 51 (2011) 1867 1881. [59] N. Huang, B.K. Shoichet, Exploiting ordered waters in molecular docking, J. Med. Chem. 51 (2008) 4862 4865. [60] M.L. Raymer, P.C. Sanschagrin, W.F. Punch, S. Venkataraman, E.D. Goodman, L.A. Kuhn, Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm, J. Mol. Biol. 265 (1997) 445 464. [61] V. Schnecke, L.A. Kuhn, Virtual screening with solvation and ligand-induced complementarity, Virtual Screening: An Alternative or Complement to High Throughput Screening? Springer, 2000, pp. 171 190.

200 Chapter 8 [62] G.E. Kellogg, S.F. Semus, D.J. Abraham, HINT: a new method of empirical hydrophobic field calculation for CoMFA, J. Comput. Mol. Des. 5 (1991) 545 552. [63] D.L. Chen, G.E. Kellogg, A computational tool to optimize ligand selectivity between two similar biomacromolecular targets, J. Comput. Mol. Des. 19 (2005) 69 82. [64] A. Amadasi, F. Spyrakis, P. Cozzini, D.J. Abraham, G.E. Kellogg, A. Mozzarelli, Mapping the energetics of water protein and water ligand interactions with the “natural” HINT forcefield: predictive tools for characterizing the roles of water in biomolecules, J. Mol. Biol. 358 (2006) 289 309. [65] A.T. Garcı´a-Sosa, R.L. Mancera, P.M. Dean, WaterScore: a novel method for distinguishing between bound and displaceable water molecules in the crystal structure of the binding site of protein-ligand complexes, J. Mol. Model. 9 (2003) 172 182. [66] C. Barillari, J. Taylor, R. Viner, J.W. Essex, Classification of water molecules in protein binding sites, J. Am. Chem. Soc. 129 (2007) 2577 2587. [67] O. Trott, A.J. Olson, AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading, J. Comput. Chem. 31 (2010) 455 461. [68] L. Zhang, J. Hermans, Hydrophilicity of cavities in proteins, Proteins: Struct., Funct., Bioinf. 24 (1996) 433 438. [69] A. Damjanovi´c, B. Garcı´a-Moreno, E.E. Lattman, A.E. Garcı´a, Molecular dynamics study of water penetration in staphylococcal nuclease, Proteins: Struct., Funct., Bioinf. 60 (2005) 433 449. [70] S.F. Sousa, P.A. Fernandes, M.J. Ramos, Protein ligand docking: current status and future challenges, Proteins: Struct., Funct., Bioinf. 65 (2006) 15 26. [71] N. Moitessier, P. Englebienne, D. Lee, J. Lawandi, C.R. Corbeil, Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go, Br. J. Pharmacol. 153 (2008) S7 S26. [72] D.S. Goodsell, G.M. Morris, A.J. Olson, Automated docking of flexible ligands: applications of AutoDock, J. Mol. Recognit. 9 (1996) 1 5. ¨ sterberg, G.M. Morris, M.F. Sanner, A.J. Olson, D.S. Goodsell, Automated docking to multiple target [73] F. O structures: incorporation of protein mobility and structural water heterogeneity in AutoDock, Proteins: Struct., Funct., Bioinf. 46 (2002) 34 40. [74] C.R. Corbeil, N. Moitessier, Docking ligands into flexible and solvated macromolecules. 3. Impact of input ligand conformation, protein flexibility, and water molecules on the accuracy of docking programs, J. Chem. Inf. Model. 49 (2009) 997 1009. [75] C.R. Corbeil, P. Englebienne, N. Moitessier, Docking ligands into flexible and solvated macromolecules. 1. Development and validation of FITTED 1.0, J. Chem. Inf. Model. 47 (2007) 435 449. [76] M. Rarey, B. Kramer, T. Lengauer, G. Klebe, A fast flexible docking method using an incremental construction algorithm, J. Mol. Biol. 261 (1996) 470 489. [77] H.-J. Bo¨hm, The development of a simple empirical scoring function to estimate the binding constant for a protein-ligand complex of known three-dimensional structure, J. Comput. Mol. Des. 8 (1994) 243 256. [78] M. Rarey, B. Kramer, T. Lengauer, The particle concept: placing discrete water molecules during proteinligand docking predictions, Proteins: Struct., Funct., Bioinf. 34 (1999) 17 28. [79] R.A. Friesner, R.B. Murphy, M.P. Repasky, L.L. Frye, J.R. Greenwood, T.A. Halgren, et al., Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein 2 ligand complexes, J. Med. Chem. 49 (2006) 6177 6196. [80] M.L. Verdonk, G. Chessari, J.C. Cole, M.J. Hartshorn, C.W. Murray, J.W.M. Nissink, et al., Modeling water molecules in protein 2 ligand docking using GOLD, J. Med. Chem. 48 (2005) 6504 6515. [81] D. Bakowies, W.F. van Gunsteren, Simulations of apo and holo-fatty acid binding protein: structure and dynamics of protein, ligand and internal water, J. Mol. Biol. 315 (2002) 713 736. ˚ qvist, Free energy calculations ¨ sterberg, M. Almlo¨f, I. Feierberg, V.B. Luzhkov, J. A [82] B.O. Brandsdal, F. O and ligand binding, Advances in Protein Chemistry, Elsevier, 2003, pp. 123 158. [83] D. Hamelberg, J.A. McCammon, Standard free energy of releasing a localized water molecule from the binding pockets of proteins: double-decoupling method, J. Am. Chem. Soc. 126 (2004) 7683 7689.

Water mapping: Analysis of binding site spaces to enhance binding 201 [84] H. Yu, S.W. Rick, Free energies and entropies of water molecules at the inhibitor 2 protein interface of DNA gyrase, J. Am. Chem. Soc. 131 (2009) 6608 6613. [85] V. Helms, R.C. Wade, Hydration energy landscape of the active site cavity in cytochrome P450cam, Proteins: Struct., Funct., Bioinf. 32 (1998) 381 396. [86] V. Helms, R.C. Wade, Computational alchemy to calculate absolute protein 2 ligand binding free energy, J. Am. Chem. Soc. 120 (1998) 2710 2713. [87] Y. Lu, C.-Y. Yang, S. Wang, Binding free energy contributions of interfacial waters in HIV-1 protease/ inhibitor complexes, J. Am. Chem. Soc. 128 (2006) 11830 11839. [88] R.L. Mancera, Molecular modeling of hydration in drug design, Curr. Opin. Drug. Discov. Dev. 10 (2007) 275 280. [89] J.W.M. Nissink, C. Murray, M. Hartshorn, M.L. Verdonk, J.C. Cole, R. Taylor, A new test set for validating predictions of protein ligand interaction, Proteins: Struct., Funct., Bioinf. 49 (2002) 457 471. [90] C. de Graaf, C. Oostenbrink, P.H. Keizers, T. van der Wijst, A. Jongejan, N.P. Vermeulen, Catalytic site prediction and virtual screening of cytochrome P450 2D6 substrates by consideration of water and rescoring in automated docking, J. Med. Chem. 49 (2006) 2417 2430. [91] P. Rowland, F.E. Blaney, M.G. Smyth, J.J. Jones, V.R. Leydon, A.K. Oxbrow, et al., Crystal structure of human cytochrome P450 2D6, J. Biol. Chem. 281 (2006) 7614 7622. [92] J. Hritz, A. de Ruiter, C. Oostenbrink, Impact of plasticity and flexibility on docking results for cytochrome P450 2D6: a combined approach of molecular dynamics and ligand docking, J. Med. Chem. 51 (2008) 7469 7477. [93] R. Santos, J. Hritz, C. Oostenbrink, Role of water in molecular docking simulations of cytochrome P450 2D6, J. Chem. Inf. Model. 50 (2009) 146 154. [94] R. Brenk, E. Meyer, K. Reuter, M.T. Stubbs, G.A. Garcia, F. Diederich, et al., Crystallographic study of inhibitors of tRNA-guanine transglycosylase suggests a new structure-based pharmacophore for virtual screening, J. Mol. Biol. 338 (2004) 55 75. [95] A. Damjanovi´c, J.L. Schlessman, C.A. Fitch, A.E. Garcı´a, Role of flexibility and polarity as determinants of the hydration of internal cavities and pockets in proteins, Biophys. J. 93 (2007) 2791 2804. [96] E.D. Lopez, J.P. Arcon, D.F. Gauto, A.A. Petruk, C.P. Modenutti, V.G. Dumas, et al., WATCLUST: a tool for improving the design of drugs based on protein-water interactions, Bioinformatics 31 (2015) 3697 3699.


Ligand-based pharmacophore modeling: A technique utilized for virtual screening of commercial databases 9.1 Introduction The concept of pharmacophores constituting simple molecules and chemical groups in certain order was introduced nearly a century ago [1]. There has been increasing interest and focus on pharmacophores in recent years following the advances in computational chemistry research [2]. Basically specific biological activity of any drug molecule is because of its specific 3D structure in which important functional groups are arranged in such a way that these are in complementary molecular interactions with its corresponding biological target to produce certain pharmacological activity. These specifically arranged functional groups in 3D space are called ‘pharmacophores’ i.e. functional groups responsible for specific pharmacological activity. Thus pharmacophore mapping is an important and unifying concept in rational drug design that embodies the notion that molecules are active at a particular enzyme or receptor because they possess a number of chemical features (i.e., functional groups) that favorably interact with the target and which possess geometry complementary to it. Interactions in molecular recognition of small molecules by its respective target are usually steric and electrostatic. On the basis of this and to avoid any kind of misconception, IUPAC in 1998 released an official definition as ‘A pharmacophore is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response’. Certain salient features of pharmacophore include: • •

The pharmacophore describes the essential, steric and electronic, function-determining points necessary for an optimal interaction with a relevant pharmacological target. The pharmacophore does not represent a real molecule or a real association of functional groups, but a purely abstract concept that accounts for the common molecular interaction capacities of a group of compounds towards their target structure. Pharmacophores are not specific functional groups (e.g. sulfonamides) or “pieces of molecules” (e.g. dihydropyridines, arylpiperazines).

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


204 Chapter 9 •

A pharmacophore can be considered as the highest common denominator of a group of molecules exhibiting a similar pharmacological profile and which are recognized by the same site of the target protein.

Pharmacophore mapping or pharmacophore modeling is ligand based drug designing approach. Often, all alignment-based methods, molecular field and potential calculations are classified as pharmacophore perception techniques. The term ‘pharmacophore model’ usually refers to one specific type of perception, namely 3D feature-based pharmacophore models represented by geometry or location constraints, qualitative or quantitative. An extrapolation of the pharmacophore approach to a set of multi-dimensional descriptors (pharmacophore fingerprints) has been developed mostly for library design and focusing purposes [3,4]. Pharmacophore model can be derived by direct analysis of the structure of a known ligand either from: i. Its most stable conformer when structure of the target protein is not available (Fig. 9.1A) or ii. Its bioactive conformation observed in its complex with the target protein if the crystal structure of the complex is available (Fig. 9.1B). One of the key considerations for pharmacophore model generation includes automated alignment methods. Correct alignment is the first and the most important prerequisite for a successful pharmacophore identification process. Several other essential issues included in pharmacophore modeling are conformational search, pharmacophore feature definitions, compounds structure storage and screening. Various ways of perceiving pharmacophores have been explored, known issues with pharmacophore modeling have been addressed in one way or another and several computer-based applications with a pharmacophore focus have been created since the 1980s. Many of these programs are not intensively used today, but we consider that they should be mentioned in this chapter- ALADDIN [5], DANTE [6], APOLLO [7], RAPID [8], SCREEN and its PMapper from ChemAxon and ChemX fingerprints from Chemical Design (now Accelrys).

Figure 9.1 Pharmacophore derived from: the most stable conformation (A), the bioactive conformation (B).

Ligand-based pharmacophore modeling 205 This chapter is based on the personal experience of the authors and should not represent a direct comparison between packages but rather a summarized status of the current developments in pharmacophore modeling technology from our perspective.

9.2 Methodology of pharmacophore modeling or mapping Pharmacophore model development is a seven steps process. General steps for the generation of pharmacophores are described in Fig. 9.2.

9.2.1 Input: Data set preparation and conformational search Computer is not an animal model but it is a simple computational machine. Whatever the data we feed in it as input we will get results according to that. Therefore, to generate highly predictive pharmacophore model, selection of appropriate data set along with preparation of suitable conformations of all molecules present in it is a very crucial, first and important step in pharmacophore modeling. Any error at this step cannot be corrected in later steps. To select appropriate data set there are certain points which one should keep in mind. These include: • •

Biological activity of the selected molecules should range over 4 log order with some intermediate values. i.e. 0.00110 nM IC50 active Biological activity of all molecules in the data set should be determined using same experimental protocol because errors vary from method to method and these variations may also be incorporated in the predictions. Pharmacophore model is a hypothetical model of active site of the target protein. Therefore all molecules in the data set should interact in the same active site.

Figure 9.2 A general steps for the generation of pharmacophores.

206 Chapter 9 • •

To cover maximum features required for selected pharmacological activity, structures of active molecules in dataset should have high structural diversity. Stereochemistry of the biologically active molecules play indispensable role in eliciting specific biological activity because intermolecular interactions in molecular recognition are purely 3D phenomenon. Therefore stereochemistry of molecules considered for model generation should be well defined. Data for racemic mixture or molecules with uncertain stereochemistry should be ignored.

After selecting appropriate data set, molecules are sketched and cleaned using builder tool option of the software under use. Sketched molecules are optimized using appropriate force field. Since pH of the medium may affect the bioactive conformation of the molecules, ionization states of each molecule at specific pH (Given in assay procedure) should be considered during conformational search step. Most of the current pharmacophore generation packages include compound builders, but users can also import them from external sources using common file formats, for example SMILES, MOL, SD or MOL2.

9.2.2 Conformational search Conformational search step is to address the flexibility of the input ligands and can be done either as a separate initial step or combined with the ligand preparation process. To avoid the error due to conformational flexibility in each ligand, instead of using only on Global Energy Minimum (GEM) of each molecule, we use entire conformation space containing one GEM and some Local Energy Minima (LEM) nearby GEM. Conformational space of each ligand containing one GEM and other LEMs is determined by conformational sampling using a maximum of 250 conformations within an energy threshold of 20 kcal/mol above the global energy minimum as displayed in Fig. 9.3.

Potential energy

A conformational expansion analysis is necessary in order to identify a conformation which makes functions available for interaction with the macromolecular target. This is probably the most critical step, since the goal here is not only to have the most representative

250 conformations within 20 kcal/mol

LEM GEM Conformations

Figure 9.3 Conformational sampling in conformational search for pharmacophore modeling.

Ligand-based pharmacophore modeling 207 coverage of the conformational space of a molecule, but also to have either the bioactive conformation as part of the set of generated conformations or at least a cluster of conformations that are close enough to the bioactive conformation. Here we divide the methods, that can be used for this purpose roughly, into four categories: systematic search in the torsional space, optionally followed by clustering, stochastic methods, e.g. Monte Carlo, sampling, e.g. Poling [9], and molecular dynamics. The resulting set of conformations can be further optimized using minimization with or without solvent. There are numerous references in the literature, e.g. showing the effects of various sets of conformational models on pharmacophore generation; however, the goal of this chapter is not to describe and analyze the different approaches. Marshall et al. described the so-called Active Analog Approach [10,11], in which the conformational space of flexible molecules is constrained to the geometry of reference molecule (generally active and as rigid as possible). Pharmacophore models are then derived from the set of resulting alignments. This approach has been successfully used since the mid-1980s and still forms the basis of many existing automated pharmacophore modeling techniques.

9.2.3 Feature extraction The representation of pharmacophores varies from one package to another and includes the nature of the pharmacophore points (fragments, chemical features) and the geometric constraints connecting these points (distances, torsions, three-dimensional coordinate location constraints). The interpretation of the chemical structures of the molecules can be done at two levels: • •

Substructural, where molecules can be decomposed into different fragments, each fragment carrying certain specifications (e.g. basic nitrogen or aromatic ring). Functional, where an abstraction of the structure is made such that each molecular fragment of the compound is expressed by the general property it carries. In the current stage, the properties mapped on the fragments are chemical properties, e.g. hydrophobic or ionic interactions or hydrogen bonding features.

For pharmacophore model development, each ligand structure is represented by a set of points in 3D space, which coincide with various chemical features that may facilitate noncovalent binding between the ligand and its target receptor. These pharmacophore sites are characterized by type, location and, if applicable, directionality. Most of the software packages usually provide a set of six color coded pharmacophoric features as described in Fig. 9.4. Some common functional groups corresponding to these pharmacophoric features are given in Table 9.1.

208 Chapter 9

Figure 9.4 Pharmacophoric features commonly used by pharmacophore modeling. Table 9.1: Some common functional groups along with pharmacophoric features. Pharmacophoric feature Hydrogen bond acceptor (HBA)

Functional groups RCOO2,

Hydrogen bond donor (HBD) Positive ionizable group (PI) Negative ionizable group (NI) Aliphatic hydrophobic

, ,




, 0



Aromatic hydrophobic

Ring aromatic feature







9.2.4 Pattern identification The presence of pharmacophoric features in the ligand molecules is not only sufficient for specific recognition by its desired target but the mutual arrangement of these features in three dimensional space i.e. pattern of pharmacophore model, is equally important. Thus determination of pharmacophore features along with intra feature distance is called pattern

Ligand-based pharmacophore modeling 209

Figure 9.5 Pattern identification in pharmacophore mapping: in catalyst (A); in Phase (B).

identification in pharmacophore mapping. The majority of pharmacophore generation packages generate qualitative pharmacophores that do not consider the activity of the molecules (potency), so in general equipotent molecules have to be used. Most of these methods are based on minimizing the RMS superposition error between conformations of various compounds while trying to increase the three-dimensional overlay of pharmacophores. The result is generally multiple pharmacophore solutions, ranked according to different metrics depending on the package used. To our knowledge, currently only two packages are capable of generating SAR models on-the-fly by using directly activity values (Ki or IC50):Catalyst® HypoGen [12] and Apex3D [13]. Stereoview of the pharmacophore model to display pattern identifications in Catalyst (A) and Phase (B) is shown in Fig. 9.5.

9.2.5 Scoring of the model Software generates number of pharmacophoric hypotheses. The best hypothesis is selected on the basis of some scoring parameters. Although parameters and their calculation to rank generated hypothesis varies software to software, ultimate idea of this scoring is to determine the capacity of the hypothesis to identify active hits from pool of active and inactive compounds, how perfectly hypothesis satisfies the active molecules, and how it accurately predict the activity of the compound. To understand this scoring of the generated model, scoring parameters which are used to select the best hypothesis in two most popular softwares i.e. Catalyst from Accelrys and Phase from Schro¨dinger are being described here. In catalyst, initial selection of the best hypothesis is based on cast analysis and some statistical parameters as described in Table 9.2. In case of phase module of Schro¨dinger, two types of parameters are considered: one that ranks the hypothesis on the basis of its ability to pick active molecules from the database

210 Chapter 9 Table 9.2: Cast values and some statistical parameters used in catalyst software used for selection of the best hypotheses. Null cost Fixed cost Error cost Configuration cost RMSD & r

Represents the highest cost of a pharmacophore with no features Represents the simplest model that fits all data perfectly Dependent on the root mean square differences between the estimated and the actual activities of the training set molecules Depends on the complexity of the pharmacophore hypothesis space and should have a value ,17 Represent the quality of the correlation between the estimated and the actual activity data

¨dinger. Table 9.3: Pharmacophore selection parameters in phase of Schro Survival High value of Survival score indicates high ability to identify active molecules Survivalinactive High value of S-I indicates high ability to discriminate active from inactive molecules Site Measures how closely the site points are superimposed in an alignment to the pharmacophore of the structures that contribute to the selected hypothesis, based on the RMSD deviation of the site point of a ligand from those of the reference Vector Measures how well vectors for HBA, HBD and RA are aligned in the structure that contribute to selected hypothesis, when the structures themselves are aligned to the pharmacophore Volume Measure how much the volumes of contributing structures overlap when aligned on the pharmacophore. The volume score is the average of the individual volume scores

and also to discriminate these active from inactive i.e. Survival and Survivalinactive; second that determines how perfectly the generated pharmacophore model is justified by the structure contributing the hypothesis i.e. Site, Vector and Volume scores. The detail of the parameters are describe in the Table 9.3

9.2.6 Validation of pharmacophore After performing pharmacophore analysis on a set of compounds, typically the user will have to select the model(s) with biological and/or statistical relevance, often from multiple possible solutions and use for further research purposes. The validation of the pharmacophore models is therefore a critical aspect of the pharmacophore generation process. The generated pharmacophore model should be statistically significant, should accurately predict activity of the molecules and should precisely identify active compounds from a database [14]. Therefore the derived pharmacophore model is validated for its ability to identify correct active molecules from database in pharmacophore based virtual screening and also for its ability to predict the activity of identified compounds. Commonly used validation methods include: • •

Test set prediction Fischer’s randomization test

Ligand-based pharmacophore modeling 211 • • • •

GH score calculation Enrichment factor Applicability domain calculation Receiver operating characteristic curve

Detail about these methods will be discussed in the Chapter 16. Ultimately, effectiveness of the generated model can only be determined by biological testing of the identified compounds and its comparison with the anticipated results.

9.2.7 Applications of pharmacophore modeling Pharmacophore mapping is the most widely used ligand based drug designing approach. It can be applied in variety of context in the pharmacodynamic as well as pharmacokinetic phases of drug action and play indispensable role in process of drug discovery and development. Majority of these applications utilize pharmacophores as a screening tool. Many examples in the literature show their successful usage in finding new scaffolds i.e. lead identifications as well as in lead optimization i.e. analogue designing against various pharmacological targets [15]. Pharmacophore based virtual screening (PBVS) It is a computational technique used in drug lead identification and deals with the quick search of large libraries of chemical structures in order to identify those structures which are most likely to map over the query pharmacophore. Fig. 9.6 describes the identification of twenty two aldose reductase (ALR2) inhibitors, which are selective over aldehyde reductase (ALR1), by pharmacophore based hierarchical virtual screening. Pharmacophore based de novo designing It is designing of new small molecules by connecting the pharmacophoric features through rigid fragments or molecular framework. Following two computer based programs can be explored for this purpose: NEWLEAD  This program uses input of a set of disconnected molecular fragments that are consistent with a pharmacophore model, and the selected sets of disconnected pharmacophore fragments are subsequently connected using linkers (such as atoms, chains or ring moieties). PhDD  It can automatically generate small molecules that satisfy the requirements of an input pharmacophore hypothesis. Then a series of assessments to the generated molecules are carried out, including assessments of drug-likeness, bioactivity and synthetic accessibility.

212 Chapter 9

Figure 9.6 Pharmacophore based hierarchical virtual screening to identify selective ALR2 inhibitors. Multi-targeting by pharmacophore Sometime single targeted drugs are not suitable for treatment multi-factorial diseases e.g. Cancer, Alzheimer disease, autoimmune disorders diabetic complications, etc. which possess multiple pathophysiological indications because if we block any of these pathways, others are still continue to progress the disease. For such diseases Multi target directed drugs (MTDs) are suitable that are simultaneously block more than one pathways. Designing of MTDs can be effectively achieved by Pharmacophore modeling. For this purpose, we can generate multiple pharmacophore models for various targets involved in disease process and then these generated models can be used in combination in VS to get MTDs. Fig. 9.7 displays designing of 27 dual SYK and ZAP-70 inhibitors using pharmacophore modeling. Target identification by pharmacophore It is based on the principle of potential drug target prediction against any given small molecules via a ‘reverse’ pharmacophore mapping approach. In this reverse pharmacophore

Ligand-based pharmacophore modeling 213 1.5 million molecules (Phase Database)

Screening with pharmacophore model of SYK

1000 hits 261 hits

Screening with pharmacophore model of ZAP-70

44 hits Qikprop (ADME ) filter 27

Figure 9.7 Designing of drug like multi-targeted molecules using pharmacophore based virtual screening.

mapping, a small molecule can be submitted to the data base of pharmacophore models of various biological targets and on the basis of fit values, candidate targets i.e. those for which pharmacophore model is present in the database, can be prioritized. This approach can also be used to explore the mechanism of action of herbal medicines with known therapeutic potential. PharmMapper server, a web server, is available for this purpose. ADME-tox prediction by pharmacophore The pharmacophore models can be used to identify possible interactions of drugs with those functional proteins that are involved in metabolism, clearance and transporters such as Pglycoprotein and organic cation transporter etc. by matching pharmacophoric features of test molecules to those of drug molecules with a well-known ADME-tox profile. Pharmacophore mapping in for addressing resistance issues A pharmacophore model can be generated using resistant molecules then this generated model can be used to discard resistant molecules in screening. Pharmacophore based designing can also be performed for inhibitors for the proteins that are responsible for resistance (Pgp/MRP-1).

9.3 In process determinants for quality pharmacophore modeling 9.3.1 Molecular alignments Molecular alignment and superposition is a prerequisite to pharmacophore development. However, some alignment methods require a pharmacophore as a starting point [16]. Molecular alignment is not limited to just providing a basis for pharmacophore elucidation, it can also be used to derive 3D-QSAR models that potentially can estimate binding affinities, in addition to indirectly providing insight into the spatial and chemical nature of

214 Chapter 9 the receptorligand interaction of the putative receptor. Essentially, an alignment endeavors to produce a set of plausible relative superposition of different ligands, hopefully approximating their putative binding geometry [17]. Many of the issues and concerns in the generation of pharmacophore models are inherent in different alignment methods. These issues can be used to differentiate or categorize the plethora of available algorithms [18].

9.3.2 Handling flexibility Primary issue in the pharmacophore generation is that of ligand flexibility, vital in the determination of the relevant binding conformation for each of the ligands concerned. In context to the flexibility issues, alignment methods can be considered rigid, semi-flexible or flexible. Rigid methods, while generally simpler and faster, require a presumption of the bioactive conformation of the ligands; this is often not possible and also removes the impartiality of the method. Semi-flexible methods are those that are fed with pre-generated conformers which are processed in either a sequentially, iterative or combinatorial manner. These methods lead to a further series of considerations such as whether the weighting, number and spread of conformers are determined by energy cut-offs or Boltzmann probability distributions and whether solvation models should be used. Flexible methods are considered to be those in which the conformational analysis is performed on-the-fly and these are generally the most time consuming as they require rigorous optimization [19].

9.3.3 Alignment algorithms Two types of algorithms are used in different types of alignment methods i.e. point or property based algorithm. In point-based algorithms, pairs of atoms or pharmacophores are usually superposed using a least-squares fitting. These algorithms often use clique detection methods [20], which are based on the graph-theoretical approach to molecular structure, where a clique, a completely connected subgraph is used to identify all possible combinations of atoms or functional groups to identify common substructures for the alignment. The greatest limitation of these algorithms is the need for predefined anchor points, as the generation of these points can become problematic in the case of dissimilar ligands. Property-based algorithms, often also termed field-based, make use of grid or field descriptors, the most popular of which are those obtained from the program GRID, developed by Goodford [21]. These are generated by defining a three-dimensional grid around a ligand and calculating the energy of interaction between the ligand and a given probe at each grid point. These diverse descriptors include various molecular properties such as molecular shape and volume, electron density, charge distribution such as molecular electrostatic potentials and even high-level quantum mechanical calculations. These algorithms are commonly broken down into three stages, which are subject to much

Ligand-based pharmacophore modeling 215 variation. First, each ligand is represented by a set of spheres or Gaussian functions displaying the property or properties of interest. Usually the property is first calculated on a grid and subsequently transformed to the sphere or Gaussian representation. A number of random or systematically sampled starting configurations are then generated depending on the degree of freedom considered, rotational, translational and conformational. Finally, local optimizations are performed with some variant of the classical similarity measure of the intermolecular overlap of the Gaussians as the objective function. While earlier property-based alignment methods were commonly used, these have been surpassed by Gaussian molecular representation and Gaussian overlap optimization. These provide high information contents and avoid the dependence on additional parameters such as grid spacing while also providing a substantial increase in speed. Variations on these algorithms have included the application of Fourier space methods to optimize the electron density overlap, similar to the molecular replacement technique in X-ray crystallography [22] and differentially weighted molecular interaction potentials or field terms [23]. Another interesting alternative has been to apportion the conformational space of the ligands into fragments, compute the property field on pairs of fragments and determine the alignment by a pose clustering and incremental build-up procedure of retrieved fragment pairs [24].

9.3.4 Key aspects of scoring and optimization All alignment methods require some quantitative measure or fitness function, to assess the degree of overlap between the ligands being aligned and to monitor the progression of that optimization. This is most often manifested as a molecular similarity score or alignment index [25]. Typically in point-based algorithms, the optimization process endeavors to reduce the root-mean-square (RMS) deviation of the distances between the points or cliques by least-squares fitting. However, interesting variations have been developed including the use of distance matrices to represent any given conformation of a ligand [26]. Simulated annealing is used to optimize the fitness function, which is a quantification of the sum of the elements of the difference distance matrix created by calculating the magnitude of the difference for all corresponding elements of two matrices. Another optimization method, related to the least-squares fitting used in point-based algorithms, is the directed tweak method [27]. This is a torsional space optimizer, in which the rotatable bonds of the ligands are adjusted at search time to produce a conformation which matches the 3D query as closely as possible. As directed tweak involves the use of analytical derivatives, it is very fast and allows for an RMS fit to consider ligand flexibility. In property-based alignments where the molecular fields are represented by sets of Gaussian functions, the intermolecular overlap of the Gaussians is used as the fitness function or

216 Chapter 9 similarity index. The two most common optimization methods are Monte Carlo and simulated annealing [28]. Other straight forward optimization algorithms include gradientbased methods and the simplex method, which seeks the vector of parameters corresponding to the global extreme (maximum or minimum) of any n-dimensional function, searching through the parameter space [29]. Further, more sophisticated, optimization algorithms include neural networks and genetic algorithms which mimic the process of evolution as they attempt to identify a global optimization [30]. In an alignment procedure chromosomes may encode the conformation of each ligand in addition to intramolecular feature correspondences, orientational degrees of freedom, torsional degrees of freedom or other information such as molecular electrostatic potential fields. During the optimization the chromosome undergo manipulation by genetic operators such as crossover and mutation. Alignment methods are also known to combine different optimization methods, such as a genetic algorithm and a directed tweak method [31].

9.4 Automated pharmacophore generation methods The currently available pharmacophore perception methods are reviewed herein three major categories: geometry and feature-based methods, field-based methods and pharmacophore fingerprints. Finally, the methods that do not fall into any of the above categories are described in an additional section.

9.4.1 Geometry- and feature-based methods DISCO (DISCOtech) DISCO was not originally reported as an automated pharmacophore identification program [32], but it had considerable influence over the development of modern pharmacophore modeling tools. By design, no conformational engine was implemented in DISCO, as no universal force fields and methods suitable for all types of compounds were available [32]. However, Tripos (commercial distributor) provides access to 3D converters and conformational search engines such as Concord® and Confort® via the Sybyl interface. The distance geometry approach has been used successfully for subsequent pharmacophore modeling with DISSCO by the authors of DISCO and other researchers [33]. DISCO considers three-dimensional conformations of compounds not as coordinates but as sets of inter-point distances, an approach similar to a distance geometry conformational search. Points are calculated between the coordinates of heavy atoms labeled with interaction functions such as HBD, HBA or hydrophobic. One atom can carry more than one label. The atom types are considered as far as they determine which interaction type

Ligand-based pharmacophore modeling 217 the respective atom would been gagged in. The points of the hypothetical locations of the interaction counterparts in the receptor macromolecule also participate in the distance matrix. These are calculated from the idealized projections of the lone pairs of participating heavy atoms or H-bond forming hydrogens. The hydrophobic points are handled in a way that the hydrophobic matches are limited to only one atom in a hydrophobic chain and there is a differentiation between aliphatic and aromatic hydrophobes. A minimum constraint on pharmacophore point of a certain type can be set [33]. DISCO relies on the BronKerbosch clique detection algorithm for inter-distance comparison. In DISCO, multiple conformations per compound are considered in the alignment and the stereochemistry is preserved. However, there is no mechanism for selecting conformations within the algorithm apart from the alignment to other structures, hence the user has to provide a conformational model that contains only the desired (low-energy) conformations. As a direction sequence, conformationally restrained compounds should be the preferred input for the program, provided that they carry the same activity as the more flexible analogues and the performance of the program tends to decrease with increasing flexibility of the input compounds. During a DISCO run, one compound is taken as reference and each conformation of the remaining compounds is aligned on to the reference conformation in order to find a pharmacophore match. Typically, the least flexible compound serves as a reference as this reduces the pharmacophore space to explore and the number of results left to evaluate. The result is scored multiple pharmacophore solutions rather than a single model. The score is based on the number of participating molecules, number of features and the inter-feature distances. Higher model quality is achieved by automatically reiterating through a number of variables such as distance tolerance specified as minimum, maximum and increment, number of features and compounds used in the analysis [34]. The resulting pharmacophores are required to match all features in all compounds. The pharmacophore points in the Tripos implementation of DISCO, currently marketed under the name DISCO techTM, can be represented as Tripos UNITY® [35] query features and the models can be used directly for UNITY database searches or in combination with 3D QSAR such as CoMFA as described in [36]. GASP GASP (Genetic Algorithm Superposition Program) uses a genetic algorithm for pharmacophore identification. GASP was developed by Jones, Willet and Glenn in the mid1990s. The methods used in GASP are similar to those in the leading docking application GOLD, developed by the University of Sheffield, GlaxoSmithKline and CCDC [37,38]. Unlike others, the conformational search is performed on-the-fly in GASP and represents an integral part of the program. Each compound is input a single, low-energy conformation

218 Chapter 9 and random rotations and translations reapplied in order to explore the conformational variation prior to superposition. The first step in the pharmacophore generation process with GASP is the determination of the pharmacophore features: rings, donors (protons) and acceptors (lone pairs). The atoms defined as HBA carriers can be aliphatic and aromatic nitrogens, alcohols, ethers, carbonyl, carboxyl oxygens and halogens; HBD carriers include amines and hydroxyls [32,39]. Projection points for the hydrogen bonding features are considered during pharmacophore analysis. GASP considers only aromatic structures as hydrophobic and there is no option to modify any of the pharmacophore feature definitions or introduce new ones [33]. If a training set consists of N compounds, a chromosome will consist of 2 N  1 strings: N binary encoding the conformational information about each compound and N  1 integer strings representing the feature mappings of the training set members to a single reference (base) molecule. The length of each integer string equals the number of features in the respective molecule. The compound with the least pharmacophore features is selected as base molecule. No more than one molecule can be used for that purpose. Essentially, the program tries to maximize the mappings using a least-squares method while trying to satisfy a fitness function comprising three components: the similarity score of the mapped features, the volume integral of the aligned structures and the internal steric energy of the participating conformers, where the weighting of each contribution can be adjusted by the user. GASP uses two genetic operators, crossover (two parents produce two children) and mutation (one parent produces one child) to evolve models with a maximum fitness score and therefore the highest quality structural alignment. The similarity score for the overlaid molecules is the sum of the scores of the similarity match between donors, acceptors and aromatic rings. The volume integral is determined as the mean volume integral per molecule with the base molecule. Finally, the internal van der Waals energy is calculated as LennardJones potential and represented as the difference from the preceding conformer. All features of all molecules must match in the alignment, hence no outliers are allowed and sometimes subsetting may be required during the training set preparation phase in order to separate out compounds that carry somewhat different pharmacophoric information. Owing to the nature of the algorithm, each run may result in a slightly different solution. Several solutions can then be collected, ranked according to fitness score and analyzed visually in order to find the most suitable answer. Similarly to DISCO, the alignments coming from GASP can be used as a starting point for CoMFA studies [40]. GALAHAD GALAHAD is a joint development between Tripos, the University of Sheffield, Novo Nordisk and Biovitrum [41,42]. The program uses a modified GA and seems to address

Ligand-based pharmacophore modeling 219 the limitations of GASP in terms of increasing performance, reducing bias towards a single template (base) molecule, introducing partial matching and an improved multi-objective Pareto scoring function. GALAHAD allows the use of pre-generated conformations as a starting point, which increases the speed of the calculation. Each molecule is represented as a core and set of torsions. In the alignment phase, a new method is used, where each molecule is compared with each other, hence no template is required. Pharmacophore similarity rather than feature mappings is used for the comparison, which should result in shorter run times. The fact that not all features are required to map contributes to the ability of the models to accommodate more diverse structures. Unlike GASP, GALAHAD reports multiple solutions from a single run which are ranked according to their scores and can be resubmitted for refinement if desired. Catalyst Catalyst® was launched 1992 by BioCAD (now Accelrys) as a tool for automated pharmacophore pattern recognition in a collection of compounds based on chemical features correlated with three-dimensional structure and biological activity data [43]. Catalyst models (hypotheses) consist of sets of abstracted chemical features arranged at certain positions in the three-dimensional space. The feature definitions are designed to cover different types of interactions between ligand and target, e.g. hydrophobic, H-bond donor, H-bond acceptor, and positive ionizable, negative ionizable. Except in some special cases, different chemical groups that lead to the same type of interaction, and thus to the same type of biological effect, are handled as equivalent. The directions of the H-bonds are usually determined and are given by vectors. Distinct chemical features in a particular conformation of a compound must be located within the tolerance constraints in order to satisfy the model. These models can be used directly as three-dimensional data base search queries in the Catalyst environment. The pharmacophore identification process as implemented in the Catalyst package involves 3D structure generation, followed by conformational search and definition of the pharmacophore points consistent with the training set. For the construction of molecular structures, a 2D formula editor is provided in combination with 3D conversion. Standard potential energy minimization is performed using the modified parameter set of the CHARMm force field;[44] the conformational models are built using Monte Carlo conformational analysis together with poling as described in the next section. Catalyst provides two algorithms for automated pharmacophore arrangement search. HypoGen uses biological assay data (e.g. IC50 or Ki) to derive hypotheses that can predict quantitatively the activity of compounds, whereas HipHop seeks a common threedimensional configuration of chemical features shared among a set of active molecules. In the case of HypoGen, similarly to 3D QSAR, all members of the training set must possess the same binding mode; the second method optionally allows automatic elimination of

220 Chapter 9 compounds that may have a different molecular site of action. The resulting models undergo a complex evaluation process by the program and the top scoring results are reported to the user. i. HipHop The HipHop algorithm attempts to produce an alignment of compounds expressing certain activity against a particular target and by superposition of diverse conformations to find common three-dimensional arrangements of features shared between them [45]. Even though HipHop does not use activity data as input, it is a good idea to select highly active chemically diverse compounds when composing training sets whenever possible. HipHop identifies common features by a pruned exhaustive search, starting with the simplest possible (two-feature) arrangements and expanding the model to three, four, five features and so on until no more common configurations can be found. This includes a search through two large spaces  the conformational space of the training set and the pharmacophore domain. HipHop does not need a particular reference conformation. If required, HipHop will attempt consecutively to align with each other all conformers of every training set member. Still, at least one molecule as the entire conformational model (principal compound) must be specified as a reference. Which exact conformer will then be present in the alignment depends on the remaining compounds and their conformational diversity and also on the conditions of the run. First, the program identifies matches and distribution of the chosen features among the training set members, followed by the alignment procedure. The features are considered superimposed when of each of them lies within a specified distance (tolerance) from the ideal location, and at the same time the RMS deviation for the configuration as a whole is measured. The quantitative estimation of the goodness of match between a molecule and a configuration of features (Fit)can be pursued similarly to a scoring function to rank virtual screening results. In the ideal case, superposition of all input molecules is desired. Sometimes it could be of advantage to permit some molecules, up to a specified number, to miss one, one particular or more than one of the features of a configuration in order to map all the remaining features. The benefit from such an option is that it allows one to work with compounds that may have a different binding mode or show activity in a particular assay as a result of an alternative mechanism of action or experimental errors. In most cases, the result of a HipHop run will be numerous configurations of features so there is a need to score and rank them. The ranking of the HipHop models is based on rarity. Maximizing the score of a configuration will minimize the probability that the training set molecules map the model by chance, making the pharmacophore specific. ii. HypoGen The HypoGen algorithm is designed to correlate structure and activity data for pharmacophore model generation. HypoGen consists of three phases: constructive,

Ligand-based pharmacophore modeling 221 subtractive and optimization. Generally, the constructive phase is similar to the proceeding of the HipHop algorithm. The training set is divided into two subsets, “active” and “inactive” compounds. First, all pharmacophores shared between the first two most active compounds are identified by overlaying systematically all their conformations, then only hypotheses that fit a minimum subset of features present in the remaining active compounds are kept. In the subtractive phase, the program inspects the hypotheses already created and removes those most common to the inactive part of the training set. Compounds are considered inactive when their activity lies 3.5 logarithmic units (this value is user adjustable) below that of the most active compound. The subtractive phase is followed by an optimization phase where simulated annealing is used to improve the predictive power of the hypotheses. Small changes are made to the models and they are scored according to the accuracy in activity estimation. Finally, the simplest models that correctly estimate activity are selected (Occam’s razor) and the top N solutions are reported to the user. An important assumption that is made within both HipHop and HypoGen is that more contacts to the receptor and therefore more features per molecule lead to enhanced activity. It is well known from practice that often this is not true, large and feature-rich compounds may be barely active because of unfavorable steric interactions. An extension to the HypoGen algorithm, HypoRefine, is intended to help in solving this problem by placing the exclusion volume in key locations derived from atoms of well-fitting but inactive compounds. On the other hand, when insufficient activity or only HTS data are present, the HipHop Refine algorithm allows the use of “negative” information from inactive compounds matching the pharmacophore in order to generate a grid-based exclusion volume which eliminates false-positive HTS hits and increases enrichment rates [46]. Database searching in Catalyst: The database search process starts with a rapid screening process within which molecules possessing properties required from potential hits are sorted out from those that can be excluded a priori. The screen involves substructure match followed by screens matching three-dimensional pharmacophore features, molecular shapes or exclusion volumes and text constraints (1D properties) if present in the query (through Oracle). All this greatly reduces the number of potential hit compounds in the database. The next step of the search process tries a rigid fit of each conformation of each compound to the corresponding features. Compounds are selected as hits after the first successful mapping of all features and once all compounds have passed the procedure a hit list is obtained. The best database search first identifies all potentially suitable compounds by using loosened constraints, thus including those that would fail a rigid search. Within this preliminary list, the algorithm attempts to modify additionally the conformers so that they can fit the original query while remaining below a certain

222 Chapter 9 energy overflow. The use of a Best search is justified when one has to deal with too small hit lists. Once a hit list has been obtained, Catalyst provides the possibility to compute fit values that can be used for scoring. Phase Phase is the pharmacophore generation module provided by Schro¨dinger. Like other modules available from Schro¨dinger, Phase uses Maestro interface as the visualization tool [2]. Maestro provides a molecule sketcher and all the common molecular file formats are supported. The pharmacophore generation module in Phase generates pharmacophore models using a four to five step procedure described below. Ligand preparation: Molecule construction and 2D to 3D conversion are performed by using the LigPrep application in the Maestro modeling environment [47]. Ionization at a given pH or neutralization, tautomer enumeration and stereoisomer enumeration are also supported. Stereoisomers can be treated either as being separate or identical molecules. The molecule preparation step includes also conformational expansion using a torsional search or a combined Monte Carlo Multiple Minimum/Low Mode search. During the search, the intramolecular hydrogen bonds are not considered. Molecules can be minimized, OPLS-2005 or MMFF force fields are available, and also two continuum solvation models (distancedependent dielectric or GB/SA) [48]. A double criterion is used to eliminate redundant conformations; it uses distances between pairs of corresponding atoms within a 1 kcal/mol energy window. Using all compounds chosen to participate in a pharmacophore analysis, a molecular spreadsheet can be created and the user can manually select the molecules that will belong to the set that will define the reference pharmacophore space (active set). Creating the pharmacophoric sites: Similarly to other software packages such as DISCO and Catalyst, Phase uses chemical features (hydrophobic, H-bond acceptors, H-bond donors, negative charge, positive charge, and aromatic ring) to define the pharmacophore points called sites. These features are encoded in SMARTS and can be edited. H-bonding and ring aromatic features are vectorised features (their directionality is considered). Finding common pharmacophores: Using the sites defined in the previous step, pharmacophores common either to all or to a user-defined number of the selected active molecules can be generated. Phase uses a tree-based partitioning algorithm for that purpose, which places pharmacophore configurations in multi-dimensional boxes and groups them according to their inter-site distances The user has control over the size of the pharmacophore models (maximum number of features), and also the interpharmacophore point spacing. Pharmacophores containing between three and seven sites can be generated. A given pharmacophore can be edited (feature addition or removal) and

Ligand-based pharmacophore modeling 223 the excluded volume can be added in order to add some more information based on inactive molecules. Scoring the pharmacophores: All ligands are then aligned on the models. The model ranking is performed using a user-weighted scoring function consisting of: • • •

The quality of the alignment (RMSD in the site-point positions); The definition of a vector score that measures the angle deviation (average cosine) between the vectorised features on the molecules; The definition of a volume similarity (common/total) score based on the overlap of steric models of heavy atoms in each pair of molecules.

Partial mapping of the molecules on a pharmacophore model is allowed. At this stage, pharmacophore models and alignments can be visualized. Excluded volume scan be added manually after having aligned the inactive molecules on the pharmacophore models. Building a QSAR model: The generation of a QSAR model is done as a post-processing step of pharmacophore generation. This is conceptually different from the Catalyst/ HypoGen approach, in which SAR data are used actually to build the pharmacophore models, and this is reflected in these models. In the QSAR approach of Phase, molecular structures are aligned on the pharmacophore, a rectangular grid that encompasses the aligned molecules is created (generating uniformly sized cubes) and partial least-squares (PLS) is used for the regression. As in CoMFA, favorable and unfavorable regions can be visualized. Both atoms and pharmacophore scan be used for the models. Screening: Phase also has its own database management system, with the possibility of either storing single conformations for the molecules or storing different sets of conformations. As part of the processing within this system, molecules can be cleaned (chirality, ionization). Conformers can also be generated on-the-fly when performing the database search. In addition to conformations, indexing of a database can be done by adding pharmacophore sites. Partial match of the hits on a pharmacophore query is allowed. The pharmacophore search hits are ranked using a fitness function. Pharmacophores in MOE MOE (Molecular Operating Environment) is the modeling platform developed by the Chemical Computing Group. This platform allows access to different sets of computational tools ranging from bioinformatics, protein modeling, structure-based design to pharmacophore perception. All these applications have been integrated using the Scientific Vector Language (SVL). The pharmacophore models built in MOE are qualitative. There is no possibility of using the SAR of a set of molecules in the building of the models.

224 Chapter 9 The workflow that is used in MOE can be divided into four main steps: Generate annotations: In the MOE, molecules are stored in a database with their associated set of conformations. Several methods can be applied to expand the conformational space of organic molecules ranging from molecular dynamics to stochastic methods and systematic search. A fragment-based high-throughput methodology is provided for the construction of conformation databases. For each molecular conformation, an annotation can be generated using a so-called Pharmacophore Pre-Processor. The goal is to encode all the possible structural features (H-bond donors and acceptors including tautomers, anions and cations, including resonance forms and hydrophobic and aromatic areas) that describe the ligand’s pharmacophore. This tool recognizes the different conformations of a molecule by molecular graph comparison. However, its use is optional and annotation can be performed during the database search (with the obvious consequence of increased search times). Molecules can be then visualized using the Database Viewer. Create a pharmacophore query: The definition of pharmacophores is done manually by applying so-called schemes using a Pharmacophore Query Editor. A template molecule is generally used for this purpose. In the MOE environment, a scheme is a collection of functions that define how each ligand is annotated. This is accessed via an SVL function. The default scheme is called PCH (Polarity-Charged-Hydrophobicity). New schemes can be created to represent certain molecules better, e.g. Planar-Polar-Charged-Hydrophobicity [49]. If the structural information of a receptor is not available, molecule alignments can be performed using an all-atom flexible alignment procedure that combines a force field and a 3D similarity function based on Gaussian descriptions of shape and pharmacophore features to produce an ensemble of possible alignments of a collection of small molecules [50]. Pharmacophore queries can be derived from the resulting set of aligned conformations of known actives. Currently, there is no automated tool in MOE that can generate pharmacophore models from a set of active/inactive molecules. As a consequence, there is no pharmacophore scoring or ranking or a validation method implemented in the program. Database search: The so-generated pharmacophore is then used for database mining. In MOE, molecules are stored in databases with pre-calculated conformations. No new conformations are generated during a database mining experiment. Compounds are aligned with the query using a rigid-body superposition, with no flexible adjustment of the rotatable bonds. Full or partial mapping of the pharmacophore features can be obtained, with user control of the pharmacophore matching rules. Excluded volumes can be used to refine a query further. Editing the pharmacophore query for refinement: The built-in query editor allows the user to refine a previously built pharmacophore model further.

Ligand-based pharmacophore modeling 225

9.4.2 Field-based methods This perception certainly involves a degree of oversimplification, yet it allows easier coverage of different conformational states, which may otherwise result in completely different fields. The high complexity of 3D descriptors, the dependence on the binding mode and the alignment associated with the field-based methods makes these accurate but labor-intensive 3D QSAR methods less suitable in a virtual screening process, but undoubtedly useful tools for compound optimization. CoMFA and CoMSIA CoMFA and CoMSIA has been already discussed in detail in Chapter 2. eXtended electron distribution (XED) As an alternative to describe molecules by their structural features (substructural elements, functional groups) and similarly to CoMFA, this approach uses field points to describe the van der Waals and electrostatic minima and maxima that surround molecules and compares these field points. The field points that are used are derived from molecular electrostatic potential descriptors. The XED model is marketed by Cresset Biomolecular and forms the basis for the proprietary virtual screening technology FieldPrintTM [51]. The eXtended Electron Distribution (XED) force field was first described by Vinter [51]. This force field proposes a different electrostatic treatment of molecules to that found in classical molecular mechanics methods. In classical methods, charges are placed on atomic centers, whereas the XED force field explicitly represents electron anisotropy as an expansion of point charges around each atom [52]. This force field is now available in Cresset BioMolecular’s software package. Apaya et al. were the first to describe the applicability of electrostatic extreme a values in drug design, on a set of PDE III inhibitors [53]. Conformational expansion of molecules (also called conformation hunting in Cresset’s XedeXTM software module) applies a Monte Carlo approach combined with fast molecular dynamics for ring conformations. The minimization of the conformations is done using the XED force field, in order to assign correct charges. Based on the results obtained by Bostro¨m [54], this method performs comparably to other available methods when considering the RMS difference between the bound conformation and the closest conformation found considering the number of conformations found with an RMS value ˚ . Three types of field points can be calculated with XED: positive between 0.0 and1.0 A and negative extrema and van der Waals points (also called “sticky” points). These points are calculated by moving probes on a grid of points placed above the van der Waals molecular surface. Extrema values are found using a 3D simplex algorithm and coincident positions are filtered out [55]. The field points are color coded and their radius reflects the

226 Chapter 9 depth of the energy well. Pairwise molecule comparison can be performed by using these field points only. A score reflecting the degree of similarity of the two sets of field points is calculated. This avoids having to pre-align the molecules as is the case for other fieldbased methods (CoMFA). As an extension to this, Cresset developed the technology Field Print to encode a molecule’s complex 3D field pattern in a 1D string and store it in a database [56]. This database can be searched with the field print of any molecule and retrieve compounds that do not necessarily belong to the same chemical class. Cresset’s database contains over 1,500,000 distinct commercially available compounds [57].

9.4.3 Pharmacophore fingerprints Pharmacophore fingerprints are defined as the binary encoded information (key) about the presence or absence of pharmacophore features and distances in a single molecule or a compound collection. This concept can be extended to include the occurrence counts of distinct pharmacophores. Usually the focus is put on two to four point fingerprints but larger number can be used and the utilization of up to nine point pharmacophores has been described. Pharmacophore triplets are widely used as traditionally they have been considered to be most effective in terms of information content versus complexity. The pharmacophore space is binned and the method of binning and the bin size are of significant importance. The most common application of pharmacophore fingerprints is in the area of diversity and similarity calculations, compound library focusing and selection, but 3D pharmacophore descriptors can also be used for the analysis of structureactivity relationships, in decision trees and QSAR. Fingerprint focusing methods commonly use similarity coefficients such as Tanimototo retrieve or classify compounds of interest out of a typically large collection.

9.4.4 ChemX/Chem Diverse, PharmPrint, OSPPREYS, 3D keys, Tuplets Numerous examples of 3D fingerprint methods have been described in the literature, but one of the most popular applications is ChemX/ChemDiverse of ChemicalDesign/ Oxford Molecular (now Accelrys). Another example, of an in-house pharmacophore fingerprint construction, is PharmPrint by Affymax. The Oriented Substituent Pharmacophore PRopErtY Space (OSPPREYS) approach, introduced by Martin and Hoeffel, is in software terms an extension of CCGs MOE package, written using SVL. The 3D oriented substituent pharmacophores are aimed towards better representation of diversity and similarity in combinatorial libraries in the 3D pharmacophore space. Combinatorial library design often operates only on substituents rather than on the final products as the complications related to the

Ligand-based pharmacophore modeling 227 conformational coverage in the 3D space and the scaffold dependency limit the product-based approaches to smaller libraries. The 3D oriented substituent pharmacophores add two more points and the corresponding distances to each substituent pharmacophore which represent the relationship of the substituents in the product with only little additional information. The fingerprints permit the creation of property space by multidimensional scaling (MDS) and, since scaffold independent, can be stored separately and applied to different libraries. The Accelrys implementation of pharmacophore fingerprint descriptors is called 3D Keys. This application is based on standard Catalyst feature definitions and is a part of the Cerius2 software package. The collection of all combinations of three (triplets) or four (quadruplets) features in 3D space over all conformations of all compounds in the supplied data set is computed. Each triplet or quadruplet is characterized by a set of feature types and the corresponding interfeature distances. Optionally, appearance counts can be included in a fingerprint. Using these fingerprints, the property space of molecules can explored on the basis of pharmacophore diversity after MDS. These fingerprint descriptors can be used for diverse and similar elections, clustering, library comparison and optimization or applied to decision trees and QSAR. Finally, relevant pharmacophore hypotheses can be extracted from the keys and used for database mining. 3D Keys can be derived both from small molecules and from three-dimensional receptor binding site features. Another, similar application is Tripos Tuplets, which handles two to four point fingerprints, with the option of requesting the presence of certain features or substructures in the fingerprint. Tuplets can be used for clustering, can provide the basis for similarity selections and can utilize both ligand and target information. Tuplets can be applied for the purpose of identifying alternative binding modes as well as for deriving hypotheses from compounds, UNITY queries or binding pockets, which then can be analyzed using multiple similarity measurements.

9.4.5 Other methods SCAMPI Most of the above-mentioned pharmacophore generation techniques use a small number of user selected molecules, commonly called a dataset, to derive the pharmacophore models. With the advent of high-throughput methods, there was a need to extract pharmacophore information from much larger datasets [58]. SCAMPI (Statistical Classification of Activities of Molecules for Pharmacophore Identification) is a program developed in C language by Chen et al. [59]. According to the authors, it allows the use of datasets of approximately 10002000 compounds. The

228 Chapter 9 SCAMPI program’s implementation has been done to allow users to visualize the molecules and the generated pharmacophores in the Sybyl environment. Two different, but connected, spaces are searched by the program: • •

The conformational space, representing all possible conformations of the compounds; The correspondence space, representing all the possible correspondences of chemical features and configurations among different compounds.

As opposed to other pharmacophore generation methods that treat the conformer expansion and pharmacophore identification phases separately, SCAMPI combines the two searches and lets them depend on each other. SCAMPI reads multiple MOL2 files containing structures and a data file containing the biological activities. The conformational expansion of the molecules is done by applying random search techniques, with no post-clustering. This search is performed in Cartesian and internal coordinate space. The pharmacophore points are represented by chemical features, in addition to specific atoms such as nitrogen, oxygen, sulfur, phosphorus, fluorine and other halogens. The correspondence search uses a recursive partitioning algorithm, comparable to the FIRM and SCAM programs [59]. The split criterion used by SCAMPI to partition the whole data set in multiple subsets uses a Student’s t-test corrected by the Bonferroni p-value. The test is based on the presence or absence of a feature also called molecular descriptor. The absence of a feature means that this feature could not be identified in any of the generated conformations. The molecular descriptor that gives the highest t value is the one selected for the split. Both a substructure and a rule-based search method have been implemented for the detection of features represented by groups of atoms and features represented by single atoms. The pharmacophore build-up procedure is similar to that in Catalyst HipHop. Two-point pharmacophores characterized by the two features and a binned distance are searched first. A new point is added only if found, and the process continues until no more pharmacophore points can be found. Pharmacophores are already recorded in the conformer generation phase. The activity of molecules is handled in a semi-quantitative manner. THINK THINK (To Have Information and Knowledge) is a modular system developed by Treweren Consultants to assist with lead generation and optimization. This system allows structure-based virtual screening, data analysis and pharmacophore profiling and is organized in different modules. Around a core module that provides chemical structure reading and writing, command scripts for batch and server jobs, there is a 2D module for data analysis and de novo derivative generation capabilities, a 3D module for 3D coordinate generation and conformer generation, a pharmacophore module for pharmacophore perception, a Microsoft GUI module only available for the Microsoft® Windows version of the program and a Screening Database module consisting of Treweren’s current collection of drug-like molecules.

Ligand-based pharmacophore modeling 229 In THINK, molecules can be built using a 2D editor and the program reads MOL, SD, SMILES and PDB files. Three-dimensional coordinates of molecules, when not available from the input file, are generated automatically by the program itself. Two classical methods are available in THINK to perform the conformational expansion of molecules: systematic search and random search. When the systematic search option is used, the use of contacts check avoids high-energy conformations and reduces the overall processing time. The random method uses a random number generator to select the conformations from within the estimated total number of conformations. The implementation of the program does not prohibit identical conformations to be output resulting from symmetry. These conformations are used in the pharmacophore generation and site search modules. The so-called pharmacophore centers use classical chemical functions such as donors, acceptors, acids, bases, hydrophobic and positive and negative charges functions. Metal ions and electron donor lone pairs are possible centers. The users can also define their own functions. THINK considers fuzzy two-, three- or four-center pharmacophores. If a given molecule contains more than three or four centers, then all possible groups of two to four centers are taken. The distances (including a tolerance) between the pharmacophore centers are measured exactly and then allocated to distance bins, each distance being represented by the bin into whose range it falls. The distance bins are used to transform the distances within each pharmacophore into a set of integers that give a more compact representation of the pharmacophores. Pharmacophore profiles are defined that represent the set of all the pharmacophores found across the conformers of a series of conformers or series of molecules. Each pharmacophore added to the profile has to be unique. This profile will help in showing the spread of pharmacophores across the conformational space of a molecule or a series of molecules. No sum of the exhibited pharmacophores or normalization is done. There is no direct graphical representation of pharmacophore models. The pharmacophores can be saved to a file in CSV format that can be imported into a MySQL or Oracle database. This approach permits the use of standard SQL queries to extract common pharmacophores within sets of molecules, helping to discriminate between active and inactive compounds. The Receptor Site search module of THINK uses these pharmacophores to eliminate quickly conformations of molecules that cannot bind to a receptor site. Feature trees Feature trees have been described by Rarey and Dixon as a new way of analyzing the similarity of molecules [60]. This approach is based on building trees that represent molecules. These trees describe the major building blocks of molecules, in addition to their

230 Chapter 9 overall arrangement. They are conformation independent. Different types of pairwise comparison algorithms are available to compare trees of different molecules.

9.5 Conclusion In this chapter, we have tried to demonstrate the great diversity of software tools available to the researcher in the area of ligand-based pharmacophore modeling. With the expansion of combinatorial chemistry techniques and the need to manipulate every large amounts of real or virtual chemical data, pharmacophore-based techniques have proved their potential in the areas of database mining with pharmacophore queries and library design using pharmacophore fingerprints. A lot of effort has been invested over the past 20 years in the optimization of the different steps of pharmacophore generation: molecular editing and 3D representation, combinatorial enumeration, conformational expansion and pharmacophore perception methodologies for small drug-like data sets. However, we note that today there are still some areas with potential for improvement in the field of ligand-based pharmacophore modeling: Validation; most of the available packages only approach validation from a given angle. The problems of validation are addressed elsewhere in this book. Chemical space coverage: it can be considered a limitation of the majority of today’s ligand-based approaches that only small-sized sets of chemical structurestraining (learning) sets are used to derive pharmacophore models. Consequently, these learning sets cover only a small portion of the chemical space and the performance of the resulting models generally tends to decrease when evaluating large datasets or within other chemical classes of compounds. As there is no unique answer to complex problems such as multiple independent data, large and diverse datasets or receptor flexibility issues, so-called ensemble pharmacophores consisting of multiple models generated from different subsets of large sets of chemical structures could represent an approach that should be pursued in the future. Exercise: To generate pharmacophore model for the molecules selected from literature using PharmaGist [61,62]. Requirements: 1. Operating system: Windows (7, 8 and/or 10) 2. Free wares for non-commercial uses: Onliner server PharmaGist (available at http:// Note: PharmaGist is free webserver to develop pharmacophore models but it has a limitation on number of input molecules. Maximum number of ligands allowed are only 32.

Ligand-based pharmacophore modeling 231 Step by step protocol: 1. 2. 3. 4. 5.


7. 8. 9. 10.

Prepare the ligands, not more than 32 molecules in mol2 format. Go to Choose file: Upload Input Molecules in Mol2 Format. Select Number of Output Pharmacophores. Provide email address (As PharmaGist is free webserver, user has to provide email address, where the results including pharmacophore and other validation parameters will be communicated) Set a key-molecule (In this option the user can set the first ligand in the input to serve as a key/pivot molecule. In this case all the other ligands are aligned onto it and all the pharmacophore candidates include this key ligand. By default, the algorithm iteratively selects each input ligand to serve as a key/pivot) Select Min no. of features in pharmacophore (the minimal number of spatially distinct features in the reported pharmacophore candidates) Select Feature weighting options (user can modify the weights of the features in the scoring function of the algorithm) Select User defined feature (if needed, additional feature type can be defined by the user. A feature file in the following format has to be prepare) Results: the results are emailed to address provided above.

Unsolved exercises for practice 1. To generated qualitative pharmacophore model of COX-1 and COX-2 enzyme using structures of approved NSAIDs. (Hint: Use structures of FDA-approved coxibs to generate COX-2 specific pharmacophore) 2. To identify some new “drug like” proton pump inhibitors using pharmacophore based virtual screening approach (hint: Develop pharmacophore model for potential pump using its various inhibitors approved by FDA and given in the books of medicinal chemistry. Screen out any freely available database using this pharmacophore as query search followed by Lipinski filter) 3. To identify Carbonic anhydrate inhibitors selective towards isform IX and XII over isoform I and II using multi pharmacophore guided hierarchical virtual screening. (Hint. Develop pharmacophore models for various isoforms of carbonic anhydrase using their reported inhibitors and use these models in combination for virtual screening to screening out selective one.) 4. To study the penetration capacity through BBB of various CNS active drugs using pharmacophore modeling of transporter protein present in this barrier. 5. To develop pharmacophore based filter for the common enzyme involved in the resistance issue of beta-lactum antibiotics. (Hint: develop pharmacophore model using structures of inhibitors of the enzyme involved in the resistance.)

232 Chapter 9

References ¨ ber den jetzigen Stand der Chemotherapie, Berichte Der Deutschen Chemischen Ges. 42 [1] P. Ehrlich, U (1909) 1747. [2] K. Poptodorov, T. Luu, R.D. Hoffmann, Pharmacophore model generation software tools, Methods Princ. Med. Chem. 32 (2006) 17. [3] J. Mason, A. Good, E. Martin, 3-D pharmacophores in drug discovery, Curr. Pharm. Des. 7 (2001) 567597. [4] M.J. McGregor, S.M. Muskal, Pharmacophore fingerprinting. 2. Application to primary library design, J. Chem. Inf. Comput. Sci. 40 (2000) 117125. [5] J.H. Van Drie, D. Weininger, Y.C. Martin, ALADDIN: an integrated tool for computer-assisted molecular design and pharmacophore recognition from geometric, steric, and substructure searching of three-dimensional molecular structures, J. Comput. Mol. Des. 3 (1989) 225251. [6] J. Van Drie, R. Nugent, Addressing the challenges posed by combinatorial chemistry: 3D databases, pharmacophore recognition and beyond, SAR. QSAR Environ. Res. 9 (1998) 121. [7] J. Snyder, S. Rao, K. Koehler, A. Vedani, R. Pelliciari, APOLLO Pharmacophores and the pseudoreceptor concept, Trends QSAR Mol. Model. 92 (1993) 4451. [8] P.W. Finn, L.E. Kavraki, J.-C. Latombe, R. Motwani, C. Shelton, S. Venkatasubramanian, et al., Rapid: randomized pharmacophore identification for drug design, Comput. Geometry 10 (1998) 263272. [9] A. Smellie, S.L. Teig, P. Towbin, Poling: promoting conformational variation, J. Comput. Chem. 16 (1995) 171187. [10] O. Guner, History and evolution of the pharmacophore concept in computer-aided drug design, Curr. Top. Med. Chem. 2 (2002) 13211332. [11] C.-G. Wermuth, T. Langer, Pharmacophore identification, 3D QSAR in Drug Design, Theory Methods Appl. (1993) 117136. [12] Y. Kurogi, O.F. Guner, Pharmacophore modeling and three-dimensional database searching for drug design using catalyst, Curr. Med. Chem. 8 (2001) 10351055. [13] V. Golender, B. Vesterman, E. Vorpagel, APEX-3D expert system for drug design, Network, Science 2 (1996). [14] N. Triballeau, H. Bertrand, F. Acher, Are you sure you have a good model? Methods Princ. Med. Chem. 32 (2006) 325. [15] A. Evers, G. Hessler, H. Matter, T. Klabunde, Virtual screening of biogenic amine-binding G-protein coupled receptors: comparative evaluation of protein-and ligand-based virtual screening protocols, J. Med. Chem. 48 (2005) 54485465. [16] A.C. Good, D.L. Cheney, Analysis and optimization of structure-based virtual screening protocols (1): exploration of ligand conformational sampling techniques, J. Mol. Graph. Model. 22 (2003) 2330. [17] C. Lemmen, T. Lengauer, Computational methods for the structural alignment of molecules, J. Comput. Mol. Des. 14 (2000) 215232. [18] S.-Y. Yang, Pharmacophore modeling and applications in drug discovery: challenges and recent advances, Drug. Discov. Today 15 (2010) 444450. [19] D. Schneidman-Duhovny, O. Dror, Y. Inbar, R. Nussinov, H.J. Wolfson, Deterministic pharmacophore detection via multiple flexible alignment of drug-like molecules, J. Comput. Biol. 15 (2008) 737754. [20] E.J. Gardiner, P.J. Artymiuk, P. Willett, Clique-detection algorithms for matching three-dimensional molecular structures, J. Mol. Graph. Model. 15 (1997) 245253. [21] P.J. Goodford, A computational procedure for determining energetically favorable binding sites on biologically important macromolecules, J. Med. Chem. 28 (1985) 849857. [22] J.W.M. Nissink, M.L. Verdonk, J. Kroon, T. Mietzner, G. Klebe, Superposition of molecules: electron density fitting by application of Fourier transforms, J. Comput. Chem. 18 (1997) 638645. [23] M. Barbany, H.G.D. Tera´n, F. Sanz, J. Villa`-Freixa, Towards a MIP-based alignment and docking in computer-aided drug design, Proteins: Struct., Funct., Bioinf. 56 (2004) 585594.

Ligand-based pharmacophore modeling 233 [24] M.C. Pitman, W.K. Huber, H. Horn, A. Kra¨mer, J.E. Rice, W.C. Swope, FLASHFLOOD: a 3D fieldbased similarity search and alignment method for flexible molecules, J. Comput. Mol. Des. 15 (2001) 587612. [25] F. Melani, P. Gratteri, M. Adamo, C. Bonaccini, Field interaction and geometrical overlap: a new simplex and experimental design based computational procedure for superposing small ligand molecules, J. Med. Chem. 46 (2003) 13591371. [26] J. Mills, I.J. de Esch, T.D.J. Perkins, P.M. Dean, SLATE: a method for the superposition of flexible ligands, J. Comput. Mol. Des. 15 (2001) 8196. [27] T. Hurst, Flexible 3D searching: the directed tweak technique, J. Chem. Inf. Comput. Sci. 34 (1994) 190196. ˇ ´ , Thermodynamical approach to the traveling salesman problem: An efficient simulation [28] V. Cerny algorithm, J. Optim. Theory Appl. 45 (1985) 4151. [29] W. Spendley, G.R. Hext, F.R. Himsworth, Sequential application of simplex designs in optimisation and evolutionary operation, Technometrics 4 (1962) 441461. [30] G. Jones, P. Willett, R.C. Glen, A genetic algorithm for flexible molecular overlay and pharmacophore elucidation, J. Comput. Mol. Des. 9 (1995) 532549. [31] S. Handschuh, M. Wagener, J. Gasteiger, Superposition of three-dimensional chemical structures allowing for conformational flexibility by a hybrid method, J. Chem. Inf. Comput. Sci. 38 (1998) 220232. [32] O.F. Gu¨ner, Pharmacophore perception, development, and use in drug design, Internat’l University Line, 2000. [33] Y. Patel, V.J. Gillet, G. Bravi, A.R. Leach, A comparison of the pharmacophore identification programs: catalyst, DISCO and GASP, J. Comput. Mol. Des. 16 (2002) 653681. [34] G.-Y. Liu, X.-L. Ju, J. Cheng, Z.-Q. Liu, 3D-QSAR studies of insecticidal anthranilic diamides as ryanodine receptor activators using CoMFA, CoMSIA and DISCOtech, Chemosphere 78 (2010) 300306. [35] P.S. Galatin, D.J. Abraham, A nonpeptidic sulfonamide inhibits the p53 2 mdm2 interaction and activates p53-dependent transcription in mdm2-overexpressing cells, J. Med. Chem. 47 (2004) 41634165. [36] D. Jung, J. Floyd, T.M. Gund, A comparative molecular field analysis (CoMFA) study using semiempirical, density functional, ab initio methods and pharmacophore derivation using DISCOtech on sigma 1 ligands, J. Comput. Chem. 25 (2004) 13851399. [37] G. Jones, P. Willett, R.C. Glen, A.R. Leach, R. Taylor, Development and validation of a genetic algorithm for flexible docking, J. Mol. Biol. 267 (1997) 727748. [38] G. Jones, P. Willet, R. Glen, GASP: genetic algorithm superimposition program, Pharmacophore Perception, Development, and Use in Drug Design, International University Line, La Jolla, CA, 2000, pp. 85106. [39] S.-K. Lin, Pharmacophore perception, development and use in drug design. Edited by Osman F. Gu¨ner, Molecules 5 (2000) 987989. [40] H. Yuan, A.P. Kozikowski, P.A. Petukhov, CoMFA study of piperidine analogues of cocaine at the dopamine transporter: exploring the binding mode of the 3α-substituent of the piperidine ring using pharmacophore-based flexible alignment, J. Med. Chem. 47 (2004) 61376143. [41] N.J. Richmond, C.A. Abrams, P.R. Wolohan, E. Abrahamian, P. Willett, R.D. Clark, GALAHAD: 1. Pharmacophore identification by hypermolecular alignment of ligands in 3D, J. Comput-Aided Mol. Des., 20, 2006, pp. 567587. [42] N.J. Richmond, P. Willett, R.D. Clark, Alignment of three-dimensional molecules using an image recognition algorithm, J. Mol. Graph. Model. 23 (2004) 199209. [43] J. Sutter, J. Li, A.J. Maynard, A. Goupil, T. Luu, K. Nadassy, New features that improve the pharmacophore tools from Accelrys, Curr. Comput. Drug Des. 7 (2011) 173180. [44] B.R. Brooks, R.E. Bruccoleri, B.D. Olafson, D.J. States, Sa. Swaminathan, M. Karplus, CHARMM: a program for macromolecular energy, minimization, and dynamics calculations, J. Comput. Chem. 4 (1983) 187217. [45] D. Barnum, J. Greene, A. Smellie, P. Sprague, Identification of common functional configurations among molecules, J. Chem. Inf. Comput. Sci. 36 (1996) 563571.

234 Chapter 9 [46] S. Toba, J. Srinivasan, A.J. Maynard, J. Sutter, Using pharmacophore models to gain insight into structural binding and virtual screening: an application study with CDK2 and human DHFR, J. Chem. Inf. Model. 46 (2006) 728735. [47] S. Release, 1: Maestro, Schro¨dinger, LLC, New York, NY, 2018. [48] W.L. Jorgensen, D.S. Maxwell, J. Tirado-Rives, Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids, J. Am. Chem. Soc. 118 (1996) 1122511236. [49] C. Choudhury, G.N. Sastry, Pharmacophore modelling and screening: concepts, recent developments and applications in rational drug design, Structural Bioinformatics: Applications in Preclinical Drug Discovery Process, Springer, 2019, pp. 2553. [50] P. Labute, C. Williams, M. Feher, E. Sourial, J.M. Schmidt, Flexible alignment of small molecules, J. Med. Chem. 44 (2001) 14831490. [51] J.G. Vinter, Extended electron distributions applied to the molecular mechanics of some intermolecular interactions, J. Comput. Mol. Des. 8 (1994) 653668. [52] G. Chessari, C.A. Hunter, C.M. Low, M.J. Packer, J.G. Vinter, C. Zonta, An evaluation of force-field treatments of aromatic interactions, Chem.A Eur. J. 8 (2002) 28602867. [53] R.P. Apaya, B. Lucchese, S.L. Price, J.G. Vinter, The matching of electrostatic extrema: a useful method in drug design? A study of phosphodiesterase III inhibitors, J. Comput. Mol. Des. 9 (1995) 3343. [54] J. Bostro¨m, Reproducing the conformations of protein-bound ligands: a critical evaluation of several popular conformational searching tools, J. Comput. Mol. Des. 15 (2001) 11371152. [55] J.G. Vinter, K. Trollope, Multiconformational composite molecular potential fields in the analysis of drug action. I. Methodology and first evaluation using 5-ht and histamine action as examples, J. Comput. Mol. Des. 9 (1995) 297307. [56] T. Cheeseright, M. Mackey PhD, S. Rose PhD, A. Vinter PhD, Molecular field technology applied to virtual screening and finding the bioactive conformation, Expert. Opin. Drug. Discov. 2 (2007) 131144. [57] S. Rose, A. Vinter, Molecular field technology and its applications in drug discovery, Innov. Pharm. Technol. 23 (2007) 1418. [58] G.S. Sittampalam, S.D. Kahl, W.P. Janzen, High-throughput screening: advances in assay technologies, Curr. Opin. Chem. Biol. 1 (1997) 384391. [59] X. Chen, A. Rusinko III, A. Tropsha, S.S. Young, Automated pharmacophore identification for large chemical data sets, J. Chem. Inf. Comput. Sci. 39 (1999) 887896. [60] M. Rarey, J.S. Dixon, Feature trees: a new molecular similarity measure based on tree matching, J. Comput. Mol. Des. 12 (1998) 471490. [61] D. Schneidman-Duhovny, O. Dror, Y. Inbar, R. Nussinov, H.J. Wolfson, PharmaGist: a webserver for ligand-based pharmacophore detection, Nucleic Acids Res. 36 (2008) W223W228. [62] Y. Inbar, D. Schneidman-Duhovny, O. Dror, R. Nussinov, H.J. Wolfson, Deterministic pharmacophore detection via multiple flexible alignment of drug-like molecules, Annual International Conference on Research in Computational Molecular Biology, Springer, 2007, pp. 412429.


Fragment based drug design: Connecting small substructures for a bioactive lead 10.1 Introduction Various innovations in combinatorial chemistry and high throughput screening have led to the expansion of compound databases comprising of diverse scaffolds, to expedite the drug discovery and development process [1]. Many compounds do not make it to the clinical use because the whole process face thousands of failures and also the success requires highly sophisticated technology and manpower with great scientific minds. The discovery of a new drug molecule relies on the search for an appropriate chemical lead that can elicit desired activity towards the biological target. However, these leads further need to be optimized so as to improve their success rate in the clinical trials. High throughput screening (HTS) which involves screening of a large number of compounds against a biological target of interest, is considered as a productive approach to obtain the hit candidates that be later on regarded as chemical leads. Besides successful applications of HTS in the drug development, there are some limitations of this approach. For instance, this technique have a very low hit rate, narrow coverage of drug-like chemical space and produce the hits with unknown mechanisms [2]. In 1996, Fesik and co-workers discovered inhibitor of immunosuppressant protein, FK506 using FBDD (Fragment based drug design) approach, as an alternative to HTS technique [3]. Later on, the FBDD technique revolutionized the drug discovery process, and till now about thirty drug candidates derived using FBDD technique are in clinical trials, and two drugs i.e. vemurafenib and venetoclax, have been approved by FDA, which are screened using FBDD [4]. FBDD is a technique that involves numerous strategies to screen small, organic fragments for binding to the particular site on the target protein. There are various factors which play an important role in the effective binding of these fragments to the target of interest. For instance, molecular weight of the fragments is very crucial, as fragments should be of enough size to probe key binding interactions with the protein residues and small enough to reduce the chances of unfavorable steric or electrostatic interactions. The fragments are generally selected on the selection criteria of three rules and the selected hits should obey this ‘Rule of Three’ (M.W. # 300, hydrogen bond donor and acceptor # 3, CLogP # 3, NROT # 3, and PSA # 60) [5]. To simplify the structure-activity relationship (SAR) and Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


236 Chapter 10 accordingly accelerating optimization, the selected fragments should be chemically less complex than larger compounds. The FBDD approach employs the use of significant biophysical and structural characterization of fragments in the beginning of the drug discovery process that results in higher content data at lower throughput than the traditional HTS technique. The low throughput and high information content represent a challenge to integrate FBDD technique into drug discovery process. The appropriate utilization of FBDD technique in drug discovery process depends upon the effective use of biophysical, biostructural and biochemical approaches in lead identification and optimization. Bioinformatic tools play a significant role in modeling of the binding of these fragments that would help in the integration of data obtained from these sources and would guide decision-making [6].

10.2 General strategy for fragment based drug design FBDD involves the identification of hot spots in the protein that can be crucial for the binding of the ligand into the active site of the protein. The identification of these hot spots aid in designing suitable ligands that can, in turn, interact with that particular region of the protein. However, by analyzing various receptor-ligand interactions, such as hydrogen bonding or hydrophobic interactions, the suitable fragment can be identified. Once the fragment is displaying optimum interaction with the binding site of the protein, it can be further grown, linked, or merged to design the potential ligand. The designing of ligand involves the generation of small libraries of the molecules based on the fragment hit to explore the substitutions that would influence the affinity and the selectivity of the designed ligand into the binding sites of the protein. Creating the libraries with a large number of fragments increases an assurance of statistical probability for identifying a suitable fragment. A brief outline of FBDD approach is displayed in Fig. 10.1.

10.2.1 Techniques for finding fragments A fundamental tenet of FBDD is that small and simple fragments are able to sample a diverse chemical space [7,8]. Fragment libraries are commonly designed to maximize their chemical diversity, whilst limiting certain physical properties of the compounds [9], and ensuring that they are amenable to the types of assays that will be used to identify hits. Different aspects of library design include the optimal size of the fragments in the library, the number of fragments that are needed to sample an appropriate chemical diversity as well as the shape and 3D character of the fragments to be screened [10]. Several detailed descriptions of the process of assembling a general fragment-screening library and assessing its quality and performance have been reported [11]. The design of fragment libraries for FBDD includes some general considerations used for HTS [12]. Additionally, some other

Fragment based drug design: Connecting small substructures for a bioactive lead 237

Figure 10.1 Fragment based drug design approach.

considerations, unique to fragment libraries, are also taken into account. One such example is a smaller size than typical HTS compounds because fragments will ultimately be elaborated. To restrict the molecules to be within the acceptable limits of these unique considerations, Jhoti and colleagues proposed a “rule of three” (as discussed in the introduction section) [9], by following Lipinski’s famous “rule of five” [13]. In the last few years, various efforts have been made in order to find the appropriate fragments for effective drug designing. Scientists at Vertex Pharmaceuticals computationally dissected known drugs into fragments corresponding to molecular frameworks and side chains; these analyses demonstrated that most drugs can be represented by a relatively small set of molecular architectures [14]. From this, they constructed a small library of fewer than 200 fragments specifically designed for NMR screening called a SHAPES library [15]. The compounds were chosen not just to represent fragments found in known drugs but also to the molecules which were highly soluble, nonreactive, and commercially available. Lewell and colleagues described the use of a “retrosynthetic combinatorial analysis procedure” (RECAP) to identify recurring fragments from known drugs [16]. Fesik and colleagues have proposed enriching fragment libraries with “privileged molecules,” such as biphenyls, that have been experimentally shown to bind to proteins frequently [17]. Fragment library design has been reviewed more recently, with the computational deconstruction of drugs into fragments remaining an active research focus [18]. For targets with fairly rigid binding sites, “virtual screening” methods can be used to augment default libraries with fragments selected on the basis of their structural complementarity to the protein; one of the first methods described was the program DOCK [19]. The predictive utility of docking methods decreases with the conformational mobility of the protein and ligand, so these methods are ideally suitable for analyzing small, inflexible fragments [20]. In fact, one of the pioneering docking programs, LUDI, was designed specifically to identify and subsequently combine

238 Chapter 10 fragments that complement a user-specified site on a protein [21]. With a collection of actual fragments in hand, there are several screening methods available, including functional screening, nuclear magnetic resonance (NMR), mass spectrometry (MS, both noncovalent and covalent), and X-ray crystallography. A typical fragment screening approach is displayed in Fig. 10.2. For improving the quality of fragment identification step, it is important to characterize possible route that can be utilized to identify the types of functional groups that will be utilized to develop structure-activity relationship (SAR) and the possible binding mode so as to accommodate the designed ligand into the binding site of the protein. The ligand

Figure 10.2 Fragment screening protocol.

Fragment based drug design: Connecting small substructures for a bioactive lead 239 selection protocol follows the rules of combinatorial chemistry and preferentially identifies the ligands which are synthetically feasible. The quality ligands with various substitutions can be produced by considering either shape, size and physico-chemical properties of the binding pocket or the pharmacophoric features of the known active ligands.

10.2.2 Converting fragments into hits and leads The fragment growing assists in improving the drug-like properties of the selected fragment and improves the ligand interaction profile of the newly generated compound. The binding site of the protein has several residues that are responsible for several hydrophobic or hydrophilic environments. Multiple fragments with different physico-chemical properties that occupy and favor the particular region of the binding pocket can be identified and coupled together to interact with the binding pocket of the enzyme. Three broad strategies for converting fragments into hits and leads includes fragment optimization, fragment growing, merging and/or linking, and in situ fragment assembly. Fragment optimization closely resembles traditional medicinal chemistry, in which various substitutions or expansions are made to the initial hit (or fragment, in this case) in order to improve affinity and other properties. Fragment merging and linking generally involve combining elements from a fragment with elements from a known substrate, inhibitor, or another fragment to create a hybrid molecule [22]. This approach can improve the potency as well as physicochemical or ADME (absorption, distribution, metabolism, and excretion) properties. Finally, in situ fragment assembly, which encompasses such areas as dynamic combinatorial chemistry, uses the target as a template for the synthesis of inhibitors from fragments [23]. These three strategies serve as a useful organizing principle; in practice, they have considerable overlap. For example, fragment linking may involve fragment optimization, and in situ methods may include fragment optimization or fragment linking. In the following sections, we will consider a variety of examples in which fragments are identified and then converted to hits or even leads. Fragment optimization Chemical optimization strategies typically focus on screening hits with low micromolar or better affinities, but simpler fragments have also been successfully optimized. While these fragments may have low intrinsic affinities, they typically possess binding specificity sufficient to serve as viable anchors for subsequent derivatization. These identified fragments can be linked using various linkers by following the synthetic protocol so as to generate synthetically novel compounds with improved drug-likeness. The generated compounds should possess the required affinity and selectivity towards the binding pocket of the protein. In most successful applications of fragment optimization, the anchoring fragments bound using discrete, specific contacts, and their binding modes are preserved throughout the optimization process. Known inhibitors or substrates also aided in the

240 Chapter 10 selection of initial fragments for screening and guided optimization, often by helping to circumvent undesired qualities. Fragment growing By using the structure of a protein, the key interactions within the various regions of its binding pocket can be explored in order to identify one suitable fragment. Further, the identified fragment can be grown by differently substituted it at various positions. Usually, both the fragment identification and fragment substitution follows the rules of chemical combinatorial to come up with the synthetically feasible compounds. The fragments can either be substituted according to the shape, size, and physical-chemical properties of the remaining part of the binding pocket or the pharmacophore features of known active compounds. This growing process generally improve the receptor-ligand interactions and drug-likeness of the identified fragment. Fragment linking Various type of fragments can be identified on the basis of different sub-regions they occupy within the binding pocket of a particular target. These sub-regions may have different structural features and therefore can create corresponding environment (hydrophobic or hydrophilic environments) within that particular sub-region, on the basis of the arrangement of amino acid residues. The fragments which favor these sub-regions of binding pocket may have different physical-chemical properties and consequently different activities. Besides activity, drug likeliness of identified fragments is also of prime importance which can be increased by introducing the linkers between two or more fragments. The linking of multiple fragments is usually done by keeping in mind the synthetic rule and generate synthetically feasible novel molecules. The new molecules designed from fragment linking are supposed to have both high affinity as well as selectivity towards the binding pocket. Fragment merging Normally, known lead molecules are available corresponding to each and every given target that may act as a probe for the different fundamental studies of that particular target. These known lead molecules may occupy a big portion of the binding pocket, and leave some empty space for the binding of suitable fragment. Apart from fragment linking, linkers can also be utilized to join the known lead molecule with the suitable fragment(s). The suitable fragments for each binding site are selected by keeping in mind the desired physico-chemical properties of the final product, forming relatively strong receptor-ligand interactions, and adding linker that form synthetically reasonable compound. Besides the designing of novel compounds, fragment merging can also be considered as a promising tool for chemical modification and generation of derivatives.

Fragment based drug design: Connecting small substructures for a bioactive lead 241 Emerging strategies: in situ fragment assembly The process of using fragment assembly to generate de novo leads is greatly facilitated when techniques such as NMR or crystallography can be used to identify fragments that can bind to nearby sites in a mutually compatible manner. Competition studies can also provide useful indirect evidence to select combinations of fragments that can bind concurrently as potential candidates for linking. But even with this type of information, productively linking or merging fragments remains a significant technical challenge. This challenge is further magnified when the target protein has a flexible binding surface. Several labs are now exploring ways of using the target protein both to select and to combine pairs of fragments in situ. In effect, the protein assembles its own inhibitor by selecting fragments that can cross-link to each other when brought into mutual proximity. The final set of examples illustrates this emerging area of investigation.

10.2.3 Hit identification and validation The fragments can also be further merged with the already known leads of particular proteins and serve as probes to examine fundamental studies using various techniques. The already known leads may occupy the maximum portion and fragments can occupy the favored region of the binding site. Suitable linkers could be used to link the lead compound with the identified fragment. The generated lead compounds then can be validated and verified for their effectiveness by carrying out experimental studies by performing suitable biological assay [24]. Fragments typically bind to the target protein with low affinity and in the micromolar and millimolar range. Therefore it necessitates the use of robust and highly sensitive techniques such as NMR, SPR and protein crystallography to detect and characterize and also provide insight into enzyme inhibition or activation by the suitable fragment. Moreover, these techniques require the use of highly sophisticated and costly equipment. The use of computational techniques can be integrated into the initial steps to hasten the process of fragment identification. Various techniques such as identification of the binding site, determining the binding site drug-ability, molecular docking to screen libraries with diversified molecular fragments, assure the affinity and selectivity of the extended fragments towards the binding site. The computational techniques are relatively less time consuming than the experimental techniques and cost effective.

10.3 Recent advancements in FBDD techniques Various advances in the field of FBDD have been introduced to simplify the generation and design of lead compounds displaying optimum interactions with the enzyme binding site.

242 Chapter 10

10.3.1 Fragment-based molecular evolutionary approach Fragment-based molecular evolutionary approach was proposed by Kawai and his research group, in 2014, for de novo designing to produce novel drug-like molecules for drug discovery [25]. The study was aimed to develop a similarity-based evolutionary approach to generate a molecule that resembles the reference structure, but have a difference in the side chain and the scaffold and also possess synthetic feasibility. This method utilizes the active ligand as a reference molecule to explore the chemical space of the protein and generate libraries of seed fragments by carrying out the fragmentation of the reference molecules. Followed by the generation of individual sets of molecules from the library of seed fragments. The next step involves either crossover or mutating the identified ligands and creating the offsprings for the next generation by carrying out the mutation. Here mutation involves randomly selecting the molecules from the library and various strategies are applied to the selected molecule such as adding the fragment, removing the fragment or replacing the fragment for constructing the new molecule. Whereas crossover involves generating two new molecules by exchanging the fragments of the selected molecules. Then finally the designed molecules are evaluated for similarity with the reference molecule by calculating the Tanimoto coefficient, as a fitness function and tournament method to identify the surviving molecules. The protocol of the approach is shown in Fig. 10.3.

Figure 10.3 Strategy of fragment-based molecular evolutionary approach.

Fragment based drug design: Connecting small substructures for a bioactive lead 243

Figure 10.4 Deconstruction-reconstruction strategy.

10.3.2 Construction and deconstruction approach Later in the year 2015, Chen et al., proposed construction and deconstruction approach as a promising strategy for the designing of drug like molecules having high potency and efficacy [26]. This approach involves the defragmentations of the substituents from the reported ligand to generate the fragment library. Then these fragments are redesigned by reconstruction as a unique chemical entity by reintroducing various original structural features and generating a drug-like molecule. However, this reconstruction step is a bit tricky as fragments need to be optimized and it can be simplified by the use of computational techniques such as Lipinski’s and Veber’s rules along with some public algorithms. This technique has been successfully employed for the identification of signal transducers and activators of transcription 3 (STAT3) [27], Monoamine transporter inhibitors [28,29], Glucokinase activators [30], Sigma 1 receptor ligands [31] and AMPA receptor positive allosteric modulators [32]. Hence, deconstruction of the known ligand into fragments and reconstruction of these fragments to generate new leads is a rational approach and may facilitate the drug discovery process with improved outcomes. The protocol employed for deconstruction and reconstruction is shown in Fig. 10.4.

10.3.3 Computational functional group mapping Later in 2016, Olgun Guvench proposed Computational functional group mapping (cFGM) technique as an analogue of experimental fragment-based design that allows the generation of comprehensive atomic-resolution 3D maps that suggest the affinity of the functional groups that can be incorporated in the drug-like molecules as fragments for the target protein [33]. These generated 3D maps can be then utilized by medicinal chemists to focus their efforts to design synthetically feasible ligands bearing functional groups that are

244 Chapter 10 necessary for affinity, specificity towards the target protein and display optimum pharmacokinetic properties. The cFGM involves all atom explicit solvent MD includes various approaches such as the use of co-solvents, MixMD and Site-Identification by Ligand Competitive Saturation (‘SILCS’). This technique has an advantage over traditional experimental FBDD as it is helpful in detecting low binding affinity regions, FGM for all functional groups aids in the determination of a specific region which might favors particular interaction in the entire target structure and the ability to prevent aggregation of hydrophobic fragments and fragment-induced denaturation of the target.

10.3.4 Multitasking computational model approach The first ever multitasking (mtk) computational model for the in silico fragment-based designing of multiple-target inhibitors was developed by Alejandro Speck-Planche and M. Natalia D. S. Cordeiro [34]. In their report, they presented a case study related to the designing of molecules against multiple breast cancer-related proteins with good inhibition activities, in order to demonstrate the productivity of the proposed approach. As an alternate of conventional high-throughput screening or virtual screening, the mtk computational model take into account a high quality experimental data to screen active molecules from a huge database. For this purpose, the quantitative contribution of inhibitory activity for each fragment is calculated and the physicochemical interpretation of descriptors is considered in the mtk computational model. Both the fragment quantitative contribution and the descriptor physicochemical interpretation provide guidance for the designing of multi-targeting molecules. In the establishment of the mtk computational model, mainly three steps are involved. Initially, in the first step, biological data for molecules is collected from databases including CHEMBL. For the selection of molecules, restrictions such as the number of reported measurements on activities, the type of measurements, and the accuracy and reproducibility of the assays can also be applied. Next step involves the calculation of molecular descriptors for the selected molecules, for instance, topological descriptors can be calculated by using QUBILs-MIDAS [35], a multilinear algebraic maps and discrete mathematics based algorithm. The third and last step is the generation of the mtk computational model in which the statistical cases are divided into training and test sets. The thoroughly optimized model is then searched and determined by the training set. The test set is used to validate the model to determine whether the model is equipped with desirable prediction power. Several statistical indices are utilized for the assessment of the model quality. The established mtk computational model can be further used to find out the quantitative contribution of each fragment in the biological activity. Fragments having positive contributions to the biological activities are selected for fragment linking and merging to

Fragment based drug design: Connecting small substructures for a bioactive lead 245 generate the new molecules. The generated molecules are examined for their drug-likeness and the drug effectiveness is verified through the virtual screening between the target(s) and the newly generated molecule(s).

10.4 Limitations In spite of being very effective drug designing approach, FBDD face several limitations. The fragments identified for the development of a novel molecules are relatively smaller molecules in size when compared with the lead or drug like molecules, which makes their molecular docking processes much more difficult [36]. The main reasons behind this difficulty are the insufficient interactions of identified fragments with the binding site of target protein and their low affinity towards respective targets, due to availability of only limited functional groups. There are chances of missing some important and potential fragment hits because of their tiny size and weak interactions with the protein. Additionally, relatively small size of these fragments can also be the reason of non-selective docking modes. Besides, if the fragments share common chemical-physical properties in the binding pocket of protein, there are possibilities that it can be favored by multiple sub-regions. Under these circumstances, the prediction of these fragments become a difficult task and also the subsequent analysis processes become time-consuming. On the flip side, if we talk about the implementation of structure based techniques in FBDD, there has been a lots of advancement in molecular docking technology in the last few decades. There are options for the researchers to choose from systematic or stochastic scoring methods for the conformational search and force-field-based, empirical, or knowledge-based screening [37]. Nevertheless, the accuracy of receptor-ligand binding modes prediction is somehow hampered in designing novel fragments. It is even more challenging to give reliable and accurate predictions for small fragments than the lead/drug molecule itself. For more accuracy and better results, reliable protein models and precise sites of the binding pockets are required to perform the in silico FBDD. Apart from this, as the crystal structure information is not available for the majority of the proteins, especially for membrane proteins like GPCRs, the application of drug development strategy is becoming more challenging. Although homology modeling resolve this issue to some extent if the target protein shares high sequence similarity with the crystallized template. However, the overall situation is not that optimistic. Also, static models are not that much effective in representing the proteins in dynamic biological environments because the rotation of even one or more key residues may cause the drastic conformational change in the shape of the binding sites and consequently may change the whole protein structure. Moreover, the binding modes obtained as a result of static models may not simulate the receptor-ligand interactions in reality.

246 Chapter 10

10.5 Conclusion Drug discovery and development is a continuous and evolving process due to the dire need of therapeutic agents for complex pathological conditions. With ever-increasing cost and time involved in this exercise, computer-assisted tools and techniques have gained the centre stage. Out of multiple CADD based strategies that are utilized to discover new leads, fragment-based drug designing approaches have become more popular and are now widely used in both industry and academia. Thus, in this chapter different strategies used to explore and implement fragment-based drug design along with recent success stories of implementation of this approach are outlined. FBDD has been significantly successful due to some key distinctions with other techniques including the ability of fragments to probe chemical space more efficiently than larger molecules, leading to a higher hit rate. Another reason is the fact that fragments form high-quality interactions (as demonstrated by their high binding energy relative to their size). And finally, the fragments have the ability to uncover novel hot spots on the surface of target proteins, including challenging targets such as PPIs. These advantages, in addition to the relatively low start-up cost of establishing a fragment library, make FBDD an efficient and cost-effective approach for the design and development of novel small molecules against a target of interest. Solved exercise: Exercise 1: FBDD-based designing of EGFR inhibitors using free web-server ACFIS [38]. Note: ACFIS contains three associated tools to perform fragment-based drug design. The PARA_GEN is a tool to generate molecular force field parameters, which are required both for MD simulation and other modules of the server. The CORE_GEN is a tool to generate a pharmacophore structure from a bioactive molecule based on fragment deconstruction. CORE_GEN can also be used as a tool to guide the optimization of a bioactive compound to improve its ligand efficiency. The third module CAND_GEN is a simple tool to perform pharmacophore-linked fragment virtual screening (PFVS) approach. Browser Recommendation: This site is designed for use with Firefox, Google Chrome, Apple Safari and IE10 or later as it makes use of special features available only in these browsers. Steps: 1. The main input for ACFIS server is a 3D-structure file (pdb or mol2 for PARA_GEN, pdb for CORE_GEN and CAND_GEN). A processible pdb file can be obtained from protein data bank or from docking results which should contain ATOM record for protein atoms, HETATM record for non-standard residue (including ligand) and TER record to separate different chains and to mark non-standard residues. 2. In the study, we downloaded “1M17” as 3D-structure file for EGFR. 3. Run ACFIS with Primary Mode

Fragment based drug design: Connecting small substructures for a bioactive lead 247 Note: Primary Mode includes all the three tools (PARA_GEN, CORE_GEN and CAND_GEN) altogether, for users to run ACFIS in a user-friendly way. After the submission of a complex pdb file, the server will generate several cores and select the first ranking core fragment to perform fragment virtual screening automatically. To avoid the bugs of the uploaded pdb file, an initial file checking module is also developed in Primary Mode. The three modules can also be independently used in the Advanced Mode. 1. Use PARA_GEN to generate force field parameters Note: Input of PARA_GEN should be a valid pdb file or a valid mol2 file. Add hydrogen atoms properly before you upload your files. To calculate the atomic charge, you can choose to calculate it automatically. Three charge method for automatic charge calculation are optional here and AM1-BCC is the recommended method. Outputs of PARA_GEN will be a ZIP file containing a PREP file, a FRCMOD file, a LIB file and a job log file. PREP file and FRCMOD file are needed as inputs of CORE_GEN and CAND_GEN, while LIB file and FRCMOD file are needed for MD simulation use online MD server like MDWEB, most PARA_GEN job will be finished in 30 minutes. Results for PARA_GEN 1. 2. 3. 4. 5. 6.

Job Id: P963085de108f433687 Molecule structure file upload: 1m17.pdb Assign total formal charge via AUTO_CALCULATE Charge method specified: gas Submission time: 2019-11-29 20:14:03 Server time: 2019-11-29 21:04:00 1. Use CORE_GEN to analyse your bioactive ligand and obtain an ideal pharmacophore

Note: CORE_GEN provides an alternative way for fragment screening based on the fact that a compound with medium bioactivity may contain an ideal fragment that can be used as a core. CORE_GEN can also point out the unfavourable part of a molecule by fragment deconstruction and binding energy calculation. A protein-ligand complex in pdb format is required as an input of CORE_GEN. The binding mode of the ligand should be accurate enough. For the protein, hydrogen atoms should be removed. We recommend you to delete domains far away from the binding site to minimize the calculation cost. 1. You can see the job submitting page by a click on the left picture on the homepage of CORE_GEN. You can custom your job though 4 steps. Files are upload after a click on the “upload” button. The Job will start to run after you clicks the “submit” button on the next page. The Server will take about one hour to finish the calculation. 2. You are guided to the result page after successful submission. Messages about all submitted CORE_GEN jobs are printed on this page, you can bookmark this page and the job status will update automatically. By a click in the id of the job, results of

248 Chapter 10 CORE_GEN are presented to users via a web page which contains a summary table showing the structure of the fragment, ligand, and their binding energy. Because molecular physicochemical properties are also important for drug design, we create links for each ligand and fragment to the molinspiration molecular properties prediction server. You can see basic molecular properties by clicking the name of the pharmacophore. The structure of the final protein-ligand complex and each proteinfragment complex are shown with JSmol by a click on the picture. The Structure of protein-core can be download by a click on the DOWNLOAD PDB icon. You can perform pharmacophore-linked fragment virtual screening using pharmacophore we generated by simply click on the RUN CAND_GEN icon.

Results for CORE_GEN Ligand

Structure & binding mode


ΔH (kcal/ mol)

2TΔS (kcal/ mol)

ΔG (kcal/ mol)



























Fragment based drug design: Connecting small substructures for a bioactive lead 249 (Continued) Core4


















1. Use CAND_GEN to perform fragment linking Note: Input of CAND_GEN is similar to CORE_GEN. You should upload a protein-core structure from CORE_GEN. For the protein, hydrogen atoms should be removed. For the core, only a linker hydrogen atom should be retained, which will be replaced with different fragments. You can upload your files and custom your job in a similar way with CORE_GEN. Result pages of CORE_GEN and CAND_GEN are also similar. Results for CAND_GEN: (only top 9 molecules are presented here) Rank


Structure & binding mode

ΔH(PB) (kcal/mol)

ΔH(GB) (kcal/mol)

2TΔS (kcal/ mol)

ΔG(PB) (kcal/mol)

ΔG(GB) (kcal/mol)







250 Chapter 10 (Continued) 2





































Fragment based drug design: Connecting small substructures for a bioactive lead 251 (Continued) 8












Unsolved exercise Exercise 2: Develop thymidine kinase inhibitors utilizing pharmacophore guided FBDD approach. Note: For help please refer to

References [1] S.J.Y. Macalino, V. Gosu, S. Hong, S. Choi, Role of computer-aided drug design in modern drug discovery, Arch. Pharmacol. Res. 38 (2015) 1686 1701. [2] M.J. Wildey, A. Haunso, M. Tudor, M. Webb, J.H. Connick, High-Throughput Screening, Annual Reports in Medicinal Chemistry, Elsevier, 2017, pp. 149 195. [3] S.B. Shuker, P.J. Hajduk, R.P. Meadows, S.W. Fesik, Discovering high-affinity ligands for proteins: SAR by NMR, Science 274 (1996) 1531 1534. [4] D.A. Erlanson, S.W. Fesik, R.E. Hubbard, W. Jahnke, H. Jhoti, Twenty years on: the impact of fragments on drug discovery, Nat. Rev. Drug Discov. 15 (2016) 605. [5] M. Congreve, R. Carr, C. Murray, H. Jhoti, A ‘rule of three’ for fragment-based lead discovery? Drug Discov. Today 8 (2003) 876 877. [6] D.A. Erlanson, W. Jahnke, R. Mannhold, H. Kubinyi, G. Folkers, Fragment-Based Drug Discovery: Lessons and Outlook, John Wiley & Sons, 2016. [7] T. Fink, J.-L. Reymond, Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery, J. Chem. Inf. Model. 47 (2007) 342 353. [8] L. Ruddigkeit, R. Van Deursen, L.C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52 (2012) 2864 2875. [9] M. Congreve, R. Carr, C. Murray, H. Jhoti, A rule of three for fragment-based lead discovery? Drug Discov. Today 19 (2003) 876 877.

252 Chapter 10 [10] G.M. Keser˝u, D.A. Erlanson, G.G. Ferenczy, M.M. Hann, C.W. Murray, S.D. Pickett, Design principles for fragment libraries: maximizing the value of learnings from pharma fragment-based drug discovery (FBDD) programs for use in academia, J. Med. Chem. 59 (2016) 8189 8206. [11] B.C. Doak, C.J. Morton, J.S. Simpson, M.J. Scanlon, Design and evaluation of the performance of an NMR screening fragment library, Aust. J. Chem. 66 (2014) 1465 1472. [12] S.J. Teague, A.M. Davis, P.D. Leeson, T. Oprea, The design of leadlike combinatorial libraries, Angew. Chem. Int. Ed. 38 (1999) 3743 3748. [13] C.A. Lipinski, F. Lombardo, B.W. Dominy, P.J. Feeney, Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings, Adv. Drug Deliv. Rev. 23 (1997) 3 25. [14] G.W. Bemis, M.A. Murcko, Properties of known drugs. 2. Side chains, J. Med. Chem. 42 (1999) 5095 5099. [15] J. Fejzo, C.A. Lepre, J.W. Peng, G.W. Bemis, M.A. Murcko, J.M. Moore, The SHAPES strategy: an NMR-based approach for lead generation in drug discovery, Chem. Biol. 6 (1999) 755 769. [16] X.Q. Lewell, D.B. Judd, S.P. Watson, M.M. Hann, Recap retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry, J. Chem. Inf. Comput. Sci. 38 (1998) 511 522. [17] P.J. Hajduk, M. Bures, J. Praestgaard, S.W. Fesik, Privileged molecules for protein binding identified from NMR-based screening, J. Med. Chem. 43 (2000) 3443 3447. [18] M. Vieth, M.G. Siegel, R.E. Higgs, I.A. Watson, D.H. Robertson, K.A. Savin, et al., Characteristic physical properties and structural fragments of marketed oral drugs, J. Med. Chem. 47 (2004) 224 232. [19] B.K. Shoichet, R.M. Stroud, D.V. Santi, I.D. Kuntz, K.M. Perry, Structure-based discovery of inhibitors of thymidylate synthase, Science 259 (1993) 1445 1450. [20] J.A. Erickson, M. Jalaie, D.H. Robertson, R.A. Lewis, M. Vieth, Lessons in molecular recognition: the effects of ligand and protein flexibility on molecular docking accuracy, J. Med. Chem. 47 (2004) 45 55. [21] H.J. Bo¨hm, A novel computational tool for automated structure-based drug design, J. Mol. Recognit. 6 (1993) 131 137. [22] O. Ramstro¨m, J.-M. Lehn, Drug discovery by dynamic combinatorial libraries, Nat. Rev. Drug Discov. 1 (2002) 26. [23] A. Ganesan, Strategies for the dynamic integration of combinatorial synthesis and screening, Angew. Chem. Int. Ed. 37 (1998) 2828 2831. [24] D.A. Erlanson, R.S. McDowell, T. O’Brien, Fragment-based drug discovery, J. Med. Chem. 47 (2004) 3463 3482. [25] K. Kawai, N. Nagata, Y. Takahashi, De novo design of drug-like molecules by a fragment-based molecular evolutionary approach, J. Chem. Inf. Model. 54 (2014) 49 56. [26] H. Chen, X. Zhou, A. Wang, Y. Zheng, Y. Gao, J. Zhou, Evolutions in fragment-based drug design: the deconstruction reconstruction approach, Drug Discov. Today 20 (2015) 105 113. [27] H. Chen, Z. Yang, C. Ding, L. Chu, Y. Zhang, K. Terry, et al., Fragment-based drug design and identification of HJC0123, a novel orally bioavailable STAT3 inhibitor for cancer therapy, Eur. J. Med. Chem. 62 (2013) 498 507. [28] S.M. Stahl, C. Lee-Zimmerman, S. Cartwright, D. Ann Morrissette, Serotonergic drugs for depression and beyond, Curr. Drug Targets 14 (2013) 578 585. [29] A.M. Capelli, F. Micheli, Triple monoamine uptake inhibitors, Pharm. Pat. Anal. 1 (2012) 469 481. [30] W. Mao, M. Ning, Z. Liu, Q. Zhu, Y. Leng, A. Zhang, Design, synthesis, and pharmacological evaluation of benzamide derivatives as glucokinase activators, Bioorg. Med. Chem. 20 (2012) 2982 2991. [31] S. Brune, S. Pricl, B. Wu¨nsch, Structure of the σ1 receptor and its ligand binding site: miniperspective, J. Med. Chem. 56 (2013) 9809 9819. [32] H. Chen, C.Z. Wang, C. Ding, C. Wild, B. Copits, G.T. Swanson, et al., A combined bioinformatics and chemoinformatics approach for developing asymmetric bivalent AMPA receptor positive allosteric modulators as neuroprotective agents, ChemMedChem 8 (2013) 226 230.

Fragment based drug design: Connecting small substructures for a bioactive lead 253 [33] O. Guvench, Computational functional group mapping for drug discovery, Drug Discov. Today 21 (2016) 1928 1931. [34] A. Speck-Planche, M.N.D. Cordeiro, Fragment-based in silico modeling of multi-target inhibitors against breast cancer-related proteins, Mol. Divers. 21 (2017) 511 523. [35] C.R. Garcı´a-Jacas, Y. Marrero-Ponce, L. Acevedo-Martı´nez, S.J. Barigye, J.R. Valde´s-Martinı´, E. Contreras-Torres, QuBiLS-MIDAS: a parallel free-software for molecular descriptors computation based on multilinear algebraic maps, J. Comput. Chem. 35 (2014) 1395 1409. [36] R.E. Hubbard, I. Chen, B. Davis, Informatics and modeling challenges in fragment-based drug discovery, Curr. Opin. Drug Discov. Dev. 10 (2007) 289 297. [37] L.G. Ferreira, R.N. Dos Santos, G. Oliva, A.D. Andricopulo, Molecular docking and structure-based drug design strategies, Molecules 20 (2015) 13384 13421. [38] G.-F. Hao, W. Jiang, Y.-N. Ye, F.-X. Wu, X.-L. Zhu, F.-B. Guo, et al., ACFIS: a web server for fragment-based drug discovery, Nucleic Acids Res. 44 (2016) W550 W556.


Scaffold hopping: An approach to improve the existing pharmacological profile of NCEs 11.1 Introduction Similarity principle is one of the pillars of modern medicinal chemistry according to which structurally related compounds display similar biological activities [1] i.e. they can exert related effects as ligands of the same macromolecular receptor. Conversely, this means that the more distantly related two chemical structures are the probability of having the same biological effect is less. Molecular modeling and cheminformatics experts have made an effort to negate this question by inventing a host of computational procedures for calculating molecular similarity as independently as possible based on details of chemical structures. Regardless of the type of molecular similarity measure used, however, there is always a trade-off between the probability to find a compound with the desired activity and the degree of ‘novelty’’ of a proposed alternative structure. At atomic level, interactions between receptors and ligands can roughly be presumed in terms of hydrophobic contact and additional specific, most often polar interactions [2]. Under two criterion a compound can be considered similar to other, one if their shapes match and second if they form same directed interactions such as hydrogen bonds. Unfortunately, these descriptors are of little significance if the chemical structure of a single active compound is known, because drugsize molecules can typically adopt many alternative low-energy conformations with varied shape. However, if the biologically relevant conformation of one ligand is known from an X-ray structure or its structure is rigid then this conformation can be used as a template to search for novel structures. Descriptors including shape and hydrogen-bonding capability have the advantage of being completely independent of chemical structure the molecules are regarded ‘from outside’’ as they act on a receptor. This increases the likelihood of identifying truly novel scaffolds [3]. Many tools have been established to flexibly superimpose molecules onto a rigid query structure. The earliest program probably being the program SEAL, meant to superimpose the molecules over the query structure [4]. The term “scaffold hopping” was coined by former Hoffmann-La Roche researcher Gisbert Schneider to identify isofunctional molecular structure with different molecular backbones

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


256 Chapter 11 having similar or improved properties and so accelerated the drug discovery process. Using this approach a lead compound i.e. a compound from a series of related compounds having some of the desired biological activity can be characterized and modified to obtain an another molecule with a better profile of required properties and is devoid of unwanted side effects. Previous studies suggested that scaffold hopping technique can possibly assist in obtaining leads based on different scaffolds [5]. An alteration of the central chemical template of a compound is often desirable for several reasons: (i) To improve solubility by replacement of a lipophilic scaffold by a more polar one, (ii) a substitution of a metabolically labile scaffold with a more stable or less toxic one for improving the pharmacokinetic properties, (iii) a replacement of a very flexible scaffold (such as a peptide backbone) by a rigid central scaffold for significantly improving the binding affinity and a change in the central scaffold for generating a novel structure that is patentable [6]. Thus, the goal of scaffold hopping is to identify structurally diverse compounds similar in activity/property space [7]. Core structure modifications or replacements might be attempted on the basis of chemical knowledge, at least for individual compound series, but searching for scaffold hops requires computational frameworks. For benchmarking of virtual screening methods, the assessment of scaffold hopping potential has become a gold standard, which is not without problems. Scaffolds extracted from bioactive compounds represent a wide spectrum of structural relationships, ranging from chemically very similar scaffolds to others that are structurally unrelated. Benchmark studies reported in the literature are often statistical in nature and do not or not sufficiently specify how scaffold hopping is evaluated and whether or not different structural relationships have been considered. While benchmarking is a necessary but not sufficient exercise for evaluating virtual screening methods, prospective scaffold hopping applications starting from known active compounds are more interesting. In medicinal chemistry and drug design, scaffold hopping aims to find replacements of chemical liable compounds and series with flat structure activity relationships (SARs), or (natural) molecules with synthetic possibility. In addition, the pharmaceutical industry, is also attempting to search for compounds that share activity characteristics with known candidates or drugs, but are chemically sufficiently different from them to help establish a competitive patent position [8]. Thus, computational scaffold hopping is an intellectually stimulating scientific exercise interest in which might often be rather pragmatic in nature.

11.2 Computational approaches of scaffold hopping Scaffold hopping can be defined as derivatives obtained from the parent compounds having novel core structures. The question that may emerge from any researcher is how different the structure of the derivative molecules from their parents in order for the evolution to be classified as scaffold hopping. In other words, how novel is the derivatives are? Boehm

Scaffold hopping: An approach to improve the existing pharmacological profile of NCEs 257

Figure 11.1 Some of the approved drugs developed by scaffold hopping.

et al classified two scaffolds as different if different synthetic routines were utilized to synthesize them, no matter how small the change might be [9]. This statement has been proven valid in many cases where the chemical structures are closely relatable. Additionally, one can claim for different patents, or different new drug applications to be approved by the Food and Drug Administration (FDA). For an instance, the major structural variation between the two phosphodiesterase enzyme type 5 (PDE5) inhibitors Sildenafil and Vardenafil (Fig. 11.1) is the swap of a carbon atom and a nitrogen atom in a 5 6 fused ring, however this difference is enough for the two molecules to be covered by different patents [10]. The two approved cyclooxygenase II (COX-2) inhibitors i.e. Rofecoxib (Vioxxt) and Valdecoxib (Bextrat) (Fig. 11.1) only differ by the 5-membered hetero rings connecting the two phenyl rings, yet they were sold by Merck and Pharmacia/Pfizer separately [11]. Scaffold hopping using 3D pharmacophores has been successfully carried out [12]. The concept of scaffolds and scaffold hopping in the context of molecular topologies has been reported [13].

11.2.1 Pharmacophore searching By giving a single 3D structure of an active ligand, it is possible to search for compounds that exactly mimic this structure, however it is impossible to distinguish its features that are essential for binding from others that are variable. Such a differentiation can be made possible if a series of ligands is known. If these ligands are structurally diverse but share

258 Chapter 11 common features and can adopt similar shapes, a 3D pharmacophore can be derived, that is, a minimal set of spatially oriented features a compound must possess to be active [14,15]. 3D pharmacophores have been successfully employed for scaffold hopping. Typically, 3D pharmacophores are built manually or in a semi-automated fashion to search large multiconformer databases of chemical structures for compounds matching the pharmacophore. Both flexible superposition and 3D pharmacophore searching methods can assist in retrieving known compounds from databases. Often, this might not be sufficient to discover novel scaffolds as a minute fraction of drug-like chemical space is represented by the largest corporate collection. In addition, any existing compounds could be covered by competitor patents, considering that major pharmaceutical companies today restock their screening libraries from the same commercial sources. The de novo design program Skelgen [16,17] attempts to generate novel molecules by utilizing set of 3D pharmacophore features and an inclusion shape (derived from a set of superimposed ligands) as input. Within this pseudo-receptor, the program then builds new ligand structures that fulfill the pharmacophore constraints.

11.2.2 Recombination of ligand fragments A third approach to generate ideas for novel ligands beyond what is contained in screening libraries is the recombination of ligand fragments: in the early nineties to post-process results of 3D database searching, the program “SPLICE” was written [18]. Two operations are preformed: portions of structures matching a 3D query but have no contribution in fulfilling a pharmacophore feature cut off, and (partial) solution structures are assembled to composite structures by linking fragments at overlapping bonds. Researchers at Vertex ( recently reported a related method called BREED, which instead of 3D database searching results operates on sets of superimposed X-ray structures of related enzyme complexes, and generates new structures by recombining inhibitor fragments connected by single bonds. The steps involved in BREED for generation of novel inhibitors by recombining fragments has been mentioned briefly in Fig. 11.2 [19]. This is an attractive way of capitalizing the vast amount of structural information contained in the Protein database (PDB, for many target classes, which could quickly lead to promising scaffold ideas, in particular in combination with results from crystal-based fragment screening [20]. Clearly, crystal structures of protein ligand complexes are the richest source of information for designing modified or novel scaffolds. Many examples of successful structure-based design have been reported, especially for kinases [21]. A common element of such studies is the use of key interaction centers and the active site shape as constraints the key elements of molecular recognition that were mentioned above. Structures of several

Scaffold hopping: An approach to improve the existing pharmacological profile of NCEs 259

Figure 11.2 BREED method for generating novel inhibitors by recombining fragments.

ligands whose complex structures with CDK2 (cyclin-dependent kinase 2) have been solved over the years (the codes are reference IDs in the Protein Structure Database (PDB, Both the cofactor ATP and the unspecific inhibitor Staurosporine were starting points for the development of inhibitors. Methods like CAVEAT and SPLICE, incorporates conformational properties of molecules which can provide new solutions. Exchanging and recombining molecular fragments, however, is common practice in medicinal chemistry that does not always require 3D structural information. Closing or opening ring structures, replacing one ring system for another or modifying linker types and lengths between two ring systems can be effective procedures to obtain new class of compounds. Unless such modifications take place at the periphery of a

260 Chapter 11 structure, one can indeed think of scaffold hopping through bioisosteric replacement. Substantial efforts have been made for substituent replacement rules and databases generation [22]. Recognizing that the replacement of one ring system by another is a way to modify scaffold, researchers at GlaxoSmithKline have compiled a database of common ring systems that can be searched like a 2D version of CAVEAT [23].

11.2.3 Molecular similarity method A large number of molecular similarity methods have been established, explicitly based on the 2D structure of molecules (connectivity and atom types), nevertheless aims at describing similarity as independently as possible from the details of substructure. Unlike for 3D formats, shape and interaction vectors are obvious descriptors; there is no unique recipe to achieve this goal in two dimensions. One typical approach is to generate vectors or bit string descriptions for molecules (Fig. 11.3), from which similarity values can be calculated very quickly. Each element of such a vector denotes the presence or absence (or frequency of occurrence) of a small structural element or pharmacophore feature. Rarey and Dixon have developed a feature tree based similarity metric, which achieves abstraction from the molecular structure in a different manner [24]. A feature tree is a “shrunk” version of a molecular graph, in which each node constitutes an acyclic atom or an entire ring system with a set of assigned properties. For calculating a similarity value, two feature trees are explicitly matched onto each other. Owing to generalized representation of rings, the feature tree method is particularly valuable at substituting heterocycles for each other or identifying alternative ways of using rings. Details on similarity searching have recently been reviewed [25]. Clearly, similarity searching not only aims at identification of an optimal single metric, but also at an optimal way to combine the results of different metrics, because each method focuses on slightly

Figure 11.3 Bit string representation for a molecule.

Scaffold hopping: An approach to improve the existing pharmacological profile of NCEs 261 different compound features, for which relative importance is not known. The combination of bioisosteric replacement with similarity searching could also be a powerful approach for scaffold hopping. Both vector-based descriptions of molecules (so-called CATS correlation vectors) and the feature tree method have been employed to design novel scaffolds in 2D de novo design algorithms [26]. In these algorithms, chemical structures are assembled through fragment joining, and the resulting new scaffolds evaluated by their similarity to the query. Ironically, the better these methods work, the less interesting the results will be: if the chemical space spanned by the fragments is complete, and if the search algorithm locates the global minimum, it will retrieve the query as the best answer. Thus, the success reported so far relies on the fact that local minima in incomplete chemical spaces can yield interesting alternative scaffolds. In the feature tree fragment space method, this shortcoming has been addressed through the concept of a target similarity value: the level of dissimilarity to the query that the output structures should display is adjustable to high (conservative) or lower values (more drastic structural changes) [27].

11.3 Conclusion The analysis of the available drugs for a given target clearly demonstrates that it is possible to find a set of structurally diverse compounds that bind to the same receptor. Therefore, the underlying assumption of scaffold hopping is clearly correct. However, it should be noted that serendipity has played a large role in many of these discoveries. In addition, a large number of new drugs are structurally rather close to known compounds. Therefore, there is a continued strong need to develop new approaches to identify novel compounds in a more straightforward, systematic fashion. There are now a large number of tools available. The concept of scaffold hopping or bioisosteric replacement is now widely recognized. Interestingly, some of these tools have been available for more than a decade. Computational scaffold hopping is here to stay, no doubt, but to reach its full potential future improvements must go well beyond successful benchmark calculations. To these ends, realistically assessing method performance, better understanding fundamental and target-related constraints of scaffold hopping, and further increasing the success rate of practical applications are prime tasks. So far, neither computational scaffold hopping nor any other individual computational approach has put drug design up onto a new level. However, scaffold hopping belongs to the core of drug design methods and any substantially new developments in this field will experience a high level of interest, beyond the computational community.

262 Chapter 11 Solved exercise Exercise 1: Scaffold hopping of a MurF inhibitor using free web-server BoBER [28]. Note: BoBER (Base of Bioisosterically Exchangeable Replacements) is a freely available method which implements an interface to a database of bioisosteric and scaffold hopping replacements. The web-server enables fast and user-friendly searches for bioisosteric replacements which were obtained by mining the whole Protein Data Bank. BoBER enables medicinal chemists to quickly search and get new and unique ideas about possible bioisosteric or scaffold hoping replacements that could be used to improve specific hit or lead drug-like compounds. Browser Recommendation: This site is designed for use with Firefox, Google Chrome, Apple Safari and IE10 or later as it makes use of special features available only in these browsers. 1. Input drug structure: Clicking the Example structure button will input the structure (a MurF inhibitor) in the molecular editor.

2. The submit query button will execute the fragmenting procedure and a new window will be displayed. 3. Fragment selection: We click on the fragment (tetrahydro-benzothiophene) we wish to replace within the query structure. We select the Intra-Family radio button within the Custom options and as the structure is quite unique, we also select the interchangable join atoms radio button to expand the chemical space from which the bioisosteric replacements are sought. All other options are left as they are. Clicking the Submit query button will open the Results window.

Scaffold hopping: An approach to improve the existing pharmacological profile of NCEs 263

4. Results: We can observe that BoBER found, in this case, one bioisosteric replacement for our fragment (tetrahydrohydro-thienopyridine). The same colored join atoms (red) and N.sp2 are the ones to overlap within the BoBER database. Clicking the glyphicon icon shows a dropdown menu of overlapping join atoms (in this case only one pair) at which the replacement should take place.

Unsolved exercises Exercise 1: To design new butterfly structure by scaffold hopping for central core heterocycle. Exercise 2: On the basis of scaffold hopping, identify some new lead molecule as phosphodiesterase-5 (PDE5)inhibitors. (Hint: perform scaffold hopping for nitrogen containing fused heterocyclic ring system using BoBER, some of the well know PDE5 inhibitors are sildenafil and vardenafil). Exercise 3: To perform the scaffold hopping for imidazole in some FDA approved imidazole containing drugs.

264 Chapter 11 Exercise 4: To identify some false substrate of thymidylate synthase as an anti-cancer drug using scaffold hopping for pyrimidine heterocycles of dTMP as given below.

References [1] M.A. Johnson, G.M. Maggiora, Concepts and Applications of Molecular Similarity, Wiley, 1990. [2] H.J. Bo¨hm, G. Klebe, What can we learn from molecular recognition in protein ligand complexes for the design of new drugs? Angew. Chem. Int. Ed. Engl. 35 (1996) 2588 2614. [3] S.K. Kearsley, G.M. Smith, An alternative method for the alignment of molecular structures: maximizing electrostatic and steric overlap, Tetrahedron Comput. Methodol. 3 (1990) 615 633. [4] G. Klebe, T. Mietzner, F. Weber, Methodological developments and strategies for a fast flexible superposition of drug-size molecules, J. Comput. Mol. Des. 13 (1999) 35 49. [5] H. Zhao, Scaffold selection and scaffold hopping in lead generation: a medicinal chemistry perspective, Drug. Discov. Today 12 (2007) 149 155. [6] B.S. Sekhon, N. Bimal, Scaffold hopping in drug discovery, RGUHS J. Pharm. Sci. 2 (2012) 10. [7] Y. Hu, D. Stumpfe, Jr. Bajorath, Recent advances in scaffold hopping: miniperspective, J. Med. Chem. 60 (2016) 1238 1246. [8] H.-J. Bo¨hm, A. Flohr, M. Stahl, Scaffold hopping, Drug. Discov. Today: Technol. 1 (2004) 217 224. [9] G. Schneider, P. Schneider, S. Renner, Scaffold-hopping: how far can you jump? Qsar Combinatorial Sci. 25 (2006) 1162 1171. [10] N.T. Southall, Ajay, Kinase patent space visualization using chemical replacements, J. Med. Chem. 49 (2006) 2103 2109. [11] G.A. FitzGerald, COX-2 and beyond: approaches to prostaglandin inhibition in human disease, Nat. Rev. Drug. Discov. 2 (2003) 879. [12] T. Langer, E. Krovat, Chemical feature-based pharmacophores and virtual library screening for discovery of new leads, Curr. Opin. Drug. Discov. Dev. 6 (2003) 370 376. [13] N. Brown, E. Jacoby, On scaffolds and hopping in medicinal chemistry, Mini Rev. Med. Chem. 6 (2006) 1217 1229. [14] A.C. Good, J.S. Mason, Three-dimensional structure database searches, Rev. Comput. Chem. (1996) 67 117. [15] J.H. Van. Drie, Strategies for the determination of pharmacophoric 3D database queries, J. Comput. Mol. Des. 11 (1997) 39 52. [16] N.P. Todorov, P.M. Dean, Evaluation of a method for controlling molecular scaffold diversity in de novo ligand design, J. Comput. Mol. Des. 11 (1997) 175 192. [17] M. Stahl, N.P. Todorov, T. James, H. Mauser, H.-J. Boehm, P.M. Dean, A validation study on the practical use of automated de novo design, J. Comput. Mol. Des. 16 (2002) 459 478. [18] C.M. Ho, G.R. Marshall, SPLICE: a program to assemble partial query solutions from three-dimensional database searches into novel ligands, J. Comput. Mol. Des. 7 (1993) 623 647.

Scaffold hopping: An approach to improve the existing pharmacological profile of NCEs 265 [19] A.C. Pierce, G. Rao, G.W. Bemis, BREED: Generating novel inhibitors through hybridization of known ligands. Application to CDK2, p38, and HIV protease, J. Med. Chem. 47 (2004) 2768 2775. [20] D. Fattori, Molecular recognition: the fragment approach in lead generation, Drug. Discov. Today 9 (2004) 229 238. [21] T. Honma, Recent advances in de novo design strategy for practical lead identification, Med. Res. Rev. 23 (2003) 606 632. [22] P. Ertl, Cheminformatics analysis of organic substituents: identification of the most common substituents, calculation of substituent properties, and automatic identification of drug-like bioisosteric groups, J. Chem. Inf. Comput. Sci. 43 (2003) 374 380. [23] X.Q. Lewell, A.C. Jones, C.L. Bruce, G. Harper, M.M. Jones, I.M. Mclay, et al., Drug rings database with web interface. A tool for identifying alternative chemical rings in lead discovery programs, J. Med. Chem. 46 (2003) 3257 3274. [24] M. Rarey, J.S. Dixon, Feature trees: a new molecular similarity measure based on tree matching, J. Comput. Mol. Des. 12 (1998) 471 490. [25] T. Lengauer, C. Lemmen, M. Rarey, M. Zimmermann, Novel technologies for virtual screening, Drug. Discov. Today 9 (2004) 27 34. [26] G. Schneider, M.-L. Lee, M. Stahl, P. Schneider, De novo design of molecular architectures by evolutionary assembly of drug-derived building blocks, J. Comput. Mol. Des. 14 (2000) 487 494. [27] M. Rarey, M. Stahl, Similarity searching in large combinatorial chemistry spaces, J. Comput. Mol. Des. 15 (2001) 497 520. ˇ [28] S. Leˇsnik, B. Skrlj, N. Erˇzen, U. Bren, S. Gobec, J. Konc, et al., BoBER: web interface to the base of bioisosterically exchangeable replacements, J. Cheminf. 9 (2017) 62.


Hotspot and binding site prediction: Strategy to target protein protein interactions 12.1 Introduction Protein functions through interaction networks that are ubiquitously found in all essential cell processes. Understanding the structure, function, and mechanism of these interaction networks at the molecular level is one of the upcoming concepts being explored for its potential as an efficacious drug target. Experimental techniques such as X-ray crystallography and nuclear magnetic resonance (NMR) are well known for detailed structural knowledge of protein protein interactions at atomic resolution. However, in crystallography structures, it is often difficult to distinguish true biological interactions from crystal packing contacts [1]. Therefore, some computer tools have been developed based on residue conservation, interface size, or other interface descriptors to distinguish crystal packing from obligate and non-obligate interactions [2]. There are also some available databases which provide true biological units, such as Protein Quaternary Structure (PQS) [3] or ProtBuD [4], while others are specialized in storing and curating structural data on protein protein interactions: 3DCOMPLEX, PiBase, Protein3D, Structural Classification of Protein-Protein Interfaces (SCOPPI), DOCKGROUND, or Surface Properties of INterfacesProtein-Protein interfaces (SPIN-PP) [5]. The main challenge regarding the structural comprehension of protein interactions is that the number of available 3-D complex structures is still low with respect to the total number of protein protein interactions that occur in living organisms. Thus, when the structure of a given protein protein complex is not easily available for technical reasons, the practical approach is to characterize the interface, i.e., to identify the surface residues of the interacting proteins that are involved in the interactions. Some experimental approaches, such as cross-linking, site-directed mutagenesis, alanine scanning, or NMR chemical shift, aim to characterize protein interfaces using methods that are faster than atomic-level structure determination, which could also be amenable to high-throughput application. However, considering the difficulties and costs of the experimental techniques, computational tools have come up to complement the experimental efforts by Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


268 Chapter 12 characterizing, classifying, and predicting protein interfaces [5]. These computational predictions of interface residues have become an essential tool for drug design and discovery process targeting specific protein protein interactions. Further characterization of the protein interfaces is required to identify hot-spot residues, i.e., the residues which are the largest contributors to the binding affinity [6]. Although this identification is typically achieved through mutational studies, recently different computational approaches have been developed to predict putative protein protein hot spots. The identification of such hot spots is important for drug design targeting protein protein interactions. In this chapter, we will discuss the main computational methods reported in the literature to predict the protein protein interfaces and to identify the key hot-spot residues.

12.2 Protein protein binding sites Protein protein interfaces are usually wide in comparison to small-molecule binding sites. Additionally, their residue composition and physicochemical character are also quite varied [7,8]. Further studies showed that the shape and composition of interfaces were largely dependent on the type of interaction. Thus, while homodimer, obligate interfaces had clearly different physicochemical composition with respect to the solvent-exposed surface, protein protein interfaces in hetero-complexes cannot be easily differentiated from the rest of the surface [9]. This seems to be in contradiction with the fact that the relationship between the chemical and physical attributes of interacting surfaces is vital for the formation of hetero-complexes, frequently involving non-obligate and transient interactions. Indeed, some studies based on continuum electrostatic calculations suggested that protein protein interfaces are naturally designed to exploit electrostatic and hydrophobic forces in very different ways [10]. According to a recent definition of protein protein interface, it is divided into core and rim regions, followed by a third region called ‘support’ [11]. Studies show that core residues contribute over two-thirds of the contact surface. Interestingly, the support region composition is similar to the interior of proteins, while the rim region composition is similar to the exposed surface. Researchers hypothesize that part of a protein protein interface (support and rim) could pre-exist in a non-interacting protein surface, and thus evolving to a protein interface would only need a few mutations in order to achieve the typical core composition. Given that most of the analyses of protein binding sites are based on available X-ray data, flexibility aspects have been typically overlooked in spite of its enormous importance. Molecular dynamics (MD) simulations have shown the existence of anchor residues for molecular recognition [12], which are more rigid than the rest of the interface and correlate well with conserved hot-spot residues [13]. Moreover, a recent systematic analysis of the dynamic properties of interface residues has shown a correlation with the interface type, size, and nature of the complex [14]. Allosteric effects are also

Hotspot and binding site prediction: Strategy to target protein protein interactions 269 important in order to understand and predict protein binding sites [15]. Other aspects that are usually difficult to address from analyses purely based on crystallographic data are those related to transient interactions [16], some of which are mediated by linear motifs, or to hub proteins in protein interaction networks, which show shared binding sites for different interaction partners [17].

12.3 Types of protein protein interaction regions For the analysis of protein-protein interaction sites and hot spots, it is important to distinguish between different types of protein protein interaction regions. Generally, protein protein interactions are classified into two basic types i.e., permanent complexes, whose individual components are not structurally stable as monomers and non-obligatory or transient complexes, whose components are stable as monomers [18]. Another classification of protein protein interfaces include four different types: homo-obligomers, homocomplexes, hetero-obligomers, and heterocomplexes [19]. The meaning of obligomers and complexes is similar to permanent and nonobligatory, respectively, which can also be classified into two extra types of interfaces: intradomain (interactions within the same domain) and domain domain (interactions within the same chain), but although they might be evolutionary close to protein protein interactions, they are not directly relevant for protein association. Of course, domain domain interactions are not much different from obligomers in the sense that the separated subunits are not stable, but the difference is that obligomers are formed by two separated entities while domain interactions have additional physical constraints, such as the peptide/s linker, which brings other considerations into the equation. The most rational classification of protein-protein interactions include several criteria, regarding (1) the similarity of the subunits (homo-oligomeric and hetero-oligomeric complexes), (2) the thermodynamics of the association (non-obligate and obligate complexes), or (3) the kinetics of the interaction (transient and permanent complexes) [20]. More recent interface classifications are based on geometrical and evolutionary criteria aiming to the automated high-throughput annotation of new interfaces [21].

12.4 Computational prediction of protein binding sites Based on the above considerations, different strategies have been developed for the specific prediction of protein protein binding sites [22,23]. Different strategies used for the prediction of protein hotspot binding sites are presented in Fig. 12.1.

12.4.1 Protein protein docking Protein protein docking is considered as a computational method used to determine the interaction between the bound structures of two proteins (protein protein complex) that

270 Chapter 12

Figure 12.1 Different computational strategies used for the prediction of protein binding sites.

form from the coordinates of the unbound structures. A number of protein protein docking methods have been reported, with reasonable predictive success [6]. Most of the docking methods are based on the rigid-body approach that seems valid for cases with only small side-chain movement upon binding. Many sampling methods perform an exhaustive search that either adopt spherical polar Fourier, fast Fourier transform (FFT), or geometric hashing algorithm. Among these, the most common FFT-based docking programs are MolFit, FTDock, Zdock, and Global RAnge Molecular Matching (GRAMM-X). Besides, the other

Hotspot and binding site prediction: Strategy to target protein protein interactions 271 most successful shape-based methods include Hex and PatchDock. On the contrary, the another group of rigid-body docking methods employ the energy based-sampling by energy minimization, molecular dynamics (MD), or Monte Carlo methodology in combination with different energy-based scoring schemes such as RosettaDock, Haddock, ICM-DISCO, or ATTRACT. There are many scoring methods that depend upon the scoring and further refinement of the generated docking poses after completing the initial docking step. For instance, ClusPro/SmoothDock, which uses an energy-based minimization function, plus an additional clustering stage.

12.4.2 Binding site prediction based on the protein sequence Very few methods for interface prediction are based solely on sequence information. Among them, correlated mutations, identified from multiple sequence alignments, have been used to detect putative protein protein interfaces from sequences, based on the hypothesis that residues involved in intermolecular contacts tend to mutate simultaneously during evolution [24]. In a different approach, receptor-binding domains are predicted by analysing hydrophobicity distribution on protein sequences [25]. The predictions are reported to have between 59% and 80% coverage (sensitivity), depending on the set of protein interactions used. This shows how dependent the predictive results are on the data set used. A method using support vector machines (SVMs) for interface prediction entirely based on the protein sequence showed similar sensitivity with rather low positive predictive value [26]. Another machine learning-based method called Interaction Sites Identified from Sequence (ISIS) has also been developed to identify interacting residues from protein sequences only. This method improves the accuracy of structure prediction by integrating the evolutionary information without using the 3D structure of any protein as reference and also have very high positive prediction value for interface residues at the expense of a very low sensitivity. Interestingly, the fact that the method predicted only a very few residues, but with high accuracy, suggests that these residues might be truly important for the binding.

12.4.3 Binding site prediction based on the protein structure The advantage of sequence-based methods for interface prediction is that they can potentially be applied to a broader set of cases. However, their predictive results are usually lower than those obtained from methods based on structural information. Empirical scoring As already discussed, it is difficult to extract common physicochemical or structural properties from protein protein complexes in order to find simple patterns on protein surfaces that can identify protein binding sites. However, some specific features can be

272 Chapter 12 observed in certain types of interactions. Many groups have developed knowledge-based functions that can be used for binding site predictions, such as interface propensities derived from complex structures, or based on conservation scores, etc. Such a method based on residue interface propensities and surface patch physicochemical properties (solvation potential, hydrophobicity, planarity, protrusion, and accessible surface area) was developed on a set of 59 complexes, which included homodimers, heterocomplexes, and antibody antigen complexes. They showed some correlation with the real interfaces [27]. Another method using interface residue propensity values derived from datasets of structures is Inter-ProSurf. It is based on a propensity scale and solvent accessibility of residues plus further clustering [28]. SiteEngines used hierarchical scoring schemes to combine different descriptors, such as a low-resolution surface representation of physicochemical properties and surface shape [29]. The Protein intErface Recognition (PIER) method has also been proposed for the identification of interface residues [30], based on the statistical properties of each surface atom type and subsequent clustering in patches generated as in the optimal docking area (ODA) method [31]. Sequence conservation Many methods include sequence conservation or evolutionary information in addition to other descriptors. For instance, a 3D cluster analysis of residue conservation scores based on the alignment of homologous sequences was shown to identify protein protein interfaces and functional residues, as revealed via a set of 35 protein families [32]. Similarly, the Evolutionary Trace (ET) method identifies functional residues, potentially involved in protein protein interactions, based on the analysis of sequence alignments, mapping of conserved/unconserved residues in the 3D structure, and clustering [33]. ConSurf is another interface prediction method based on sequence conservation [34]. The procedure analyzes multiple sequence alignments in search of conserved functional regions that are then mapped on the 3D structure of the interacting proteins. In this line, the recently reported Joint Evolutionary Trees (JET) method based on the ET method focused on improving the sequence alignments and the functional and structural detection of conserved residues [34]. Similarly, the TreeDet server predicts functional sites based on conservation and evolutionary information [35], and the related ‘specificity determining positions’ (SPDs) can be applied to identify protein protein binding sites [36]. The Promate server uses a combination of conservation, physical, and empirical parameters to predict protein protein interfaces with a Naı¨ve Bayesian approach [37]. Protein INterface residUe Prediction (PINUP) uses a combination of parameters based on residue energy, interface propensity, and conservation scores optimized on a dataset of 57 proteins, in which the method yielded 44.5% positive predictive value and 42.2% sensitivity [38]. Protein-protein Interaction prediction by Structural Matching (PRISM) server is based on geometric complementarity and residue conservation. This server is also applicable in

Hotspot and binding site prediction: Strategy to target protein protein interactions 273 performing the challenging tasks like identifying whether two given proteins will interact or not [39]. The What Information does Surface Conservation Yield (WHISCY) server uses surface conservation and structural information to predict interface residues achieving more than three times higher accuracy than random predictions [40]. Machine learning techniques The method cons-Protein-Protein Interaction Site Prediction (PPISP) is based on a neural network that takes into consideration the sequence profiles of the adjacent residues and their solvent exposure [41]. This machine learning method was trained on different 615 pairs of non-homologous protein protein complexes (i.e. homodimers and heterodimers), and was subsequently tested on different bound and unbound proteins. For unbound proteins, about 70% of the predicted residues were correctly located at the given protein protein interfaces. Patch Finder Plus is another neural network method of machine learning that combines residue conservation, frequency and composition, surface concavity, accessible area and H-bond potential, with the goal of finding large electrostatic patches. The method was developed to find DNA-binding regions, but in some cases, these regions can also overlap to protein binding sites [42]. Another neural network method based on evolutionary information and chemicophysical surface properties has been reported with high predictive coverage [43]. The web server Protein-Protein Interface PREDiction (PPI PRED) uses the SVM method to evaluate different parameters such as surface shape, solvent accessible surface area, conservation, electrostatic potential, hydrophobicity, and interface residue propensity. Their reported success rates using leave-one-out cross-validation are difficult to compare to other methods because of the non-standard definition of a correct prediction (i.e., a patch over 50% positive predictive value and 20% sensitivity was ranked in the top three) and the used test set, composed of transient and obligate interfaces [44]. The method solvent accessibility based Protein-Protein Interface identification and Recognition (SPPIDER) uses machine learning approaches, such as SVM and Neural Networks to evaluate relative solvent accessibility predictions as a fingerprint for interaction sites together with a number of parameters [45]. They showed that this method improved the predictions obtained with other parameters such as evolutionary conservation, physicochemical character, and structural features. Another SVM method, based on evolutionary conservation signals and local surface properties and trained on a non-redundant dataset of 1494 protein protein interfaces, showed 39% positive predictive value and 57% sensitivity at residue-level predictions in cross-validation tests on the bound conformations of a total of 632 dimers (from which 518 were homodimers) [46]. Meta-servers A meta-server called meta-PPISP combined the neural network method cons-PPISP with other different web servers such as ProMate and PINUP. The different scores were

274 Chapter 12 combined with weighting factors obtained by a linear regression method trained on 35 non-redundant proteins. The cross-validation predictive results improved over those of the individual servers (PPV increased by 4.8 18.2% points) [47]. In general, meta-servers can be a convenient way of accessing several different servers, but caution is advised when interpreting the results to evaluate the contribution of each individual method.

12.4.4 Energy-based methods Other methods for predicting protein binding sites have been based on energy considerations. For instance, the Solvation potential, Hydrophobicity, Accessible surface area, Residue interface propensity, Planarity and Protrusion (SHARP2) server combines solvation potential and hydrophobicity calculations with other geometric descriptors and propensity scores. The ODA is established on the hypothesis that desolvation must be considered as a crucial factor in protein protein binding. It is based on a computer algorithm that identifies continuous surface patches of optimal docking desolvation energy. The size of the patches is not fixed and it is calculated through an iterative procedure until finding the circular surface patch with the most favorable desolvation energy from each starting point. This method was reportedly benchmarked on 66 unbound non-redundant protein structures involved in non-obligate protein protein hetero-complexes, where the ODA-based predicted regions corresponded with real interfaces in 80% of the cases [31]. The limitation is that this method can only be applied to the cases where the desolvation effect is important, and thus, in approximately half of the cases, in which electrostatics role is more evident, there is no predictive signal. ODA has been applied for the determination of proteins of biological and therapeutic interest and has shown great predictive results [48]. Recently, this method has been implemented in the SEQMOL package. The abovedescribed desolvation energy descriptor has been also included as part of a method to predict protein interaction sites based on clefts in protein surfaces using Q-SiteFinder [49]. This desolvation descriptor was compared to several others and achieved excellent predictive results in all types of complexes, being the top predictor in antibody, antigens, and the ‘other’ type of complexes. Energy-based docking simulations have also been used to identify protein interfaces. Although the challenging goal of docking is predicting the binding mode of two interacting proteins, it has been observed that the docking solutions sample more frequently the interface regions even when using a low-resolution protein representation [50]. This is consistent with the fact that conformational changes upon protein protein association are often limited to local movements, which suggests that in many cases protein protein association can be represented by rigid-body fit [51]. The inclusion of energy-based

Hotspot and binding site prediction: Strategy to target protein protein interactions 275 descriptors to sample and score rigid-body docking poses improved the docking energy landscapes and the tendency towards the real interfaces. This led to the development of a residue-based normalized interface propensity (NIP) parameter, computed from the ensemble of about 100 lowest-energy docking poses, which was then used for the identification of surface residues that are potentially involved in protein protein interactions. A cut-off value of 0.4 is reported for NIP to predict known protein protein interfaces on unbound proteins with PPV of over 80%, although with quite low sensitivity. The method was able to identify only a few residues of the interface, but its high accuracy suggested that these residues might be the important ones for the interaction.

12.5 Hot-spot residues at protein interfaces Protein protein interfaces are formed by a high variety of residues, many of which are important for specificity or for dynamic considerations. But it has been reported that most of the binding free energy is usually contributed by a few number of residues, known as ‘hot-spots’ [52]. The term ‘hot-spot’ was later used to describe the key residues in human growth hormone-receptor complex [53,54]. Different studies have tried to find characteristic structural features in the known hot spots. The current view is that protein protein interfaces are made up of different residues that play a major role in the specific interactions with the group of conserved hot-spot residues which act as binding site anchors and are required for the stabilization of the complex. One basic observation is that the number of hot spots increases with the size of the interface. Structurally, hot spots are surrounded by moderately conserved and energetically less important residues forming a hydrophobic O-ring responsible for bulk solvent exclusion [55]. They appear to be clustered in tightly packed regions in the center of the interface [54]. However, there has not been found any single attribute as shape, charge, or hydrophobicity that can unequivocally define a hot spot by itself. Tryptophan, arginine, and tyrosine are the most frequently found hotspot residues, while, leucine, serine, threonine, and valine are the least frequent [56]. Hot spots have been recently shown to correlate with relevant nodes of residue networks in protein interfaces [57]. Interestingly, in hub proteins at protein protein networks, different hot regions can be used to bind to different partners [58]. Regarding flexibility, MD simulations have shown that hot spots are quite rigid as compared to the surrounding interface residues. To identify the hot spots for a given interaction is not that easy process. It takes into account the involvement of each and every contributory residue. The energetic contribution of each residue can experimentally be done by alanine scanning mutagenesis in combination with various biophysical methods [59]. There are databases of experimentally calculated binding energies of hot-spot residues, such as ASEdb (, Binding Interface Database (BID) (http://tsailab/.tamu.eduBID/), or HotSprint (http://prism.

276 Chapter 12 However, the experimental characterization of protein interfaces in search of hot spots is still costly and technically cumbersome, so several computational methods have been developed for the prediction of hot spots in protein protein interactions.

12.6 Prediction of hot spots in protein protein interactions Different scoring schemes have been reported for the computational hotspot prediction and are based on residue conservation, hydrogen bonding, or complete energy binding. Other approaches have utilized a combination of all these features with machine learning techniques. Although a few methods can predict hot spots based only on protein sequences, most of the available methods need some structural information as input.

12.6.1 Hot-spot prediction based on the sequence Very few methods have been reported to make hotspot predictions based only on the protein sequences. A neural network method, ISIS, initially designed to predict interface residues from protein sequences, was later also applied to determine the hot spots on a dataset of 296 mutations pertaining to 30 different complexes. The remarkable reported high predictive rates obtained only from the protein sequence can be explained because of the restricted definition of positive and negative predictions they used in their benchmark. They considered hot spots as those residues with G upon mutation .2.5 kcal/mol, whereas non-hot-spot residues were only those with G 5 0 kcal/mol. Thus, they were leaving out of the test all the mutants with G , 0 (even though these residues should be considered as nonhot spots) or 0 , G # 2.5, which can be in fact of high interest in a realistic situation and perhaps the most difficult residues to be classified as hot spots. Most of the other methods establish a single cut-off to classify the predictions as a hot spot or non-hot spot, and thus the success rates reported by the ISIS method cannot be compared to them.

12.6.2 Hot-spot prediction based on the structure Most of the available hot-spot prediction methods are based on the 3D structure of the complex. Such methods can be classified based on an empirical function, conservation data, and energy considerations. Empirical function Due to the energetic importance of hot-spot residues, hotspots are expected to be conserved at the interfaces within the members of a given family [56]. This concept has been used to computationally identify hot spots in a recently reported method known as Multiple Alignment of Protein-Protein InterfaceS (MAPPIS). This approach predicts hot spots by

Hotspot and binding site prediction: Strategy to target protein protein interactions 277 detecting spatially conserved patterns applying multiple alignments of physicochemical interactions and binding properties in the 3D space [60]. MAPPIS success rates on predicting hot spots have been analyzed on a dataset of 440 mutants from 10 different complexes, 120 yielding quite good predictive rates [61]. However, this method needs a sufficient number of high-resolution complex structures of functionally similar proteins in order to build reliable structural alignments. A recently reported web server, HotPoint [62], predicts hot spots using an empirical model, based on relative accessibility in the complex state and knowledge-based pair potentials. Energy-based Several methods for hot-spot prediction are based on the computational alanine scanning of a protein protein complex. This approach consists in computing the variation of binding affinity (G) upon in silico modification of a given residue to alanine. One such method utilize the energy-based scoring function in ROBETTA to predict hot spots, with reported high success on a data set of 380 mutants pertaining to the 19 different complexes. Another method, FOLD-X Energy Function (FOLDEF) with the FOLD-X energy function, has also been used to provide a fast and quantitative evaluation of the interactions involved in a protein protein complex [63]. Among a set of 40 single mutations in alanine from three complexes, this approach resulted in 61% PPV and 72% sensitivity [64]. Using a different energy-based approach, two different machine learning algorithms including K-FADE and K-CON have been reported to predict hot-spot residues on the basis of physical/biochemical features [65]. Collectively, these two models are reported to provide better results than the individual ones, and even better when they were integrated with ROBETTA.

12.6.3 Hot-spot prediction based on the unbound protein structure All of the above-described methods can give reliable predictions of hot spots on a given protein protein complex. Nevertheless, the requirement of the 3D structure of the complex is a major limitation of these methods. Unfortunately, in most of the protein protein interactions, the 3D structure of the complex is not yet available, and thus the aforementioned methods have limited applicability. Only a few hot-spot prediction methods have been reported based only on the structure of the unbound proteins. An interface prediction method is based on computational protein protein docking simulations, and therefore suitable in an instance where the structure of the complex is not available, has been recently applied to hot-spot prediction. The pyDockNIP method was a variation of the original algorithm for interface prediction. The original NIP calculations utilized a longer docking approach which was based on ICM pseudo-Brownian rigid-body docking search with a complete energy function, such as van der Waals, hydrogen bonding, electrostatic and desolvation. This rather sophisticated docking and the scoring scheme were

278 Chapter 12 replaced by a simpler one, which achieved similar docking results in previous tests, based on faster FFT-based docking search (FTDock and ZDOCK) and a simple energy function implemented in pyDock. These new NIP from the FFT-based rigid-body docking and pyDock scoring predicted known hot-spot residues. This method was the first reported systematic application of protein protein docking calculations to the identification of hotspot residues. This kind of approach can be especially helpful in drug design projects targeting protein protein interactions in cases with no structural information about the complex. Another interesting aspect of the NIP values is that they can be also seen as the residue binding free energies estimated from the Boltzmann population of the two states in which a given residue can be found after the docking simulations: either exposed or involved in the docking interface. A similar approach, but based on the distribution of MD conformations of organic solvents around the protein, has also shown hot-spot detection on the selected test cases [66].

12.7 Conclusion Protein protein interfaces are usually large and formed by a variety of residues, which is needed in order to achieve high affinity and specificity in protein protein recognition. The physicochemical features of protein interfaces strongly depend on the type of association, ranging from obligate complexes in which the separated components are not stable to transient interactions in which the lifetime of the complex is extraordinarily small. Based on observed structural and physicochemical patterns, conservation, and energy considerations, a number of computational methods have been reported for the prediction of protein binding sites, which we have discussed here. Most of the methods show quite good predictive success rates, so interface prediction is becoming a common tool to help in characterizing a given protein protein interaction. However, important challenges remain, such as the impossibility of identifying the relevant interface for each partner in cases of shared binding sites or multiple interfaces, or the lack of truly negative data in benchmark tests. Future efforts should focus on including flexibility and allosteric considerations in the predictions, as well as to improve affinity and specificity predictions when dealing with multiple interfaces. In spite of the variety of protein interfaces, it has been observed that most of the binding affinity usually arise from only a few residues, so-called hot spots, which are important from a functional and practical point of view and can be used as starting points for drug discovery targeting protein protein interactions. We report here a survey of methods that have been developed to predict such hot spots. The majority of them need the structure of the protein protein complex, although a few of the methods are able to identify hot spots on the structures of the unbound proteins or homology-based models, which opens the door

Hotspot and binding site prediction: Strategy to target protein protein interactions 279 to large-scale identification of hotspot residues in protein interaction networks. However, the current limited availability of experimental data is hampering further advancement in method development. One of the goals of the binding site and hotspot prediction methods is to help in the therapeutic drug discovery programs. The field is highly promising, and several small molecules have already been reported to disrupt protein interactions of therapeutic interest. Solved exercise: Exercise 1: To analyse and identify the hotspot residues present in the HER2 protein using Hotpoint webserver [62]. Note: HotPoint is based on a few simple rules consisting of solvent accessibility and energetic contribution of residues. The thresholds of the model are adjusted according to a data set composed of 150 experimentally alanine mutated residues of which 58 residues are hot spots and 92 residues are non-hot spots. The interface residues, whose mutations change the binding free energy at least 2.0 kcal/mol, are considered as experimental hot spots. If the mutation results in a change ,0.4 kcal/mol, that residue is labeled as experimental nonhot spot. Step-by-step protocol Input: Input data is the protein structure in PDB formatted coordinate file, so type “3PP0” for HER2 protein. Note: Server does not work for PDB files containing only one chain and returns an error. HotPoint is specific to protein protein interfaces; chains corresponding to DNA structures return a warning in the web server. Two chain identifiers forming the interface are defined, so type “A” in chain 1 and “B” in chain 2. User can either run the server with default distance thresholds to extract interface residues or can change the interface definition by submitting a distance threshold. Here we use default values. Extraction of computational hot spots: HotPoint server starts the calculation of three consecutive steps: Extraction of interface residues: if the distance between any two atoms belonging to two residues, one from each chain, is less than the sum of their van der Waals radii plus a 0.5 A tolerance, these two residues are defined as interacting. Calculation of the features: Residue solvent accessibilities are calculated using Naccess. In the contact potential matrix, there are 210 distinct contact potentials between all possible pairs of 20 amino acids in RT unit (R, universal gas constant; T, absolute temperature). Total contact potential of the residue is defined as the absolute of the sum of the contact potentials with its neighbors.

280 Chapter 12 Prediction based on empirical model: Finally, the empirical model is applied on the residue to determine whether it is a computational hot spot or not. If the relative accessibility of an individual interface residue is 20% and its total contact potential is 18.0, it is labeled as hot spot. Results: The output of the server is a table (Fig. 12.1) consisting of the interface residues with their features. Background of the predicted hot spots is highlighted with red color (Fig. 12.2). The prediction results as a text file and interface residue coordinates in PDB file format are also downloadable by the user (Fig. 12.3).

Figure 12.2 Tabular representation of identified hotspot residues with residue number.

Hotspot and binding site prediction: Strategy to target protein protein interactions 281

Figure 12.3 3-D color-coded representation of identified hotspot residues in 3PP0.

Unsolved exercises Exercise 2: Identify the hotspots at the protein-protein interface of aldehyde reductase (ALR1) and aldose reductase (ALR-2) proteins using PRISM webserver. Note: For help please refer Exercise 3: To perform fragment hotspot mapping on protein kinase B and pantothenate synthetase. Note: Foe help please refer

References [1] R.P. Bahadur, P. Chakrabarti, F. Rodier, J. Janin, A dissection of specific and non-specific protein protein interfaces, J. Mol. Biol. 336 (2004) 943 955. [2] H. Zhu, F.S. Domingues, I. Sommer, T. Lengauer, NOXclass: prediction of protein-protein interaction types, BMC Bioinf. 7 (2006) 27. [3] K. Henrick, J.M. Thornton, PQS: a protein quaternary structure file server, Trends Biochem. Sci. 23 (1998) 358 361. [4] Q. Xu, A. Canutescu, Z. Obradovic, R.L. Dunbrack Jr, ProtBuD: a database of biological unit structures of protein families and superfamilies, Bioinformatics 22 (2006) 2876 2882. [5] B. Huang, M. Schroeder, Using protein binding site prediction to improve protein docking, Gene 422 (2008) 14 21. [6] D.W. Ritchie, Recent progress and future directions in protein-protein docking, Curr. Protein Peptide Sci. 9 (2008) 1 15. [7] S. Jones, J.M. Thornton, Analysis of protein-protein interaction sites using surface patches, J. Mol. Biol. 272 (1997) 121 132. [8] F. Glaser, D.M. Steinberg, I.A. Vakser, N. Ben-Tal, Residue frequencies and pairing preferences at protein protein interfaces, Proteins: Struct., Funct., Bioinf. 43 (2001) 89 102. [9] L.L. Conte, C. Chothia, J. Janin, The atomic structure of protein-protein recognition sites, J. Mol. Biol. 285 (1999) 2177 2198. [10] F.B. Sheinerman, B. Honig, On the role of electrostatic interactions in the design of protein protein interfaces, J. Mol. Biol. 318 (2002) 161 177. [11] E.D. Levy, A simple definition of structural regions in proteins and its use in analyzing interface evolution, J. Mol. Biol. 403 (2010) 660 670. [12] D. Rajamani, S. Thiel, S. Vajda, C.J. Camacho, Anchor residues in protein protein interactions, Proc. Natl Acad. Sci. 101 (2004) 11287 11292.

282 Chapter 12 [13] O.N. Yogurtcu, S.B. Erdemli, R. Nussinov, M. Turkay, O. Keskin, Restricted mobility of conserved residues in protein-protein interfaces in molecular simulations, Biophys. J. 94 (2008) 3475 3485. [14] A. Zen, C. Micheletti, O. Keskin, R. Nussinov, Comparing interfacial dynamics in protein-protein complexes: an elastic network approach, BMC Struct. Biol. 10 (2010) 26. [15] A. del Sol, C.-J. Tsai, B. Ma, R. Nussinov, The origin of allosteric functional modulation: multiple pre-existing pathways, Structure 17 (2009) 1042 1050. [16] J.R. Perkins, I. Diboun, B.H. Dessailly, J.G. Lees, C. Orengo, Transient protein-protein interactions: structural, functional, and network properties, Structure 18 (2010) 1233 1243. [17] S.A. Ozbabacan, A. Gursoy, O. Keskin, R. Nussinov, Conformational ensembles, signal transduction and residue hot spots: application to drug discovery, Curr. Opin. Drug. Discov. Dev. 13 (2010) 527 537. [18] S. Jones, J.M. Thornton, Principles of protein-protein interactions, Proc. Natl Acad. Sci. 93 (1996) 13 20. [19] Y. Ofran, B. Rost, Analysing six types of protein protein interfaces, J. Mol. Biol. 325 (2003) 377 387. [20] I.M. Nooren, J.M. Thornton, Diversity of protein protein interactions, EMBO J. 22 (2003) 3486 3492. [21] W.K. Kim, A. Henschel, C. Winter, M. Schroeder, The many faces of protein protein interactions: a compendium of interface geometry, PLoS Comput. Biol. 2 (2006) e124. [22] S. Leis, S. Schneider, M. Zacharias, In silico prediction of binding sites on proteins, Curr. Med. Chem. 17 (2010) 1550 1562. [23] N. Tuncbag, G. Kar, O. Keskin, A. Gursoy, R. Nussinov, A survey of available tools and web servers for analysis of protein protein interactions and interfaces, Brief. Bioinf. 10 (2009) 217 232. [24] F. Pazos, M. Helmer-Citterich, G. Ausiello, A. Valencia, Correlated mutations contain information about protein-protein interaction, J. Mol. Biol. 271 (1997) 511 523. [25] X. Gallet, B. Charloteaux, A. Thomas, R. Brasseur, A fast method to predict protein interaction sites from sequences, J. Mol. Biol. 302 (2000) 917 926. [26] I. Reˇs, I. Mihalek, O. Lichtarge, An evolution based classifier for prediction of protein interfaces without using protein structures, Bioinformatics 21 (2005) 2496 2501. [27] S. Jones, J.M. Thornton, Prediction of protein-protein interaction sites using patch analysis, J. Mol. Biol. 272 (1997) 133 143. [28] S.S. Negi, C.H. Schein, N. Oezguen, T.D. Power, W. Braun, InterProSurf: a web server for predicting interacting sites on protein surfaces, Bioinformatics 23 (2007) 3397 3399. [29] A. Shulman-Peleg, R. Nussinov, H.J. Wolfson, Recognition of functional sites in protein structures, J. Mol. Biol. 339 (2004) 607 633. [30] I. Kufareva, L. Budagyan, E. Raush, M. Totrov, R. Abagyan, PIER: protein interface recognition for structural proteomics, Proteins: Struct., Funct., Bioinf. 67 (2007) 400 417. [31] J. Fernandez-Recio, M. Totrov, C. Skorodumov, R. Abagyan, Optimal docking area: a new method for predicting protein protein interaction sites, Proteins: Struct., Funct., Bioinf. 58 (2005) 134 143. [32] R. Landgraf, I. Xenarios, D. Eisenberg, Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins, J. Mol. Biol. 307 (2001) 1487 1502. [33] D.H. Morgan, D.M. Kristensen, D. Mittelman, O. Lichtarge, ET viewer: an application for predicting and visualizing functional sites in protein structures, Bioinformatics 22 (2006) 2049 2050. [34] S. Engelen, L.A. Trojan, S. Sacquin-Mora, R. Lavery, A. Carbone, Joint evolutionary trees: a large-scale method to predict protein interfaces based on sequence sampling, PLoS Comput. Biol. 5 (2009) e1000267. [35] A. Carro, M. Tress, D. De Juan, F. Pazos, P. Lopez-Romero, A. Del Sol, et al., TreeDet: a web server to explore sequence space, Nucleic Acids Res. 34 (2006) W110 W115. [36] A. Rausell, D. Juan, F. Pazos, A. Valencia, Protein interactions and ligand binding: from protein subfamilies to functional specificity, Proc. Natl Acad. Sci. 107 (2010) 1995 2000. [37] H. Neuvirth, R. Raz, G. Schreiber, ProMate: a structure based prediction program to identify the location of protein protein binding sites, J. Mol. Biol. 338 (2004) 181 199. [38] V. Chelliah, T.L. Blundell, J. Fernandez-Recio, Efficient restraints for protein protein docking by comparison of observed amino acid substitution patterns with those predicted from local environment, J. Mol. Biol. 357 (2006) 1669 1682.

Hotspot and binding site prediction: Strategy to target protein protein interactions 283 [39] O. Keskin, R. Nussinov, A. Gursoy, PRISM: protein-protein interaction prediction by structural matching, Funct. Proteom., Springer, 2008, pp. 505 521. [40] S.J. de Vries, A.D. van Dijk, A.M. Bonvin, WHISCY: what information does surface conservation yield? Application to data-driven docking, Proteins: Struct., Funct., Bioinf. 63 (2006) 479 489. [41] H. Chen, H.X. Zhou, Prediction of interface residues in protein protein complexes by a consensus neural network method: test against NMR data, Proteins: Struct., Funct., Bioinf. 61 (2005) 21 35. [42] E.W. Stawiski, L.M. Gregoret, Y. Mandel-Gutfreund, Annotating nucleic acid-binding function based on protein structure, J. Mol. Biol. 326 (2003) 1065 1079. [43] P. Fariselli, F. Pazos, A. Valencia, R. Casadio, Prediction of protein protein interaction sites in heterocomplexes with neural networks, Eur. J. Biochem. 269 (2002) 1356 1361. [44] J.R. Bradford, D.R. Westhead, Improved prediction of protein protein binding sites using a support vector machines approach, Bioinformatics 21 (2004) 1487 1494. [45] A. Porollo, J. Meller, Prediction-based fingerprints of protein protein interactions, Proteins: Struct., Funct., Bioinf. 66 (2007) 630 645. [46] A.J. Bordner, R. Abagyan, Statistical analysis and prediction of protein protein interfaces, Proteins: Struct., Funct., Bioinf. 60 (2005) 353 366. [47] H. Chen, H.-X. Zhou, Prediction of solvent accessibility and sites of deleterious mutations from protein sequence, Nucleic Acids Res. 33 (2005) 3193 3199. [48] D. Ferna´ndez, J. Vendrell, F.X. Avile´s, J. Ferna´ndez-Recio, Structural and functional characterization of binding sites in metallocarboxypeptidases based on optimal docking area analysis, Proteins: Struct., Funct., Bioinf. 68 (2007) 131 144. [49] N.J. Burgoyne, R.M. Jackson, Predicting protein interaction sites: binding hot-spots in protein protein and protein ligand interfaces, Bioinformatics 22 (2006) 1335 1342. [50] I.A. Vakser, O.G. Matar, C.F. Lam, A systematic study of low-resolution recognition in protein protein complexes, Proc. Natl Acad. Sci. 96 (1999) 8477 8482. [51] R.R. Gabdoulline, R.C. Wade, Protein-protein association: investigation of factors influencing association rates by Brownian dynamics simulations, J. Mol. Biol. 306 (2001) 1139 1155. [52] W.R. Tulip, V.R. Harley, R.G. Webster, J. Novotny, N9 neuraminidase complexes with antibodies NC41 and NC10: empirical free energy calculations capture specificity trends observed with mutant binding data, Biochemistry 33 (1994) 7986 7997. [53] T. Clackson, J.A. Wells, A hot spot of binding energy in a hormone-receptor interface, Science 267 (1995) 383 386. [54] O. Keskin, B. Ma, R. Nussinov, Hot regions in protein protein interactions: the organization and contribution of structurally conserved hot spot residues, J. Mol. Biol. 345 (2005) 1281 1294. [55] A.A. Bogan, K.S. Thorn, Anatomy of hot spots in protein interfaces, J. Mol. Biol. 280 (1998) 1 9. [56] B. Ma, T. Elkayam, H. Wolfson, R. Nussinov, Protein protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces, Proc. Natl Acad. Sci. 100 (2003) 5772 5777. [57] N. Tuncbag, F.S. Salman, O. Keskin, A. Gursoy, Analysis and network representation of hotspots in protein interfaces using minimum cut trees, Proteins: Struct., Funct., Bioinf. 78 (2010) 2283 2294. [58] E. Cukuroglu, A. Gursoy, O. Keskin, Analysis of hot region organization in hub proteins, Ann. Biomed. Eng. 38 (2010) 2068 2078. [59] I.S. Moreira, P.A. Fernandes, M.J. Ramos, Hot spots—a review of the protein protein interface determinant amino-acid residues, Proteins: Struct., Funct., Bioinf. 68 (2007) 803 812. [60] A. Shulman-Peleg, M. Shatsky, R. Nussinov, H.J. Wolfson, Spatial chemical conservation of hot spot interactions in protein-protein complexes, BMC Biol. 5 (2007) 43. [61] S. Grosdidier, J. Ferna´ndez-Recio, Docking and scoring: applications to drug discovery in the interactomics era, Expert. Opin. Drug. Discov. 4 (2009) 673 686. [62] N. Tuncbag, O. Keskin, A. Gursoy, HotPoint: hot spot prediction server for protein interfaces, Nucleic Acids Res. 38 (2010) W402 W406.

284 Chapter 12 [63] R. Guerois, J.E. Nielsen, L. Serrano, Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations, J. Mol. Biol. 320 (2002) 369 387. [64] S. Grosdidier, J. Ferna´ndez-Recio, Identification of hot-spot residues in protein-protein interactions by computational docking, BMC Bioinf. 9 (2008) 447. [65] S.J. Darnell, D. Page, J.C. Mitchell, An automated decision-tree approach to predicting protein interaction hot spots, Proteins: Struct., Funct., Bioinf. 68 (2007) 813 823. [66] J. Seco, F.J. Luque, X. Barril, Binding site detection and druggability index from first principles, J. Med. Chem. 52 (2009) 2363 2371.


In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets 13.1 Introduction Bioinformatics tools can be used to help pharmaceutical companies gain a better understanding of the effects of genomic variants on drug efficacy and drug toxicity. This can be achieved by using in-silico techniques to screen through coding and highly conserved putative regulatory regions of genes to discover novel SNPs and deduce their functions. A SNP is a mutation with a single DNA base substitution or minor allele frequency (MAF) observed with a frequency of at least 1% in a given population [1]. Any single base change in the coding region of a gene is referred as nonsynonymous SNP (nsSNP), which further results in subsequent amino acid substitution (AAS) in the resultant protein. Given that nsSNPs result in actual changes in primary amino acid sequences, the function of the protein products might get altered, which thereby can result in significant alterations in drug target phenotypes and, therefore, in resistance against already existing drugs [2]. SNPs affecting drug targets are more likely to occur at highly conserved loci as these positions in genes are conserved throughout evolution and are crucial for protein folding and/or function. Currently available SNP prediction tools can be categorized as sequence-based [3] or structure based, because most disease-causing SNPs affect protein stability. Structure-based rules have been established to distinguish functionally significant SNPs from those that are functionally neutral [4]. SNPs are significant targets in clinical research and in the development of new drugs. It is therefore pertinent that researchers possess ability to distinguish protein function-altering nsSNPs from those that are functionally neutral accurately and prioritize nsSNPs for future research in explicating new drug targets. As there is a vast number of SNPs, it might not be feasible for researchers to carry out wet laboratory experiments on every SNP to determine their biological significance. Thus, bioinformatics tools are used by researchers to first screen the potentially deleterious SNPs affecting drug responses before further investigations are carried out using biological assays and techniques. This translates into

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


286 Chapter 13 reduced costs and saved time. If prediction scores were available, quantitative ranking of functionally significant SNPs could be done to further prioritize SNPs for analysis. There are many publicly available bioinformatics tools that provide systematic means of predicting the functional significance of SNPs complete with scores and annotations. However, users have to spend much time and effort to select the most appropriate one to use and are often confused by each tool. Here, we discuss various in-silico tools to provide researchers with a guide to select the most appropriate tool for use in their drug discovery exercise.

13.2 Sequence-based approaches to SNP analysis Sequence- and structure-based methodologies are the most common approaches used in SNP prediction tools. The advantage of using the sequence-based approach alone for prediction is that results for a large number of substitutions can be generated [5], as structural information tends to be less available. Sequence based predictions can be more encompassing than structure-based ones as they can incorporate all types of effect at the protein level, and may be applicable for any human protein with known relatives [6]. Overall, such an approach has broader applicability as it does not require knowledge of three-dimensional (3D) structures to predict the impact on functions of resulting proteins. However, sequence-based predictions (based on homology and evolutionary conservation, Fig. 13.1) are unable to shed light on the underlying mechanisms of how SNPs result in changed protein phenotypes, which might have consequences for drug targets [7].

13.3 Structure-based approaches to SNP analysis Structure-based approaches are useful as they shed light on how a given amino acid substitution can result in an altered protein phenotype by predicting its effect on the 3D

Figure 13.1 Sequence based approach.

In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets 287

Figure 13.2 Structure based approach.

structure (Fig. 13.2). The main disadvantage of a structure-based approach is that the 3D structures of most proteins are unknown. Thus, this approach has limited applicability. Tools that integrate both approaches have an additional advantage of being able to assess the reliability of the predicted results by cross-referencing the results obtained from both approaches. Tools that combine these approaches and employs different algorithms and methodologies for making the predictions, thereby having a wider coverage of the different aspects of SNP analysis

13.4 Sequence-based prediction tools 13.4.1 SIFT “Sorting intolerant from tolerant” ( is a sequencebased SNP prediction tool that focuses on human nsSNPs in dbSNP [8]. National Center for Biotechnology Information (NCBI) allows SIFT to use relevant multiple sequence alignments (MSAs) from pre-computed BLAST searches [9]. SIFT predictions of the severity of AASs on protein function are based on sequence homology and the physiochemical properties of amino acids [10]. The tool assumes that amino acid residues that are important for function are conserved. From the MSA, the loci and their corresponding amino acid residues that are deemed to be highly conserved are then determined. SNPs that cause changes at these conserved positions, especially physiochemical ones, are deemed to be detrimental. SIFT calculates the probability that an amino acid change at a particular position is tolerated. Probability values are calculated using position-specific scoring matrices and the scores generated are normalized probabilities that the AAS is tolerated [11]. If any of the scores are lower than the cut-off of 0.05 used by SIFT, the respective AASs would then be predicted to be deleterious [12]. Output scores are in the range from 0 to 1, with 0 being damaging and 1 being neutral. A lower score indicates that the AAS is more damaging to the protein function. Such scores enable the quantitative comparison and ranking of SNPs

288 Chapter 13 in the order of their biological significance, and are useful for researchers to decide which SNPs of a gene they should first look at. When a protein sequence is entered as a search input, SIFT generates predictions for all AASs that alter protein function. It has been shown that the multivariate analysis of protein polymorphism (MAPP) and SNPs3D tools perform better than SIFT in their published results.

13.4.2 MAPP The MAPP tool ( is a tool for sequence-based SNP prediction which take physiochemical variation present in a column of a MSA of homologous proteins into consideration. On the basis of this variation, it predicts the impact of all possible AASs on protein function. MAPP generates SNP prediction results by quantitatively comparing the deviation of AASs with the variation present in corresponding columns of the MSA. A greater deviation indicates that there is a higher probability that the AAS is damaging to protein function [13]. MAPP requires search inputs of either protein sequence alignments in FASTA format, or of tree in parenthesis representation with branch lengths. Gene or SNP IDs cannot be used as direct search inputs and the tool requires known information of the protein alignment for input. This is tedious and limiting as such information might not be readily available. In addition, each search can only be done at the protein sequence level and not at the SNP or DNA level. MAPP use a JAVA program with a command line instead of a web-based HTML search. This form of search method requires a certain amount of learning and familiarization. MAPP is also a ‘rubbish in rubbish out’ prediction tool, which means that if the wrong protein sequences are used in the MSA, all results generated would be essentially useless and inaccurate. The protein sequences initially included in the alignment are of utmost importance to the accuracy of prediction results. Although results generated by MAPP are extensive, as a prediction is made for the substitution of every possible amino acid residue at every position, significant effort is still required to analyse and interpret the results.

13.4.3 PANTHER Protein analysis through evolutionary relationships (PANTHER) is a tool for sequencebased SNP-analysis that is used to predict SNP based on evolutionary information encoded in protein sequence profile. Additionally, it perform evolutionary analysis of coding SNPs by using HMM-based statistical modeling methods and MSAs. By calculating the substitution position-specific evolutionary conservation score (subPSEC) based on an

In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets 289 alignment of evolutionarily related proteins, PANTHER estimates the likelihood of a particular nonsynonymous coding SNP causing a functional impact on the protein [14]. PANTHER is able to classify proteins by function, thus adding another layer of complexity to refine SNP prediction. The tool is able to generate a variety of outputs, the most useful being the probability that a particular variant is deleterious.

13.4.4 Parepro Prediction of amino acid replacement probability ( is an identification method based on a support vector machine (SVM) which is used to determine whether nonsynonymous single base changes have deleterious or neutral effects on protein function. It uses evolutionary information that encompasses a particular SNP and comprises three components i.e. residue difference (RD), the status of the mutation position(SM) and mutation sequence environment (ME). Although, the prediction of Parepro, is based on sequence information but it does not depend on the existence of many homologous sequences [15]. However, one drawback of this tool is that the prediction results of a SNP being either neutral or disease causing do not have assigned weighted scores; thus, qualitative ranking of the deleterious effects of various SNPs cannot be carried out.

13.4.5 PhD-SNP Predictor of human deleterious SNPs i.e. PhD-SNP is a tool based on SVM classification models. Three slightly different algorithms are available for use: ‘sequence-based’, ‘sequence and profile-based’ and ‘hybrid method’. The sequence-based algorithm carries out its analysis based on the local sequence environment of the mutation at hand. The hybrid method enables the use of either the SVM sequence or the SVM profile, depending on the availability of a sequence profile of the sequence of interest. The tool aims to predict whether an nsSNP causing a single point protein mutation would be a neutral polymorphism or one that is deleterious [16]. As a search input, PhD-SNP requires the protein sequence and the position of the SNP. However, the tool cannot handle batch inputs of SNPs.

13.4.6 SNPs&GO SNPs&GO is a method based on SVMs that predicts disease related mutations from a protein sequence as the starting point, incorporating the functional annotation of the protein simultaneously. The tool collects a unique framework of information which is derived from the protein sequence, the protein sequence profile, the local sequence environment of the SNP, features that are derived from sequence alignment, as well as protein function, adding a degree of complexity to the functional analysis of SNPs. The functionality of the protein

290 Chapter 13 queried is mainly considered by incorporating information from the Gene Ontology (GO) database, in terms of a GO-based score, which reports gene products in terms of their associated cellular components, biological processes as well as molecular functions [17]. The use of functional GO term is the main aspect of novelty of this tool over other existing bioinformatic tools. It enables the prioritization of SNPs based on their associated functions. Apart from incorporating GO classification, SNPs&GO also includes prediction data provided by the PANTHER classification system, making it one of the more comprehensive SNP analysis tools.

13.5 Structure-based prediction tools 13.5.1 PolyPhen Polymorphism phenotyping ( carries out automated functional annotation of coding nsSNPs [18]. Unlike SIFT, it does not solely depend on sequence homology alone to make SNP functional predictions as its modeling of the AAS is also based on structural information. SWISS-PROT annotations are also used in the prediction process. The search input required for PolyPhen is either a protein identifier or an amino acid sequence. PolyPhen is limited to the protein sequence level rather than at the SNP or DNA level. As Gene IDs or dbSNP IDs cannot be entered directly as a valid input, this narrows the scope of searches that can be done. The PolyPhen output is comprised of a score that ranges from 0 to a positive number, where zero indicates a neutral effect of AAS on protein function. Conversely, a large positive number indicates that the substitution is detrimental to protein function. Such scoring is once again useful as ranking of SNPs according to their significance can be carried out to enable quantitative assessment of the severity of the effect on protein function.

13.5.2 SNPs3D SNPs3D ( provides predictions of SNP analysis, gene gene interaction networks and candidate genes for diseases. SNPs3D generates two SNP prediction scores based on SVM Profile and SVM Structure analyses, respectively, thus providing a two-pronged approach to SNP prediction. Negative scores suggest deleterious effects of SNPs on protein phenotype. The dual use of sequence- and structure-based methods is a measure to check the reliability of prediction results [18]. SNPs3D use an SVM [19], which is a machine learning technique that performs classification by partitioning an n-dimensional space representing n number of factors into 2 volumes, where by one volume comprises disease-causing mutations. This SVM model is

In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets 291 trained on AASs that are causes of single gene diseases, with a control set of non-disease causing mutations. The specific protein mutation caused by the SNP is assigned a position in this partitioned space, which would predict whether the mutation causes disease. The distance of this position from the volume partitioning surface serves as a rough measure of confidence in the predictions. The profile model used for sequence-based prediction is one in which a MSA is done for relevant sequences homologous to the query protein. The level of conservation at a SNP position, and the probability of observing such a variant in the alignment, is used to differentiate disease-causing mutations from those that are neutral. The SVM profile model uses five factors that capture sequence conservation to make functional predictions. In a structure based prediction, the stability model used is based on the hypothesis that diseasecausing SNPs affect protein function mainly by decreasing protein stability. The SVM stability model uses 15structural factors to determine the separation pattern between disease-causing and neutral SNPs. SNPs3D requires search inputs of dbSNP, RefSNP ID, gene ID/symbol, sequence Accession number, literature search by keyword, or GO. Queries of multiple IDs cannot be carried out simultaneously. Instead, IDs must be keyed in and searched individually, which is time consuming and laborious. When a dbSNP ID is used as search input, SNPs3D also provides prediction results for all other SNPs present in the same protein sequence. Such prediction scores provide a quantitative means of comparing and ranking the functional significance of SNPs, especially in candidate gene studies. Furthermore, SNPs3D has the added advantage that users are able to visualize the mutation effects on the 3D protein structure, thus aiding in the understanding of underlying mechanisms of the mutation.

13.5.3 LS-SNP The large-scale human SNP annotated database ( uses both sequence-based and structure-based approaches to generate SNP prediction results. It is an annotated database of all human, coding nsSNPs in the dbSNP. LS-SNP first maps SNPs onto human protein sequences (in SwissProt/TrEMBL) by retrieval of genomic locations of the SNPs from dbSNP. nsSNPs are then mapped onto MSAs and functional pathways [20]. For each target protein sequence, related known protein structures are identified as a template for each target protein to be aligned to; thus, comparative structure models can be built and analyzed. By integrating information on sequence, evolution and structure together with rule-based annotations to identify protein structure destabilizing changes and SVMs, predictions can be made about positions where AASs would have deleterious effects. Specifically, nsSNPs that lead to the destabilization of protein structure, interruption of domain domain interface formation or change in protein ligand binding be elucidated [20].

292 Chapter 13 LS-SNP enables queries based on either IDs (SwissProt ID, dbSNPID, Kegg Pathway ID, or HUGO Gene ID) or by genomic range. One advantage is that the tool enables the query of multiple IDs simultaneously in a single search. The search can also be refined to only include validated SNPs from dbSNP to better refine prediction results. Given that LS-SNP aims to incorporate information from multiple sources, additional information on genomic sequence, protein sequence and structure, together with functional predictions, is readily available. However, there are no prediction scores provided in the results, indicating that there is no quantitative means of ranking the functional significance of various SNPs. In addition, many dbSNP IDs could not be found when a trial search was carried out.

13.5.4 SNPeffect SNPeffect ( uses sequence- and structure based approaches to predict the effect of nsSNPs on the molecular phenotype (structure and function) of human proteins [21]. It does not solely depend on scores derived from evolutionary conservation as these do not elucidate the nature of the properties affected [22]. SNP effect assesses the effects of nsSNPs by assigning them into three categories of functional attributes: (i) thermodynamic and structural properties affecting protein dynamics and stability; (ii) integrity of functional and binding sites; and (iii) changes in post-translational processing and cellular localization of proteins. For proteins with known structures, SNP effect models the structure of mutant proteins for a better analysis of their effects on protein binding and stability. Given that known protein structures are limited, SNP effect also extends its analysis to include sequence-based predictions for wider applicability. A variety of search inputs, such as dbSNP ID, NCBI RefSeq ID, ENSEMBL SNP ID, PDB ID and Online Mendelian Inheritance in Man (OMIM) ID, are accepted. No batch submission of queries is available and queries have to be searched individually. In addition, many dbSNP IDs could not be found in the database when a trial search was carried out. Prediction results can also be ranked based on the extent of the selected phenotypic change in the mutant protein. This enables easy prioritization of the functional significance of SNPs for experimental studies. Although SNP effect is useful in providing large volumes of prediction results, such detailed information for each SNP might prove to be tedious to analyze and interpret.

13.5.5 SNAP Screening for non acceptable polymorphisms (SNAP) is based on neural network and improved machine-learning methodologies to predict the functional effects of nsSNPs in proteins. It utilizes sequence, functional and structural (secondary structure, solvent accessibility) annotations, and biophysical and evolutionary (residue conservation within

In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets 293 sequence families) characteristics so as to predict any gain or loss in protein function [23]. Although SNAP requires only protein sequence information as a search input, the analysis benefits from functional and structural annotations if available, enabling an added degree of complexity to this prediction tool. SNAP prediction outputs include the reliability index, whereby higher reliability indices strongly correlate with increased accuracy of prediction, thus allowing for filtering out of less accurate predictions. Other than a measure of accuracy, the reliability index also reflects the strength of the functional effect of a SNP. The main advantages of SNAP are thus the provision of enhanced performance across an entire range of accuracy and/or coverage thresholds and the use of a reliability index that enables users to focus on the most accurate predictions and/or the most severe effects. SNAP also demonstrates a particular advantage in correct predictions for the least obvious cases (i.e. those for which other existing methods disagree).

13.5.6 PMUT PMUT combines sequence alignment with structural factors and uses feed-forward neural network to characterize missense substitutions. The neural network used has been trained with a large database of disease-associated and neutral mutations. PMUT works by retrieving information from a local database of mutational hotspots followed by analysis of a given SNP in a specific protein [24]. As an input, PMUT needs a protein sequence or its SWISSProt/trEMBL code. The user then has a choice of analyzing a single mutation or carrying out a complete mutation scan at that position. The PMUT-generated output consists of the confidence index (ranging from 0 to 9) and a binary prediction of ‘neutral’ versus ‘pathological’, represented by the pathogenicity index (ranging from 0 to 1; . 0.5 signals pathological mutation). It is also possible for the user to retrieve all the intermediate information (alignments, Blast and PhD outputs) used by PMUT while generating a prediction. Furthermore, if the protein structure is available, the PMUT server enables the display of the mutation site on the protein structure using a color code to trace the pathogenicity associated with the mutation. Such 3D visualization is obtained as a Rasmol script and the user requires either a Rasmol or Chime plug-in to see the visualization. In addition to all of its above functions, PMUT also enables the fast scanning and detection of mutational hotspots, helping to detect regions where mutations are expected to have a huge pathological impact. To achieve this, it carries out the following procedures: alanine scanning, massive mutation and genetically accessible mutation scanning.

13.5.7 SAPRED SAPRED ( uses a data set of single amino acid polymorphisms (SAPs) which were compiled from Swiss-Prot variant pages. Biologically informative attributes, including structural neighbor profiles describes the SAP micro environment, nearby functional

294 Chapter 13 sites by which the structure- and sequence-based distances can be measured from the SAP site and its nearby functional sites, aggregation properties that aid in measuring the likelihood of protein aggregation and disordered regions considering whether the SAP is located in structurally disordered regions. These informative attributes were extracted for analysis and prediction by SAPRED [25]. The tool is based on SVM classification model that integrates the above mentioned informative attributes, as well as previously published attributes in its algorithm. SAPRED required search inputs in the form of FASTA format protein sequence, a mutation in the form of A$B (where A and B represent single letter amino acid code and $ represents the position of the substituted amino acid), and two PDB-format files describing the structures of the wild-type and variant protein. Output results constitute the prediction result, the prediction confidence, and values related to all the attributes are used to elucidate the putative biological predictions. One advantage of SAPRED is that the algorithm take residues both at and near the functional sites into consideration in terms of both sequence and structure, thus enabling greater coverage and comprehensive predictions. For proteins with unavailable structural information, SAPRED offers an alternative form of input via SAPRED_SEQ, which only requires a FASTA sequence and a mutation as search input. In this case, prediction is made solely based on sequence-derived properties.

13.5.8 MutPred MutPred is a computation tool that uses protein sequences as its basis for analysis to model changes of structural features and functional sites between wild-type and mutant sequences. Such changes are represented as probabilities of gain or loss of structure and function. MutPred builds upon the established SIFT method and a gain or loss of 14 different structural and functional properties, with improved classification accuracy with respect to human disease mutations. In all, the advantages of MutPred include the use of a comprehensive data set of disease-associated mutations, with incorporation of new attributes for classification that directly model the gain and/or loss of structural and functional properties, leading to improved classification performance over SIFT [26].

13.5.9 MuD Mutation detector ( uses a Random Forest machine-learning algorithm and a set of structure- and sequence-based features to assess the impact of a given substitution on the protein function. Structure-based features or factors used to assess the impact of a given substitution on protein function include solvent accessibility and oligomerization interface, ligand distance and binding site conservation and fold 3Dresidue environment. Sequence-based features include the sequence identity to the closest homolog bearing the substitution. MuD aims to differentiate between functionally neutral and non-neutral AASs in the protein sequence. When used in its automatic mode, MuD is

In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets 295 comparable to alternative tools in terms of performance. The uniqueness of the MuD web server is that it offers a semi-automatic scheme, whereby users can include additional protein-specific structural and function information at run-time, improving the prediction accuracy of this tool [27]. This procedure aims to reduce the erroneous features that might be extracted from the crystal structure. MuD assigns a reliability score to every prediction, making it useful in terms of prioritization of substitutions in proteins with an available 3D structure. A reliability score assigned to each prediction enables users to assess qualitatively the prediction in terms of its accuracy, sensitivity and precision. However, MuD might be less suitable for the prediction of substitutions in non-globular and small proteins.

13.6 Conclusion In-silico prediction gives an idea of which SNPs are important in disease causation and likely to influence therapeutic outcome. These SNP candidates can thus be ranked higher in pharmacological importance in the drug discovery process. Most in silico tools predict effects of SNP mutations on proteins based on the probable effects of primary nucleotide sequence changes on secondary and tertiary protein structures. Many drug targets and drug-metabolizing enzymes (DMEs) are proteins. Therefore, functional prediction using these in silico tools can aid the identification of novel potential deleterious nsSNPs in drug targets and DMEs. Such discovery of novel deleterious SNPs in drug targets that can alter the pharmacogenomics of responses of drugs specific to those targets will aid drug discovery. In addition, as pharmacokinetic effects are mostly influenced by the interaction of multiple gene products and SNPs, unraveling the polygenic components that bring about these phenotypic effects is complex. These tools can help to unravel this complexity. Given that in-silico tools can suggest possible protein structural and conformational changes owing to alterations in deleterious nsSNPs, these tools are important in drug target validation to aid the study of the molecular dynamics of drug target binding and discovery of new drug targets. Many predictors of drug efficacy and drug toxicity are SNPs that occur in drug targets or in DMEs. If inactivating in nature, these SNPs could change the pharmacokinetics of these therapeutic compounds. Early in silico identification of these changes, such as those encoding drug transporter proteins, can lead to valuable insights into the prediction of the drug response and efficacy before wet laboratory experimentation studies. Solved problem Exercise: To identify the potential SNPs in EGFR using PMut: a web-based tool [28]. Note: PMut2017 predictor is trained using SwissVar (Humsavar) as a training set. It is built using the PyMut library, with a Random Forest classifier based on 12 features. In the

296 Chapter 13 experiment we are evaluating known point mutation in EGFR (T-M), which results in development of resistance in EGFR against FDA-approved drugs. Step-by-step protocol Click “analyse mutations” on the main webpage of the tool at http://mmb.irbbarcelona. org/PMut/. Input: Input the UniProt ID of EGFR in the “Protein sequence” menu. The ID we used is “P00533”. Click search sequence. Then, in the “Variants” section, in “mutations”, specify the mutation you want to analyse. We put “T766” (gatekeeper threonine residue reported to mutate and cause resistant cancer). In known variant, select “Thr. Met (missense mutation)”, visible in selected variants category. Click “Start analysis”. Results The output 3-D interactome show all the possible deleterious and non-deleterious SNPs present in the submitted sequence (Fig. 13.3). For the specific mutations submitted as query, prediction correctly suggest 80% (disease). Also, the prediction results are also provided as an excel file, showing tabular representation of the obtained result (Table 13.1).

Figure 13.3 3D interactome showing the results of SNP analysis of T-M mutation in EGFR. Table 13.1: Tabular representation of the results of SNP analysis of T-M mutation in EGFR. Protein_id












In-silico SNP analysis: An aid to identify novel potential deleterious SNPs in drug targets 297 Unsolved Problems Excercise 2. To analyze the effect of SNPs in Human CALR Gene as either disease causing or neutral using PMut. (Hint: PMut can predict mutations as deterious or neutral by loading the Uniprot ID for retrieving protein sequence of Human CALR and later load the variants as a resultant of SNPs; SNPs can be obtained from NCBI database of genetic variation i.e. dbSNP). SNP ID

AA change

rs11547563 rs11547569 rs138503495

W261G V176M Y57Cs

Excercise 3: To analyze the effect of SNP on Toll-like receptor gene and classify amino acid substitutions as pathogenic or benign using MutPred. (Hint: Toll-like receptor gene that potentially cause mastitis in dairy cattle MutPred require requires a protein sequence, a list of amino acid substitutions mentioned below, and an email address; Protein sequence can be obtained from Uniprot.) Variants: T563H, T605M and R563H. Excercise 4: To determine the ddG corresponding to Q221R and L197P mutations in Caspase-9 (Uniprot ID: P55211) using SNPeffect. (Hint: ddG refers to free energy change in Kcal/mol).

References [1] J.T. Mah, K. Chia, A gentle introduction to SNP analysis: resources and tools, J. Bioinf. Comput. Biol. 5 (2007) 1123 1138. [2] F.S. Collins, M.S. Guyer, A. Chakravarti, Variations on a theme: cataloging human DNA sequence variation, Science 278 (1997) 1580 1581. [3] M.P. Miller, S. Kumar, Understanding human disease mutations through the use of interspecific genetic variation, Hum. Mol. Genet. 10 (2001) 2319 2328. [4] Z. Wang, J. Moult, SNPs, protein structure, and disease, Hum. Mutat. 17 (2001) 263 270. [5] P.C. Ng, S. Henikoff, SIFT: Predicting amino acid changes that affect protein function, Nucleic Acids Res. 31 (2003) 3812 3814. [6] P. Yue, Z. Li, J. Moult, Loss of protein structure stability as a major causative factor in monogenic disease, J. Mol. Biol. 353 (2005) 459 473. [7] P. Yue, J. Moult, Identification and analysis of deleterious human SNPs, J. Mol. Biol. 356 (2006) 1263 1274. [8] S.T. Sherry, M.-H. Ward, M. Kholodov, J. Baker, L. Phan, E.M. Smigielski, et al., dbSNP: the NCBI database of genetic variation, Nucleic Acids Res. 29 (2001) 308 311. [9] D.L. Wheeler, D.M. Church, A.E. Lash, D.D. Leipe, T.L. Madden, J.U. Pontius, et al., Database resources of the National Center for Biotechnology Information: 2002 update, Nucleic Acids Res. 30 (2002) 13 16. [10] P.C. Ng, S. Henikoff, Accounting for human polymorphisms predicted to affect protein function, Genome Res. 12 (2002) 436 446.

298 Chapter 13 [11] P.C. Ng, S. Henikoff, Predicting the effects of amino acid substitutions on protein function, Annu. Rev. Genomics Hum. Genet. 7 (2006) 61 80. [12] P.C. Ng, S. Henikoff, Predicting deleterious amino acid substitutions, Genome Res. 11 (2001) 863 874. [13] E.A. Stone, A. Sidow, Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity, Genome Res. 15 (2005) 978 986. [14] P.D. Thomas, A. Kejariwal, M.J. Campbell, H. Mi, K. Diemer, N. Guo, et al., PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification, Nucleic Acids Res. 31 (2003) 334 341. [15] J. Tian, N. Wu, X. Guo, J. Guo, J. Zhang, Y. Fan, Predicting the phenotypic effects of non-synonymous single nucleotide polymorphisms based on support vector machines, BMC Bioinf. 8 (2007) 450. [16] E. Capriotti, R. Calabrese, R. Casadio, Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information, Bioinformatics 22 (2006) 2729 2734. [17] R. Calabrese, E. Capriotti, P. Fariselli, P.L. Martelli, R. Casadio, Functional annotations improve the predictive score of human disease-related mutations in proteins, Hum. Mutat. 30 (2009) 1237 1244. [18] V. Ramensky, P. Bork, S. Sunyaev, Human non-synonymous SNPs: server and survey, Nucleic Acids Res. 30 (2002) 3894 3900. [19] V. Vapnik, The Nature of Statistical Learning Theory, Springer Science & Business media, 2013. [20] K. Karplus, C. Barrett, R. Hughey, Hidden Markov models for detecting remote protein homologies, Bioinforma (Oxford, Engl.) 14 (1998) 846 856. [21] J. Reumers, J. Schymkowitz, J. Ferkinghoff-Borg, F. Stricher, L. Serrano, F. Rousseau, SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs, Nucleic Acids Res. 33 (2005) D527 D532. [22] J. Reumers, S. Maurer-Stroh, J. Schymkowitz, F. Rousseau, SNPeffect v2. 0: a new step in investigating the molecular phenotypic effects of human non-synonymous SNPs, Bioinformatics 22 (2006) 2183 2185. [23] Y. Bromberg, B. Rost, SNAP: predict effect of non-synonymous polymorphisms on function, Nucleic Acids Res. 35 (2007) 3823 3835. [24] C. Ferrer-Costa, J.L. Gelpı´, L. Zamakola, I. Parraga, X. De La Cruz, M. Orozco, PMUT: a web-based tool for the annotation of pathological mutations on proteins, Bioinformatics 21 (2005) 3176 3178. [25] Z.-Q. Ye, S.-Q. Zhao, G. Gao, X.-Q. Liu, R.E. Langlois, H. Lu, et al., Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP), Bioinformatics 23 (2007) 1444 1450. [26] B. Li, V.G. Krishnan, M.E. Mort, F. Xin, K.K. Kamati, D.N. Cooper, et al., Automated inference of molecular mechanisms of disease from amino acid substitutions, Bioinformatics 25 (2009) 2744 2750. [27] G. Wainreb, H. Ashkenazy, Y. Bromberg, A. Starovolsky-Shitrit, T. Haliloglu, E. Ruppin, et al., MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data, Nucleic Acids Res. 38 (2010) W523 W528. [28] V. Lo´pez-Ferrando, A. Gazzo, X. De La Cruz, M. Orozco, J.L. Gelpı´, PMut: a web-based tool for the annotation of pathological variants on proteins, 2017 update, Nucleic Acids Res. 45 (2017) W222 W228.


ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 14.1 Introduction Studies revealed that the average time to develop a new drug has steadily increased. In a drug discovery process, scientists primarily focus on ‘hunting’ for drug candidates, whereas in development phase they try to ‘promote’ NCEs as novel and safer commercial medicines. Despite the drastic improvement in R&D expenditures, the output of pharmaceuticals (number of new drugs launched per year) remains virtually flat. Concurrently, the productivity of the pharmaceutical industry has been recorded to continuously decline during the past decades, as measured by NCE output per dollar [1]. Despite the recent strong performance in earnings [2], major pharma firms still trying to improve the efficiency and effectiveness of the drug discovery and development strategy to secure continuing growth and to appeal the investors. In addition to the increasingly tight regulatory hurdles, big pharmaceutical firms also suffer from their failing discovery of NCEs. Drug candidates might fail due to numerous reasons during development, two of the major reasons for such failures is attributed to high attrition rate during development phases owing to poor pharmacokinetics, toxicological and safety-related pharmacological properties. Kennedy investigated the reason for 198 NCEs, failed in clinical development [3] and found that the most prominent cause of the failures was associated with poor pharmacokinetic (PK) and ADME properties. Although lack of efficacy was still one of the main reasons for terminations, the unsatisfactory PK/ADME, toxicology and adverse effects accounted for up to two-thirds of the total failures. A separate analysis also led to a similar conclusion, particularly true for the early 1990s [4]. However, to be fair, Kolo & Landis’s report disclosed that even with the latest improvement in PK/Bioavailability aspects of a drug, the total loss of drug candidates in development remains near 50% owing to ADME (PK/bioavailability, formulation), toxicology and pharmacology (safety) [4]. General efforts to address these challenges embrace initiation of combinatorial chemistry techniques that accelerate the production of NCEs entering the pipeline. However, one of the drawback of combinatorial chemistry is to shift the discovery compound libraries toward large, ‘greasy’ and biologically inactive molecules that can rarely survive in the development phase. Indeed, NCE collections from Pfizer and Merck for the Lipinski’s Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


300 Chapter 14 recent drug-ability analysis led to the findings that higher molecular weight and higher LogP and in turn, poorer solubility and permeability is mostly exhibited by the new NCEs [5]. Therefore, it is important to eliminate compounds with the worst ADMET properties as early as possible and this has become an attractive approach for most of the researchers [6,7]. Alternatively, expert advice related to ADME properties could guide chemists to make structure-activity relationship (SAR) based modifications and optimize them to obtain ‘drug like properties’ i.e. good absorption, high bioavailability with metabolic stability, required distribution. In early discovery, most of the ADMET-related issues can be foreseen by using tools for ADME, toxicological and pharmacological profiling [8]. However, traditionally, drug optimization to improve efficacy has been strongly associated with discovery and drug development, applied sequentially in their own fields. Specifically, most of the research is focussed on the improvement of in vitro efficacy via lead selection and optimization [9]. In the end, the NCEs with most potent inhibition or binding properties to the in vitro target fail in the development phase in terms of pharmacokinetics. Although, identification of active candidates relative to the therapeutic target is the goal from “discovery” point of view. However, from the “development” point of view, excellent in vitro efficacy would not always translate into in vivo potency [10]. And the late discovery of poor drug-like properties and adverse side-effects results in wastage of both money and time. In the modern drug discovery and development process, evaluation of both the therapeutic and drug-like features of NCEs together as early as possible has become a new trend. This requires a new strategy for parallel profiling of efficacy and drug-ability of NCEs in early discovery. It however raises an important question as most of the tools appraise the drugability, reduce side-effects and improve absorption, metabolism, distribution, excretion, toxicity and pharmacology. These assays are usually costly, laborious, low throughput and required a great deal of materials, thereby not feasible for early drug discovery phases. Nowadays, during early drug discovery major efforts are being made to develop and implement high-throughput (HT), miniaturized, fast profiling assays with good predictivity for in-vivo drug-ability. These assays can assist directly in selecting early hits or leads and can be implemented for the optimization cycle of chemical synthesis. In-silico prediction is an alternative approach to assess various pharmacokinetic ADMET parameters [11]. It provides an opportunity to make drug-ability prediction in a costeffective and fast fashion. However, the major challenge for in-silico predictions lies in their predictivity and reliability. This is because properties like drug solvation and solubilization does not only rely on chemical structure but also on a complex interplay between hydrogen-bond donor and acceptor properties, conformational effects and crystal packing energy. For these reasons standard deviations of predicted water solubility with respect to experimentally determined values (thermodynamic solubility) are still relatively

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 301 substantial. Initially, many commercial products that claim for satisfactory agreement between the computational and experimental data were not able to deliver an adequate prediction for drug-like NCEs. Because in silico tools were mostly trained by using nondrug-like chemicals that lies within non-relevant solubility range. However, advanced software packages greatly solve such problem by establishing collaborations between commercial vendors that develop the new in silico tools and pharmaceutical labs possessing large collections of wet-lab data that had been derived from drug-like molecules. In-silico prediction deteriorates real NCEs in early discovery phase, establishing the necessity for NCEs determination using in vitro assays. This is not surprising that most of the prediction tools are trained by the set of commercial drugs. However, such tools are useful for screening virtual molecules or in the cases where no experimental alternatives are available.

14.2 Prediction of physicochemical properties The pharmacokinetic and metabolic fate of a drug in a body is mostly influenced by physicochemical properties of a drug, and so a good understanding of these physicochemical properties, coupled with their measurement and prediction, are crucial to establish a successful drug discovery program.

14.2.1 Solubility and solubilization Solubility is one of a critical factor as drug substances have to be dissolved before they can be absorbed. Solubility and rate of dissolution have crucial role in the Biopharmaceutical Classification System (BCS) [12], as considering Fick’s first law, absorption of passively transported drugs across the gastrointestinal (GI) tract is the combined product of both permeability and solubility [13]. Without considering solubility, data from an in vitro assays such as HTS activity or membrane permeability assays could be misinterpreted. Dosing poorly soluble compounds in pharmacological animal testing is highly risky as it commonly fails in deriving a correlation between dose and in vivo efficacy. Under such circumstances, it is difficult to differentiate the issues between solubility and efficacy. Additionally, solubility data may help one to understand other PK or PD mechanisms. The challenge to reliably estimate the solubility of drug substances is of vital relevance. However, it is very problematic to precisely quantify or predict the aqueous solubility owing to the complicated solubilization process and solid phase chemistry of drug candidates. A variety of approaches have been established which could be utilized in tiers to deal with the solubility and dissolution issues occurring at the different drug discovery and development phases.

302 Chapter 14 In-silico prediction can be utilized to estimate water solubility of the drug. It provides an opportunity to predict the solubility in a fast and cost-effective fashion. Predictive solubility methods, for example, neural networks might assist in this effort. However, there are currently no approaches which are robust enough to accurately predict low solubility. Many current predictive solubility programs utilize training data from different laboratories with varying quality and experimental conditions [14]. Hopefully, by measuring many compounds under standardized conditions, current predictive models can be improved [15].

14.2.2 Permeability and active transporters Permeability is an another key factor governing human oral absorption as it is referred to as the capability of NCEs to penetrate across the human GI tract [16]. Ideally, oral absorption of drug substances is measured by quantifying the fraction of the designated drug absorbed through the human GI tract. Although the data derived serve as ‘gold standards’ and reliably assess oral absorption of drug substances, the approach is impractical in early discovery owing to the intricate and costly experimental procedures. Alternatively, high throughput in vitro permeability assays by employing either artificial membranes or cellbased models have become methods of choice in early drug discovery [17]. Efforts have been undertaken to make the prediction for the permeability of compounds through Caco-2 cells, which serve as a model for human intestinal absorption, in an approach called membrane-interaction quantitative structure activity relationships (MI-QSAR) [18].

14.2.3 Hydrogen bonding The hydrogen-bonding capacity of a drug solute is now recognized as an important determinant of permeability. In order to cross a membrane, a drug molecule needs to break hydrogen bonds with its aqueous environment. The more potential hydrogen bonds a molecule can make, the more energy this bond breaking costs, and so high hydrogen-bonding potential is an unfavorable property that is often related to low permeability and absorption of a drug. Initially, ΔlogP, the difference between octanol/water and alkane/water partitioning, was utilized as a measure for solute hydrogen-bonding, but this technique is limited by the poor solubility of many compounds in an alkane phase. A variety of computational approaches have addressed the problem of estimating hydrogen-bonding capacity, ranging from simple heteroatom (O and N) counts, the consideration of molecules in terms of the number of hydrogen-bond acceptors and donors, and more sophisticated measures that take into account such parameters as free-energy factors [19] and (dynamic) polar surface area (PSA) [20]. The latter are easily calculated, and it is now believed that a single minimum-energy conformation is sufficient to compute the PSA, instead of the more computationally demanding and timeconsuming dynamic polar surface-area calculation. A fast fragment-based algorithm for PSA

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 303 has been reported, which allows PSA calculations to be implemented in virtual screening approaches [21].

14.2.4 Ionization constant (or dissociation constant) ADME properties of NCEs are govern by the ionization of drug candidates and the distribution of the neutral and iodized species under physiological pH. Ionization constant, pKa, defined as the concentration at which there is an equal distribution of the neutral species and iodized form and is a useful thermodynamic parameter at which the charge state of drug candidates can easily be monitored. First, pKa data can help to predict ADME properties of NCEs owning to the pH gradient of 1.7 8.0 present in the human GI tract [13]. For an instance, pKa value(s) of a drug candidate severely modulates the solubility and permeability where solubility is favored by the iodized form while the permeability is proportional to the concentration of its neutral species in solution. Additionally, pKa data can be utilized for better understanding the binding mechanisms of therapeutic events and also for the optimization of chemical reactions. A number of ADME properties including lipophilicity, solubility and pH profiles are derived in combination with aqueous pKa data. As ionization can influence the solubility, lipophilicity (log D), permeability and absorption of a compound, approaches have been developed to rapidily measure pKa values of sparingly soluble drug compounds. Using experimental data reported in the literature, several approaches have been developed for pKa calculations. Programs including ACD/ pKa (ACD), Pallas/pKa (Compudrug), SPARC [22] and the software developed by Advanced Chemistry Development (ACD/Labs) yields promising pKa data for commercial drugs, comparable to experimentally collected data. The prediction deteriorates for real NCEs in early discovery phase (LP and LO stages), therefore, necessitates the determination of NCEs using in vitro assays.

14.2.5 Lipophilicity Lipophilicity is an important physicochemical parameter to link membrane permeability and, drug absorption and distribution with the route of clearance (metabolic or renal). Measuring the compound lipophilicity is readily amenable to automation. The gold standards for expressing lipophilicity are the partition coefficient P (or log P to have a more convenient scale) in an octanol/water system; alternatives include applications of immobilized liposome chromatography (ILC), immobilized artificial membranes (IAM) and liposome/water partitioning. LogP and LogD are the logarithms of partition co-efficient and apparent partition co-efficient of drug candidates in a lipophilic phase such as octanol and a hydrophilic phase like water. The data is useful to make prediction for the ADME properties ranging from solubility, permeability to the understanding of transport mechanism. The conventional approach for LogP and LogD determination, the saturation

304 Chapter 14 shake-flask method, is to measure the equilibrium distributions of NCEs in octanol and water, not feasible for early discovery. Continuous interest is developing in improving log P calculation programs, and already there are many such programs available. Most calculation approaches are based on fragment values, although simple methods to calculate log P values are based on molecular size and hydrogen-bonding indicators for functional groups have also been shown to be extremely versatile [23]. However, log P values can only be a first estimate to label compound as lipophilic in a biological environment. For partition processes in the body, the distribution coefficient D (log D), an aqueous buffer at pH 7.4 (blood pH) or 6.5 (intestinal pH) is used in the experimental determination to obtain more meaningful description of lipophilicity, especially for ionizable compounds. However, in our experience, programs that can reliably predict log D, are scarce at present.

14.3 Prediction of ADME and related properties There are several targets which generally impact the drug pharmacokinetics (PK) by affecting the drug absorption, metabolism distribution and excretion such as P-glycoprotein (P-gp), Breast cancer resistance protein (BCRP), Organic cation transporters (OAT), Organic-anion-transporting polypeptide (OATP) etc., as mentioned in Fig. 14.1. These

Figure 14.1 Individual PK parameters determined for the analysis of ADMET profile of a drug.

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 305 targets affect some of the PK descriptors like logP, logs, logBB, logPS etc., and may be responsible for poor PK profile of a drug. Various in-silico models have been built to predict ADMET for a new NCEs so as to prevent its failure at later stages of clinical trials. Some of these in-silico techniques involve QSPR, QSAR, pharmacophore, machine learning models etc. These PK parameters have been individually described in below sections.

14.3.1 Absorption For a compound to cross a membrane by purely passive diffusion, a reasonable permeability estimate can be made by utilizing single molecular properties, such as log D or hydrogen-bonding capacity. However, besides the purely physicochemical component contributing to membrane transport, many compounds are also affected by biological events, including the influence of transporters and metabolism. Drugs may either be substrates for transporter proteins or can promote or hinder permeability. Cytochrome P450 3A4 (CYP3A4) and P-glycoprotein (P-gp) in the gut act as a barrier to drug absorption which has been well studied [24]. Theoretical QSAR models can be account for these effects. Oral absorption estimation are widely made by in vitro methods, such as Caco-2 or MadinDarbycanine kidney (MDCK) monolayers. These cells also express transporter proteins, but only express very low levels of metabolizing enzymes. Similarly, there has been a continued interest in finding a relevant in-vitro screen to estimate the permeability of drugs for the central nervous system (CNS) diseases. The bovine microvessel endothelial cell (BMEC) model has been explored as a possible in vitro model of the blood brain barrier [25]. Considerable effort has also gone in developing in silico models for oral absorption prediction [26]. Most of the simplest models have been reported to be based on a single descriptor, such as log P or log D, or polar surface area, which is a descriptor of hydrogen-bonding potential. Different multi-variate approaches, such as multiple linear regression, partial least squares and artificial neural networks, have been employed to develop quantitative structure humanintestinal-absorption relationships [27]. In all approaches, hydrogen bonding is considered to be a property with an important effect on oral absorption. Absorption-simulation programs, like Gastro-Plus [28] and Idea [29], are valuable tool in lead optimization and compound selection. These programs are basically the computer simulation models that are developed and validated to predict ADME outcomes, such as rate of absorption and extent of absorption, using a limited number of in vitro data inputs [30]. These advanced compartmental absorption and transit (ACAT) based models are well incorporated with physicochemical concepts, such as solubility and lipophilicity than physiological aspects involving transporters and metabolism. In more recent versions, attempts are being made to model the influence of these transporters, in addition to gut-wall metabolism, on gastrointestinal uptake.

306 Chapter 14

14.3.2 Bioavailability Bioavailability depends on a superposition of two processes i.e. absorption and liver first-pass metabolism. Absorption in turn depends on the solubility and permeability of the compound, as well as interactions with transporters and metabolizing enzymes in the gut wall. Important properties for determining permeability seem to be the size of the molecule, as well as its capacity to make hydrogen bonds, its overall lipophilicity and possibly its shape and flexibility. Molecular flexibility, for example, as evaluated by counting the number of rotatable bonds, has been identified as a factor influencing bioavailability in rats [31]. Yoshida and Topliss trained a QSAR model with logD at pH 7.4 and 6.5 as inputs for the physicochemical properties and the presence/absence of typical functional groups most likely to be involved in metabolic reactions as the structural input. This approach used ‘fuzzy adaptive least squares’, and drugs could be classified into one of four predefined bioavailability ranges. Using this approach, a new drug can be assigned to the correct class with an accuracy of 60% [32]. In another approach, regression and recursive partitioning can also be used. However, the models should not be expected to generate predictions that are more accurate than the variability inherent in the biological measurements [33]. Genetic programming, which is a specific form of evolutionary programming, has also been developed for predicting bioavailability [34]. The results show a slight improvement compared with the Yoshida-Topliss approach, although a direct comparison is difficult owing to a different selection of the bioavailability ranges of the four classes.

14.3.3 Blood brain barrier penetration Drugs that act in the CNS need to cross the blood brain barrier (BBB) to reach their molecular target. By contrast, for drugs with a peripheral target, little or no BBB penetration might be required in order to avoid CNS side effects. A key issue in the development of models to predict BBB penetration is the use of appropriate data to describe brain uptake of compounds. There is an ongoing discussion about the use of total-brain data versus extracellular fluid (ECF) or cerebro-spinal fluid (CSF) data or data generated by microdialysis [35]. Another point of debate relates to the time point of measurement, which is clearly crucial. Overall, data in the literature are rather limited in number, and are also generated from different experimental protocols. All of these factors limit the development of highly predictive models of BBB penetration. Nevertheless, a variety of models for the prediction of uptake into the brain have been developed [36]. ‘Rule-of-five’-like recommendations regarding the molecular parameters

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 307 that contribute to the ability of molecules to cross the BBB have been made to aid BBBpenetration predictions [37]; for example, molecules with a molecular mass of ,450 Da ˚ 2 are more likely to penetrate the BBB. Most of the early predictive or with PSA ,100 A models are based on a multiple linear regression approach and many use physicochemical properties [38]. Other multivariate techniques have also been tried using ADME-tailored properties, such as the Volsurf approach, in which a variety of 3D molecular field descriptors are transformed into a new set of descriptors, which are inputs for the construction of a model using a discriminant partial least squares procedure. As this method is based on computed properties only, it can be used as a tool in virtual screening [39].

14.3.4 Transporters Transport proteins are distributed in most of the organs that are involved in the uptake and elimination of an endogenous compounds and xenobiotics, including drugs [40]. As mentioned above, a better understanding of the transporters role in oral absorption and uptake in the brain and liver is of particular interest [41]. Consequently, several in vitro systems, some with double transfected transporters are under developed which might become a valuable tool in screening to achieve optimal pharmacokinetic properties. One of the best-studied transporters is P-gp, a member of the ATP-binding cassette (ABC) transporter family that was identified first as the transporter responsible for multiple-drug resistance (MDR) observed with anti-tumor agents. A better understanding of the relationships between the structure of P-gp binders (substrate or inhibitor) has been obtained using QSAR, as well as from pharmacophore and protein modeling. A set of well-defined structural elements required for interaction with P-gp has also been established from the analysis of a set of known P-gp substrates [42]. The key recognition elements in this model are two or three electron-donor groups and hydrogen-bond acceptors with a fixed spatial separationas in Fig. 14.2. List of databases and tools has been mentioned in Tables 14.1 and 14.2.

Figure 14.2 Pharmacophoric features required for interaction with P-gp.

308 Chapter 14 Table 14.1: List of database used for determining drug transportation. Database



It provides information on the substrates and inhibitors of the 31 transporters VARIDT 1.0: In this database they focused on drug transporters variability’s variability of drug like epigenetic regulation and genetic polymorphism of DT; transporter species, tissue and disease-specific protein abundance of DT; database exogenous factors modulating DT activity METscout A pathfinder exploring the landscape of metabolites, enzymes and transporters TransportDB A relational database of cellular membrane transport systems HMPAS

Human membrane protein analysis

Website http://transportal. http://varidt.idrblab. net/ttd/ http://metscout. http://www. membranetransport. org/ http://fcode.kaist.

Table 14.2: List of tools used for determining drug transportation. Tools



ViennaLiverTox Workspace

It is used to predict the substrates showing interaction with liver transporters (BCRP, P-PG, BSEP, MRP2, MRP3, OATP1B1, & OATP1B3) and their inhibition causing liver toxicity It is a multi labeled classification model using Network-Based Label Space Partition method for predicting the Specificity of Membrane Transporter Substrates. But, the drawback of this model is, due to imbalanced nature of dataset, this model shows unsatisfactory on the proteins of MRP2, MRP3, MRP4, SO1A2, and SOB1B1 in view of F1 score It will Predicting the substrate class of trans-membrane transport proteins A prediction server providing two endpoints related to the P-glycoprotein (P-gp): substrate property and inhibitor property a molecular docking-based tool to analyse ligand transport through protein tunnels and channels



TranCEP MDML_P-gp rules CaverDock STS-NLSP

https://omictools. com/trancep-tool https://pgprules. https://loschmidt. caverweb/

Models are now sufficiently sophisticated to rationalize earlier observations for already existing P-gp substrates in terms of molecular weight, lipophilicity, hydrogen bonding, presence of a basic nitrogen and so on. These models can efficiently predict NCE as P-gp substrate. The program MolSurf has been used to calculate descriptors and build a PLS model to predict P-gp-associated ATPase activity. This model identified the main contributing descriptors for predicting ATPase activity as the size of the molecular surface, polarizability and hydrogen bonding potential [43].

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 309

14.3.5 Dermal and ocular penetration Although much attention has been given to oral absorption models, some drugs are administered through alternative routes, such as the skin or an eye. For many years, QSAR models have been developed to predict the optimal percutaneous penetration [44]. These models resemble oral absorption and BBB models, and often employ very similar properties and descriptors. The existing transdermal models are typically a function of the octanol/water partition coefficient and terms that have been associated with aqueous solubility, including hydrogen-bonding parameters, molecular weight and molecular flexibility. Commercial models for the prediction of solute-permeation rates through the skin are available, for example, the QikProp and DermWin programs. However, it seems that there is a little difference between the commercially available models and models published in the literature. Most, if not all, of the published skin permeation models have been constructed from various compilations of published skin-permeation data.

14.3.6 Plasma-protein binding It is generally assumed that only free drug can cross membranes and bind to the intended molecular target, and it is therefore important to estimate the fraction of drug bound to plasma proteins. Drugs can bind to a variety of particles in the blood, including red blood cells, leukocytes and platelets, in addition to proteins such as albumin (particularly acidic drugs), α1-acid glycoproteins (basic drugs), lipoproteins (neutral and basic drugs), erythrocytes and α,β,γ-globulins. As unbound drug primarily contributes to pharmacological efficacy, the effect of plasma protein binding is an important consideration while evaluating the effective (unbound) drug plasma concentration. Studies have been carried out to predict the plasma protein binding (PPB) for an instance: Zhu et al compiled experimental data for 1200 compounds considering two end-points like fraction bound (%PPB); and the logarithm of a pseudo binding constant (lnKa) derived from %PPB to develop accurate QSAR model of PPB [45]. Several other computational models were also developed to determine %PPB [46,47]. These models can efficiently give an idea for affinity of plasma proteins towards a drug. Using the multiple computer-automated structure evaluation (M-CASE) program and proteinaffinity data for 154 drugs, models were generated that correctly predicted the percentage of drug bound in plasma for B80% of the test compounds with an average error of B14% [48].

14.3.7 Volume of distribution The volume of distribution, together with the clearance rate, determines the half-life (t1/2) of a drug and therefore its dose regimen, and so the early prediction of both properties would be beneficial. When the logarithm of the volume of distribution is plotted against log D, a scatter

310 Chapter 14 plot is obtained with no correlation. However, when these data are corrected for plasma-protein binding, the resulting plot of the logarithm of the unbound volume of distribution (log Vdu) against log D reveals a clear linear trend, with log Vdu increasing at higher lipophilicities [49]. This can be utilized as a simple guide in modifying and optimizing the Vdu. An approach for predicting volume of distribution values has been extensively studied till date. Studies considered experimental distribution coefficients at pH 7.4 in octanol/water, the ionization constant (pKa) of the compounds to measure plasma-protein-binding data. In principle, this approach could be fully computational, as predictive models are available in terms of log P and pKa, and models for plasma-protein binding are under development, as described above. Several groups in recent years have explored such fully computational models.

14.3.8 Clearance Clearance is an important pharmacokinetic parameter that defines, together with the volume of distribution, the half-life, and thus the frequency of dosing, of a drug. For a series of adenosine A1 receptor agonists, not only their clearance, but also their volume of distribution and protein binding, could be predicted using the multivariate PLS technique [50]. Further improvements using nonlinear models, such as neural networks, have been obtained. Using neural networks in addition to multivariate techniques improve the human hepatic drug clearance prediction. Generally to determine clearance predictions, computational tools can be successful only if the models can use experimental data as part of their input. Obviously, however, these models can then only be used at stages in the drug discovery process where such experimental data are being generated.

14.3.9 Metabolism Several aspects of metabolism including the rate and extent of metabolism (turnover), the enzymes involved and the products formed give rise to different concerns. These aspects are relevant to drug discovery and should not be overlooked. The extent and rate of metabolism affect clearance, whereas the involvement of particular enzymes might result in issues associated with polymorphic nature of some of these enzymes and to drug drug interactions. QSAR and molecular modeling approaches for predicting metabolism have an increasingly important role as a possible alternative to in vitro metabolism studies. In silico approaches to predict metabolism can be divided into QSAR and 3D-QSAR studies, pharmacophore models and predictive databases. Some of the first-generation predictivemetabolism tools currently require considerable input from a computational chemist, whereas others can be used as rapid filters for the screening of virtual libraries, for an instance: testing CYP3A4 liability [51]. Perhaps the most intellectually satisfying molecular modeling studies are those which are based on the crystal structure of the

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 311 metabolizing enzymes. Historically, these structure-based models have relied on crystal structure information from bacterial homologues [52]. However, the crystal structures of the more relevant mammalian cytochrome P450s (CYP3A4 and CYP2C9) and the structure of CYP2C5are publicly available. Early predictions of the vulnerability to metabolism of certain positions in the molecule might help to eliminate metabolic liabilities [53]. Another program, called MetaSite [54], is based on a pharmacophore representation obtained from interaction fields for the protein structure and a pharmacophoric fingerprint for the potential substrate. Several approaches that use databases to predict metabolism are available or under development [55], including expert systems, such as MetabolExpert (Compudrug), META (MultiCASE) or Meteor (Lhasa), and the databases Metabolite (MDL) and Metabolism (Accelrys) [56]. Ultimately, such programs might be linked to computer-aided toxicity prediction on the basis of quantitative structure toxicity relationships and expert systems for toxicity evaluation such as DEREK (Lhasa) and MultiCase. List of databases and Tables has been mentioned in Tables 14.3 and 14.4. Table 14.3: List of database used to predict the drug metabolism. Database




Metrabase offers structured and freely accessible manually extracted data on interactions between transport and metabolism related proteins and chemical compounds. It provides measured activities, chemical structural information, tissue expression data and negative action types. It is an open access database for xenobiotic metabolism implemented as an online system for deposition and sharing of experimental data. The database contains chemical structures of xenobiotic biotransformations with substrate atoms annotated as reaction center, the resulting product formed, and the catalyzing enzyme, type of experiment, and literature references. The Human Metabolome Database (HMDB) is currently the most complete and comprehensive curated collection of human metabolite and human metabolism data in the world. 2180 endogenous metabolites are reported. HMDB also contains an extensive collection of experimental metabolite concentration data compiled from hundreds of mass spectra (MS) and Nuclear Magnetic resonance (NMR) metabolomic analyses performed on urine, blood and cerebrospinal fluid samples. The Transformer database contains integrated information on the three phases of biotransformation (modification, conjugation and excretion) of 3000 drugs and .350 relevant food ingredients and herbs, which are catalyzed by 400 proteins. Comprehensive data on the 57 human CYPs are stored in the SuperCYP database. Around 1000 SNPs and more than 1200 protein mutations are listed and ordered by their effect on expression and/or activity. 1170 drugs and 3800 drug interactions are list.




Transformer Database SuperCYP xmetdb/xmetdbserver

http://bioinformatics. http://bioinformatics.

312 Chapter 14 Table 14.4: List of tools used to predict the drug metabolism. Tools


MetaSite Uses combination of protein structure, MIFs of protein and ligand as well as molecular orbital calculations to predict SOM of CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4 substrates CypScore It consists of multiple models for the most important P450 oxidation reactions such as aliphatic hydroxylation, N-dealkylation, O-dealkylation, aromatic hydroxylation, doublebond oxidation, N-oxidation, and S-oxidation. It uses semi empirical molecular orbital theory and atomic reactivity descriptors for SOM prediction FAME 3 FAst MEtabolizer (FAME 3), a collection of extra trees classifiers for the prediction of sites of metabolism (SoMs) in small molecules for phase 1 and phase 2 metabolic enzymes. FAME 3 was derived from the MetaQSAR database. Circular atom descriptors combined with atom type fingerprints were used in model development Somugt Somugt software predicts the site of glucuronidation, for four major sites of metabolism functional groups, i.e. aliphatic hydroxyl, aromatic hydroxyl, carboxylic acid or amino nitrogen, respectively. Quantum chemical descriptors are used for SOM prediction GLORY GLORY, which combines SoM prediction with FAME 2 (FastMetabolizer) and a new collection of reaction rules for metabolic reactions mediated by the cytochrome P450 enzyme family. It also predicts the Structures of Likely Cytochrome P450 Metabolites based on predicted site of metabolism

Website https://www. software/metasite/ https://omictools. com/cypscore-tool

https://omictools. com/fame-3-tool adme/jlpeng/

14.3.10 Toxicity ‘Drug toxicity’ can be defined as a diverse array of adverse effects which are brought about through drug use at either therapeutic or non-therapeutic doses. The toxic effects are observed above therapeutic doses are often predictable from the pharmacological properties of the drug and are likely to affect all patients. Numerous mechanisms have been proposed through which drug toxicity occurs and some of them are still not completely understood. Drug toxicity can be through off-target interaction or can be through production of (reactive) electrophilic metabolites. Another, reason can be dose dependent toxicity as similar compounds administered at different doses have different toxicities [57]. Drug metabolizing enzymes have key role in drug mediated organ damage, by recognizing “structural alerts”, or “toxicophores” such as thiophenes and other sulfur-containing heterocycles (via S-oxidation), furans (via epoxidation), anilines (via N- or C-oxidation), nitrobenzenes (via nitro reduction), hydrazines (via oxidation to free radical species), and some carboxylic acid derivatives (via acyl glucuronide or acylcoenzyme A thioester formation) [58]. Additionally, compounds with variety of with heteroatom-substituted benzene rings can undergo metabolic activation to generate electrophilic “quinoid” products such as imines, quinones, methides, etc. that bind covalently to cellular macromolecules and, in some cases, oxidative stress via reactive oxygen species; both of these mechanisms can lead to liver toxicity [59]. There have been various web-tools and softwares

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 313 developed to predict the drug toxicity. One of the tool such as Metatox can predict predicts metabolites formed by nine classes of reactions including aliphatic and aromatic hydroxylation, N and O-glucuronidation, N-, S- and C-oxidation, and N- and O-dealkylation [60]. Another tool ProTox whose prediction scheme is classified into different levels of toxicity such organ toxicity (hepatotoxicity), oral toxicity, toxicological endpoints (such as mutagenicity, carcinotoxicity, cytotoxicity and immunotoxicity), toxicological pathways and toxicity targets. This tool provides insights into the possible molecular mechanism behind such toxic response [61]. The list of Tables and database has been mentioned in Tables 14.5 and 14.6. Table 14.5: List of database for predicting drug toxicity. Database



US-EPA ToxRef Database

To provide fast automated tests for screening and assessing chemical exposure, hazard and risk A database on repeated dose toxicity studies of commercial chemicals A Metabolite Mass Spectral Database facilitates in metabolite identification through mass analysis It compiles the safety and toxicity data of excipients about_repdose.html

It provides a high quality public chemistry resource for supporting improved predictive toxicology

REPDOSE METLIN: a metabolite mass spectral database The STEP (safety and toxicity of excipients for pediatrics) database. DSSTOX

Table 14.6: List of tools for predicting drug toxicity. Tools




A virtual lab for the prediction of toxicities of small molecules. It incorporates molecular similarity, fragment propensities, most frequent features and machine-learning, based a total of 33 models for the prediction of various toxicity endpoints It predicts structure and toxicity of xenobiotic metabolites, which are formed by nine classes of reactions A pipeline for predicting toxic effects of chemical compound To make predictions based on interactions that have been reported for compounds with similar chemical groups TOPKAT provides models for both qualitative and quantitative toxicity predictions

MetaTox DeepTox ECOSAR TOPKAT ecological-structure-activity-relationshipsecosar-predictive-model collaborative-science/biovia-discovery-studio/ qsar-admet-and-predictive-toxicology.html

314 Chapter 14

14.4 Computational tools for ADMET prediction The prediction of the ADMET properties plays an important role in the drug design process because these properties account for the failure of about 60% of all drugs in the clinical phases. Computational tools for ADMET prediction can be applied at an early phase of the drug development process, in order to remove molecules with poor ADME properties from the drug development pipeline and leading to significant savings in research and development costs. Free available software: DSSTox: Distributed Structure-Searchable Toxicity (DSSTox) Public Database. The Carcinogenic Potency Database (CPDB): aunique and widely used international resource of the results of 6540 chronic, long-term animal cancer tests on 1547 chemicals. PK Tutor: Free Excel Tools for PK & ADME Research and Education. Web Servers: PreADMET: ADMET Prediction, predict permeability for Caco-2 cell, MDCK cell and BBB (blood brain barrier), HIA (human intestinal absorption), skin permeability and plasma protein binding. PreADMET: Toxicity Prediction, predict toxicological properties from chemical structures, such as mutagenicity and carcinogenicity. Molinspiration: Calculation of Molecular Properties and Drug-likeness. Commercial software: Schro¨dinger (QikProp): ADME Properties Prediction. Spotfire: Calculate properties with ADMET WorkBench. GastroPlus: Simulation software package that simulates biopharmaceutics, pharmacokinetics, and pharmacodynamics in humans and animals. chemTree: Predict ADME/Tox Properties(free trial available). MDL®Metabolite Database: A complete metabolism information system. MDL®Toxicity Database: structure-searchable bioactivity database of toxic chemical substance. Volsurf: a computational procedure to produce 2D molecular descriptors from 3D molecular interaction energy grid maps. MetaSite: a computational procedure specially designed to predict the site of metabolism for xenobiotics starting from the 3D structure of a compound.

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 315 MoKa: in-silico computation of pKa values. Tsar 3.2: structure-activity software that assists in identifying new drug leads with automatic ADME calculation, FIRM analysis, virtual library enumeration, and database connectivity TOPKAT: Predictive Toxicology Metabolism: Database of metabolic pathways in numerous species ADMET: allow to eliminate compounds with unfavorable ADMET characteristics early on to avoid expensive reformulation later, and to evaluate proposed structural refinements that are designed to improve ADMET properties, prior to resource expenditure on synthesis. Web Interface on Libraries: There are number of libraries (e.g. R, Bioconductor, Biojava) which provides number of tools. Though these libraries are powerful but one need expertise in computer. Development of web interfaces over these libraries are going on in order to provide service to users who have little or no knowledge of computer. Druglikeliness Rule’s Certain rules have been proposed by different scientists to evaluate oral bioavailability or drug likeliness of NCE. As per these rules, for a drug to be orally bioavailable or “druglike”, should have some physicochemical properties within certain ranges. Description about these rules are summarized in Table 14.7.

14.5 Conclusion In the last decade, pharmaceutical profiling has made a great leap forward with the development of comprehensive ADMET diagnostic tools. The construction of a full in silico-PK profiling architecture in support of drug discovery and development is still under development. The technology innovations are focused on the improvement of quality such as increasing predictivity of the in-silico suites to the in-vivo-PK results. Exploring user-friendly databases, via comprehensible data mining tools to effectively present the comprehensive ADMET profiling data is one of the commonly employed strategy. Another strategy is developing or evaluating the evocative in silico tools for the projection of in vivo PK/PD properties using collectively the in vitro ADMET data. This leads to adequate uses of the in silico-PK profiling to prioritize the drug candidates improving drug discovery and development.

316 Chapter 14 Table 14.7: List of various rules reported to estimate the “Drug-likeness” of the molecules. Rule

Physicochemical property and their range

Lipinski’s rule of five MDDR-like rule Veber rule Ghose filter


BBB rule CMC 2 50 like rule Quantitative Estimate of Druglikeness (QED).

HBD # 5,bHBA # 10,cMW # 500,dLogP # 5 RA $ 3 No.fRB $ 18,gROTBs $ 6 g ROTBs # 10,hPSA # 140 d logP (20.4B5.6),iMR (40B130),cMW (160B480),jNA (20B70),hPSA , 140 k HB (8 10),cMW (400 500), No of acid groups. d logP (1.3 4.1),iMR (70 110),cMW (230 390),jNA (30 55) 0 (all properties unfavorable) 1 (all properties favorable) Properties required to compute QED aredLOGP,aHBD,bHBA,hPSA,gROTBs,eRA andlALERTS. e


Number of hydrogen bond donors. Number of hydrogen bond acceptors. c Molecular weight. d Octanol-water partition coefficient. e Number of ring aromatic. f Number of rigid bonds. g Number of rotable bonds. h Molecular polar surface area. i Molar refractivity. j Number of atoms. k Hydrogen bonds. l Number of structural alerts. b

Exercise 1: To determine ADME properties of a common medicinal scaffold, chalcone via PreADMET, a webserver [62]. Note: PreADMET is a web-based application for predicting ADME data and building drug-like library using in silico method. In the experiment we are evaluating well-known medicinal scaffold, chalcone. Step-by-step protocol Click “ADME prediction” on the main webpage of the tool at https://preadmet.bmdrc. kr/. Input: Draw the structure of 1,3-diarylpropenone (chalcone) in the molecule sketcher. Click “Submit”. Results: The prediction results are provided as an excel file, showing tabular representation of the predicted values for different ADME parameter, for the submitted structure, in this case, chalcone (Table 14.8).

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 317 Table 14.8: Various predicted values for ADME parameters, for chalcone.



BBB Buffer_solubility_mg_L Caco2 CYP_2C19_inhibition CYP_2C9_inhibition CYP_2D6_inhibition CYP_2D6_substrate CYP_3A4_inhibition CYP_3A4_substrate HIA MDCK Pgp_inhibition Plasma_Protein_Binding Pure_water_solubility_mg_L Skin_Permeability SKlogD_value SKlogP_value SKlogS_buffer SKlogS_pure

1.51574 75.55 54.5929 Inhibitor Inhibitor Non Non Non Non 100 115.204 Inhibitor 94.83183 9.81034 2 1.57522 3.74265 3.74265 2 3.44037 2 4.32692

Exercise 2: To predict the toxicity for anthracyclines class of anti-cancer drugs using ProTox-II tool. Exercise 3: To predict the metabolites formed by nine classes of reactions using MetaTox for COX-II class of NSAIDs. Exercise 4: To study cardiotoxicity of tricyclic class of selective COX-II inhibitors (coxibs) using various Toxicity prediction tools (Hint: refer article [63] to draw the structures of all selective coxibs). Exercise 5: To predict the P-gp mediated transportation of various class of anti-cancer drug (Hint: To retrieve structure of anti-cancer drugs refer Foye’s Principles of Medicinal Chemistry Book, pp. 1147 1192). Exercise 6: To draw the structures of some orally bioavailable FDA approved antibiotics and calculate their physicochemical properties used in Lipinski’s rule of five (Hint: see page no 24 25 of this chapter under section Druglikeness rule’s, Lipinki’s rule of five, collect structures of antibiotic from medicinal chemistry book “Foye’s Principles of Medicinal Chemistry” pp. 1028 1083 and calculate the required property using Padel, Dragon, Chemdes etc).

318 Chapter 14 Exercise 7: To draw the structures of clinically used anti-hypertensive drugs and determine their oral bioavailability by applying Lipinski’s rule of five. Exercise 8: To determine the structures of some orally active sulfonamide and determine their drug likeliness using Veber rule using DruLiTo.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

[13] [14] [15] [16]

[17] [18] [19]

[20] [21]

B. Booth, R. Zemmel, Opinion: prospects for productivity, Nat. Rev. Drug. Discov. 3 (2004) 451. V. Marx, Drug earnings rise, albeit unevenly, Chem. Eng. N. 82 (2004) 15 16. T. Kennedy, Managing the drug discovery/development interface, Drug. Discov. Today 2 (1997) 436 444. I. Kola, J. Landis, Can the pharmaceutical industry reduce attrition rates? Nat. Rev. Drug. Discov. 3 (2004) 711. C.A. Lipinski, Drug-like properties and the causes of poor solubility and poor permeability, J. Pharmacol. Toxicol. Methods 44 (2000) 235 249. E.H. Kerns, L. Di, Pharmaceutical profiling in drug discovery, Drug Discov. Today 8 (2003) 316 323. J. Penzotti, G. Landrum, S. Putta, Building predictive ADMET models for early decisions in drug discovery, Curr. Opin. Drug Discov. Dev. 7 (2004) 49 61. A. Sharma, S. Sharma, M. Gupta, S. Fatima, R. Saini, S.M. Agarwal, Pharmacokinetic profiling of anticancer phytocompounds using computational approach, Phytochem. Anal. 29 (2018) 559 568. H. Lu, P.J. Tonge, Drug target residence time: critical information for lead optimization, Curr. Opin. Chem. Biol. 14 (2010) 467 474. M.P. Gleeson, A. Hersey, D. Montanari, J. Overington, Probing the links between in vitro potency, ADMET and physicochemical parameters, Nat. Rev. Drug. Discov. 10 (2011) 197. K.M. Honorio, T.L. Moda, A.D. Andricopulo, Pharmacokinetic properties and in silico ADME modeling in drug discovery, Med. Chem. 9 (2013) 163 176. G.L. Amidon, H. Lennerna¨s, V.P. Shah, J.R. Crison, A theoretical basis for a biopharmaceutic drug classification: the correlation of in vitro drug product dissolution and in vivo bioavailability, Pharm. Res. 12 (1995) 413 420. A. Avdeef, Absorption and Drug Development: Solubility, Permeability, and Charge State, John Wiley & Sons, 2012. W.L. Jorgensen, E.M. Duffy, Prediction of drug solubility from structure, Adv. Drug Deliv. Rev. 54 (2002) 355 366. C.A. Bergstro¨m, U. Norinder, K. Luthman, P. Artursson, Experimental and computational screening models for prediction of aqueous drug solubility, Pharm. Res. 19 (2002) 182 188. D. Sun, L. Yu, M. Hussain, D. Wall, R. Smith, G. Amidon, In vitro testing of drug absorption for drug’developability’assessment: forming an interface between in vitro preclinical data and clinical outcome, Curr. Opin. Drug Discov. Dev. 7 (2004) 75 85. P.V. Balimane, S. Chong, R.A. Morrison, Current methodologies used for evaluation of intestinal permeability and absorption, J. Pharmacol. Toxicol. Methods 44 (2000) 301 312. A. Kulkarni, Y. Han, A.J. Hopfinger, Predicting Caco-2 cell permeation coefficients of organic molecules using membrane-interaction QSAR analysis, J. Chem. Inf. Comput. Sci. 42 (2002) 331 342. O.A. Raevsky, V.I. Fetisov, E.P. Trepalina, J.W. McFarland, K.J. Schaper, Quantitative estimation of drug absorption in humans for passively transported compounds on the basis of their physico-chemical parameters, Quant. Structure-Activity Relatsh. 19 (2000) 366 374. P. Stenberg, U. Norinder, K. Luthman, P. Artursson, Experimental and computational screening models for the prediction of intestinal drug absorption, J. Med. Chem. 44 (2001) 1927 1937. P. Ertl, B. Rohde, P. Selzer, Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties, J. Med. Chem. 43 (2000) 3714 3717.

ADMET tools: Prediction and assessment of chemical ADMET properties of NCEs 319 [22] S. Hilal, S. Karickhoff, L. Carreira, A rigorous test for SPARC’s chemical reactivity models: Estimation of more than 4300 ionization pKas, Quant. Structure-Activity Relatsh. 14 (1995) 348 355. [23] P. Buchwald, N. Bodor, Computer-aided drug design: the role of quantitative structure property, structure activity and structure metabolism relationships (QSPR, QSAR, QSMR), Drugs Future 27 (2002) 577 588. [24] C.L. Cummins, W. Jacobsen, L.Z. Benet, Unmasking the dynamic interplay between intestinal Pglycoprotein and CYP3A4, J. Pharmacol. Exp. Ther. 300 (2002) 1036 1045. [25] M. Gumbleton, K.L. Audus, Progress and limitations in the use of in vitro cell cultures to serve as a permeability screen for the blood-brain barrier, J. Pharm. Sci. 90 (2001) 1681 1698. [26] X. Fu, W. Liang, Q. Yu, Correlation of drug absorption with molecular charge distribution, Die Pharm. 56 (2001) 267 268. [27] S. Agatonovic-Kustrin, R. Beresford, A.P.M. Yusof, Theoretically-derived molecular descriptors important in human intestinal absorption, J. Pharm. Biomed. Anal. 25 (2001) 227 237. [28] B. Agoram, W.S. Woltosz, M.B. Bolger, Predicting the impact of physiological and biochemical processes on oral drug bioavailability, Adv. Drug. Deliv. Rev. 50 (2001) S41 S67. [29] D. Norris, G. Leesman, P. Sinko, G. Grass, Development of predictive pharmacokinetic simulation models for drug discovery, J. Controlled Release 65 (2000) 55 62. [30] N. Parrott, T. Lave´, Prediction of intestinal absorption: comparative assessment of gastroplust and ideat, Eur. J. Pharm. Sci. 17 (2002) 51 61. [31] D.F. Veber, S.R. Johnson, H.-Y. Cheng, B.R. Smith, K.W. Ward, K.D. Kopple, Molecular properties that influence the oral bioavailability of drug candidates, J. Med. Chem. 45 (2002) 2615 2623. [32] F. Yoshida, J.G. Topliss, QSAR model for drug human oral bioavailability, J. Med. Chem. 43 (2000) 2575 2585. [33] C.W. Andrews, L. Bennett, X.Y. Lawrence, Predicting human oral bioavailability of a compound: development of a novel quantitative structure-bioavailability relationship, Pharm. Res. 17 (2000) 639 644. [34] W. Bains, R. Gilbert, L. Sviridenko, J.-M. Gascon, R. Scoffin, K. Birchall, et al., Evolutionary computational methods to predict oral bioavailability QSPRs, Curr. Opin. Drug. Discov. Dev. 5 (2002) 44 51. [35] E.C. de Lange, M. Danhof, Considerations in the use of cerebrospinal fluid pharmacokinetics to predict brain target concentrations in the clinical setting, Clin. Pharmacokinet. 41 (2002) 691 703. [36] K. Rose, L.H. Hall, L.B. Kier, Modeling blood-brain barrier partitioning using the electrotopological state, J. Chem. Inf. Comput. Sci. 42 (2002) 651 666. [37] H. van de Waterbeemd, G. Camenisch, G. Folkers, J.R. Chretien, O.A. Raevsky, Estimation of bloodbrain barrier crossing of drugs using molecular size and shape, and H-bonding descriptors, J. Drug Target. 6 (1998) 151 165. [38] M.H. Abraham, J.A. Platts, Physicochemical factors that influence brain uptake, in the blood-brain barrier and drug delivery to the CNS, in, 2000. [39] P. Crivori, G. Cruciani, P.-A. Carrupt, B. Testa, Predicting blood 2 brain barrier permeation from threedimensional molecular structure, J. Med. Chem. 43 (2000) 2204 2216. [40] A. Ayrton, P. Morgan, Role of transport proteins in drug absorption, distribution and excretion, Xenobiotica 31 (2001) 469 497. [41] J. Van Asperen, U. Mayer, O. Van Tellingen, J.H. Beijnen, The functional role of P-glycoprotein in the blood brain barrier, J. Pharm. Sci. 86 (1997) 881 884. [42] A. Seelig, X. Li Blatter, F. Wohnsland, Substrate recognition by P-glycoprotein and the multidrug resistance-associated protein MRP1: a comparison, Int. J. Clin. Pharmacol. Ther. 38 (2000) 111 121. ¨ sterberg, U. Norinder, Theoretical calculation and prediction of P-glycoprotein-interacting drugs [43] T. O using MolSurf parametrization and PLS statistics, Eur. J. Pharm. Sci. 10 (2000) 295 303. [44] W.J. Pugh, I.T. Degim, J. Hadgraft, Epidermal permeability penetrant structure relationships: 4, QSAR of permeant diffusion across human stratum corneum in terms of molecular weight, H-bonding and electronic charge, Int. J. Pharm. 197 (2000) 203 211. [45] X.-W. Zhu, A. Sedykh, H. Zhu, S.-S. Liu, A. Tropsha, The use of pseudo-equilibrium constant affords improved QSAR models of human plasma protein binding, Pharm. Res. 30 (2013) 1790 1798.

320 Chapter 14 [46] K. Yamazaki, M. Kanaoka, Computational prediction of the plasma protein-binding percent of diverse pharmaceutical compounds, J. Pharm. Sci. 93 (2004) 1480 1494. [47] Z.D. Zhivkova, Quantitative structure pharmacokinetics relationships for plasma protein binding of basic drugs, J. Pharm. Pharm. Sci. 20 (2017) 349 359. [48] R.D. Saiakhov, L.R. Stefan, G. Klopman, Multiple computer-automated structure evaluation model of the plasma protein binding affinity of diverse drugs, Perspect. Drug Discov. Des. 19 (2000) 133 155. [49] K. Valko, E. Chiarparin, S. Nunhuck, D. Montanari, In vitro measurement of drug efficiency index to aid early lead optimization, J. Pharm. Sci. 101 (2012) 4155 4169. [50] P.H. van der Graaf, J. Nilsson, E.A. van Schaick, M. Danhof, Multivariate quantitative structure pharmacokinetic relationships (QSPKR) analysis of adenosine A1 receptor agonists in rat, J. Pharm. Sci. 88 (1999) 306 312. [51] J. Zuegge, U. Fechner, O. Roche, N.J. Parrott, O. Engkvist, G. Schneider, A fast virtual screening filter for cytochrome P450 3A4 inhibition liability of compound libraries, Quant. Structure-Activity Relatsh. 21 (2002) 249 256. [52] S. Ekins, M.J. de Groot, J.P. Jones, Pharmacophore and three-dimensional quantitative structure activity relationship methods for modeling cytochrome p450 active sites, Drug. Metab. Dispos. 29 (2001) 936 944. [53] L. Higgins, K.R. Korzekwa, S. Rao, M. Shou, J.P. Jones, An assessment of the reaction energetics for cytochrome P450-mediated reactions, Arch. Biochem. Biophys. 385 (2001) 220 230. [54] G. Cruciani, E. Carosati, B. De Boeck, K. Ethirajulu, C. Mackie, T. Howe, et al., MetaSite: understanding metabolism in human cytochromes from the perspective of the chemist, J. Med. Chem. 48 (2005) 6970 6979. [55] J. Langowski, A. Long, Computer systems for the prediction of xenobiotic metabolism, Adv. Drug Deliv. Rev. 54 (2002) 407 415. [56] R.P. Remmel, Drug metabolism databases and high-throughput testing during drug design and development, J. Med. Chem. 45 (2002). 1958-1958. [57] C.M. Ellison, S.J. Enoch, M.T. Cronin, A review of the use of in silico methods to predict the chemistry of molecular initiating events related to drug toxicity, Expert. Opin. Drug Metab. Toxicol. 7 (2011) 1481 1495. [58] P.K. Singh, A. Negi, P.K. Gupta, M. Chauhan, R. Kumar, Toxicophore exploration as a screening technology for drug design and discovery: techniques, scope and limitations, Arch. Toxicol. 90 (2016) 1785 1802. [59] T.A. Baillie, A.E. Rettie, Role of biotransformation in drug-induced toxicity: influence of intra-and interspecies differences in drug metabolism, Drug. Metab. Pharmacokinet. (2010). 1010210091-1010210091. [60] M.N. Drwal, P. Banerjee, M. Dunkel, M.R. Wettig, R. Preissner, ProTox: a web server for the in silico prediction of rodent oral toxicity, Nucleic Acids Res. 42 (2014) W53 W58. [61] A.V. Rudik, V.M. Bezhentsev, A.V. Dmitriev, D.S. Druzhilovskiy, A.A. Lagunin, D.A. Filimonov, et al., MetaTox: web application for predicting structure and toxicity of xenobiotics’ metabolites, J. Chem. Inf. Model. 57 (2017) 638 642. [62] S. Lee, G. Chang, I. Lee, J. Chung, K. Sung, K. No, The PreADME: Pc-based program for batch prediction of adme properties, EuroQSAR 9 (2004) 5 10. [63] A. Zarghi, S. Arfaei, Selective COX-2 inhibitors: a review of their structure-activity relationships, Iran. J. Pharm. Res.: IJPR 10 (2011) 655.


Cheminformatic tools: Identify suitable synthesis procedures to realize designed molecules 15.1 Introduction In order to synthesize a target chemical compound, it is necessary to search for a series of suitable reaction steps beginning from available starting materials. This analysis, starting from the target compound and working backward, dates as far back as Robert Robinson’s seminal 1917 work on the synthesis of tropinone [1]. In 1990, E.J. Corey was honored with Nobel Prize for his contribution in formalizing retrosynthesis [2]. This formalization prompted the development of computer assistance and making it easy for the chemists to focus on what to make, rather than how to make it; much of the field’s development was led by J. Gasteiger in the following years [3]. From the very beginning, with the assistance of computers in retrosynthesis planning the vast majority of automated retrosynthesis programs have been possible by encoding reaction templates, or generalized subgraph matching rules [4]. These template-based approaches necessitate a decision to be made about the extent of generalization and abstraction, whether to extract algorithmically from reaction databases [5] or encoded by hand [6]. Advancement has been made in various techniques to extract the likely meaningful context around the reaction center, through the consideration of non-structural reactivity descriptors, but the trade-off of specificity and coverage is imminent. Moreover, application of templates is computationally expensive owing to the cost of solving the subgraph isomorphism problem, and so these approaches do not scale well for large template sets [7]. Similar considerations are applicable to the task of forward prediction [8], which has been the subject of several recent studies [9]. Identifying the reaction site is necessary in order to propose synthons [10], or nonphysical fragments of precursors. Given one or more synthons resulting from a proposed retrosynthetic step, it is still important to propose specific functionalities to create synthetic equivalents. However, despite their limitations, reaction templates still provide a very useful way to encode transformations, particularly in their ability to fully specify the chemical precursors.

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


322 Chapter 15

15.2 Strategies for computer assisted prediction of synthetic schemes Contemporary strategies to predict synthetic schemes for desired molecules can be grouped into three categories: template library-based, template-free, and focused template application.

15.2.1 Template library-based Template library based retrosynthesis involves matching generalized reaction rules to target molecules to yield one or more candidate precursors. Early programs required chemists to manually codify these rules using user-unfriendly syntax [11], resulting in often incomplete template databases able to predict a limited set of chemistries. With sufficient time investment, this manual approach has been made to cover more of known chemistry in the form of commercial program Chematica [6]. Contemporary approaches use algorithmic template extraction from atom-mapped reaction examples [12]. An extracted rule must contain atoms that change connectivity, but the degree to which auxiliary atoms are included is flexible. There is an inevitable trade-off between specificity and speed including too many neighboring atoms, leading to large, poorly generalized template libraries, which are computationally expensive to apply. Conversely, too few neighboring atoms neglects the necessary context and leads to unfeasible disconnections. However, using simple heuristics provides an appropriate balance. Atoms adjacent to the reaction center are included if they are terminal, are required for unambiguous specification of chirality, or belong to substructures known to influence reactivity [13]. Application of the full template library to a target produces candidate precursors, often numbering in the 100s or 1000s. If no precursors are commercially available, they must themselves be expanded; this continues recursively until a fully buyable path is found or some maximum depth is reached. Careful handling of stereochemistry is required for faithful preservation, inversion, or destruction of tetrahedral centers and cis/trans chirality, for example, using the open-source RDChiral [14] wrapper for RDKit [15]. To avoid combinatorial explosion, recursive expansion must focus on only the most promising disconnections that yield easily synthesizable compounds. Numerous metrics attempt to quantify the complexity of molecular structures [16]. One very crude metric is the length of a molecule’s SMILES representation raised to the three-halves power (SMILES3/2), which favors dividing molecules into the smallest possible components as in a convergent synthesis [17].

15.2.2 Template-free While the dominant paradigm of retrosynthetic enumeration is based on templates, template-free alternatives are attractive for several reasons. First, calculating subgraph

Cheminformatic tools: Identify suitable synthesis procedures to realize designed molecules 323

Figure 15.1 Seq2Seq model.

isomorphism is computationally expensive, especially for large libraries. Second, the degree of template generality/specificity can lead to either low-quality or incomplete recommendations and third, template-based methods cannot propose fundamentally novel disconnections. Retrosynthesis requires the prediction of reactant molecules. Such structured predictions where the output is not natively numerical can require specialized neural network architectures, but molecule prediction can be recast as sequence prediction via SMILES representations; generating sequences using recurrent networks is common place. Liu et al. describe a sequence-to-sequence (seq2seq; Fig. 15.1) model that converts a product SMILES to reactant(s) SMILES [7]. While no significant benefit in accuracy, the ability to propose the true, recorded reactants as high-ranked suggestions, over template library application was reported (57.0% versus 59.1% top-5 accuracy), neural translation has been successfully applied to the inverse problem of reaction prediction [18], suggesting that straightforward model improvements might improve its performance.

15.2.3 Focused template application Rather than forego templates entirely, focused template methods select relevant templates to apply, mitigating the computational expense of full library application. Segler and Waller employed a neural network to score template relevance based on molecular fingerprints [5]. This focused expansion policy was later used in a Monte Carlo Tree Search (MCTS) framework for full pathway design, which enables recommendations to be generated with impressive speed [19]. This neither overcome the question of generalization during template extraction, nor the need to exclude rare templates. To overcome these challenges, a strategy based on the concept of molecular similarity was developed for retrosynthetic expansion [13]. Similar to

324 Chapter 15 a manual search on Reaxys or SciFinder to find syntheses of similar compounds, in this approach the structural similarity (using ECFP fingerprints and the Tanimoto distance) is calculated between the target and all known products to recall relevant precedents from a reaction corpus. A highly generalized template is extracted from the precedent, with no attempt to incorporate surrounding context, and applied to the target. For any resulting precursor sets, their structural similarity to the reaction precedent’s reactants is calculated and multiplied by the product similarity to provide an overall score. This overall similarity score quantifies how strongly the precedent reaction supports the proposed reaction and is used for ranking suggestions. Similarity-based scoring implicitly considers potential functional group conflicts or missing activating groups. Further, recalling individual precedents makes use of all known reactions, not just those corresponding to popular templates. This approach outperforms the templatefree seq2seq model (81.2% versus 57.0% top-5 accuracy) and extends to full route planning when applied recursively [7].

15.3 Approaches to validate selected synthetic route Advancements in data-driven retrosynthetic planning have helped avoid manual curation of template libraries and improve computational speed. An additional benefit is increased confidence in template applicability without explicitly encoding reactivity conflicts. Thus, an important goal of such tools is this reduction of such false positives, i.e., proposed reactions incorrectly thought to be chemically feasible. To validate retrosynthetic suggestions, one can solve the inverse problem of forward synthesis: given specific experimental parameters (reactants, reagents, catalysts, solvent, concentrations, temperature, time, etc.). The absence of detailed concentration information in reaction databases necessitates simplification to only reactants and reagents, perhaps also including catalysts, solvents, and temperature. Note that these simplified problems are under determined, although one can assume reactions are run under implicitly defined “standard conditions”. Another expected limitation is the absence of side-product information, prediction of the full product must be considered as prediction of the major ( . 50% yield) product.

15.3.1 Classifying reaction feasibility One approach is to estimate reaction feasibility without explicit enumeration or consideration of side reactions. Segler et al. describe such a neural network model that classifies reactions as true or false based on their fingerprint representations, trained using true experimental data augmented by synthetic negative data [19].

Cheminformatic tools: Identify suitable synthesis procedures to realize designed molecules 325

15.3.2 Predicting mechanistic steps The Baldi group has approached reaction prediction from a mechanistic perspective [20]. Their Reaction Predictor identifies electron sources and sinks, enumerates possible interactions, and ranks those interactions using graph-based representations of molecules with pseudo molecular orbital considerations. An openly acknowledged limitation is the need to manual encode mechanistic rules for artificial data generation.

15.3.3 Ranking templates Applying a library of synthetic templates generates many products, provided that the major product is among those, only one template is needed to propose it. Wei et al. established a proof-of-concept for using machine learning to predict template applicability [21]. Their model, given reactants and reagents, predicts which of 16 rules is most relevant, using simulated data generated by these rules. Segler and Waller extend this approach to experimental data from Reaxys; each reactant fingerprint yields a probability distribution over a library of # 8720 algorithmically extracted templates [5].

15.3.4 Ranking products The model framework combines template based forward enumeration and machine learning-based candidate ranking. Importantly, this two-step approach overcomes the literature bias toward positive reaction examples. Reactions are represented by atom- and bond-level features of atoms gaining/losing hydrogen atoms and pairs of atoms gaining/ losing bonds. Features include structural information (e.g., atomic number, aromaticity, degree) as well as rapidly calculable geometric and electronic features (e.g., estimated partial charge, surface area contribution). These feature vectors constitute the inputs to a feedforward neural network model, which embeds each edit separately before sum-pooling their latent representations and further transforming that combined representation into one aggregated score. Scores of all candidate outcomes are converted to probabilities via a softmax activation.

15.3.5 Generating products Molecules are represented as attributed graphs, with atom- and bond-level features. Atom features are iteratively embedded using a Weisfeller 2 Lehman Network (WLN) a graph convolutional neural network, which incorporates information from neighboring atoms into each atom’s representation [22]. A global attention mechanism, a concept originally developed for natural language processing, accounts for effects of distant atoms as is needed to recognize. Pairwise reactivity scores are calculated for each atom pair based on

326 Chapter 15 their feature vectors, quantifying the propensity of that (atom, atom) interaction to change. The model is trained to predict which pairs of atoms belong to the reaction center. The top pairs are used to combinatorially enumerate possible bond changes; structural and valence requirements restrict candidates to chemically valid molecules, which are each scored by a Weisfeller 2 Lehman Difference Network (WLDN) using differences between embedded atom representations of products and reactants as the basis for scoring. The fully learned approach achieves significantly higher accuracy than the template-based approach and operates orders of magnitude faster, enabling its application to a larger data set of c. 400,000 reactions. A human benchmarking study showed that the model performs on par with graduate and postdoctoral synthetic chemists. As described for retrosynthetic prediction, generating molecules can be recast as generating SMILES sequences. Applying translation models to reaction prediction was first demonstrated by Nam and Kim, who trained a sequence-to sequence model to predict product SMILES strings from reactant SMILES using a combination of real (USPTO) and synthetic (generated from rules) data, tested only on textbook problems [23]. That paradigm was later applied to the larger and more complex USPTO data set, with the important addition of an attention mechanism to account for long-range dependencies [18]. Quantitative performance was nearly identical to Jin et al., although 15 times as many model parameters were required, demonstrating that one can ignore prior chemical knowledge and apply a purely empirical language model to forward prediction [24].

15.4 Tools developed so-far Chematica, the most famous tool to predict organic synthesis, supports various types of organic synthesis searches [25]. The simplest search, called the Network Travel displays reactions in the NOC (network of chemistry) which, depending on user preference, lead either to or from the molecule of interest. The small, diamond-shaped reaction nodes have basic information about the particular reaction, green and blue nodes denote known molecules, red nodes denote commercially available chemicals, and the yellow halos denote regulated and or toxic substances as in Fig. 15.2. Each or all of the molecule nodes can be visualized as 2D molecular structures or as 3D models on which various types of analyses (geometry optimizations, Connolly surfaces, etc.) can be performed. Algorithmically, Chematica first calculates the optimal, relatively short M-long pathways to all the compounds in the NOC. When it is then queried to find a long, N-step (N . M) pathway to a given target, it only has to perform searches up to the depth of NbM, and the M-long endings are already available, in effect accelerating the searches. While the algorithmic details are quite complicated, a judicious choice of M allows for the overall search speeds to be accelerated by several orders of magnitude.

Cheminformatic tools: Identify suitable synthesis procedures to realize designed molecules 327

Figure 15.2 Basic information about the particular reaction in Chematica.

In 1969 Corey and Wipke presented the first computer-aided synthesis design software called OCSS for Organic Chemical Simulation of Synthesis [26]. It was short lived and the project split into two directions—LHASA [11] under Corey’s supervision and SECS developed by Wipke [27]. LHASA (for Logic and Heuristics Applied to Synthetic Analysis) has been significant, among its other aspects, as one of the first retrosynthetic programs using a graphical interface to input and display chemical structures. Technically, it can be classified as a semi-empirical retrosynthetic planning software relying on various types of heuristic transforms. One of the major drawbacks of the program has been its limited ability to deal with stereochemistry and its interactive (step-by-step) rather than automated nature to find full pathways [28]. After LHASA, there were numerous efforts to create other types of synthesis planning programs. The aforementioned Simulation and Evaluation of Chemical Synthesis (SECS) [27] was developed by Todd Wipke largely building on the LHASA approach but extending its knowledge base. Although it received a substantial backing from a consortium of Swiss and German pharmaceutical companies, it was eventually disconnected for reasons that are not entirely clear. SYNLMA was an effort by P. Y. Johnson’s group from the Illinois Institute of Technology [29]. It was significant because it separated the knowledge-base from its “reasoning component” based on logical operations to be applied during retrosynthesis. Unfortunately, the program ran into the “combinatorial explosion” problem generating excessively large retrosynthetic trees which it could not meaningfully prune. It disappeared from the scene

328 Chapter 15 already in 1989. SYNCHEM [30] and its successors were under development at Stanford/ Stony Brook already at the time of LHASA’s initial publication, but the program came to light only in 1977. The truly innovative aspect of this approach—especially at the times when modern computing was in its infancy—was that it attempted to construct and explore (with BFS (breadth-first search) like searches) full retrosynthetic trees leading to fewthousand memory-stored commercial products and using on the order of few hundred expert-coded (but general-level) transforms. Unfortunately, the transforms were often coded only after a human chemist inspected the target molecule, and there were additional problems with the transforms applicability in specific molecules. The generated strategies were found to be too “short-term” without a significant enough accounting for the path’s histories [28]. As in all previous programs, stereochemistry was a major problem and regiochemistry was not considered. The last publication on SYNCHEM described efforts to parallelize the code. Afterwards, SYNCHEM seemed to have joined other retrosynthetic programs in the Valhalla of computational chemistry. SYNGEN was a program developed by Jim Hendrickson and his team at Brandeis. Hendrickson’s SYNGEN program was developed in the 1970s and 1980s [31] and placed emphasis on the identification, supported by various types of heuristics, of reasonably-sized retrosynthetic trees containing the best, highly convergent routes. The synthesis was simplified to focus on skeletal construction and ignore re-functionalization as that would produce the shortest synthetic routes. The starting materials were identified using an empirical observation that three out of every four bonds in the target come from the starting materials. This enabled the computer to generate possible bond sets of the synthetic plan which could be refined by analyzing the synthetic space for paths leading to synthons containing the unchanged bonds. The program ran into issues when functionalization was lost before finding the full pathway, and an additional program, FORWARD, was developed to reintroduce functionalization in the synthetic direction. This effort, however, was never completed. Turning into conceptually different types of approaches [32], no story on computer-aided synthesis would be complete without mentioning the contributions of Ivar Ugi who introduced the concepts of logic-oriented synthetic planning relying on fundamental chemical knowledge to evaluate feasibility of not only known reactions but also potentially novel ones. In the 1980s and 1990s Ugi and co-workers developed programs like IGOR and IGOR2 [33] in which molecules were represented as bond-electron (BE) matrices and reactions as R-matrices obtained by subtraction of substrate and product matrices. The interactive (i.e., one reaction at a time) analysis of potential reactions relied on the rearrangements of valence electrons stored in the matrices and the selection of feasible versus nonsensical reactions was further guided by calculations of quantities such as reaction enthalpies. IGOR has been reported to identify several novel pericyclic reactions

Cheminformatic tools: Identify suitable synthesis procedures to realize designed molecules 329 and a novel rearrangement of aminoalkylboranes to the corresponding dialkylaminomonoalkylboranes, which were later experimentally verified [34]. On the other hand, the program was not really used in multistep synthetic planning perhaps due to the fact that operations on BE- and R-matrices are computationally costly and only limited numbers of reactions could be explored. Nowadays the acronym IGOR2 is used in algorithm design for Cabell-based software for “inductive synthesis of functional programs,” not molecules. Another notable effort from the 1990s is the WODCA program developed by Johann GasteigerI¨s group [35]. Akin to IGOR, this approach breaks away from the dogma of synthon-based retrosynthetic planning and functional-group approaches. Instead, it focuses on the fundamental properties of bonds (e.g., polarity, inductive effects, resonance, polarizability) to suggest which bonds are suitable for retrosynthetic disconnections. Another difference is that it allows the user bi-directional analysis in which common substrates stored in computer’s memory can be matched onto the target to suggest routes directing the chemist to these targets. Since the program relies on matrix notation, the analyses of molecules are necessarily slower than with alpha numeric representation such as SMILES. This, however, does not beat WODCA’s objective as it is not per se a tool to automatically design syntheses but rather to assist the chemist in synthetic planning. The CHIRON [36] program was developed by Stephen Hanessian at the University of Montreal and also uses the idea of mapping available substrates onto the user-specified targets, directing the user towards syntheses that maximize the overlap. The distinctive feature of this approach is that it takes care of stereochemistry during mapping. On the other hand, the program is not searching for complete retrosynthetic trees and can therefore be classified as an interactive tool whose purpose, like WODCA, is to aid a human chemist in synthetic planning [37]. Finally, the idea of using similarity underlies the ARChemRoute Designer developed by SymBioSys [38]. This approach departs drastically from the concept of expert-coded reactions and instead relies on reaction transformations/reaction “rules” (close to 100,000) machine-extracted from similar literature examples (though it also supplies a set of around 50 hand-generated rules). The program explores relatively short reaction trees exhaustively but does not account for stereochemistry and/or regiochemistry. In a similar genre, the InfoChem’s ICSYNTH relies on the reaction cores extracted from various databases which are then used for construction of synthetic suggestion trees under user control [39]. While the suggestions are based on “analogous” reactions performed on different compounds, they can complement the intuition of a practicing chemist thus serving as a synthetic “idea generator” [40]. Unlike many other programs described earlier in this section, both ARChem and ICSYNTH are commercially available.

330 Chapter 15

15.5 Conclusion There are clear opportunities for the application of machine learning and artificial intelligence techniques to organic synthesis and synthesis planning. Prediction of regio- and enantioselectivity, rational catalyst design, experimental prioritization through active learning, etc. While few research groups currently work in this area, a rapid increase is expected in the coming years, particularly as practical challenges of data availability and standardization are addressed. The potential payoff for computer-aided synthesis planning is higher than ever. Combined with advances in automated experimentation, sophisticated planning software could one day enable fully autonomous synthesis: a true realization of the “robo-chemist”. Further integration of de novo molecular design and online biological assays would revolutionize small molecule drug discovery. Exercise: There are currently no freeware or webservers available which could be utilized to predict the synthetic schemes for target compounds. Researchers can use Reaxys or Scifinder to manually search for synthetic schemes. A key tool which should be mentioned in this chapter is a robotic platform, informed by AI planning, being developed at MIT, for flow synthesis of organic compounds [41].

References [1] R. Robinson, LXIII.—a synthesis of tropinone, J. Chem. Soc., Trans. 111 (1917) 762 768. [2] E.J. Corey, The logic of chemical synthesis: multistep synthesis of complex carbogenic molecules (Nobel Lecture), Angew. Chem. Int. Ed. Engl. 30 (1991) 455 465. [3] J. Gasteiger, W.D. Ihlenfeldt, R. Fick, J.R. Rose, Similarity concepts for the planning of organic reactions and syntheses, J. Chem. Inf. Comput. Sci. 32 (1992) 700 712. [4] E.J. Corey, A.K. Long, S.D. Rubenstein, Computer-assisted analysis in organic synthesis, Science 228 (1985) 408 418. [5] M.H. Segler, M.P. Waller, Neural-symbolic machine learning for retrosynthesis and reaction prediction, Chem. A Eur. J. 23 (2017) 5966 5971. [6] S. Szymku´c, E.P. Gajewska, T. Klucznik, K. Molga, P. Dittwald, M. Startek, et al., Computer-assisted synthetic planning: the end of the beginning, Angew. Chem. Int. Ed. 55 (2016) 5904 5937. [7] B. Liu, B. Ramsundar, P. Kawthekar, J. Shi, J. Gomes, Q. Luu Nguyen, et al., Retrosynthetic reaction prediction using neural sequence-to-sequence models, ACS Cent. Sci. 3 (2017) 1103 1113. [8] W.L. Jorgensen, E.R. Laird, A.J. Gushurst, J.M. Fleischer, S.A. Gothe, H.E. Helson, et al., CAMEO: a program for the logical prediction of the products of organic reactions, Pure Appl. Chem. 62 (1990) 1921 1932. [9] M.H. Segler, M.P. Waller, Modelling chemical reasoning to predict and invent reactions, Chem. A Eur. J. 23 (2017) 6118 6128. [10] E.J. Corey, General methods for the construction of complex molecules, Pure Appl. Chem. 14 (1967) 19 38. [11] E. Corey, W.T. Wipke, R.D. Cramer III, W.J. Howe, Computer-assisted synthetic analysis. Facile manmachine communication of chemical structure by interactive computer graphics, J. Am. Chem. Soc. 94 (1972) 421 430.

Cheminformatic tools: Identify suitable synthesis procedures to realize designed molecules 331 [12] C.D. Christ, M. Zentgraf, J.M. Kriegl, Mining electronic laboratory notebooks: analysis, retrosynthesis, and reaction based enumeration, J. Chem. Inf. Model. 52 (2012) 1745 1756. [13] C.W. Coley, L. Rogers, W.H. Green, K.F. Jensen, Computer-assisted retrosynthesis based on molecular similarity, ACS Cent. Sci. 3 (2017) 1237 1245. [14] C.W. Coley, W.H. Green, K.F. Jensen, RDChiral: an RDKit wrapper for handling stereochemistry in retrosynthetic template extraction and application, J. Chem. Inf. Model. (2019). [15] N.M. O’Boyle, G.R. Hutchison, Cinfony combining open source cheminformatics toolkits behind a common interface, Chem. Cent. J. 2 (2008) 24. [16] R.P. Sheridan, N. Zorn, E.C. Sherer, L.-C. Campeau, C. Chang, J. Cumming, et al., Modeling a crowdsourced definition of molecular complexity, J. Chem. Inf. Model. 54 (2014) 1604 1616. [17] D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Comput. Sci. 28 (1988) 31 36. [18] P. Schwaller, T. Gaudin, D. Lanyi, C. Bekas, T. Laino, “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models, Chem. Sci. 9 (2018) 6091 6098. [19] M.H. Segler, M. Preuss, M.P. Waller, Planning chemical syntheses with deep neural networks and symbolic AI, Nature 555 (2018) 604. [20] D. Fooshee, A. Mood, E. Gutman, M. Tavakoli, G. Urban, F. Liu, et al., Deep learning for chemical reaction prediction, Mol. Syst. Des. Eng. 3 (2018) 442 452. [21] J.N. Wei, D. Duvenaud, A. Aspuru-Guzik, Neural networks for the prediction of organic chemistry reactions, ACS Cent. Sci. 2 (2016) 725 732. [22] T. Lei, W. Jin, R. Barzilay, T. Jaakkola, Deriving neural architectures from sequence and graph kernels, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 2024 2033. [23] J. Nam, J. Kim, Linking the neural machine translation and the prediction of organic chemistry reactions, Arxiv Prepr. (2016). Arxiv:1612.09529. [24] W. Jin, C. Coley, R. Barzilay, T. Jaakkola, Predicting organic reaction outcomes with Weisfeiler-Lehman network, Adv. Neural Inf. Process. Syst. (2017) 2607 2616. [25] M. Peplow, Organic synthesis: the robo-chemist, Nat. N. 512 (2014) 20. [26] E. Corey, W.T. Wipke, Computer-assisted design of complex organic syntheses, Science 166 (1969) 178 192. [27] W.T. Wipke, G.I. Ouchi, S. Krishnan, Simulation and evaluation of chemical synthesis—SECS: an application of artificial intelligence techniques, Artif. Intell. 11 (1978) 173 193. [28] M.H. Todd, Computer-aided organic synthesis, Chem. Soc. Rev. 34 (2005) 247 266. [29] P. Johnson, I. Bernstein, J. Crary, M. Evans, T. Wang, H.P. BA Holme, Designing an expert system for organic synthesis in expert systems application in chemistry, in: ACS Symposium Series, American Chemical Society, Washington, 1989. [30] D. Krebsbach, H. Gelernter, S.M. Sieburth, Distributed heuristic synthesis search, J. Chem. Inf. Comput. Sci. 38 (1998) 595 604. [31] J.B. Hendrickson, A.G. Toczko, SYNGEN program for synthesis design: basic computing techniques, J. Chem. Inf. Comput. Sci. 29 (1989) 137 145. [32] J.H. Chen, P. Baldi, No electron left behind: a rule-based expert system to predict chemical reactions and reaction mechanisms, J. Chem. Inf. Model. 49 (2009) 2034 2043. [33] J. Bauer, R. Herges, E. Fontain, I. Ugi, IGOR and computer assisted innovation in chemistry, Chimia 39 (1985) 43 53. [34] S. Warren, Organic Synthesis: The Disconnection Approach, John Wiley & Sons, 2007. [35] R. Ho¨llering, J. Gasteiger, L. Steinhauer, K.-P. Schulz, A. Herwig, Simulation of organic reactions: from the degradation of chemicals to combinatorial synthesis, J. Chem. Inf. Comput. Sci. 40 (2000) 482 494. [36] S. Hanessian, J. Franco, B. Larouche, The psychobiological basis of heuristic synthesis planning-man, machine and the chiron approach, Pure Appl. Chem. 62 (1990) 1887 1910.

332 Chapter 15 [37] S. Hanessian, Man, machine and visual imagery in strategic synthesis planning: computer-perceived precursors for drug candidates, Curr. Opin. Drug. Discov. Dev. 8 (2005) 798 819. [38] O. Ravitz, Data-driven computer aided synthesis design, Drug. Discov. Today: Technol. 10 (2013) e443 e449. [39] H. Kraut, J. Eiblmaier, G. Grethe, P. Lo¨w, H. Matuszczyk, H. Saller, Algorithm for reaction classification, J. Chem. Inf. Model. 53 (2013) 2884 2895. [40] A. Bøgevig, H.-Jr. Federsel, F. Huerta, M.G. Hutchings, H. Kraut, T. Langer, et al., Route design in the 21st century: the IC SYNTH software tool as an idea generator for synthesis prediction, Org. Process. Res. Dev. 19 (2015) 357 368. [41] C.W. Coley, D.A. Thomas, J.A. Lummiss, J.N. Jaworski, C.P. Breen, V. Schultz, et al., A robotic platform for flow synthesis of organic compounds informed by AI planning, Science 365 (2019) eaax1566.


Statistical methods and parameters: Tools to generate and evaluate theoretical in silico models 16.1 Introduction Statistical analyses are the core and essential component for analysing in vivo or in vitro or in silico data which are obtained from biological response in molecular recognition process amongst biomolecules and small molecules [1]. Fundamentally, these are mathematical procedures that help to build in silico theoretical models on the basis of concept of correlation between activity and structural properties. Mainly two kind of statistical methods are explored in in silico modeling i.e. data analysis methods and regression methods. Data analysis methods are used to recombine data into different forms, and group observations into hierarchies [2]. Regression methods are used to correlate biological activities as dependent variables with physicochemical properties or structural features as independent variables in the form of an equation or model [3]. The model can then be used to predict activities for new molecules, perhaps prioritizing or screening a large group of molecules whose activities are not known [4]. A model’s ability to provide insight into the system is as important as its predictive ability. Possibly more valuable than being able to predict an activity or property is to know that it increases when a particular descriptor increases. Since in silico modeling is a theoretical technique, generated models require rigorous statistical treatments before their practical application in ‘wet lab’. Finally, validation methods are needed to establish the predictiveness of a model on unseen data and to help determine the complexity of an equation that your amount of data justifies [5]. This chapter provides information about various statistical analysis methods, such as principal components analysis (PCA), cluster analysis, simple linear regression (simple), multiple linear regression (linear), stepwise multiple linear regression (stepwise), principal components regression (CR), partial least squares (PLS), genetic function approximation (GFA), genetic partial least squares (G/PLS), and associated statistical measures such as

Concepts and Experimental Protocols of Modelling and Informatics in Drug Design. DOI: © 2021 Elsevier Inc. All rights reserved.


334 Chapter 16 various cross-validation correlation coefficient, test of significances, that are used in validation and evaluation of theoretical models utilized in any or every in-silico strategies.

16.2 Data analysis methods 16.2.1 Principal components analysis (PCA) Principal components analysis (PCA) is one of the most popular data-reduction technique. It aims at representing large amounts of multidimensional data by transforming them into a more intuitive low-dimensional representation. This transformation suppresses the dimensions deemed to contribute an insignificant percentage of the total variance present in the data [6]. Suppose the data is represented by a set of p variables X1,. . ., Xp, Principal component analysis transforms this set of variables into a (preferably much smaller) set X1,. . ., Xk of linear combinations of the original variables (Xi), which accounts for most of the variance of the original set. The new variables (Xj) are referred to as principal components and are usually presented in the order of decreasing contribution to the total variance. Typically, the variables are known only by sampling their values by measurements performed on a collection of n objects. Let us denote the result of measuring the value of jth variable on ith object by Xij. Thus all measurements for all variables can be written in the form of an n X p matrix, which is referred to as the property matrix. A variant of PCA in frequent use calculates the principal components from the variables (Xi) after they have been normalized to unit variance [7].

16.2.2 Cluster analysis The goal of cluster analysis is to partition a dataset (typically representing a set of models in a molecular descriptor property space) into classes or categories consisting of elements of comparable similarity. While “similarity” is usually defined precisely, the notion of “comparable” cannot be defined completely since determination of what constitutes “comparable” is in fact a part of the answer cluster analysis seeks to determine. It is thus advisable to have several clustering analysis algorithms available, examining the data from different points of view [8].

16.3 Regression methods Regression analysis is an important tool in theoretical model building to correlate activity with property descriptors or variables which are statistically responsible for variation in activity. There are two main goals of regression methods: prediction and experimental design. It is useful to have a model that is predictive (even if imperfect) because it can be

Statistical methods and parameters: Tools to generate and evaluate theoretical


used for screening a large set of molecules or proposed molecules as promising candidates. A regression model might be even more useful if it suggests a previously unrecognized correlation between some property (or combination of properties) and activity. This is especially true if we know how to adjust that property by changing some substituent. This can lead to new experiments designed to increase understanding of the system under study [9,10]. There is no single method that works best for all problems and that has the perfect balance amongst predictive ability, interpretability, and computational efficiency. Some examples of these trade-offs would be: •

Simple and multiple linear regression are very quick and easy to interpret, but do not work well when the number of independent variables is larger than (or even comparable to) the number of molecules. Stepwise multiple linear regression and GFA work with any number of variables but do not perform well if important information is spread over more of them than can be included in the model. This often occurs in 3D QSAR. Partial least squares can handle any number of independent variables, but creates only linear relationships. Genetic partial least squares offers automatic creation of nonlinear terms.

16.3.1 Simple regression analysis Simple linear regression analysis defines the functional relationship between two variables X and Y by best fitting straight line as displayed in Fig. 16.1A. In this method a linear oneterm equation (Eq. 16.1) is produced separately for each independent variable [11]. This is


Y-dependent variable ( biological response)

Y-dependent variable (predicted activity)


X- independent variable (Any physico chemical property)

X- independent variable (observed activity)

Figure 16.1 Graph of regression analysis: simple (A); multiple linear (B).

336 Chapter 16 useful for discovering some of the most important descriptors. The line of regression of Y on X be: Y 5 a X 1 b;


where, a 5 intercept, b 5 regression coefficient

16.3.2 Multiple linear regressions This method generates multi-parameter containing equation (as Eq. 16.2) by performing standard multivariable regression calculations. This method requires at least as many molecules as independent variables [12]. To produce reliable results, we typically need 5 times as many molecules as independent variables. Quality of regression can be initially determined by correlating observed activity i.e. from which equation has been derived, with predicted activity i.e. which is predicted from the generated equation, using best fitting straight line as displayed in Fig. 16.1B. Y 5 aX1 1 bX2 1 cX3. . . 1 d


16.3.3 Stepwise multiple linear regression In this method, a multiple-term linear equation is produced in stepwise fashion. A parent initial best possible equation is formed first than new regression is performed by deletion or addition of a variable (depending upon mode of derivation) on the basis of test of significance that new equation passes [13]. Stepwise multiple regression model can be derived by two types of methods: Backward elimination: Starting with all possible variables and in each further step eliminating the variable, which is least significant one. Forward selection: Starting with the best single variable and adding further significant variables, according to their contribution to the model.

16.3.4 Principal components regression (PCR) A multiple-term linear equation is created based on a principal-components analysis transformation of the independent variables. The components are chosen so that they retain the largest amount of variance of the independent variables if some of the components are discarded [14]. The first component is the direction of greatest variance in the independent variables. The next component is the direction of greatest variance that is orthogonal to all preceding independent variables. Some of the last components are discarded to reduce the

Statistical methods and parameters: Tools to generate and evaluate theoretical


size of the model and avoid over-fitting. Normally the number of components kept is determined by cross-validation. Components are added in order, as long as they improve the cross-validated r2. Variables that are co-linear are replaced by a single variable. In effect, this method titrates the size of the model to the amount of data available. However, this method does not work well if some of the variables contain a lot of variance but do not correlate with activity (e.g., fingerprint-like descriptors). These variables are given a high loading in the components, pushing out others that are more relevant to activity. This means that, unless your independent variables are pre-screened for relevance, you should probably consider PLS instead.

16.3.5 Partial least square regression analysis Partial least square regression analysis (PLS) is the most promising multivariate regression method. Many even hundreds or thousands of independent variables (X-block) can be correlated with one or several dependent variables (Y-block) without the problem of over fitting and noise creation. Therefore PLS method is the most suitable for 3D QSAR modeling where number of independent variables (X-variable, field values on grid points) are greater than number of compounds in the data set [15]. This method can simultaneously correlate multiple independent variables with large number of dependent variables. In drug designing approaches, usually one dependent variable i.e. biological response is required to correlate one with one or more than one physicochemical parameters. It is principle component like method and as described in Fig. 16.2, finds new variables ‘A latent variable’ also called X-Scores, which is denoted by ta (a 5 1, 2, 3. . ..A) These scores are linear combination of original variable Xk with weight of coefficient w ka where, a 5 index of components, i 5 index of objects, k 5 index of independent variables ta 5 Σk w ka Xik


As in regression analysis, the correlation coefficient r increases with the number of extracted vectors in PLS analysis. Dependent on the number of components, often perfect correlations are obtained in PLS analyses, owing to the large number of x variables. Correspondingly, the goodness of fit (high values of r2 and S) is no criterion for the validity of a PLS model. The significance of additional PLS vectors is determined by cross validation [16]. The most common leave-one-out cross validation as displayed in the Fig. 16.3 is usually followed: One object (i.e., one biological activity value) is eliminated from the training set and a PLS model is derived from the residual compounds. Then this model is used to predict the biological activity value of the compound which was not included in the model. The same procedure is repeated after elimination of another object until all objects have been eliminated once.

338 Chapter 16

Figure 16.2 Partial least square regression analysis.

Statistical methods and parameters: Tools to generate and evaluate theoretical


Figure 16.3 Leave one out cross validation in PLS analysis.

The sum of the squared differences, PRESS, between these ‘outside-predictions’ and the observed y value is a measure for the internal predictivity of the PLS model. X ðYobs  Ycalc Þ2 PRESS 5 For larger data sets, an alternative to the leave one-out technique is recommended to yield more stable PLS models i.e. leave group out in which several objects are

340 Chapter 16 randomly eliminated from the data set at a time or in a systematic manner, and the excluded objects are predicted by the corresponding model similar to the leave one out protocol. Advantages of PLS analysis •

• •

It can be performed like the multiple linear regression, without the problem of mutual correlation amongst the independent parameters therefore correlation matrix, which is used to show the mutual correlation among the variables present in the model, is not required to prepare. The number of independent variables can exceed the number of observations (rows) without problems of over fitting. It gives a more robust QSAR equation than simple multiple linear regressions.

16.3.6 Genetic function approximation (GFA) In this method, models are collected that have a randomly chosen proper subset of the independent variables, and then the collected models are “evolved”. A generation is the set of models resulting from performing multiple linear regression on each model; a selection of the best ones becomes the next generation. Crossover operations are performed on these, which take some variables from each of two models to produce an offspring. In addition, the best model from the previous generation is retained. Besides linear terms there can also be spline, quadratic, and quadratic spline terms. These are added or deleted by mutation operations [17]. A major advantage of this approach is that a collection of diverse small models is generated that all have roughly the same high predictability. Each of these might provide a different insight into your system. Loading spread does not occur because, at most, one of a set of co-linear variables is retained in each model. This can make interpretation much easier than with PLS. A disadvantage is that it takes too long to perform cross-validation on each generation and, thus, you need to have a reasonable idea of how many terms to keep before you start. Another disadvantage, compared to PLS, is that if the information in your system is highly diffuse, you may need to retain more terms in each model than can be determined by the number of molecules. This happens sometimes with 3D QSAR data.

16.3.7 Genetic partial least squares (G/PLS) This method combines the best features of GFA and PLS. Each generation has PLS applied to it instead of multiple linear regression, and so each model can have more terms in it without danger of over-fitting. G/PLS retains the ease of interpretation of GFA by back-transforming the PLS components to the original variables [18].

Statistical methods and parameters: Tools to generate and evaluate theoretical


16.4 Evaluation of in silico models Theoretical models generated by in silico modeling are mainly applied for two purposes: First for screening out molecules from large databases to identify active compounds; secondly to predict activity of identified compounds. Therefore, generated model should be rigorously evaluated on these two fronts before final implication to accomplish these tasks Number of statistical measures are reported to determine quality of regression model and their predictive ability. Additionally, some special methods have also been developed that evaluate the generated models in variety of context like external predictive ability, chance correlation, efficiency of screening of large data bases to identify active hits etc. Details about these parameters and evaluation tests are being described in next sections.

16.4.1 Internal validation Least squares fit The most common internal method of validating the model is least squares fitting [19]. This method of validation is similar to linear regression and is the r2 (squared correlation coefficient) for the comparison between the predicted and experimental activities. An improved method of determining r2 is the robust straight line fit, where data points are away from the central data points (essentially data points which are a specified standard deviation away from the model) are given less weight when calculating the r2. Standard deviation is an absolute measure of quality of the fit. Its value considers the number of objects n and the number of variables k. Thus ‘s’ does not only depend on goodness of fit, but also on the number of degrees of freedom, DF 5 n 2 k 2 1; the larger the number of objects and smaller the number of P variables, the smaller the standard deviation s will be for a certain value of Δ2. X   s2 5 Δ2 =ðn 2 k 2 1Þ 5 1 2 r2 :SYY =ðn 2 k 2 1Þ An alternative to this method is the removal of outliers (compounds from the training set) from the dataset in an attempt to optimize the QSAR model and is only valid if strict statistical rules are followed. The difference between the r2 and r2adj value is less than 0.3 indicates that the number of descriptors involved in the QSAR model is acceptable. The number of descriptors is not acceptable if the difference is more than 0.3. Fit of the model Fit of the QSAR models can be determined by the methods of chi-squared (χ2) and rootmean squared error (RMSE). These methods are used to decide if the model possesses the predictive quality reflected in the r2. The use of RMSE shows the error between the mean of the experimental values and predicted activities [20]. The chi squared value exhibits the difference between the experimental and predicted bioactivities.

342 Chapter 16 Large chi-square or RMSE values ($0.5 and 1.0, respectively) reflect the model’s poor ability to accurately predict the bioactivities even if the model is having large r2 value ($0.7). For good predictive model the chi and RMSE values should be low (,0.5 and ,0.3, respectively). These methods of error checking can also be used to aid in creating models and are especially useful in creating and validating models for nonlinear data sets, such as those created with Artificial Neural Network (ANN) [21]. However, excellent values of r2, χ2 and RMSE are not sufficient indicators of model validity. Thus, alternative parameters must be provided to indicate the predictive ability of models. In principle, two reasonable approaches of validation can be envisaged, one based on prediction and the other based on the fit of the predictor variables to rearranged response variables. Cross-validation A common method for internally validating a QSAR model is cross-validation (CV, q2, or jack-knifing) [19]. Cross-validated q2 is a squared correlation coefficient generated during a cross validation procedure using the equation. q2 5 σ 2 PRESS=σ where, σ is standard deviation. A cross-validated r2or q2 is usually smaller than the overall r2 for a QSAR equation. It is used as diagnostic tool to evaluate the predictive power of an equation. CV process repeats the regression many times on subsets of data. Usually each molecule is left out once (only), in turn, and the r is computed using the predicted values of the missing molecule. Sometimes more than one molecule (leave many out, LMO) is left out at a time. CV is often used to determine how large a model can be used for a given data set. A cross-validated r2 is usually smaller than the overall r2 for a QSAR equation. It is used as a diagnostic tool to evaluate the predictive power of an equation. CV used to measure a model’s predictive ability and draw attention to the possibility a model has been over-fitted. Over-fitting refers to the phenomenon in which a predictive model may well describe the relationship between predictors and response, but may subsequently fail to provide valid predictions for new compounds. Overfitting of the model is usually suspected when the r2 value from the original model is significantly larger (25%) than the q2 value (Difference between r2 and q2 should not be more than 0.3) [22]. CV values are considered more characteristic of the predictive ability of the model. Thus, CV is considered a measure of goodness of prediction and not fit in the case of r2. The process of CV begins with the removal of one or a group of compounds, which becomes a temporary test set, from the training set. A CV model is created from the remaining data points using the descriptors from the original model, and tested on the removed molecules for its ability to correctly predict the bioactivities.

Statistical methods and parameters: Tools to generate and evaluate theoretical


In the leave-one-out (LOO) method of CV, the process of removing a molecule, and creating and validating the model against the individual molecules is performed for the entire training set. Once complete, the mean is taken of all the q2 values and reported. The data utilized in obtaining q2 is an augmented training set of the compounds (data points) used to determine r2. The method of removing one molecule from the training set is considered to be an inconsistent method [5]. A more correct method is leave-many-out (LMO), where a group of compounds is selected for validation of the CV model. This method of cross-validation is especially useful if the training set used to create the model is small (#20 compounds) or if there is no test set. For good predictability r2q2 value should not exceed 0.3. To validate a QSAR model, most of researchers apply the LOO or LMO CV procedures. The outcome from this procedure is a cross-validated correlation coefficient r2(q2). Frequently, q2 is used as a criterion of both robustness and predictive ability of the model. Many authors consider high q2 (for instance, q2 . 0.5) as an indicator or even as the ultimate proof of the high predictive power of, the QSAR model. They do not test the models for their ability to predict the activity of compounds of an external test set (i.e., compounds that have not been used in the QSAR model development). There are several examples of recent publications, in which the authors claim that their models have high predictive ability without validating them by use of an external test set [23]. Some authors validate their models by the use of only one or two compounds that were not used in QSAR model development [24] and still claim that their models are highly predictive. However, it has been found that if a test set with known values of biological activities is available for prediction, no correlation may exist between the predicted and observed activities for the test set [25]. Bootstrapping Bootstrapping is another method used to determine the robustness of the model by selecting N number of random subsets from original dataset containing N number of compounds. In bootstrapping, for each run some objects are not considered in the PLS analysis while some others are repeated more than once. Confidence intervals for each term can be estimated from such a procedure, giving an independent measure of the stability of the PLS model. In bootstrapping, r2bs is an average of r2 values of entire N random subsets and can be calculated by the formula given in equation. . . . . . . If r2bsBr2, this indicate that the model is robust. Equation of internal validation where samples are selected randomly from the data set: r2 bs 5

r2r1 1 r2r2 1 r2r3 . . . . . . :r2rNn N

344 Chapter 16 In the simplest form of bootstrapping, instead of repeatedly analyzing subsets of the data, sub samples of the data are repeatedly analyzed. Each sub sample is a random sample with replacement from the full sample. In a typical bootstrap validation, K groups of size n are generated by a repeated random selection of n objects from the original data set. Some of these objects can be included in the same random sample several times, whereas other objects may never be selected. The model obtained from n randomly selected objects is used to predict the target properties for the excluded samples. A high average q2 in the bootstrap validation is a demonstration of the model robustness. Randomization test (scrambling model) The predictive power of the equation is poor when the observations are not sufficiently independent of each other. One way to test for this is by randomization of the dependent variables. This procedure ensures that the model is not due to a chance. The set of activity values is reassigned randomly to different molecules, and repeating the entire modeling procedure. This process is repeated many times. If the random models activity prediction is comparable to the original equation, the set of observations is not sufficient to support the model. Generally, the software scramble the data set of biological activity and the corresponding structures for several times and new spreadsheets are generated to confirm that the chosen pharmacophore model is not generated by chance. For a 95% statistical significance 19 spread sheets are created, for 98% 49 spreadsheets and for 99% statistical significance 99 spreadsheets are generated. The creation of a Scrambled Model [26] is a unique method of checking the descriptors used in the model because the bioactivities are randomized ensuring the new model is created from a bogus data set. The basis for this method is to test the validity of the original QSAR model and to ensure that the selected descriptors are appropriate. These new models (Scram-models) are created using the same descriptors as the original model, yet the bioactivities are changed. After each Scram-model is created, validation is performed using the methods mentioned earlier. To ensure that the Scram-models are truly random, the process of changing the bioactivities can be repeated and as each new Scram-model is created its r2 and q2 values are also generated. Each time the r2 and q2 values of the Scrammodels are substantially lower, it further enforces that the true QSAR model is sound. The basis of using this method is to validate the original QSAR model because the Scram models are created using the original descriptors and bogus bioactivities. The model would be in question if there was a strong correlation (r2 . 0.50) between the randomized bioactivities and the predicted bioactivities, specifically that the model is not responsive to the bioactivities [27]. r2pred: The predictive ability of the selected model can also confirmed by external R2pred. Equations are generated based on training set compounds and predictive capacity of the models is judged based on the predictive r2 (r2pred) values. A value of r2pred is greater than 0.6 may be taken as an indicator of good external predictability.

Statistical methods and parameters: Tools to generate and evaluate theoretical


Correlation coefficient (r2): Correlation coefficient r is a relative measure of the quality of the model. Its value depends on the overall variance of the data and can be expressed by the following equation: X Δ2 =SYY r2 5 1 2 P P 2 P where, SYY 5 ðYobs :  Ymean Þ2 and Δ 5 ðYobs  Ycalc Þ2 F-value: The F-value is a measure of the level of statistical significance of regression model. Only F-value being larger than the 95% significant limits prove the overall significance of regression equation.   F 5 r2 ðn 2 k 2 1Þ=k 1 2 r2 Two different regression models, containing different number of variables k1 (smaller number) and k2 (larger number) can be compared by partial F-value. The use of the model containing the larger number of variables is justified if the resulting partial F-value indicates 95% significance for the introduction of the new variable/s.     Fpartial 5 r2 2 2 r1 2 :ðn 2 k2 2 1Þ=ðk2 2 k1 Þ: 1 2 r2 2 Regression coefficients: It represents the increment in the value of dependent variable Y corresponding to a unit change in the value of independent variable X and therefore determine the individual weightage of each independent variable in variation of Y Regression coefficient Y on X 5 r σy=σx Where, σ 5 standard deviation and r 5 correlation coefficient Standard error (SE): The magnitude of the standard error gives an index of the precision of the estimate of the parameter. It enables us to determine the probable limits within which the population parameter may be expected to lie. rffiffiffiffiffi ð1 2 p2 Þ pq SE 5 pffiffiffi The limit of SE 5 p 1 3 n n

16.4.2 External validation To estimate the true predictive power of a QSAR model is to compare the predicted and observed activities of an (sufficiently large) external test set of compounds that were not used in the model development [28]. To estimate the predictive power of a QSAR model, Golbraikh and Tropsha recommended use of the following statistical characteristics of the test set:[29] (i) correlation coefficient R between the predicted and observed activities; (ii) coefficients of determination (R2)(predicted vs. observed activities r02, and observed vs.

346 Chapter 16 predicted activities r00 ); (iii) slopes k and k0 of the regression lines through the origin. They consider a QSAR model is predictive, if the following conditions are satisfied: R2 pred . 0:6; r2  r20 =r2 , 0:1; r2  r00 =r2 , 0:1 and 0:85 , k , 1:15 or 0:85 , k0 , 1:15: 2

Optimum number of component (ONC): These are the total number of components (descriptors or independent variables) corresponding to the highest q2 and lowest standard error of prediction derived in leave one out cross validation method. The value of ONC is used to compute the conventional statistics of the final model. To avoid over fitting, its value is usually kept the one-fifth of the total number of compounds present in the data set. The lack of the correlation between q2 and r2 was noted in, Kubinyi et al. [25]. Norinder et al., suggest that the external test set must contain at least five compounds, representing the whole range of both descriptor and activities of compounds included into the training set [30]. Recent term to check the external predictability of the selected model is r2m, which was proposed by Roy and Paul [31]. A value of r2m is greater than 0.5 may be taken as an indicator of good external predictability. Unlike external validation parameters such as R2pred, the r2m (overall) statistic is not based only on limited number of test set compounds. It includes prediction for both test set and training set (using LOO predictions) compounds. Thus, this statistic is based on prediction of comparably large number of compounds. The r2m (overall) statistic maybe advantageous when the test set size is considerably small and regression based external validation parameter may be less reliable and highly dependent on individual test set observations. The r2m (overall) statistic may be used for selection of the best predictive models from among comparable models are obtained, where some models show comparatively better internal validation parameters and some other models show comparatively superior external validation parameters. Other validation parameter named as R2p to check the acceptability of the selected model has been reported by Roy [32]. The parameter R2p, which penalize the model R2 for the difference between squared mean correlation coefficient (R2r) of the randomized models and squared correlation coefficient (R2) of the non-randomized model. A value of R2p should be greater than 0.5 may be taken as an indicator of model acceptability. The value of r2m (overall) determines whether the range of predicted activity values for the whole dataset of molecules are really close to the observed activity or not (best predictive model or not). The value of R2p, on the divergent, determines whether the model obtained is really robust or obtained as a result of chance only. Hence it can be inferred that a QSAR model can be considered acceptable if the values of r2m (overall) and R2p are equal to or above 0.5 (or at least near 0.5). A developed theoretical model can be accepted generally when it can satisfy the following criterion (The following values are the minimum recommended values for significant

Statistical methods and parameters: Tools to generate and evaluate theoretical


QSAR model meanwhile these evaluation measures are depend on the response measure scale or measure unit):            

If correlation coefficient R $ 0.8 (for in vivo data). If coefficient of determination R2 $ 0.6 If the standard deviation s is not much larger than standard deviation of the biological data. If its F value indicates that overall significance level is better than 95%. If its confidence interval of all individual regression coefficients proves that they are justified at the 95% significance level. If cross-validated R2(Q2) . 0.5 If R2 for external test set, R2pred . 0.6 Randomized R2 value should be as low as compare to R2. Randomized Q2 value should be as low as to Q2. (r2r20)/r2 , 0.1 and 0.85 # k # 1.15, or (r2r0 20)/r2 , 0.1 and 0.85 # k0 # 1.15 (for test set). r2m (overall) and R2p are $ 0.5 (or at least near 0.5). In addition, the biological data should cover a range of at least one, two or even more logarithmic units: they should be well distributed over whole distance. Also, physicochemical parameter should be spread over a certain range and should be more or less evenly distributed.

Equation has to be rejected,  If the above mentioned statistical measures are not satisfied.  If the number of the variables in the regression equation is unreasonably large.  If standard deviation is smaller than error in the biological data. Applicability domain (APD) Activity of the entire universe of chemicals cannot be predicted even by a robust and validated QSAR model. The prediction of a modeled response using QSAR is valid only if the compound being predicted is within the applicability domain of the model. The applicability domain is a theoretical region of the chemical space, defined by the model descriptors and modeled response and, thus, by the nature of the training set molecules [33]. It is possible to check whether a new chemical lies within applicability domain using the leverage approach. A compound will be considered outside the applicability domain when the leverage value is higher than the critical value of 3p/n, where p is the number of model variables plus 1 and n is the number of objects used to develop the model. Recently Jaworska et al. reviewed the methods and criteria for estimating applicability domain through training set interpolation based on range, distance, geometrical and probability density distribution based approaches [34]. A cluster-based approach have been proposed by Stanforth et al. [35]. to modeling the domain of applicability by applying an

348 Chapter 16 intelligent version of the Kmeans clustering algorithm, modeling the training set as a collection of clusters in the descriptors space, assigning a test compound fuzzy membership of each individual cluster from which an overall distance may be calculated. Guha and Jurs have used a classification method that divides the regression residuals from a previously generated model into a good class and a bad class and builds a classifier based on the division. The trained classifier is then used to determine the class of the residual of a new compound [27]. In general, to calculate applicability domain (APD), initially, the mean of Euclidean distances was determined formolecules in the training set using the following formula: APD 5 (d) 1 σZ; where, (d); average of the Euclidean distances less than previously calculated average, ‘σ’; standard deviation of Euclidean distances for molecules with value of Euclidean distances less than previously calculated average, Z; an empirical cutoff selecting 0.5 as default value.

16.4.3 Virtual screening validation The virtual screening results of pharmacophore models are further validated using various validation techniques such as calculation of the EF, GH and ROC. The EF score is calculated to check the ability of the models to pick the active molecules over the inactives (decoys) using the formula: EF 5 (Ha/Ht)/(A/D), while the GH is assessed by screening the model through a database of known actives and inactives and the results were evaluated using the Gu¨nerHenry scoring method based on the following equation: GH 5 {[Ha 3 (3A 1 Ht)]/4HtA} 3 {(HtHa)/(DA)}, where D is the total number of compounds in the database, A is the number of actives, Hais the total number of actives on the hit list, and Ht is the total number of compounds onthe hit list. The calculated GH scores range from 0 (null model) to 1 (ideal model). The best generated hypotheses are also validated using the calculation of sensitivity, specificity and ROC. The pharmacophore hypotheses were subjected to ROC analysis to assess their abilities to correctly classify a list of compounds as actives or inactives. In this case, the validity of a particular pharmacophore is indicated by the area under the curve (AUC) of the corresponding ROC curve.

16.5 Conclusion From generation, to validation, to application, each theoretical model is based on different statistical methodologies. Accordingly, several statistical measures are available to validate and evaluate the reliability and significance of a theoretical model. These measures can also be used to check if the size of the model is appropriate for the quantity of data available, as well as provide some estimate of how well the model can predict activity for new molecules. Any researchers involved in theoretical modeling should be utmost careful while selecting the in-silico tool for the statistical basis and information available from the parameters involved therein.

Statistical methods and parameters: Tools to generate and evaluate theoretical


Exercise: No exercise is incorporated in this chapter, as all these statistical parameters are components of algorithms in almost each in-silico tool and are not used directly except in QSAR (Chapter 2), whereby exercise is provided.

References [1] M. Rausand, A. Høyland, System Reliability Theory: Models, Statistical Methods, and Applications, John Wiley & Sons, 2003. [2] S.W. Raudenbush, A.S. Bryk, Hierarchical Linear Models: Applications and Data Analysis Methods, Sage, 2002. [3] V. Nguyen-Cong, G. Van Dang, B. Rode, Using multivariate adaptive regression splines to QSAR studies of dihydroartemisinin derivatives, Eur. J. Med. Chem. 31 (1996) 797803. [4] R. Kiralj, M. Ferreira, Basic validation procedures for regression models in QSAR and QSPR studies: theory and application, J. Braz. Chem. Soc. 20 (2009) 770787. [5] R. Veerasamy, H. Rajak, A. Jain, S. Sivadasan, C.P. Varghese, R.K. Agrawal, Validation of QSAR models-strategies and importance, Int. J. Drug. Des. Discov. 3 (2011) 511519. [6] L. Xue, J. Bajorath, Molecular descriptors for effective classification of biologically active compounds based on principal component analysis identified by a genetic algorithm, J. Chem. Inf. Comput. Sci. 40 (2000) 801809. [7] S. Raychaudhuri, J.M. Stuart, R.B. Altman, Principal components analysis to summarize microarray experiments: application to sporulation time series, Biocomputing, 1999, World Scientific, 2000, pp. 455466. [8] A.H. Fielding, Cluster and Classification Techniques for the Biosciences, Cambridge University Press, 2006. [9] N. Draper, H. Smith, Applied Linear Regression, Wiley, New York, 1966. [10] S. Bergman, J. Gittins, R&D project selection methods, Statistical Methods for Pharmaceutical Research Planning, Marcel Dekker, Inc, 1985. [11] X. Yan, X. Su, Linear Regression Analysis: Theory and Computing, World Scientific, 2009. [12] A.J. Stromberg, Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression, SIAM J. Sci. Comput. 14 (1993) 12891299. [13] M. Tranmer, M. Elliot, Multiple linear regression, Cathie Marsh Cent. Census Surv. Res. (CCSR) 5 (2008) 3035. [14] W.F. Massy, Principal components regression in exploratory statistical research, J. Am. Stat. Assoc. 60 (1965) 234256. [15] H. Abdi, L.J. Williams, Partial least squares methods: partial least squares correlation and partial least square regression, Comput. Toxicol., Springer, 2013, pp. 549579. [16] L. Sta˚hle, S. Wold, Partial least squares analysis with cross-validation for the two-class problem: a Monte Carlo study, J. Chemometrics 1 (1987) 185196. [17] D. Rogers, A.J. Hopfinger, Application of genetic function approximation to quantitative structure-activity relationships and quantitative structure-property relationships, J. Chem. Inf. Comput. Sci. 34 (1994) 854866. [18] W. Dunn, D. Rogers, Genetic partial least squares in QSAR, Genetic Algorithms in Molecular Modeling, Elsevier, 1996, pp. 109130. [19] P. Gramatica, Principles of QSAR models validation: internal and external, QSAR Combinatorial Sci. 26 (2007) 694701. [20] E.W. Steyerberg, F.E. Harrell Jr, G.J. Borsboom, M. Eijkemans, Y. Vergouwe, J.D.F. Habbema, Internal validation of predictive models: efficiency of some procedures for logistic regression analysis, J. Clin. Epidemiol. 54 (2001) 774781. [21] G. Schneider, P. Wrede, Artificial neural networks for computer-based molecular design, Prog. Biophys. Mol. Biol. 70 (1998) 175222.

350 Chapter 16 [22] A.R. Leach, Molecular Modelling: Principles and Applications, Pearson Education, 2001. [23] T. Suzuki, K. Ide, M. Ishida, S. Shapiro, Classification of environmental estrogens by physicochemical properties using principal component analysis and hierarchical cluster analysis, J. Chem. Inf. Comput. Sci. 41 (2001) 718726. [24] J.A. Moro´n, M. Campillo, V. Perez, M. Unzeta, L. Pardo, Molecular determinants of MAO selectivity in a series of indolylmethylamine derivatives: biological activities, 3D-QSAR/CoMFA analysis, and computational simulation of ligand recognition, J. Med. Chem. 43 (2000) 16841691. [25] H. Kubinyi, F.A. Hamprecht, T. Mietzner, Three-dimensional quantitative similarity 2 activity relationships (3D QSiAR) from SEAL similarity matrices, J. Med. Chem. 41 (1998) 25532564. [26] A. Yasri, D. Hartsough, Toward an optimal procedure for variable selection and QSAR model building, J. Chem. Inf. Comput. Sci. 41 (2001) 12181227. [27] R. Guha, P.C. Jurs, Determining the validity of a QSAR model 2 a classification approach, J. Chem. Inf. Model. 45 (2005) 6573. [28] N.S. Zefirov, V.A. Palyulin, QSAR for boiling points of “small” sulfides. Are the “high-quality structureproperty-activity regressions” the real high quality QSAR models? J. Chem. Inf. Comput. Sci. 41 (2001) 10221027. [29] A. Golbraikh, A. Tropsha, Beware of q2!, J. Mol. Graph. Model. 20 (2002) 269276. [30] U. Norinder, Single and domain mode variable selection in 3D QSAR applications, J. Chemometrics 10 (1996) 95105. [31] K. Roy, S. Paul, Exploring 2D and 3D QSARs of 2, 4-diphenyl-1, 3-oxazolines for ovicidal activity against Tetranychus urticae, QSAR Combinatorial Sci. 28 (2009) 406425. [32] K. Roy, On some aspects of validation of predictive quantitative structureactivity relationship models, Expert. Opin. Drug Discov. 2 (2007) 15671577. [33] A. Atkinson, Plots, transformations and regression, Clarendon Press, Oxford, Plots, Transformations and Regression, Clarendon Press, Oxford, 1985. [34] J. Jaworska, N. Nikolova-Jeliazkova, T. Aldenberg, QSAR applicability domain estimation by projection of the training set in descriptor space: a review, Altern. Lab. Anim. 33 (2005) 445459. [35] R.W. Stanforth, E. Kolossov, B. Mirkin, A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering, QSAR Combinatorial Sci. 26 (2007) 837844.

Glossary A Ab initio Ab initio means “from first principles” or “from the beginning”, implying that the only inputs into an ab initio calculation are physical constants. Ab initio quantum chemistry methods attempt to solve the electronic Schro¨dinger equation given the positions of the nuclei and the number of electrons in order to yield useful information such as electron densities, energies and other properties of the system. Absorption A physical or chemical phenomenon or a process in which atoms, molecules or ions enter some bulk phase liquid or solid material. Accession number An identifier supplied by the curators of the major biological databases upon submission of a novel entry that uniquely identifies that sequence (or other) entry. Active site The region of functional protein like enzyme receptor where substrate molecules bind and undergo a biochemical reaction. The active site consists of amino acid residues that form temporary bonds with the substrate (binding site) and residues that catalyse a reaction of that substrate (catalytic site). Active transport The movement of molecules across a membrane from a region of lower concentration to a region of higher concentration—against the concentration gradient. Active transport requires cellular energy to achieve this movement. ADME ADME stands for Absorption, Distribution, Metabolism and Excretion. These four aspects of a drug’s action are collectively known as pharmacokinetic. Adverse effect An undesired harmful effect resulting from a medication or other intervention such as surgery. An adverse effect may be termed a “side effect”, when judged to be secondary to a main or therapeutic effect. Agonist An agonist is a chemical that binds to a receptor and activates the receptor to produce a biological response. Algorithm A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer. Allosteric effect The binding of a ligand to one site on a protein molecule in such a way that the properties of another site on the same protein are affected. Some enzymes are allosteric proteins, and their activity is regulated through the binding of an effector to an allosteric site. Allosteric Of or relating to the binding of a molecule to an enzyme at a site other than the active site, resulting in modulation of the enzyme’s activity as a result of a change in its shape. Alternative splicing One of the alternate combinations of a folded protein that are possible due to by recombination of multiple gene segments during mRNA splicing that occurs in higher organisms. AM1 Austin Model 1. A semi empirical molecular orbital method. AMBER Assisted model building refinement: a molecular mechanics force field designated for the simulation of peptide and nucleic acid. Amino acid Organic compound containing amino and acid functionality both. α- Amino acids are building block of protein that are joined by amide (peptide) linkages to form a polypeptide chain of a protein. Antagonist A substance which inhibits the physiological action of functional protein. Antisense DNA or RNA composed of the complementary sequence to the target DNA/RNA. Also used to describe a therapeutic strategy that uses antisense DNA or RNA sequences to target specific gene DNA sequences or mRNA implicated in disease, in order to bind and physically inhibit their expression by physically blocking them.


352 Glossary Applicability domain (AD) Of a QSAR model, the response and chemical structure spaces in which the model makes predictions with a given reliability. Area under the curve or AUC (a.k.a. concordance index) In logistic regression, the area under the ROC curve; it represents the likelihood that a case will have a higher predicted probability of the event than a control across the range of criterion probabilities. Artificial intelligence The simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. Artificial neural networks (ANNs) Computing systems vaguely inspired by the biological neural networks that constitute animal brains.

B Bagging and boosting approaches Bagging is a way to decrease the variance in the prediction by generating additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Boosting is an iterative technique which adjusts the weight of an observation based on the last classification. Beta sheet One of the secondary structural elements characteristic of proteins in which a three dimensional arrangement taken up by polypeptide chains consisting alternate strands linked by hydrogen bonds. Binding energy Amount of energy required to separate a particle from a system of particles or to disperse all the particles of the system. Binding site Region in the functional biomacromolecule where small effecter molecules bind to elicit its biological activity. Bioactive (of a substance) having a biological effect. BioAssay A biological testing procedure for estimating the concentration of a pharmaceutically active substance in a formulated product or bulk material. In contrast to common physical or chemical methods, a bioassay results in detailed information on the biological activity of a substance. Bioavailability The proportion of a drug or other substance which enters the circulation when introduced into the body and so is able to have an active effect. Biocurators A professional scientist who curates, collects, annotates, and validates information that is disseminated by biological and model Organism Databases. Bioinformatics An interdisciplinary field of science that combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Biomolecules A molecule that is produced by a living organism. Biopolymer Polymers produced by living organisms in which monomeric units are covalently bonded to form larger structures e.g. protein, nucleic acid, polysaccharides, etc. BLAST Basic Local Alignment Search Tool: a set of similarity search programs for DNA and protein sequences, originally published in the Journal of Molecular Biology (S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, 215 (3) (1990) 403 410). Several web resources are available including the NCBI - NIH and Washington University. Blood brain barrier (BBB) A highly selective semipermeable border of endothelial cells that prevents solutes in the circulating blood from non-selectively crossing into the extracellular fluid of the central nervous system where neurons reside. Bond angle The angle between two bonds or two bonded electron pairs in a compound. For example, in CH4 the bond angle is 109 28v. Bond length In molecular geometry, bond length or bond distance is defined as the average distance between nuclei of two bonded atoms in a molecule. Bootstrapping Any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates.

Glossary 353 C Caco-2 cells A continuous line of heterogeneous human epithelial colorectal adenocarcinoma cells, developed by the Sloan-Kettering Institute for Cancer Research through research conducted by Dr. Jorgen Fogh. Carcinogenicity The ability or tendency of a chemical to induce tumors (benign or malignant). Catalyst A substance that can be added to a reaction to increase the reaction rate without getting consumed in the process. Catalytic domain The region of an enzyme that interacts with its substrate to cause the enzymatic reaction. Cathode ray tubes (CRTs) A vacuum tube that contains one or more electron guns and a phosphorescent screen and is used to display images. It modulates, accelerates, and deflects electron beam(s) onto the screen to create the images. Cerebro-spinal fluid (CSF) A clear, colorless body fluid found in the brain and spinal cord. It is produced by specialized ependymal cells in the choroid plexuses of the ventricles of the brain, and absorbed in the arachnoid granulations. Chemical shift In nuclear magnetic resonance spectroscopy, the chemical shift is the resonant frequency of a nucleus relative to a standard in a magnetic field. Often the position and number of chemical shifts are diagnostic of the structure of a molecule. Cheminformatics (also known as chemoinformatics) refers to use of physical chemistry theory with computer and information science techniques—so called “in silico” techniques—in application to a range of descriptive and prescriptive problems in the field of chemistry, including in its applications to biology and related molecular fields. Chirality The property of a figure that is not identical to its mirror image. A molecule is said to be chiral if it’s all valence is occupied by different atom or group of atom. Clearance The action or process of clearing or of being dispersed. Client A computer, or the software running on a computer, that interacts with another computer at a remote site (server). Cluster The grouping of similar objects in a multidimensional space. Clustering is used for constructing new features which are abstractions of the existing features of those objects. Coding regions The portion of a genomic sequence bounded by start and stop codons that identifies the sequence of the protein being coded for by a particular gene. Codon A sequence of three adjacent nucleotides that designates a specific amino acid or start/stop site for transcription. Combinatorial chemistry The use of chemical methods to generate all possible combinations of chemicals starting with a subset of compounds. The building blocks may be peptides, nucleic acids or small molecules. The libraries of compounds formed by this methodology are used to probe for new pharmaceutical reagents (see high-throughput screening). Configuration (in software) The complete ordering and description of all parts of a software or database system. Configuration management is the use of software to identify, inventory and maintain the component modules that together comprise one or more systems or products. Conformation The precise three-dimensional arrangement of atoms and bonds in a molecule describing its geometry and hence its molecular function. Consensus In sequencing, the predicted sequence of the original DNA used to create a shotgun library. In alignments, the base or amino acid most likely to occur at any given position; consensus sequences can be used to characterize protein families. Consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. Contour Plot A color coded graphical representation of 3-dimensional surface by plotting constant z slices, called contours, on a 2-dimensional format. That is, given a value for z, lines are drawn for connecting the (x,y) coordinates where that z value occurs.

354 Glossary Convergence The end-point of any algorithm that uses iteration or recursion to guide a series of data processing steps. An algorithm is usually said to have reached convergence when the difference between the computed and observed steps falls below a pre-defined threshold. Coordinates A system of numbers used to locate a point or object in a drawing. In the Cartesian coordinate system 2 numbers x and y are used to describe the location of a point in the horizontal and vertical dimensions respectively. 3D CAD programs add the z coordinate which describes distance in the third dimension. In the Polar coordinate system a point is described by a distance and an angle where 0 expends horizontally to the right. Correlation coefficient A measure ranging from 21 to +1 that indicates the strength and direction of linear association between two quantitative variables. Coulomb potential The amount of work needed to move a unit of charge from a reference point to a specific point inside the field without producing acceleration. Cross docking The process of taking a series of complexes of ligand-receptor pairs, and docking every ligand to every receptor. Cross validation A technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set. Crystal structure Term used to describe the high resolution molecular structure derived by X-ray crystallographic analysis of protein or other biomolecular crystals. Cytotoxicity The quality of being toxic to cells.

D Data Numbers, letters, or special characters representing measurements of the properties of one’s analytic units, or cases, in a study; data are the raw material of statistics. Data mining Extracting the information from a huge set of data i.e. mining the knowledge from data. Database A structured set of data held in a computer, especially one that is accessible in various ways. De novo design The design of bioactive compounds by incremental construction of a ligand model within a model of the receptor or enzyme active site, the structure of which is known from X-ray or NMR data. Decoys A set of molecules that (probably) won’t bind to target protein (false positive). Degrees of freedom A technical term reflecting the number of independent elements comprising a statistical measure. Certain distributions require a degrees of freedom value to fully characterize them (e.g., the t, χ2, and F distributions). Deletions A genetic rearrangement through loss of segments of DNA or RNA, bringing sequences which are normally separated into close proximity. This deletion may be detected using cytogenetic techniques and can also be inferred from the phenotype, indicating a deletion at one specific locus. Density functional theory (DFT) Computational quantum mechanical modeling method used in physics, chemistry and materials science to investigate the electronic structure (or nuclear structure) (principally the ground state) of many-body systems, in particular atoms, molecules, and the condensed phases. Descriptors A piece of stored data that indicates how other data is stored. Desolvation The process where in an aqueous solution containing an enzyme and a substrate, water that is surrounding the substrate is replaced by the enzyme. Dihedral angle The angle between two intersecting planes. In chemistry, it is the angle between planes through two sets of three atoms, having two atoms in common. In solid geometry, it is defined as the union of a line and two half-planes that have this line as a common edge. Dipole moment The mathematical product of the separation of the ends of a dipole and the magnitude of the charges. DNA fingerprinting A technique for identifying human individuals based on a restriction enzyme digest of tandemly repeated DNA sequences that are scattered throughout the human genome, but are unique to each individual. Domains These are independent structural unit in protein which can found alone or in conjunction with other domains and is responsible for the specific function. The domains are evolutionarily related.

Glossary 355 Dotplot A visual technique for comparing two sequences with one another, allowing for the identification of regions of local alignment, direct or inverted repeats, insertions, deletions, or low-complexity regions. Download The act of transferring a file from a remote host to a local machine via FTP. Drug A chemical substance used in the treatment, cure, prevention, or diagnosis of disease or used to otherwise enhance physical or mental health. DrugBank A unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information. Drug discovery The process by which new candidate medications are discovered. Historically, drugs were discovered by identifying the active ingredient from traditional remedies or by serendipitous discovery, as with penicillin. Drug disposition A general term that encompasses the four processes that determines drug and metabolite concentrations in plasma, in tissue, and within cells: absorption, distribution, metabolism, and excretion (usually biliary or renal). Drug Distribution Transfer of a drug from one location to another within the body. Druglikeness A qualitative concept used in drug design for how “druglike” a substance is with respect to factors like bioavailability. It is estimated from the molecular structure before the substance is even synthesized and tested.

E Electron tunneling The passage of electrons through a potential barrier which they would not be able to cross according to classical mechanics, such as a thin insulating barrier between two superconductors. Electrostatic field When two objects in each other’s vicinity have different electrical charges, an electrostatic field exists between them. Electrostatic fields arise from a potential difference or voltage gradient, and can exist when charge carriers, such as electrons, are stationary (hence the “static” in “electrostatic”). Electrostatic force Attractive or repulsive forces between particles that are caused by their electric charges. Empirical function A scoring function which is based on counting the number of various types of interactions between the two binding partners. Enantioselective synthesis A chemical reaction in which one or more new elements of chirality are formed in a substrate molecule and which produces the stereoisomeric products in unequal amounts. Energy minimization A computational method for reducing the calculated covalent and noncovalent energy of a molecule with a given geometry usually to determine its lowest stable conformation. Ensemble A group producing a single effect. Enthalpy A thermodynamic quantity equivalent to the total heat content of a system. It is equal to the internal energy of the system plus the product of pressure and volume. Entropy A thermodynamic quantity representing the unavailability of a system’s thermal energy for conversion into mechanical work, often interpreted as the degree of disorder or randomness in the system. Enzyme A substance that acts as a catalyst in living organisms, regulating the rate at which chemical reactions proceed without itself being altered in the process. The biological processes that occur within all living organisms are chemical reactions, and most are regulated by enzymes. Epigenetic Descriptive term for processes that change the phenotype without altering the genotype. Euclidean distance The “ordinary” straight-line distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is called the Euclidean norm. Euclidean distance threshold The distance between a gold standard node and a test node in Euclidean space that is allowable for the nodes to match. Excretion A process by which metabolic waste is eliminated from an organism. Exon The region of DNA within a gene that codes for a polypeptide chain or domain. Typically a mature protein is composed of several domains coded by different exons within a single gene. External validity The extent to which a study’s results can be generalized to a larger, known population. Extracellular fluid (ECF) All body fluid outside the cells of any multicellular organism.

356 Glossary F F test A statistical test for which the null hypothesis is that all group means are the same (ANOVA) or that all regression coefficients equal zero in the population (linear regression). FASTA format A text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. Fingerprints Groups of motifs excised from conserved regions of sequence alignments and used for iterative database scanning. Fold A protein fold is defined by the arrangement of the secondary structure elements of the structure relative to each other in space. Force-field Set of rules to parametrization of potential energy function of molecular system. Formulation In pharmaceutics, is the process in which different chemical substances, including the active drug, are combined to produce a final medicinal product. Fragments Small organic molecules which are small in size and low in molecular weight. Free energy In physics and physical chemistry, free energy refers to the amount of internal energy of a thermodynamic system that is available to perform work.

G Gaussian An ab initio molecular orbital theory package. Gene The basic physical and functional unit of heredity. Genetic function approximation (GFA) A technique for generating statistical models of data using the process of evolution. Unlike most other analysis algorithms, GFA provides the user with multiple models; the populations of models are created by evolving random initial models using a genetic algorithm. Genetics A branch of biology concerned with the study of genes, genetic variation, and heredity in organisms. Genome The complete set of genes or genetic material present in a cell or organism. Geometric Complementarity (also known as VDW complementarity) in which the final adhesion of two surfaces depends on the way they can fit together at the atomic scale, either naturally or after some local rearrangement of the atomic positions. Gibbs free energy The energy that may be converted into work in a system that is at constant temperature and pressure or a thermodynamic quantity equal to the enthalpy (of a system or process) minus the product of the entropy and the absolute temperature. Global electrophilicity index Quantitative and base-independent metric of Lewis acidity. Parr et al. defined global electrophilicity as a quantitative intrinsic numerical value and suggested the term electrophilicity index, ω, a new global reactivity descriptor of atoms and molecules. Glycosylation The addition of carbohydrate groups (sugars). Graph A set of vertices (also called nodes) and a set of edges connecting those vertices. Grid A drawing tool which is usually a pattern of regularly spaced dots or lines which make the alignment and drawing of objects easier. Grid map A grid map consists of a three dimensional lattice of regularly spaced points, surrounding (either entirely or partly) and centered on some region of interest of the macromolecule under study. GROMACS A molecular dynamics package, primarily designed for biochemical molecules like proteins and lipids. GUI Graphical user interface. Refers to software that relies on pictures and icons to direct the interaction of users with the application.

H Hetero-oligomer A complex made of several different protein subunits is called a hetero-oligomer or heteromer. High-Throughput Screening (HTS) An automated process that can rapidly identify active compounds, antibodies or genes, and the results provide starting points for drug design and an understanding of the interaction or role of identified biochemical processes in biology.

Glossary 357 HOMO (highest occupied molecular orbital) It is the energy in eV of highest levelled molecular orbital that contains electron. It is used to characterize nucleophilicity of substrate molecules. Homologous If two proteins are evolutionary related and stem out from common ancestors they are called homologous. Homology modeling Computational modeling of a protein structure from its sequence based on homology to known structures. Homology Two or more biological species, systems or molecules that share a common evolutionary ancestor. Homo-oligomer When only one type of protein subunit is used in the complex, it is called a homo-oligomer or homomer. Hormone A chemical that is made by specialist cells, usually within an endocrine gland, and it is released into the bloodstream to send a message to another part of the body. Hot spots Most biological processes involve multiple proteins interacting with each other. Certain residues in these protein-protein interactions, which are called hot spots, contribute more significantly to binding affinity than others. HTML Hypertext markup language. The standard, text-based language used to specify the format of Internet documents. HTML files are translated and rendered through the use of web browsers. http Hypertext Transfer Protocol. Hydrogen bond A weak electrostatic attraction between relatively electronegative atoms in which hydrogen atom acts as bridge. In this bridge, hydrogen atom is connected with one of the electronegative atom by covalent bond while other with weak non-covalent attraction bond. Hydrophobicity (lit. water-hating) The degree to which a molecule is insoluble in water, and hence is soluble in lipids. If a molecule lacking polar groups is placed in water, it will be entropically driven to finding a hydrophobic environment (such as the interior of a protein or a membrane). Hydrophilicity (lit. water-loving) The degree to which a molecule is soluble in water. Hydrophilicity depends to a large degree on the charge and polarizability of the molecule and its ability to form transient hydrogen-bonds with (polar) water molecules. Hyperlink A graphic or text within an Internet document that can be selected using a mouse. Clicking on a hyperlink transports the user to another part of the same web page or to another web page, regardless of location. Hyperplane A subspace whose dimension is one less than that of its ambient space. If a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines. Hypertext Within a web page, text which functions as a hyperlink and is differentiated either by color or by underlining. Hypothesis A tentative statement about the value of one or more population parameters.

I Immunoglobulin A member of the globulin protein family consisting of two light and two heavy chains linked by disulfide bonds. All antibodies are immunoglobulins. Immunotoxicity Adverse effects on the functioning of both local and systemic immune systems that result from exposure to toxic substances including chemical warfare agents. In silico Performed on computer or via computer simulation. In vitro Performed or taking place in a test tube, culture dish, or elsewhere outside a living organism. In vivo Performed or taking place in a living organism. Inductive effect An effect regarding the transmission of unequal sharing of the bonding electron through a chain of atoms in a molecule, leading to a permanent dipole in a bond. It is present in a σ bond as opposed to electromeric effect which is present on a π bond. Interface A device or program enabling a user to communicate with a computer Internal validity The extent to which treatment-group differences on a study endpoint represent the causal effect of the treatment on the study endpoint.

358 Glossary Internet A system of linked computer networks used for the transmission of files and messages between hosts. Introns Nucleotide sequences found in the structural genes of eukaryotes that are non-coding and interrupt the sequences containing information that codes for polypeptide chains. Intron sequences are spliced out of their RNA transcripts before maturation and protein synthesis. Ionization constant An acid dissociation constant, Ka, (also known as acidity constant, or acid-ionization constant) is a quantitative measure of the strength of an acid in solution. It is the equilibrium constant for a chemical reaction. Ionization The process by which an atom or a molecule acquires a negative or positive charge by gaining or losing electrons, often in conjunction with other chemical changes. Isothermal titration calorimetry (ITC) A technique used in quantitative studies of a wide variety of biomolecular interactions. Iteration A series of steps in an algorithm whereby the processing of data is performed repetitively until the result exceeds a particular threshold. Iteration is often used in multiple sequence alignments whereby each set of pairwise alignments are compared with every other, starting with the most similar pairs and progressing to the least similar, until there are no longer any sequence-pairs remaining to be aligned.

J Java A programming language developed by Sun Microsystems that allows small programs (applets) to be run on any computer. Java applets are typically invoked when a user clicks on a hyperlink on a web page.

K Karyotype The constitution (typically number and size) of chromosomes in a cell or individual. Knockout mice Mice which have been engineered to lack a chosen gene. The gene is inactivated in so called embryonic stem cells using the technique of homologous recombination. These cells are then introduced into a early stage embryo (blastocyst) and this is then transplanted into a recipient mouse. The subsequent progeny lack the targeted gene in some cells. This technique is used to determine the function of the chosen gene.

L LAN Local Area Network. A network that connects computers in a small, defined area, such as the offices in a single wing or a group of buildings. Lead compound A candidate compound identified as the best “hit” (tight binder) after screening of a combinatorial (or other) compound library, that is then taken into further rounds of screening to determine its suitability as a drug. Lead optimization The process by which a drug candidate is designed after an initial lead compound is identified. LGO Leave-Group-Out, aka Monte Carlo CV, a cross-validation scheme which holds out the samples according to a third-party provided array of integer groups. Library A library might be either a genomic library, or a cDNA library. In either case, the library is just a tube carrying a mixture of thousands of different clones - bacteria or l phages. Each clone carries an “insert” the cloned DNA. Ligand A substance that forms a complex with a biomolecule to serve a biological purpose. In protein-ligand binding, the ligand is usually a molecule which produces a signal by binding to a site on a target protein. Linear discriminant analysis (LDA) A method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. Lipophilicity The affinity of a drug for a lipid environment. Lipophilicity can be measured by the distribution of a drug between the organic phase, which is generally n-octanol pre-saturated with water, and the aqueous phase, which is generally water pre-saturated with n-octanol.

Glossary 359 LOO Leave-one-out, a special case of cross-validation where the number of folds equals the number of instances in the data set. LUMO (lowest unoccupied molecular orbital) It is the energy in eV of lowest levelled molecular orbital that does not contain electron. It is used to characterize electrophilicity of substrate molecules.

M Machine learning A method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. Macromolecule A molecule containing a very large number of atoms, such as a protein, nucleic acid, or synthetic polymer. Madin-Darby Canine Kidney (MDCK) cells A model mammalian cell line used in biomedical research. It is one of few cell culture models that is suited for 3D cell culture and multicellular rearrangements known as branching morphogenesis. Membrane perturbation A physical disruption of some kind. Metabolism A set of chemical reactions that occur in the cells of living organisms to sustain life. Microdialysis A minimally-invasive sampling technique that is used for continuous measurement of free, unbound analyte concentrations in the extracellular fluid of virtually any tissue. Microsatellites Consist of tandem repeats, which contain repetitive runs of the same short base sequence (e.g., GTA, GTA, GTA. . .). Among individuals, these sections of DNA may vary in the number of repeats they contain and can serve as markers and signs of genetic variation. Missense mutation A point mutation in which one codon (triplet of bases) is changed into another designating a different amino acid. Model validation The set of processes and activities intended to verify that models are performing as expected. Molar refractivity A measure of the total polarizability of a mole of a substance and is dependent on the temperature, the index of refraction, and the pressure. Molecular docking A kind of bioinformatic modeling which involves the interaction of two or more molecules to give the stable adduct. Depending upon binding properties of ligand and target, it predicts the threedimensional structure of any complex. Molecular dynamics A computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic “evolution” of the system. Molecular holograms The striped pattern of a molecule. Molecular mechanics A computational method that computes the potential energy surface for a particular arrangement of atoms using potential functions that are derived using classical physics. These equations are known as a force-field. Molecular modeling A collection of (computer based) techniques for deriving, representing and manipulating the structures and reactions of molecules, and those properties that are dependent on these three dimensional structures. Molecular recognition The specific interaction between two or more molecules, which exhibit molecular complementarity, through noncovalent bonding such as hydrogen bonding, metal coordination, hydrophobic forces, van der Waals forces, π π interactions, and/or electrostatic effects. Molecular surfaces An important tool for representing molecules and for computing inter-molecular interactions. Monte Carlo Tree Search (MCTS) A search technique in the field of Artificial Intelligence (AI). It is a probabilistic and heuristic driven search algorithm that combines the classic tree search implementations alongside machine learning principles of reinforcement learning. Motif The term motif refers to a set of contiguous secondary structure elements that either have a particular functional significance or define a portion of an independently folded domain.

360 Glossary Multiple linear regression (MLR) Statistical technique that uses several explanatory variables to predict the outcome of a response variable. Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable. Multiplicity A measure of the number of unpaired electrons in a molecule. Singlet multiplicity means that all the electron spins are paired, a doublet must have one unpaired spin. Mutagenicity Refers to a chemical or physical agent’s capacity to cause mutations (genetic alterations). Agents that damage DNA causing lesions that result in cell death or mutations are genotoxins. Mutation A modification to a chromosome. Mutations can involve single bases or entire regions of a chromosome. Mutations can be neutral (i.e., have no effect), harmful, or beneficial. As such, mutations drive evolutionary change.

N Naı¨ve Bayesian approach It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Neural network A series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or artificial in nature. Neurotransmitter Endogenous chemicals that enable neurotransmission. It is a type of chemical messenger which transmits signals across a chemical synapse, such as a neuromuscular junction, from one neuron (nerve cell) to another “target” neuron, muscle cell, or gland cell. New chemical entity (NCE) According to the U.S. Food and Drug Administration, NCE is a drug that contains no active moiety that has been approved by the FDA in any other application submitted under section 505 of the Federal Food, Drug, and Cosmetic Act. Non- synonymous SNPs, nsSNPs When the altered code doesn’t correspond to the same amino acid as the “wild- type” sequence. Substitutions in coding regions that result in a different amino acid. Non-covalent interactions A type of chemical bond that typically bond between macromolecules. They do not involve sharing a pair of electrons. Noncovalent bonds are used to bond large molecules such as proteins and nucleic acids. Nonsense mutation A point mutation in which a codon specific for an amino-acid is converted into a nonsense codon. Nuclear magnetic resonance (NMR) A method of physical observation in which nuclei in a strong constant magnetic field are perturbed by a weak oscillating magnetic field (in the near field and therefore not involving electromagnetic waves) and respond by producing an electromagnetic signal with a frequency characteristic of the magnetic field at the nucleus. Nucleic acid Biopolymers, or small biomolecules, essential to all known forms of life. The term nucleic acid is the overall name for DNA and RNA. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. Nutraceuticals A broad umbrella term that is used to describe any product derived from food sources with extra health benefits in addition to the basic nutritional value found in foods.

O Operating system An interface between a computer user and computer hardware. An operating system is a software which performs all the basic tasks like file management, memory management, process management, handling input and output, and controlling peripheral devices such as disk drives and printers. Oracle The world’s leading supplier of software for information management, and the world’s second largest independent software company. The Oracle database, which uses SQL, is being made increasingly internet aware. Oxidative stress An imbalance between free radicals and antioxidants in the body.

Glossary 361 P Partial least squares (PLS) A statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Partition coefficient The ratio of the concentrations of a solute in two immiscible or slightly miscible liquids, or in two solids, when it is in equilibrium across the interface between them. Peptide bond A covalent bond formed between two amino acids when the amino group of one is linked to the carboxyl group of another (resulting in the elimination of one water molecule). Pharmacodynamic (sometimes described as what a drug does to the body) is the study of the biochemical, physiologic, and molecular effects of drugs on the body and involves receptor binding (including receptor sensitivity), postreceptor effects, and chemical interactions. Pharmacoeconomic The scientific discipline that compares the value of one pharmaceutical drug or drug therapy to another. It is a sub-discipline of health economics. Pharmacogenomics The study of how genes affect a person’s response to drugs. This relatively new field combines pharmacology (the science of drugs) and genomics (the study of genes and their functions) to develop effective, safe medications and doses that will be tailored to a person’s genetic makeup. Pharmacokinetic Sometimes described as what the body does to a drug, refers to the movement of drug into, through, and out of the body—the time course of its absorption, bioavailability, distribution, metabolism, and excretion. Pharmacometabolomics A field which stems from metabolomics, the quantification and analysis of metabolites produced by the body. Pharmacophore A part of a molecular structure that is responsible for a particular biological or pharmacological interaction that it undergoes. Pharmacophore mapping The definition and placement of pharmacophoric features and the alignment techniques used to overlay 3D. Plasma protein binding (PPB) Refers to the degree to which medications attach to proteins within the blood. A drug’s efficiency may be affected by the degree to which it binds. The less bound a drug is, the more efficiently it can traverse cell membranes or diffuse. Polar surface area (PSA) A very useful parameter for prediction of drug transport properties. Polar surface area is defined as a sum of surfaces of polar atoms (usually oxygens, nitrogens and attached hydrogens) in a molecule. Polarity The distribution of electrical charge over the atoms joined by the bond. Polarizability A measure of how easily an electron cloud is distorted by an electric field. Typically the electron cloud will belong to an atom or molecule or ion. The electric field could be caused, for example, by an electrode or a nearby cation or anion. Polymorphism Common differences in DNA sequence among individuals that can be used as markers for linkage analysis. Potential energy The energy held by an object because of its position relative to other objects, stresses within itself, its electric charge, or other factors. Precursor A substance from which another is formed, especially by metabolic reaction. Principal component regression (PCR) Regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model. Probe Any biochemical that is labeled or tagged in some way so that it can be used to identify or isolate a gene, RNA, or protein. Protein A macronutrient that is essential to building muscle mass. It is commonly found in animal products, though is also present in other sources, such as nuts and legumes. Protein data bank (PDB) Data bank used to obtain 3D STRUCTURES OF PROTEIN using entry code (e.g. 3EDZ) available on

362 Glossary Protein folding The physical process by which a protein chain acquires its native 3-dimensional structure, a conformation that is usually biologically functional, in an expeditious and reproducible manner. Proteomes The entire complement of proteins that is or can be expressed by a cell, tissue, or organism. Proteomics Large-scale study of proteins.

Q Quantum mechanics Science dealing with the behavior of matter and light on the atomic and subatomic scale. It attempts to describe and account for the properties of molecules and atoms and their constituents—electrons, protons, neutrons, and other more esoteric particles such as quarks and gluons. Query sequence The amino acid sequence for which a 3D model is wanted. More commonly called the target sequence.

R R factor Residual disagreement. Used in X-ray crystallography as a measure of agreement between the experimentally measured diffraction amplitudes and those calculated using the protein coordinates. Perfect agreement corresponds to an R factor of 0.0. Total disagreement corresponds to an R factor of 0.59. Most good quality protein structures have R factors between 0.15 and 0.20. Racemic mixture The one that has equal amounts of left- and right-handed enantiomers of a chiral molecule. The first known racemic mixture was racemic acid, which Louis Pasteur found to be a mixture of the two enantiomeric isomers of tartaric acid. Ramachandran plot A scatterplot showing the disposition of backbone phi (ϕ) and psi (φ) torsion angles for each residue in a protein or set of proteins. Certain combinations of ϕ and φ angles are preferred strongly or are repeated over a series of residues, and these patterns can be easily detected in a Ramachandran plot. Random forest method An ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Raster images (bitmaps) A dot matrix data structure that represents a generally rectangular grid of pixels (points of color), viewable via a monitor, paper, or other display medium. Raster images are stored in image files with varying formats. ReceptorA region of tissue, or a molecule in a cell membrane, which responds specifically to a particular neurotransmitter, hormone, antigen, or other substance. Recursive partitioning (RP) A statistical method for multivariable analysis. Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. Re-docking The process by which a ligand taken from the structure of a complex of it with a receptor, and docked to the “induced-fit” form of the receptor. Refractive index The ratio of the velocity of light in a vacuum to its velocity in a specified medium. Regioselectivity The preference of one direction of chemical bond making or breaking over all other possible directions. Resonance A way of describing bonding in certain molecules or ions by the combination of several contributing structures into a resonance hybrid in valence bond theory. Restraint If bonds or angles are restraints then they are able to deviate from the desired values means that restraints only acts to encourage a particular values. Retrosynthesis The process of “deconstructing” a target molecule into readily available starting materials by means of imaginary breaking of bonds (disconnections) and by the conversion of one functional group into another (functional group interconversions). RMSD In bioinformatics, the root-mean-square deviation of atomic positions (or simply root-mean-square deviation, RMSD) is the measure of the average distance between the atoms (usually the backbone atoms) of superimposed proteins. RNA or ribonucleic acid is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes.

Glossary 363 Robo-Chemist Robot which uses artificial intelligence (AI) to discover new molecules. The machine is learning to create chemical reactions that could lead to new medicines and materials. It works initially with a human chemist to find promising new types of molecules. Robustness The quality or condition of being strong and in good condition.

S Selectivity Selectivity of bioinformatics similarity search algorithms is defined as the significance threshold for reporting database sequence matches. As an example, for BLAST searches, the parameter E is interpreted as the upper bound on the expected frequency of chance occurrence of a match within the context of the entire database search. E may be thought of as the number of matches one expects to observe by chance alone during the database search. Semiempirical Partly empirical especially: involving assumptions, approximations, or generalizations designed to simplify calculation or to yield a result in accord with observation. Sequence alignment A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Server A computer or computer program which manages access to a centralized resource or service in a network. Side effect An effect, whether therapeutic or adverse, that is secondary to the one intended; although the term is predominantly employed to describe adverse effects, it can also apply to beneficial, but unintended, consequences of the use of a drug. Similarity search The most general term used for a range of mechanisms which share the principle of searching (typically, very large) spaces of objects where the only available comparator is the similarity between any pair of objects. Single nucleotide polymorphism (SNP) Alleles that are represented by single-base changes in a DNA sequence. SMILES The simplified molecular-input line-entry system is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. Solvent-accessible surface (SAS) The surface area of a biomolecule that is accessible to a solvent. Solvent excluded surface (SES) A popular molecular representation that gives the boundary of the molecular volume with respect to a specific solvent. SESs depict which areas of a molecule are accessible by a specific solvent, which is represented as a spherical probe. Steric effect Nonbonding interactions that influence the shape (conformation) and reactivity of ions and molecules. Steric effects complement electronic effects, which usually dictate shape and reactivity. Steric effects result from repulsive forces between overlapping electron clouds. Structural activity relationship (SAR) The relationship between the chemical structure of a molecule and its biological activity. Structure-based drug design the design and optimization of a chemical structure with the goal of identifying a compound suitable for clinical testing—a drug candidate. It is based on knowledge of the drug’s threedimensional structure and how its shape and charge cause it to interact with its biological target, ultimately eliciting a medical effect. Support vector machines (SVMs) Supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Support vector regression (SVR) A regression method which gives the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. SWISS-PROT A curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.

364 Glossary Synonymous SNP Substitutions in coding regions that result in the same amino acid. Synthon A constituent part of a molecule to be synthesized which is regarded as the basis of a synthetic procedure.

T Tanimoto distance Often referred to, erroneously, as a synonym for Jaccard distance. This function is a proper distance metric. “Tanimoto Distance” is often stated as being a proper distance metric, probably because of its confusion with Jaccard distance. Template protein An empirically determined 3D protein structure with significant sequence similarity to the query. Tertiary structure The arrangement or positioning of secondary structure elements into compact, nonoverlapping globules or domains. Tertiary structures are the three dimensional structures of proteins and RNAs. Thermodynamics The branch of physical science that deals with the relations between heat and other forms of energy (such as mechanical, electrical, or chemical energy), and, by extension, of the relationships between all forms of energy. Time-of-Flight Mass Spectrometer (TOF-MS) A mass spectrometer that measures mass-to-charge ratios by the time required to traverse a set distance. Topology The map or plan of a physical system or set of connected objects. The topology of proteins generally is described by their backbone tertiary (three-dimensional) structure. Torsion angle Also known as a dihedral angle, is formed by three consecutive bonds in a molecule and defined by the angle created between the two outer bonds. The backbone of a protein has three different torsion angles. Torsional energy The energy it takes to overcome torsional strain, or the difference in energy between eclipsed and staggered conformations. Toxicity The degree to which a chemical substance or a particular mixture of substances can damage an organism. Transient interactions Which involve protein interactions that are formed and broken easily, are important in many aspects of cellular function. Here we describe structural and functional properties of transient interactions between globular domains and between globular domains, short peptides, and disordered regions.

U Unitary Matrix Also known as Identity Matrix. A scoring system in which only identical characters receive a positive score (NCBI). Up-and-Down The simplest topology for a helical bundle or folded leaf, in which consecutive helices are adjacent and antiparallel; it is approximately equivalent to the meander topology of a beta-sheet (SCOP). URL Uniform resource locator. Used within web browsers, URLs specify both the type of site being accessed (FTP, Gopher, or Web) and the address of the website. User The person using client-server or other types of software.

V Validation The action of checking or proving the validity or accuracy of something. Van der Waals surface (VWS) Of a molecule; an abstract representation or model of that molecule, illustrating where, in very rough terms, a surface might reside for the molecule based on the hard cutoffs of van der Waals radii for individual atoms, and it represents a surface through which the molecule might be conceived as interacting with other molecules. Vector graphics Computer graphics images that are defined in terms of 2D points, which are connected by lines and curves to form polygons and other shapes.

Glossary 365 Virtual libraries The creation and storage of vast collections of molecular structures in an electronic database. These databases may be queried for subsets that exhibit specific physicochemical features, or may be “virtually screened” for their ability to bind a drug target. This process may be performed prior to the synthesis and testing of the molecules themselves. Virtual screening A computational technique used in drug discovery to search libraries of small molecules in order to identify those structures which are most likely to bind to a drug target, typically a protein receptor or enzyme. Volume of distribution (VD) The theoretical volume that would be necessary to contain the total amount of an administered drug at the same concentration that it is observed in the blood plasma.

W Water mapping Mapping the locations and thermodynamic properties of water molecules that solvate protein binding sites offers rich physical insights into the properties of the pocket and quantitatively describes the hydrophobic forces driving the binding of small molecules. Wave function A function that satisfies a wave equation and describes the properties of a wave. Website A collection of web pages and related content that is identified by a common domain name and published on at least one web server. World Wide Web (WWW) A document delivery system capable of handling various types of nontext-based media.

X X-ray crystallography The experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions.

Y Y-randomization A tool used in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r2) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure.

Z Zinc-finger protein A secondary feature of some proteins containing a zinc atom; a DNA-binding protein. Z-matrix A simple, geometrical approximation. It works by identifying each atom in a molecule by a bond distance, bond angle and dihedral angle (the so called internal coordinates) in relation to other atoms in the molecule. Z-score, Z-value This measures the distance of a value from the mean of a normal or Gaussian distribution in standard deviation units. A Z-score of one means the value is one standard deviation away from the mean. A Z-score of four indicates the value is four standard deviations away from the mean, indicating the value has ,99.9% chance of occurring randomly. α-crystallin A water-soluble structural protein found in the lens and the cornea of the eye accounting for the transparency of the structure.

Index Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.

A AAS. See Amino acid substitution (AAS) Ab initio method, 22 23 Abalone, 16 ABC. See ATP-binding cassette (ABC) Absorption, 305 ACAT. See Advanced compartmental absorption and transit (ACAT) Accelerated molecular dynamics (aMD), 173 AcquaAlta method, 185 Active Analog Approach, 207 Active transporters, 302 ADF, 16 ADME-tox prediction, 213 ADMET tools ADMET-related issues, 300 computational tools for, 314 315 drug metabolism, 311t, 312t drug transportation, 308t PK parameters, 304f PK/ADME, 299 prediction of ADME, 304 313 of physicochemical properties, 301 304 ADs. See Applicability domains (ADs) ADUN, 16 Advanced compartmental absorption and transit (ACAT), 305

Advanced free-energy calculations using MD simulations, 170 171 Alanine scanning, 267 268 Algorithms, 163 164 Beeman’s, 164 integration, 163 leap-frog, 164 Velocity Verlet, 164 Verlet, 163 Alignment algorithms, 214 215 correction in homology modeling, 113 Allosteric binding sites, 169 Allosteric effects, 268 269 ALogP, 43 AMBER. See Assisted model building refinement (AMBER) aMD. See Accelerated molecular dynamics (aMD) Amino acid modifications, 93 replacement probability, 289 residues, 287 Amino acid substitution (AAS), 285 AMOEBA polarizable atomic multipole force field, 16 Anchor and-grow method, 133 134 ANNs. See Artificial Neural Networks (ANNs) Antibody antigen complexes, 271 272 APD. See Applicability domains (ADs)


Apol. See Atomic polarizabilities (Apol) Applicability domains (ADs), 37, 40, 347 348 AQUARIUS, 184 ARChemRoute Designer, 329 Arginine, 275 Artificial Neural Networks (ANNs), 39, 342 Ascalaph designer, 16 ASPARQL endpoint, 89 90 Assisted model building refinement (AMBER), 12, 15 ATB. See Automated Topology Builder (ATB) Atom records (ATOM), 98 Atomic polarizabilities (Apol), 42 ATP-binding cassette (ABC), 307 ATTRACT, 269 271 AUTODOCK docking method, 187 AutoDock software, 143 144 AutoDockVina, 144 Automated pharmacophore generation methods, 216 230. See also Automated pharmacophore generation methods ChemX/Chem Diverse, 226 227 feature trees, 229 230 field-based methods, 225 226 geometry-and feature-based methods, 216 224 OSPPREYS, 226 227 pharmacophore fingerprints, 226 PharmPrint, 226 227

368 Index Automated pharmacophore generation methods (Continued) SCAMPI, 227 228 THINK, 228 229 3D keys, 226 227 tuplets, 226 227 Automated Topology Builder (ATB), 16 Automatic annotation approach, 94

B B-factor, 95 96 BA. See Biological activity (BA) Backbone building, 113 114 Backward elimination, 336 BAliBASE method, 112 113 Basic local alignment search tool (BLAST), 97 search, 85 BBB penetration. See Blood brain barrier penetration (BBB penetration) BCRP. See Breast cancer resistance protein (BCRP) BCS. See Biopharmaceutical Classification System (BCS) BE matrices. See Bond-electron matrices (BE matrices) Beeman’s algorithm, 164 Benchmarking, 256 β1 adrenergic receptors (β1AR), 169 β2 adrenergic receptors (β2AR), 169 BFS. See Breadth-first search (BFS) BID. See Binding interface database (BID) Binary kernel discrimination, 67 Binding interface database (BID), 275 276 BindingDB, 65 68, 68f, 100 availability, 68 binary kernel discrimination, 67 description, 66 maximum similarity, 67 SVM, 67 68 Bioactive compounds, 256

BioAssay information system, 80 Bioavailability, 306 Biochemical compounds, 69 Bioinformatics tools, 285 Bioinformatics cheminformatics database, 83 Biological data, 35 Biological Process, 95 96 Biological activity (BA), 31 32 BioMagResBank, 95 Biopharmaceutical Classification System (BCS), 301 Biotherapeutics, 73 Bitmaps. See Raster images Black box, 74 75 BLAST. See Basic local alignment search tool (BLAST) Blind docking, 131 Blood brain barrier penetration (BBB penetration), 306 307 BMEC. See Bovine microvessel endothelial cell (BMEC) BoBER, 262 Bond angles, 3, 95 96 Bond lengths, 3, 95 96 Bond-electron matrices (BE matrices), 328 329 Bootstrapping, 343 344 Bovine microvessel endothelial cell (BMEC), 305 Bovine pancreatic trypsin inhibitor (BPTI), 157 Breadth-first search (BFS), 327 328 Breast cancer resistance protein (BCRP), 304 305 BREED method, 258, 259f ‘Browse Database’ option, 95 96

C C3-C4 cycloalkanes, 5 Cambridge Structural Database (CSD), 184 185 Candidate genes, 290 Canonical Ensemble (NVT), 159 Carbon-carbon double bond, 3 Carcinogenic Potency Database (CPDB), 314 Cartesian coordinate, 2, 2f

CAS. See Chemical abstracts service (CAS) Catalyst models, 219 222 CATH. See Class. Architecture Topology and Homologous (CATH) Cathode ray tubes (CRTs), 7 CATS correlation vectors, 261 CAVEAT method, 259 Cavity “waters”, 187 192 free energy calculations, 190 192 molecular docking, 187 189 molecular dynamics, 189 190 CDK2. See Cyclin-dependent kinase 2 (CDK2) Cellular Component, 95 96 Cellular processes, 19 Central nervous system (CNS), 305 Cerebro-spinal fluid (CSF), 306 CFF92. See Consistent force field 92 (CFF92) cFGM. See Computational functional group mapping (cFGM) Charged partial surface area (CPSA), 10 11 CHARMm. See Chemistry at Harvard Macromolecular Mechanics (CHARMm) ChEBI. See Chemical Entities of Biological Interest (ChEBI) Chematica, 326 ChEMBL, 73 76 Chemical abstracts service (CAS), 71 73 Chemical Entities of Biological Interest (ChEBI), 68 72 ID, 70 Chemical markup language (CML), 69 Chemical ontology, 70 Cheminformatic tools computer assisted prediction of synthetic schemes, 322 324 template-based approaches, 321 tools for, 326 329

Index validating selected synthetic route, 324 326 Chemistry at Harvard Macromolecular Mechanics (CHARMm), 15 ChemQuery tool, 85 ChemSpider, 72 73 chemTree, 314 ChemX/Chem Diverse, 226 227 CHIRON program, 329 Class. Architecture Topology and Homologous (CATH), 95 96, 109 Classical mechanics, 160 162 Clearance, 310 CliBE. See Computed Ligand Binding Energy database (CliBE) Clostridium pasteurianum, 172 ClustalW program, 112 Cluster analysis, 334 CML. See Chemical markup language (CML) CNS. See Central nervous system (CNS) Color coded plastic tubes, 4 5 Color-blind friendly color schemes, 99 Comparative molecular field analysis (CoMFA), 49 52, 53f Comparative molecular similarity indices analysis (CoMSIA), 49, 52 54 Comparative protein modeling, 107 Components regression (CR), 333 334 Compound database, 79 80 Computational functional group mapping (cFGM), 243 244 Computational prediction of protein binding sites, 269 275, 270f binding site prediction based on protein sequence, 271 based on protein structure, 271 274 energy-based methods, 274 275

protein protein docking, 269 271 Computed Ligand Binding Energy database (CliBE), 103 Computer graphics, 1, 3 4, 6 Computer models, 6 7, 6f filled color coded computer model representation of ligand, 7f Computer-aided drug-design, 65 CoMSIA. See Comparative molecular similarity indices analysis (CoMSIA) Concave patches, 9 Concex spherical patches, 9 Conformational entropy, 169 170 Conformational expansion analysis, 206 207 of molecules, 225 226 Conformational sampling, 171 Conformational search, 206 207 Conjugate gradient methods, 14 15 Consensus scoring, 137 Consistent force field 92 (CFF92), 15 Consistent valence force field (CVFF), 15 Consolv, 185 Constrained docking, 137 Constraints, 108 Construction methods, 115, 243 ConSurf (interface prediction method), 272 Coordinate systems, 1 2 CorelDraw, 3 4 Correlation coefficient, 345 COX-2 inhibitors. See Cyclooxygenase II inhibitors (COX-2 inhibitors) CP2K, 17 CPDB. See Carcinogenic Potency Database (CPDB) CPK models, 4 5, 5f CPSA. See Charged partial surface area (CPSA) CR. See Components regression (CR) Cross-link and glycosylation, 93


Cross-validation (CV), 342 343 CRTs. See Cathode ray tubes (CRTs) Cryptic binding sites, 169 CSD. See Cambridge Structural Database (CSD) CSF. See Cerebro-spinal fluid (CSF) CV. See Cross-validation (CV) CVFF. See Consistent valence force field (CVFF) Cyclin-dependent kinase 2 (CDK2), 258 259 Cyclohexane, 2 Cyclooxygenase II inhibitors (COX-2 inhibitors), 256 257 Cytochrome P450 3A4 (CYP3A4), 305 Cytochrome P450cam enzyme (CYP450cam enzyme), 190 191

D Data analysis, 37 39 methods, 334 Data set preparation, 205 206 Data-rich molecular biology, 83 Database Browser displays molecules, 79 Database methods, 115 Database mining, 1 dbSNPID, 292 DDBJ/EMBL/GenBank, 2D PAGE, 90 De nova constants. See Indicator variables De novo designing, 211 Deconstruction approach, 243 Dermal penetration, 309 Desolvation energy descriptor, 274 Desulfovibrio desulfuricans, 172 Digital object identifier (DOI), 72 73, 75 Dihedral angle, 95 96 Dipole moment, 43 DISCO method, 216 217 Dissociation constant. See Ionization constant Distance threshold (DT), 40

370 Index Distributed Structure-Searchable Toxicity (DSSTox), 314 Disulfide bond, 93 Dixon, 260 261 DNA, 7 8 DNA-binding regions, 273 DOCK software, 143 DOCKGROUND, 267 Docking program, 169 Docking score, 131 DOI. See Digital object identifier (DOI) Domain domain interactions, 269 Domain domain interface formation, 291 Domains, 109 Double-decoupling method, 186, 191 DOWSER, 186 187 Dreiding models, 5, 6f Drill-down distributions, 95 96 Driver torsion, 139 Drug designing process, 7 8 discovery process, 255 256 target phenotypes, 285 toxicity, 312 313, 313t Drug and Drug-Target Mapping, 95 96 Drug-like bioactive compounds, 73 DrugBank, 82 86 Drug drug metabolic enzymes, 82 83 Druglikeliness Rule’s, 315 DSSTox. See Distributed Structure-Searchable Toxicity (DSSTox) DT. See Distance threshold (DT) Dual processor Xeon 2.4 GHz server, 78 Dummy variables. See Indicator variables

E E-value. See Expected value (E-value) EBI. See European Bioinformatics Institute (EBI) ECF. See Extracellular fluid (ECF)

Electron microscopy technology, 89 Electronic parameter, 42 43 Electronic polarization, 172 Electrostatic and desolvation, 277 278 Electrostatic attraction/repulsion, 12 14 ‘Email us’ link, 68 Empirical function, 276 277 Empirical scoring, 271 272 functions, 136 Empirical valence bond (EVB), 16 ENCAD. See Energy calculation and dynamics (ENCAD) Energy-based docking simulations, 274 275 Energy-based methods, for predicting protein binding sites, 274 275 Energy-based scoring function, 277 Energy calculation and dynamics (ENCAD), 22 Energy minimization, 14 15, 19 Energy scoring schemes, 269 271 Ensemble, 159 Ensemble average, 159 Enzyme Commission number, 95 96 Enzyme-catalysed reactions, 69 Enzyme-specific information, 90 Equilibrium geometries, 14 ESFF. See Extensible systematic force field (ESFF) ET method. See Evolutionary trace method (ET method) European Bioinformatics Institute (EBI), 69 EVB. See Empirical valence bond (EVB) Evolutionary trace method (ET method), 272 Expected value (E-value), 111 Expert literature-based curation, 92 Extended Electron Distribution (XED), 16, 225 226 Extensible systematic force field (ESFF), 15

External validation, 40 41, 345 348 Extracellular fluid (ECF), 306

F F-value, 345 FABP. See Fatty acid-binding protein (FABP) Fast Fourier transform (FFT), 269 271 Fatty acid-binding protein (FABP), 190 FBDD. See Fragment based drug design (FBDD) FDA. See Food and Drug Administration (FDA) Feature extraction, 207 Feature space, 38 39 Feature trees, 229 230 Feed-forward neural network, 293 FEP method. See Free energy perturbation method (FEP method) FFT. See Fast Fourier transform (FFT) Field-based algorithms. See Property-based algorithms Field-based methods, 225 226 FITTED docking method, 188 Flat-file table dumps, 71 Flexible docking, 138 139 FlexX, 144 samples, 188 FLOG software, 133, 137, 143 Focused template methods, 323 324 FOLD-X Energy Function (FOLDEF), 277 Food and Drug Administration (FDA), 74 75, 82 83, 256 257 drug, 100 Force fields, 12 16 classes of force field methods, 15 force field-based scoring functions, 135 136 Forward selection, 336

Index Fragment fragment-based molecular evolutionary approach, 242, 242f fragment-like compounds, 79 growing, 240 linking, 239 240 merging, 239 240 optimization, 239 240 Fragment based drug design (FBDD), 235 236, 237f advancements in, 241 245 fragment screening protocol, 238f fragments converting, 239 241 hit identification and validation, 241 limitations, 245 strategy for, 236 241 techniques for finding fragments, 236 239 Free energy, 170 calculations, 190 192 Free energy of binding (ΔG), 181 Free energy perturbation method (FEP method), 190 191 Free Wilson analysis, 29, 47 48 FTDock, 269 271, 277 278 FTMAP, 169 Functional groups, 203 204

G G/PLS. See Genetic partial least squares (G/PLS) GALAHAD program, 218 219 GASP. See Genetic Algorithm Superposition Program (GASP) Gastrointestinal tract (GI tract), 301 GastroPlus, 314 Gaussian distribution, 162 GEM. See Global Energy Minimum (GEM) Gene bank, 109 Gene ontology (GO), 289 290 Gene gene interaction networks, 290

Genetic Algorithm Superposition Program (GASP), 217 218 Genetic algorithms, 134 135 Genetic function approximation (GFA), 333 334, 340 Genetic Optimization for Ligand Docking (GOLD), 144 docking method, 189 Genetic partial least squares (G/PLS), 333 334, 340 Genetic programming, 306 Genomic variants, 285 Geometric hashing algorithm, 269 271 Geometry-and feature-based methods, 216 224 GFA. See Genetic function approximation (GFA) GI tract. See Gastrointestinal tract (GI tract) ‘Ginzu’ method, 118 119 GLIDE. See Grid-based ligand docking with energetics (GLIDE) Glide designs, 144 Global Energy Minimum (GEM), 206 Global RAnge Molecular Matching (GRAMM-X), 269 271 GNUStep-Cocoa, 16 GO. See Gene ontology (GO) GOLD. See Genetic Optimization for Ligand Docking (GOLD) GPU. See Graphics processing unit (GPU) GRAMM-X. See Global RAnge Molecular Matching (GRAMM-X) Grand canonical Ensemble (mVT), 159 Graphics processing unit (GPU), 11 12, 173 hardware, 11 12 GRID method, 184, 214 215 Grid-based ligand docking with energetics (GLIDE), 188 Grid-based Monte Carlo method, 184 Gromacs, 12


H H-bond acceptor (HBA), 45 H-bond donor (HBD), 45 Haddock, 269 271 Hansch analysis, 29 Hansch models, 45 47 HBA. See H-bond acceptor (HBA) HBD. See H-bond donor (HBD) Hetero-obligomers, 269 Hetero-oligomeric complex, 269 Heterocomplexes, 269, 271 272 Hidden Markov Models (HMM), 113 HMM-based statistical modeling methods, 288 289 High throughput screening (HTS), 235 High-throughput (HT), 300 Highest occupied molecular orbital (HOMO), 36, 43, 46 HINT analysis, 185 HipHop algorithm, 220 HMM. See Hidden Markov Models (HMM) HOMO. See Highest occupied molecular orbital (HOMO) Homo-obligomers, 269 Homo-oligomeric complex, 269 Homocomplexes, 269 Homodimers, 271 272 Homologous protein, 107 108 Homology modeling, 110f, 168 alignment correction, 113 backbone building, 113 114 comparative protein modeling, 107 homologous sequence, 107 108 ligand modeling, 116 loop modeling, 114 115 methodology of, 110 118 model optimization, 116 117 model validation, 117 118 side-chain modeling, 116 software for, 118 121 template recognition and initial alignment, 111 113 Hot-spots, 275 prediction

372 Index Hot-spots (Continued) in protein protein interactions, 276 278 on sequence, 276 on structure, 276 277 types of protein protein interaction regions, 269 on unbound protein structure, 277 278 residues at protein interfaces, 275 276 HT. See High-throughput (HT) HTS. See High throughput screening (HTS) Huck’s law of mechanics, 12 14 HUGO Gene ID, 292 Human protein sequences, 291 Hybrid QM-MM COSMOS-NMR force field, 16 Hydrogen, 3 bonding, 277 278, 302 303 bonds, 255 256 Hydrophobic parameters, 43 44 Hydrophobicity, 8 9 HypoGen algorithm, 220 221

I IAM. See Immobilized artificial membranes (IAM) ICM. See Internal Coordinates Mechanics (ICM) Identity, 108 IFREDA, 144 145 IFST. See Inhomogeneous fluid solvation theory (IFST) ILC. See Immobilized liposome chromatography (ILC) Illustrator (program), 3 4 Immobilized artificial membranes (IAM), 303 304 Immobilized liposome chromatography (ILC), 303 304 In silico modeling, 333 evaluation, 341 348 external validation, 345 348 internal validation, 341 345 virtual screening validation, 348 In situ fragment assembly, 239

In-silico prediction, 300 301 techniques, 285 tools, 286 In-silico SNP analysis sequence-based approaches to SNP analysis, 286 to SNP analysis, 286 287 sequence-based prediction tools, 287 290 structure-based prediction tools, 290 295 InChI. See International chemical identifier (InChI) Incremental construction, 133 134 Indicator variables, 36 Indicator variables, 45 “Induced-fit” theory, 132 Industrial assay vendors, 79 80 Inhomogeneous fluid solvation theory (IFST), 184 InkScape (program), 3 4 INSDC. See International Nucleotide Sequence Database Consortium (INSDC) Integration algorithms, 163 IntEnz-the integrated relational enzyme database, 69 Inter-ProSurf, 272 Interaction Sites Identified from Sequence (ISIS), 271 Internal Coordinates Mechanics (ICM), 138 ICM-DISCO, 269 271 pseudo-Brownian rigid-body docking search, 277 278 Internal validation, 341 345 International chemical identifier (InChI), 69, 75 International Nucleotide Sequence Database Consortium (INSDC), 91 Intradomain, 269 Ionization constant, 303 ISIS. See Interaction Sites Identified from Sequence (ISIS) Isobaric-Isothermal Ensemble (NPT), 159

Isothermal titration calorimetry (ITC), 100 IUPAC name, 70

J Java molecular editor (JME), 78 79 JChem chemical fingerprints, 67 Joint Evolutionary Trees method (JET method), 272

K K-nearest neighbors genetic algorithm, 185 Kegg Pathway ID, 292 Kernel trick, 38 39 Knowledge-based functions, 271 272 scoring functions, 136 Kynostatin 272 (KNI-272), 181 Kyoto encyclopedia of genes and genomes (KEGG), 70, 72

L LAMA program, 112 Large-scale human SNP (LS-SNP), 291 292 LDA. See Linear discriminant analysis (LDA) Lead-like compounds, 79 Lead-like molecules, 78 “Leapfrog” method, 19 21, 164 Least squares fit, 341 Leave many out method (LMO method), 342 Leave-one-out method (LOO method), 343 LEM. See Local Energy Minima (LEM) Leucine, 275 LHASA. See Logic and Heuristics Applied to Synthetic Analysis (LHASA) Ligand, 17, 169, 255 256 chemical component, 99 modeling, 116 recombination of ligand fragments, 258 260

Index reporting and visualization, 98 Ligand preparation (LigPrep), 222 ‘Ligand Summary’ pages, 99 Ligand-based pharmacophore modeling, 204. See also Pharmacophore modeling LigPrep. See Ligand preparation (LigPrep) Linear discriminant analysis (LDA), 37 38 Lipophilicity, 303 304 LMO method. See Leave many out method (LMO method) Local Energy Minima (LEM), 206 Local Move Monte Carlo loop sampling (LMMC loop sampling), 139 “Lock and key” mechanism, 132 Logic and Heuristics Applied to Synthetic Analysis (LHASA), 327 logP. See Partition coefficient (logP) LOO method. See Leave-one-out method (LOO method) Loop modeling, 114 115 Loop motions, 157 158 Loop prediction methods, 115 LS-SNP. See Large-scale human SNP (LS-SNP) LUDI, 134 LUMO, 36, 43, 46

M M-CASE program. See Multiple computer-automated structure evaluation program (M-CASE program) Machine learning techniques, 273, 290 291 Macromolecular receptor, 255 256 Macromolecule, 7 8 Madin-Darbycanine kidney (MDCK), 305 MAF. See Minor allele frequency (MAF) Manual curation progress, 92 94

MAPP. See Multivariate analysis of protein polymorphism (MAPP) MAPPIS. See Multiple Alignment of Protein-Protein InterfaceS (MAPPIS) Mass spectrometry (MS), 236 238 Maximum similarity, 67 Maxwell distribution, 168 Maxwell-Boltzmann, 162 MC method. See Monte Carlo method (MC method) MCSS. See Multiple copy simultaneous search (MCSS) MCTS framework. See Monte Carlo Tree Search framework (MCTS framework) MD. See Molecular dynamics (MD) MDCK. See Madin-Darbycanine kidney (MDCK) MDL®Metabolite Database, 314 MDL®Toxicity Database, 314 MDR. See Multiple-drug resistance (MDR) MDS. See Multidimensional scaling (MDS) ME. See Mutation sequence environment (ME) Mechanical state of system, 158 159 Membrane-interaction quantitative structure activity relationships (MI-QSAR), 302 Memory-band width, 11 12 Meta-PPISP, 273 274 Meta-servers, 273 274 Metabolism, 310 311 MetaSite, 314 MI-QSAR. See Membraneinteraction quantitative structure activity relationships (MI-QSAR) Microcanonical ensemble (NVE), 159 Microscopic state of system, 158 159


Minor allele frequency (MAF), 285 Mixed approach, 48 49 Mixed quantum mechanicalclassical simulations, 157 MLR. See Multiple linear regression (MLR) MLSMR. See NIH Molecular Libraries Small Molecule Repository (MLSMR) MM. See Molecular mechanics (MM) ModBase, 109 MODELLER program, 119 120 Modified residue, 93 MOE. See Molecular Operating Environment (MOE) MoKa, 315 Molar refractivity (MR), 44 Molconn-Z (eduSoftLC), 36, 67 68 Molecular alignments, 213 214 Molecular biology, 1 Molecular docking, 187 189 analysis, 131 outline for docking analysis, 140f software available for, 143 145 standard methodology for, 139 143 theory of docking, 132 137 types of, 137 139 Molecular dynamics (MD), 1, 12, 18 22, 20f, 115, 189 190, 269 271 applications of, 21 computer programs for MD calculations, 22 Molecular dynamics simulations (MD simulations), 157 158, 193, 268 269 applications in drug discovery, 168 171 advanced free-energy calculations, 170 171 identifying cryptic and allosteric binding sites, 169

374 Index Molecular dynamics simulations (MD simulations) (Continued) improving computational identification of smallmolecule binders, 169 170 limitations, 172 176 interaction diagram after, 175f principles, 158 164 algorithms, 163 164 calculating averages, 159 160 classical mechanics, 160 162 definitions, 158 159 RMSD plot after Simulation Event Analysis, 175f steps, 165 168, 166f energy minimization, 167 168 equilibration at constant temperature, 168 heating simulation system, 168 initialization, 165 167 production stage of MD trajectory, 168 Molecular-dynamics-based drugdiscovery techniques, 171 Molecular-dynamics-based freeenergy calculations, 171 Molecular fingerprints, 36 Molecular Function, 95 96 Molecular mechanics (MM), 1, 8 9, 12 18, 115 classes of force field methods, 15 computer programs that predominantly for molecular mechanics calculations, 16 18 energy minimization, 14 15 force fields, 15 16 limitations of, 18 Molecular modeling, 1 computer graphics, 3 4 molecular models, 4 7 computer models, 6 7 CPK models, 4 5 dreiding models, 5 molecular surfaces, 7 11

molecular representation, 1 3 principles, 12 24 MD, 18 22 MM, 12 18 QM, 22 24 workstations, 11 12 Molecular Operating Environment (MOE), 223 224 Molecular recognition process, 31 Molecular representations, 1 3 Cartesian coordinate, 2 internal coordinate, 3 polar coordinate, 3 Molecular similarity method, 260 261 bit string representation for molecule, 260f Molecular surface area, 44 Molecular surfaces, 1, 6 11, 8f CPSA, 10 11 SAS, 8 9 SES, 9 VWS, 8 Molecular volume (Vm), 45 46 Molecular weight (MW), 45 46, 78 79 MolFit, 269 271 Molinspiration, 314 Monte Carlo method (MC method), 134, 183 184, 269 271 Monte Carlo Tree Search framework (MCTS framework), 323 Motif, 109 MR. See Molar refractivity (MR) MS. See Mass spectrometry (MS) MSA. See Multiple sequence alignment (MSA) MTDs. See Multi target directed drugs (MTDs) mtk. See Multitasking (mtk) MuD. See Mutation detector (MuD) Multi target directed drugs (MTDs), 212 Multi-targeting by pharmacophore, 212 Multidimensional scaling (MDS), 227

Multiple Alignment of Protein-Protein InterfaceS (MAPPIS), 276 277 Multiple computer-automated structure evaluation program (M-CASE program), 309 Multiple copy simultaneous search (MCSS), 134 Multiple linear regression (MLR), 37 38, 336 Multiple sequence alignment (MSA), 112 113, 118 119, 287 Multiple-drug resistance (MDR), 307 Multitasking (mtk), 244 computational model approach, 244 245 Multivariate analysis of protein polymorphism (MAPP), 288 Mutation, 290 291 Mutation detector (MuD), 294 295 Mutation sequence environment (ME), 289 MutPred (computation tool), 294 mVT. See Grand canonical Ensemble (mVT) MW. See Molecular weight (MW)

N Naı¨ve Bayesian approach, 272 273 NAMD software, 12, 194 National Center for Biotechnology Information (NCBI), 287 Network of chemistry (NOC), 326 Network Travel, 326 Neural network method, 276 NEWLEAD program, 211 Newton-Raphson procedure, 15 Newton’s equation of motion, 161 application, 161 Newton’s second law of motion, 19 Newtonian’s mechanics, 12 14 Next generation sequencing, 89

Index NIH Molecular Libraries Small Molecule Repository (MLSMR), 81 82 NIH Molecular Library Program, 81 82 NIP. See Normalized interface propensity (NIP) NMR. See Nuclear magnetic resonance (NMR) NOC. See Network of chemistry (NOC) Non-bonded atoms, 12 14 Non-covalent interactions, 10 11 Non-linear methods, 38 39 Non-obligate complex, 269 Non-obligate protein protein hetero-complexes, 274 Non-polar molecules, 8 9 Non-polar phase, 8 9 Nonsynonymous SNP (nsSNP), 285 Normalized interface propensity (NIP), 274 275 NPT. See Isobaric-Isothermal Ensemble (NPT) nsSNP. See Nonsynonymous SNP (nsSNP) Nuclear magnetic resonance (NMR), 168 169, 236 238, 267 Nucleic acid, 8 Nucleotide sequence databases, 90 NVE. See Microcanonical ensemble (NVE) NVT. See Canonical Ensemble (NVT)

O OAT. See Organic cation transporters (OAT) OATP. See Organic-aniontransporting polypeptide (OATP) Obligate complex, 269 OBO. See Open biomedical ontologies (OBO) OCSS. See Organic Chemical Simulation of Synthesis (OCSS) Ocular penetration, 309

ODA method. See Optimal docking area method (ODA method) OMIM. See Online Mendelian Inheritance in Man (OMIM) On-line Medical Dictionary database (OMD database), 103 ONC. See Optimum number of component (ONC) Online Mendelian Inheritance in Man (OMIM), 292 Open biomedical ontologies (OBO), 69 ontology format, 72 OpenEye’s Omega program, 77 78 OPLS 3. See Optimized Potential for Liquid Simulations 3 (OPLS 3) Optimal docking area method (ODA method), 272 Optimal docking desolvation energy, 274 Optimized Potential for Liquid Simulations 3 (OPLS 3), 16 Optimum number of component (ONC), 346 Oracle binary table dumps, 72 Organic cation transporters (OAT), 304 305 Organic Chemical Simulation of Synthesis (OCSS), 327 Organic-anion-transporting polypeptide (OATP), 304 305 Oriented Substituent Pharmacophore PRopErtY Space (OSPPREYS), 226 227 Over-fitting, 342

P P-glycoprotein (P-gp), 304 305 PANTHER. See Protein analysis through evolutionary relationships (PANTHER) Parepro, 289


Partial least squares (PLS), 37 38, 50 51, 223, 333 334 regression analysis, 337 340, 338f Partition coefficient (logP), 8 9 Patch Finder Plus, 273 Pattern identification, 208 209 PBVS. See Pharmacophore based virtual screening (PBVS) PCA. See Principal components analysis (PCA) PCH. See Polarity-ChargedHydrophobicity (PCH) PCR. See Principal component regression (PCR) PDB. See Protein Data Bank (PDB) PDB in Europe (PDBe), 95 PDB Japan (PDBj), 95 PDE5. See Phosphodiesterase enzyme type 5 (PDE5) PDP. See Protein domain parser (PDP) Peptide, 17 Permanent complex, 269 Permeability, 302 Pharmacokinetics (PK), 299, 304 305 properties, 255 256 Pharmacophore based virtual screening (PBVS), 211 Pharmacophore modeling. See also Automated pharmacophore generation methods applications of, 211 213 conformational search, 206 207 data set preparation, 205 206 feature extraction, 207 methodology of, 205 213 pattern identification, 208 209 pharmacophore based hierarchical virtual screening, 212f pharmacophoric features, 208f, 208t process determinants for quality, 213 216 scoring of model, 209 210 validation of pharmacophore, 210 211

376 Index Pharmacophores, 203 204 fingerprints, 226 Pre-Processor, 224 PharmPrint, 226 227 PhD-SNP. See Predictor of human deleterious SNP (PhD-SNP) PhDD program, 211 Phosphodiesterase enzyme type 5 (PDE5), 256 257 PiBase, 267 PIER method. See Protein intErface Recognition method (PIER method) PINUP. See Protein INterface residUe Prediction (PINUP) PIR protein database, 89 90 PK. See Pharmacokinetics (PK) Plasma-protein binding (PPB), 309 PLS. See Partial least squares (PLS) Polar and apolar desolvation energy, 77 78 Polar coordinate, 3 Polar surface area (PSA), 302 303 Polarity-Charged-Hydrophobicity (PCH), 224 Polymorphism phenotyping (PolyPhen), 290 Position-Specific Iterated BLAST (PSI-BLAST), 111 Post translational modifications (PTM), 90, 92 93 Potential energy, 163 PPB. See Plasma-protein binding (PPB) PPI PRED. See Protein-Protein Interface PREDiction (PPI PRED) PPISP. See Protein-Protein Interaction Site Prediction (PPISP) PQS. See Protein quaternary structure (PQS) PreADMET, 314 Predictor of human deleterious SNP (PhD-SNP), 289 Primary accession-SID, 79 80 Principal component regression (PCR), 37 38, 336 337

Principal components analysis (PCA), 333 334 PRISM. See Protein-protein Interaction prediction by Structural Matching (PRISM) PROBCONS program, 112 113 PROCHECK validation program, 118 Prodrug, 74 75 Progressive alignment, 112 Property-based algorithms, 214 215 PROSAII validation program, 118 Protein analysis through evolutionary relationships (PANTHER), 288 289 Protein binding sites, computational prediction of, 269 275 Protein Data Bank (PDB), 65 66, 103, 119, 141, 184 185 Protein domain parser (PDP), 97 98 Protein intErface Recognition method (PIER method), 272 Protein INterface residUe Prediction (PINUP), 272 273 Protein quaternary structure (PQS), 267 Protein Structure Database, 258 259 Protein-ligand complexes, 179 180, 188, 192 193 Protein-protein Interaction prediction by Structural Matching (PRISM), 272 273 Protein-Protein Interaction Site Prediction (PPISP), 273 Protein-Protein Interface PREDiction (PPI PRED), 273 Protein(s), 7 8, 17, 66, 292 aggregation, 293 294 binding site prediction based on protein sequence, 271

binding site prediction based on protein structure, 271 274 empirical scoring, 271 272 machine learning techniques, 273 meta-servers, 273 274 sequence conservation, 272 273 comparison tool, 97 98 databases, 258 259 binding database, 100 RCSB PDB, 95 100 therapeutic target database, 101 104 UniProt, 89 95 fold, 109 function, 267 interaction networks, 268 269 phenotype, 290 sequence alignments, 288 binding site prediction based on, 271 knowledgebase, 89 90 Protein3D, 267 Protein protein association, 274 275 binding sites, 268 269 docking, 269 271 calculations, 277 278 interactions, 99, 267, 275 278 region types, 269 Proteome-wide analysis, 91 Proteomics standards initiative protein modification (PSI-MOD), 93 Pruning of tree, 39 PSA. See Polar surface area (PSA) PSI-BLAST. See Position-Specific Iterated BLAST (PSI-BLAST) PSI-MOD. See Proteomics standards initiative protein modification (PSI-MOD) Psygene-G/myPresto, 12 PTM. See Post translational modifications (PTM) PubChem, 79 82 PubChem BioAssay database, 79 80

Index PubChem BioAssay Summary service, 81

Q QCPE. See Quantum chemical program exchange (QCPE) QM. See Quantum mechanics (QM) Quantitative structure-activity relationships (QSAR), 29, 65 calculations for models, 41 45 Free Wilson approach, 47 48 Hansch models, 45 47 in vitro systems, 30 methodology, 32 41, 33f data analysis, 37 39 data preparation, 34 37 validation, 39 41 mixed approach, 48 49 principle, 31 32, 32f QSAR vs. 3D-QSAR, 54 steps of QSAR modeling, 38f types of descriptors, 41 45 3D QSAR analyses, 49 54 Quantum chemical program exchange (QCPE), 23 Quantum mechanics (QM), 1, 12, 22 24 quantum mechanical energies, 19 quantum-mechanical effects, 173

R Radius of gyration (ROG), 44 Random Forest machine-learning algorithm, 294 295 Randomization test, 344 345 Rank algorithm, 185 186 Raster graphics, 3 4, 7 Raster images, 3 4 RCS. See Relaxed complex scheme (RCS) RCSB PDB. See Research collaboratory for structural bioinformatics protein data bank (RCSB PDB) RD. See Residue difference (RD)

Reactive forcefield (ReaxFF), 16 Real-time graphics, 7 ReaxFF. See Reactive forcefield (ReaxFF) RECAP. See Retrosynthetic combinatorial analysis procedure (RECAP) Receptor-ligand system, 170 Recombination of ligand fragments, 258 260 Recursive partitioning (RP), 39 Reference protein. See Homologous protein Reference proteome cluster, 92 Regression coefficients, 345 Regression methods, 334 340 Relaxed complex scheme (RCS), 169 Reliability index, 292 293 Representative proteome group, 92 Research collaboratory for structural bioinformatics protein data bank (RCSB PDB), 95 100 web services, 100 web site features, 97 99 Residue difference (RD), 289 Residues listed in sequence (SEQRES), 98, 98f Restraint, 109 Retrosynthesis, 321 323 Retrosynthetic combinatorial analysis procedure (RECAP), 235 236 Rigid docking, 137 RMS. See Root-mean-square (RMS) RMSD. See Root-mean-square deviation (RMSD) RMSE. See Root-mean squared error (RMSE) RNA, 7 8 read-out grows, 89 Robetta, 118 119 Rofecoxib, 256 257 ROG. See Radius of gyration (ROG) Root-mean squared error (RMSE), 341 Root-mean-square (RMS), 215


Root-mean-square deviation (RMSD), 107 108, 116, 119 RosettaDock, 269 271 Rotamers, 116 Rotlbond (ROTBOND), 45 Royal society of chemistry (RSC), 72 RP. See Recursive partitioning (RP) ‘Rubbish in rubbish out’ prediction tool, 288 ‘Rule-of-five’, 306 307

S SAAS. See Statistical Automatic Annotation System (SAAS) Saddle patches, 9 Sampling algorithms, 132 135 SAPRED, 293 294 SAPs. See Single amino acid polymorphisms (SAPs) SAR. See Structure-activity relationship (SAR) SAS. See Solvent-accessible surface (SAS) SATCHMO algorithm, 112 SBDD. See Structure based drug designing (SBDD) Scaffold hopping, 255 256, 257f computational approaches of, 256 261 molecular similarity method, 260 261 pharmacophore searching, 257 258 recombination of ligand fragments, 258 260 Scaling-relaxation method, 115 SCAMPI. See Statistical Classification of Activities of Molecules for Pharmacophore Identification (SCAMPI) Schro¨dinger (QikProp), 314 Schrodinger’s equation, 22 Scientific Vector Language (SVL), 223

378 Index SCOP. See Structural classification proteins (SCOP) SCOPPI. See Structural classification of proteinprotein interfaces (SCOPPI) Scoring functions, 135 137, 135f of model, 209 210 Scrambling model. See Randomization test Screening, 223 Screening for Ligands by InducedFit Docking (SLIDE), 189 Screening for non acceptable polymorphisms (SNAP), 292 293 SE. See Standard error (SE) SECS. See Simulation and Evaluation of Chemical Synthesis (SECS) Semi-automatic scheme, 294 295 Semiempirical methods, 23 seq2seq model. See Sequenceto-sequence model (seq2seq model) SEQRES. See Residues listed in sequence (SEQRES) Sequence and structure alignments, 97 98 Sequence-and structure-based methodologies, 286 Sequence-based approaches to SNP analysis, 286, 286f, 287f Sequence-based prediction tools MAPP, 288 PANTHER, 288 289 Parepro