Computational Drug Discovery and Design (Methods in Molecular Biology, 2714) [2nd ed. 2024] 1071634402, 9781071634400

This second edition provides new and updated methods and techniques for identification of drug target, binding sites pre

363 55 14MB

English Pages 367 [357] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Drug Discovery and Design (Methods in Molecular Biology, 2714) [2nd ed. 2024]
 1071634402, 9781071634400

Table of contents :
Preface
Contents
Contributors
Chapter 1: Computer-Aided Drug Discovery and Design: Recent Advances and Future Prospects
1 Introduction
2 Where to Start (1): Choosing a Pharmacological Target to Pursue
3 Where to Start (2): Identifying Active Scaffolds
4 The Actual Design: Hit to Lead and Beyond
5 In Silico ADMET Filters and Antitargets
6 Final Remarks
References
Chapter 2: Virtual Screening Process: A Guide in Modern Drug Designing
1 Introduction
2 Material
3 Methods
3.1 Selection of Target Protein, Preparation, and Grid Generation
3.1.1 Preliminary Details
3.1.2 Preprocessing of Targeted Protein
3.1.3 Grid Generation
3.2 Selection of Chemical Database and Their Preprocessing for Screening
3.3 Molecular Docking of Ligand Within Binding Pocket of Protein
3.4 Prediction of ADME and PAINS Filtering
3.5 Dynamics Studies to Get the Stability of Selected Hits
4 Notes
References
Chapter 3: Molecular Dynamics as a Tool for Virtual Ligand Screening
1 Introduction
2 Materials
3 Methods
3.1 Prepare Receptor for MD Simulation
3.2 Preparation of MD Input Files
3.3 Running the MD Simulation
3.4 Analysis of the MD Trajectory
3.5 Virtual Screening of Ligands and Multiple Receptor Structures
3.6 Validation of Complexes by MD Simulations
3.7 Estimation of Ligand Affinity
3.8 Concluding Remarks
4 Notes
References
Chapter 4: Antiviral Drug Target Identification and Ligand Discovery
1 Introduction
2 Receptor Protein Target
2.1 Obtaining the Structure of the Receptor Protein Target
2.2 Structure Suitability and Validation Check
3 Analyzing Viral Genome and Protein Conservation for Drug Target Selection
4 Databases for Selecting Compound Libraries and Ligands
4.1 ADME Screening
4.2 Developing An Analogue Library
5 Notes
6 Summary
References
Chapter 5: GRAMM Web Server for Protein Docking
1 Introduction
2 Materials
2.1 User Input
2.2 Implementation
3 Methods
3.1 Free Docking
3.2 Template-Based Docking
3.3 Output of Docking Results
3.4 Case Studies: Free Docking
3.5 Case Studies: Template-Based Docking
4 Notes
5 Concluding Remarks
References
Chapter 6: Protein-Ligand Blind Docking Using CB-Dock2
1 Introduction
2 Web Server
3 Usage of CB-Dock2 Web Server
3.1 Upload of Query Protein and Query Ligand
3.2 User-Customized Settings
3.3 Visualization and Analysis of Results
4 Case Study
5 Notes
References
Chapter 7: Applications of Molecular Dynamics Simulations in Drug Discovery
1 Introduction
2 Theoretical Basis of MD Simulations
3 Molecular Dynamics in Lead Identification Studies
3.1 Identifying Druggable Binding Sites Using Molecular Dynamics Simulations
3.2 Evaluating Drug-Target Interactions Using Molecular Dynamics Simulations
3.3 MD Simulations in Lead Optimization
4 Molecular Dynamics Simulations as a Tool to Study Protein Conformational Sampling
5 Tools to Enhance Sampling Efficiency of MD Simulations
5.1 Accelerated Molecular Dynamics
5.2 Well-Tempered Meta-Dynamics
5.3 Coarse Graining
6 Conclusion
7 Notes
References
Chapter 8: Molecular Dynamics Simulation-Based Prediction of Glycosaminoglycan Interactions with Drug Molecules
1 Introduction
2 Methods
3 Notes
References
Chapter 9: Mining Chemogenomic Spaces for Prediction of Drug-Target Interactions
1 Introduction
2 Dataset Development
3 Feature Representation
3.1 Representation of Protein Sequences
3.2 Representation of Molecules
3.3 Machine Learning Platforms
4 Performance Evaluation and Performance Evaluation Metrics
5 Conclusion and Future Perspective
6 Notes
References
Chapter 10: Expanding the Landscape of Amyloid Sequences with CARs-DB: A Database of Polar Amyloidogenic Peptides from Disorde...
1 Introduction
2 CARs-DB
2.1 Algorithm
2.2 Database Content
2.3 Link with Other Databases
2.4 Work with CARs-DB Data
2.5 Understanding CARs: Function and Disease Association
3 Practical Use of CARs-DB to Search for Both Functional and Disease-Associated CARs
4 Concluding Remark
5 Notes
References
Chapter 11: Accelerating Molecular Dynamics Simulations for Drug Discovery
1 Introduction
2 Materials and Methods
2.1 Ligand Gaussian Accelerated Molecular Dynamics (LiGaMD)
2.2 Ligand Binding Kinetics Calculated from Reweighting of LiGaMD Simulations
2.3 Running Simulations Using LiGaMD
2.3.1 System Preparation
2.3.2 System Equilibration with cMD Simulations
2.3.3 LiGaMD Equilibration and Production
2.3.4 LiGaMD Analysis
3 Results
4 Notes
References
Chapter 12: Exploring the Role of Chemoinformatics in Accelerating Drug Discovery: A Computational Approach
1 Introduction
1.1 Components Involved in Cheminformatics
1.2 Web-Based Tools for Cheminformatics and Molecular Property Prediction
1.3 Involvement of Machine Learning in Cheminformatics
2 Material
3 Methods
3.1 Development of Model
3.2 Calculation of Properties
3.3 Case Study Involving Cheminformatics Approach
4 Notes
References
Chapter 13: Recent Deep Learning Applications to Structure-Based Drug Design
1 Introduction
1.1 Computational Representations of Molecules
2 Pose Rescoring
3 Deep Learning Pose Generation
4 Deep Learning Molecule Optimization
4.1 Pharmacokinetic Optimization
4.2 Receptor-Based Optimization
5 Ab Initio Molecule Generation
6 Summary
7 Notes
References
Chapter 14: Techniques for Developing Reliable Machine Learning Classifiers Applied to Understanding and Predicting Protein:Pr...
1 Introduction
1.1 Characteristics of Protein:Protein Interfaces
1.2 Approaches for Predicting Protein:Protein Interaction Hot Spots
1.3 Dataset Size and Redundancy in Training and Test Sets for Machine Learning
1.4 Databases of Experimental Data for Hot Spot Training and Testing
2 Materials and Methods
2.1 Constructing Training and Testing Datasets and Removing Redundant Sites
2.2 Characterizing the Features of Each Site as the Basis for Prediction
2.3 Evaluating a Panel of Machine Learning Methods for Hot Spot Prediction
2.4 Testing Different Subsets of the Data to Evaluate Training Robustness
2.5 Evaluating if the Sample Size of Training Data Is Sufficient
2.6 Applying Hot Spot Prediction with Hotspotter on New Data
2.7 Choosing Appropriate Performance Metrics to Evaluate Prediction Quality
2.8 Assessing Correlation Between Features and Their Relative Importance for Prediction
2.9 Applying Hotspotter to New Proteins: Predicting Critical Residues on Human ACE2 for Binding the SARS CoV-2 Spike Protein
3 Results
3.1 Assessing Features for Hot Spot Prediction
3.2 Evaluating Classifier Feature Correlation in the Curated Mutational Database
3.3 Hot Spot Prediction from Training and Testing Five Different Classifiers on 1046 Sites
3.4 Diagnosing if Hyperparameters Have Been Overtuned in Training
3.5 Selecting the Final Hotspotter Classifier
3.6 Defining Site Features Most Important for Hot Spot Identification
3.7 Hotspotter Prediction of Hot Spots on Human ACE2 Receptor for SARS CoV-2 Spike Binding
3.8 Focusing Hot Spot Databases on Surface and Interfacial Residues to Improve Their Usefulness
3.9 Lessons Learned for Training and Test Set Design for Improving Hot Spot Prediction
4 Conclusion
5 Notes
References
Chapter 15: AI-Driven Enhancements in Drug Screening and Optimization
1 Introduction
2 Materials
3 Ligand-Based Screening
3.1 Tools Available
3.2 Method
4 Role of Mutations in Drug Efficacy
4.1 Screening Mutations
4.2 Methods
5 Prediction and Optimization of Ligand Pharmacokinetic and Toxicity Properties
5.1 In Silico ADMET Characterization
5.2 Methods
6 Summary
7 Notes
References
Chapter 16: Applications of Big Data and AI-Driven Technologies in CADD (Computer-Aided Drug Design)
1 Introduction
2 Materials
3 Methods
4 Notes
References
Chapter 17: Artificial Intelligence in ADME Property Prediction
1 Introduction
2 Materials and Methods
2.1 QSAR Modeling
2.2 Experimental Assay Data
2.3 Chemical Structure Representation
2.4 Chemical Structure Normalization
2.5 Dataset Split
2.6 Data Balancing Techniques
2.7 Modeling Methods
2.7.1 Random Forest
2.7.2 XGBoost
2.7.3 Graph Convolutional Neural Network
2.7.4 Recurrent Neural Networks
2.8 Cross Validation and External Validation
2.9 Model Validation
2.10 Applicability Domain
3 Evolution of AI in ADME Modeling
4 ADME@NCATS
5 Open-Source ADME Tools
6 Summary
7 Notes
References
Chapter 18: Accelerating the Discovery and Design of Antimicrobial Peptides with Artificial Intelligence
1 Introduction
2 Predicting Peptide Antimicrobial Activity
2.1 Evolution of AMP Predictors
2.2 Limitations
2.3 Perspectives
3 Generating Novel AMPs
3.1 Evolutionary-Based Generators
3.2 Deep Learning Models
3.3 Perspectives
4 Interplay
5 Conclusions
References
Index

Citation preview

Methods in Molecular Biology 2714

Mohini Gore Umesh B. Jagtap  Editors

Computational Drug Discovery and Design Second Edition

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by step fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Computational Drug Discovery and Design Second Edition

Edited by

Mohini Gore Department of Basic and Applied Sciences, Dayananda Sagar University, Bangalore, Karnataka, India

Umesh B. Jagtap Department of Botany, Rajaram College Kolhapur, Kolhapur, Maharashtra, India

Editors Mohini Gore Department of Basic and Applied Sciences Dayananda Sagar University Bangalore, Karnataka, India

Umesh B. Jagtap Department of Botany Rajaram College Kolhapur Kolhapur, Maharashtra, India

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-3440-0 ISBN 978-1-0716-3441-7 (eBook) https://doi.org/10.1007/978-1-0716-3441-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A. Paper in this product is recyclable.

Preface In recent years, computational drug discovery and design has emerged as a powerful approach for accelerating the drug discovery process. The ability to use computational techniques to predict the properties and interactions of potential drug candidates has the potential to greatly reduce the time and cost required for traditional experimental techniques. This book, Computational Drug Discovery and Design, provides a comprehensive overview of the current state-of-the-art in this rapidly growing field. The present book is divided into 18 chapters, covering a wide range of topics from computer-aided drug discovery and design to artificial intelligence in ADME property prediction. The first chapter provides an overview of in silico approaches for identifying active scaffolds and guiding the subsequent optimization process. Latest groundbreaking advances in the field have also been discussed, setting the stage for the subsequent chapters. Chapter 2 focuses on virtual screening, providing a guide to modern drug designing in a step-wise manner. The next chapter covers the use of molecular dynamics as a tool for virtual ligand screening. The general virtual ligand screening workflow by implementing docking coupled with MD simulations has been presented in this chapter in a step-wise protocol. The following chapter explores antiviral drug target identification and ligand discovery. The web-based resources available for antiviral drug discovery studies have been outlined with specific reference to free, online, open-source tools and resources which can be applied for anti-viral drug discovery studies. Chapter 5 discusses the use of the GRAMM webserver for protein docking which provides options to choose free or template-based docking, as well as other advanced features, such as clustering of the docking poses, and interactive visualization of the docked models, while the subsequent chapter explores protein-ligand blind docking using CB-Dock2, an automatic docking server, with detailed description on using the CB-Dock2 server. Chapter 7 includes the application of molecular dynamics simulations in drug discovery and also emphasizes various strategies to improve the conformational sampling efficiency in molecular dynamics simulations, while the next chapter focuses on molecular dynamics simulation-based prediction of glycosaminoglycan interactions with drug molecules. There is an explanation of the molecular dynamics-based protocols particularly developed to characterize GAG-small drug molecule complexes in this chapter. Chapter 9 discusses the mining of chemogenomic spaces for the prediction of drugtarget interactions. This chapter focuses on the process of the drug-target interaction prediction from the perspective usage of machine learning algorithms and the various stages involved for developing an accurate predictor. The subsequent chapter explores the expansion of the landscape of amyloid sequences with CARs-DB, a database of polar amyloidogenic peptides from disordered proteins. Step-wise protocol describing how to use CARsDB to search for sequences of interest that might be connected to disease or functional protein-protein interactions has been explained in this chapter. Chapter 11 discusses the accelerated molecular dynamics simulations for drug discovery, with a brief review of the status and usage of LiGaMD in drug discovery, while the ensuing chapter explores the role of chemoinformatics in accelerating drug discovery and provides case study to describe concepts of chemoinformatics.

v

vi

Preface

Chapter 13 presents an overview of recent deep learning-based developments with applications in drug discovery. A general framework of the approaches is described and the individual methods are discussed in this chapter. The next chapter focuses on techniques for developing reliable machine learning classifiers applied to understanding and predicting protein-protein interaction hot spots and presents the use of well-trained classifier, hotspotter. Chapter 15 explores AI-driven enhancements in drug screening and optimization with a description of several machine learning models and databases for structure-guided pharmacodynamics (PD) screening, ligand-based PD screening, PK/ADME prediction, toxicity prediction, and the effect of genetic mutations in drug efficacy. The following chapter depicts the applications of big data and AI-driven technologies in CADD (computer-aided drug design). Various approaches in data pre-processing, modeling, and applications in drug design and discovery using big data and AI are discussed in this chapter. Chapter 17 presents artificial intelligence in ADME property prediction. Different modeling methods in light of the most recent methodological advancements in AI and their applications in ADME modelling have been reviewed in this chapter. The final chapter describes the machine learning-guided design of peptide antibiotics. The evolution and applications of predictive modeling and generative modeling to discover and design safe and effective antimicrobial peptides are presented in the chapter. The contributors to this book are world-renowned experts in their fields, and their collective expertise provides readers with a comprehensive overview of the latest developments in computational drug discovery and design. This book is an essential resource for students, researchers, and professionals working in the field of drug discovery, computational chemistry, bioinformatics, and related disciplines. Our sincere appreciation goes out to John Walker, the series editor, for his invaluable guidance and support throughout the entire process of developing this book. We also extend our gratitude to all the authors who promptly contributed and shared their practical knowledge by providing a stepwise methodology for using bioinformatics tools in drug discovery and design. We trust that this volume will prove useful to both beginners in the field of bioinformatics and seasoned scientists involved in drug discovery research. Bangalore, Karnataka, India Kolhapur, Maharashtra, India

Mohini Gore Umesh B. Jagtap

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Computer-Aided Drug Discovery and Design: Recent Advances and Future Prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alan Talevi 2 Virtual Screening Process: A Guide in Modern Drug Designing . . . . . . . . . . . . . . Umesh Panwar, Aarthy Murali, Mohammad Aqueel Khan, Chandrabose Selvaraj, and Sanjeev Kumar Singh 3 Molecular Dynamics as a Tool for Virtual Ligand Screening . . . . . . . . . . . . . . . . . . Gre´gory Menchon, Laurent Maveyraud, and Georges Czaplicki 4 Antiviral Drug Target Identification and Ligand Discovery. . . . . . . . . . . . . . . . . . . Hershna Patel and Dipankar Sengupta 5 GRAMM Web Server for Protein Docking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amar Singh, Matthew M. Copeland, Petras J. Kundrotas, and Ilya A. Vakser 6 Protein–Ligand Blind Docking Using CB-Dock2 . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Liu and Yang Cao 7 Applications of Molecular Dynamics Simulations in Drug Discovery . . . . . . . . . . Sara AlRawashdeh and Khaled H. Barakat 8 Molecular Dynamics Simulation-Based Prediction of Glycosaminoglycan Interactions with Drug Molecules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martyna Maszota-Zieleniak and Sergey A. Samsonov 9 Mining Chemogenomic Spaces for Prediction of Drug–Target Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abhigyan Nath and Radha Chaube 10 Expanding the Landscape of Amyloid Sequences with CARs-DB: A Database of Polar Amyloidogenic Peptides from Disordered Proteins . . . . . . . . . . . . . . . . . . Carlos Pintado-Grima, Oriol Ba´rcenas, and Salvador Ventura 11 Accelerating Molecular Dynamics Simulations for Drug Discovery . . . . . . . . . . . . Kushal Koirala, Keya Joshi, Victor Adediwura, Jinan Wang, Hung Do, and Yinglong Miao 12 Exploring the Role of Chemoinformatics in Accelerating Drug Discovery: A Computational Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aarthy Murali, Umesh Panwar, and Sanjeev Kumar Singh 13 Recent Deep Learning Applications to Structure-Based Drug Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacob Verburgt, Anika Jain, and Daisuke Kihara 14 Techniques for Developing Reliable Machine Learning Classifiers Applied to Understanding and Predicting Protein:Protein Interaction Hot Spots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiaxing Chen, Leslie A. Kuhn, and Sebastian Raschka

vii

v ix

1 21

33 85 101 113 127

143

155

171 187

203

215

235

viii

Contents

15

AI-Driven Enhancements in Drug Screening and Optimization . . . . . . . . . . . . . . Adam Serghini, Stephanie Portelli, and David B. Ascher 16 Applications of Big Data and AI-Driven Technologies in CADD (Computer-Aided Drug Design). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seongmin Seo and Jai Woo Lee 17 Artificial Intelligence in ADME Property Prediction . . . . . . . . . . . . . . . . . . . . . . . . Vishal B. Siramshetty, Xin Xu, and Pranav Shah 18 Accelerating the Discovery and Design of Antimicrobial Peptides with Artificial Intelligence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mariana d. C. Aguilera-Puga, Natalia L. Cancelarich, Mariela M. Marani, Cesar de la Fuente-Nunez, and Fabien Plisson Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

269

295 307

329

353

Contributors VICTOR ADEDIWURA • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA MARIANA D. C. AGUILERA-PUGA • Centro de Investigacion y de Estudios Avanzados del IPN (CINVESTAV-IPN), Unidad de Genomica Avanzada, Laboratorio Nacional de Genomica para la Biodiversidad (Langebio), Irapuato, Guanajuato, Mexico; CINVESTAV-IPN, Unidad Irapuato, Departamento de Biotecnologı´a y Bioquı´mica, Irapuato, Guanajuato, Mexico SARA ALRAWASHDEH • Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, AB, Canada DAVID B. ASCHER • School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, QLD, Australia; Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia KHALED H. BARAKAT • Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, AB, Canada ORIOL BA´RCENAS • Institut de Biotecnologia i de Biomedicina and Departament de Bioquı´mica i Biologia Molecular, Universitat Auto`noma de Barcelona, Barcelona, Spain NATALIA L. CANCELARICH • Instituto Patagonico para el Estudio de los Ecosistemas Continentales (IPEEC), Consejo Nacional de Investigaciones Cientı´ficas y Te´cnicas (CONICET), Puerto Madryn, Argentina YANG CAO • Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China RADHA CHAUBE • Department of Zoology, Institute of Science, Banaras Hindu University, Varanasi, India JIAXING CHEN • Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, PA, USA; Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA MATTHEW M. COPELAND • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA GEORGES CZAPLICKI • Institut de Pharmacologie et de Biologie Structurale (IPBS), Universite´ de Toulouse, CNRS, Universite´ Toulouse III – Paul Sabatier (UT3), Toulouse, France CESAR DE LA FUENTE-NUNEZ • Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA; Penn Institute for Computational Science, University of Pennsylvania, Philadelphia, PA, USA HUNG DO • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA ANIKA JAIN • Department of Biological Sciences, Purdue University, West Lafayette, IN, USA

ix

x

Contributors

KEYA JOSHI • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA MOHAMMAD AQUEEL KHAN • Computer Aided Drug Design and Molecular Modelling Lab, Department of Bioinformatics, Science Block, Alagappa University, Karaikudi, Tamil Nadu, India DAISUKE KIHARA • Department of Biological Sciences, Purdue University, West Lafayette, IN, USA; Department of Computer Science, Purdue University, West Lafayette, IN, USA KUSHAL KOIRALA • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA LESLIE A. KUHN • Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA PETRAS J. KUNDROTAS • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA JAI WOO LEE • Department of Big Data Science, College of Public Policy, Korea University, Sejong, Republic of Korea YANG LIU • Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China MARIELA M. MARANI • Instituto Patagonico para el Estudio de los Ecosistemas Continentales (IPEEC), Consejo Nacional de Investigaciones Cientı´ficas y Te´cnicas (CONICET), Puerto Madryn, Argentina MARTYNA MASZOTA-ZIELENIAK • Faculty of Chemistry, University of Gdan´sk, Gdan´sk, Poland LAURENT MAVEYRAUD • Institut de Pharmacologie et de Biologie Structurale (IPBS), Universite´ de Toulouse, CNRS, Universite´ Toulouse III – Paul Sabatier (UT3), Toulouse, France GRE´GORY MENCHON • Inserm U1242, Oncogenesis, Stress and Signaling (OSS), Universite´ de Rennes 1, Rennes, France YINGLONG MIAO • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA AARTHY MURALI • Computer Aided Drug Design and Molecular Modelling Lab, Department of Bioinformatics, Science Block, Alagappa University, Karaikudi, Tamil Nadu, India ABHIGYAN NATH • Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, India UMESH PANWAR • Computer Aided Drug Design and Molecular Modelling Lab, Department of Bioinformatics, Science Block, Alagappa University, Karaikudi, Tamil Nadu, India HERSHNA PATEL • School of Life and Medical Sciences, University of Hertfordshire, Hatfield, UK CARLOS PINTADO-GRIMA • Institut de Biotecnologia i de Biomedicina and Departament de Bioquı´mica i Biologia Molecular, Universitat Auto`noma de Barcelona, Barcelona, Spain FABIEN PLISSON • Centro de Investigacion y de Estudios Avanzados del IPN (CINVESTAVIPN), Unidad de Genomica Avanzada, Laboratorio Nacional de Genomica para la Biodiversidad (Langebio), Irapuato, Guanajuato, Mexico; CINVESTAV-IPN, Unidad Irapuato, Departamento de Biotecnologı´a y Bioquı´mica, Irapuato, Guanajuato, Mexico STEPHANIE PORTELLI • School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, QLD, Australia

Contributors

xi

SEBASTIAN RASCHKA • Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA; Department of Statistics, University of WisconsinMadison, Madison, WI, USA SERGEY A. SAMSONOV • Faculty of Chemistry, University of Gdan´sk, Gdan´sk, Poland CHANDRABOSE SELVARAJ • Center for Transdisciplinary Research, Department of Pharmacology, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, Tamil Nadu, India DIPANKAR SENGUPTA • Health Data Sciences Research Group, Centre for Optimal Health, School of Life Sciences, College of Liberal Arts and Science, University of Westminster, London, UK SEONGMIN SEO • Department of Mechanical Engineering, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea ADAM SERGHINI • School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, QLD, Australia; Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia PRANAV SHAH • National Center for Advancing Translational Sciences, Rockville, MD, USA AMAR SINGH • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA SANJEEV KUMAR SINGH • Computer Aided Drug Design and Molecular Modelling Lab, Department of Bioinformatics, Science Block, Alagappa University, Karaikudi, Tamil Nadu, India; Department of Data Sciences, Centre of Biomedical Research, SGPGIMS Campus, Lucknow, Uttar Pradesh, India VISHAL B. SIRAMSHETTY • National Center for Advancing Translational Sciences, Rockville, MD, USA; Department of Safety Assessment, Genentech, Inc., South San Francisco, CA, USA ALAN TALEVI • Laboratory of Bioactive Compound Research and Development (LIDeB), Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata, Argentina; Argentinean National Council of Scientific and Technical Research (CONICET), La Plata, Argentina ILYA A. VAKSER • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA SALVADOR VENTURA • Institut de Biotecnologia i de Biomedicina and Departament de Bioquı´mica i Biologia Molecular, Universitat Auto`noma de Barcelona, Barcelona, Spain JACOB VERBURGT • Department of Biological Sciences, Purdue University, West Lafayette, IN, USA JINAN WANG • Computational Biology Program and Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, USA XIN XU • National Center for Advancing Translational Sciences, Rockville, MD, USA

Chapter 1 Computer-Aided Drug Discovery and Design: Recent Advances and Future Prospects Alan Talevi Abstract Computer-aided drug discovery and design involve the use of information technologies to identify and develop, on a rational ground, chemical compounds that align a set of desired physicochemical and biological properties. In its most common form, it involves the identification and/or modification of an active scaffold (or the combination of known active scaffolds), although de novo drug design from scratch is also possible. Traditionally, the drug discovery and design processes have focused on the molecular determinants of the interactions between drug candidates and their known or intended pharmacological target(s). Nevertheless, in modern times, drug discovery and design are conceived as a particularly complex multiparameter optimization task, due to the complicated, often conflicting, property requirements. This chapter provides an updated overview of in silico approaches for identifying active scaffolds and guiding the subsequent optimization process. Recent groundbreaking advances in the field have also analyzed the integration of state-of-the-art machine learning approaches in every step of the drug discovery process (from prediction of target structure to customized molecular docking scoring functions), integration of multilevel omics data, and the use of a diversity of computational approaches to assist target validation and assess plausible binding pockets. Key words ADME, ADMET, Antitarget, Drug design, Computer-aided drug design, Computerassisted drug design, Computer-guided drug design, In silico screening, Ligand-based approaches, Molecular optimization, Pharmacological target, Molecular target, Target validation, Pharmacophore, QSAR, Structure-based approaches, Target-based approaches, Virtual screening, Structure-based approaches, Machine learning, Deep learning, Ensemble learning, Omics, Data integration, Pocket prediction, Druggability, Druggability prediction, Molecular dynamics, Open source, Cooperative knowledge, Collective knowledge, Collaborative knowledge, De novo drug design, Fragment-based drug design

1

Introduction Computer-aided drug discovery and design (CADDD) involve the use of information technologies to assist in the identification and/or development of novel chemical scaffolds with the desired alignment of relevant physicochemical and biological properties. Under the target-focused paradigm, computer-aided drug

Mohini Gore and Umesh B. Jagtap (eds.), Computational Drug Discovery and Design, Methods in Molecular Biology, vol. 2714, https://doi.org/10.1007/978-1-0716-3441-7_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

1

2

Alan Talevi

discovery and design rely heavily on the structural information of the intended pharmacological target(s) (direct drug design) and/or known binders of such target(s) (indirect drug design). In recent decades, though, the use of direct or indirect structural information on relevant antitargets (e.g., biotransformation enzymes, drug transporters, the hERG potassium channel, and nontargeted subtypes of the intended target) has gained increasing attention to improve ligand selectivity and reduce off-target interactions, leading to enhanced safety and even improved pharmacokinetic profile [1–6]. In other words, modern drug design not only relies on available molecular information on the proposed molecular targets but also on information on antitargets. In its most common form, drug discovery and design involve the identification and modification (molecular optimization) of an active scaffold, or linking known active scaffolds. Active scaffolds may be found serendipitously or, more commonly, systematically, through ethnopharmacology [7] or by massive wet or in silico screening of chemical libraries [8, 9]. However, drug design starting with smaller building blocks, as in de novo and fragment-based drug design, is also possible [10, 11]. In any case, a starting point or seed is required to build or optimize the active compound. Computer-assisted methods have gained a prominent role in both stages of modern drug discovery: searching for starting points and making rational decisions regarding which chemical modifications are more convenient for optimizing their pharmacological and biopharmaceutical profiles. In silico or virtual screening (VS) (i.e., using computational methods to explore vast collections of chemicals and identify novel active scaffolds) represents a rational way of finding starting points to implement a drug design campaign; it should be conceptually separated from drug design. Drug design is intrinsically related to chemical novelty. In contrast, in silico screening, which can be coupled with drug design, typically explores the known chemical universe in the search for new active motifs. The novelty of virtual screening lies not in the chemistry of the emerging hits but in uncovering an unknown, hidden association between a known molecular entity and a given biological activity. In addition to its rationality, accessibility is another attractive aspect of CADDD. The technological gap between high- and low-to-middle-income countries is smaller for computer-aided drug discovery than for any other process or approach in the drug discovery cycle. This is partly because many computational resources have been made publicly available and many computational tools used in the field operate smoothly on modern personal computers. Open-source and collaborative knowledge philosophies are deeply rooted in the informatics community, and the fields of bioinformatics and cheminformatics, on which CADDD feeds, are no exception [12–15]. The potential of CADDD has been

Computer-Aided Drug Discovery

3

significantly boosted by data sharing and data assembly initiatives that release large volumes of valuable data into the public domain. However, it should be emphasized that several constraints operate in the process of drug design. First, the synthetic feasibility of the designed compounds should not be neglected [16]. A proposed compound might not be synthetically attainable for universal technical reasons (lack of a given synthetic route) or, more frequently, local limitations (e.g., lack of access to the required technology and/or reactants). Even if synthetic feasibility is not an issue per se, synthetic scalability might be problematic, as large quantities of the compound(s) under development will be required to afford preclinical and clinical development and, eventually, commercialization. For instance, scalability is an important issue in the case of medications based on complex natural products [17]. Equally important is the fact that, as previously implied, drug discovery and development constitute a challenging multiobjective problem, where numerous pharmaceutically relevant objectives should be simultaneously addressed [18]. This is further complicated by the fact that, occasionally, some of those objectives might be conflicting, resulting in very complex solution spaces. The challenge can be metaphorically compared to solving a Rubik cube: sometimes, moving toward a global solution (solving all the cube faces) requires local sacrifices (taking apart an already solved face). Furthermore, the optimal solution, in the case of hit and lead optimization, may be unachievable. For example, it is generally accepted that higher selectivity leads to safer medications; however, efficacious treatments for complex disorders may require multitarget therapeutic agents that, by definition, are not exquisitely selective [19]. On the other hand, as implicit in the famous Lipinski’s rule of five and similar rules [20], a certain degree of aqueous solubility is often pursued to ensure dissolution at the site of absorption; however, excessive solubility could be detrimental to absorption and biodistribution. The introduction of lipophilic substituents into adequate positions of a ligand often translates into a gain in potency [20], and a certain degree of lipophilicity is desirable in central nervous system (CNS) medications to achieve brain bioavailability [21]. However, high lipophilicity conspires against both drug dissolution [22] and metabolic stability [23]. The keyword in drug design seems to be balance, which explains why multiobjective optimization methods and composite metrics have gained such popularity in the field in the last years [18, 24–28]. A scheme illustrating the complex interplay between pharmaceutically relevant properties is shown in Fig. 1. This scheme is indeed an oversimplification. The nature of the relationship between two given properties may be nonlinear, and many counterexamples of the relationships stated in the scheme can be found. For example, although it is accepted that lipophilicity has a positive impact on cell permeability, excessively highly lipophilic drugs

4

Alan Talevi

Fig. 1 A complex, conflicting interplay is observed between pharmaceutically relevant properties that are taken into consideration when facing a drug discovery and development project. An inverse, possibly conflicting relationship between two properties is indicated through a red line. Oppositely, a direct, nonconflicting relationship is shown with a green line

might become sequestered inside the cell membrane, with little improvement in permeability across biological barriers [29], thus determining a parabolic relationship between lipophilicity and permeability. Whether pursuing a given property is fundamental also depends on the therapeutic goal. In general, hit identification is potency-driven, preferring ligands with affinities for their pharmacological target(s) in the nanomolar or even sub-nanomolar range. Whereas potent ligands are undoubtedly pursued in some cases (e.g., to treat anti-infectious diseases), strong modulation of the target may not be the best choice if trying to restore the normal functioning of sensitive physiologic systems (e.g., the brain), as highly potent ligands will tend to impair normal functioning and produce intolerable side effects. From a network pharmacology perspective, partial modulation of pharmacological targets (by use of low-affinity ligands and partial agonists) could be a more adequate approximation to safely restore physiological systems to well functioning [30–32]. This chapter provides an overview of the computational approaches that can be used to assist in the selection of one or more pharmacological targets, identify novel active chemotypes, and guide their subsequent molecular optimization. It has not

Computer-Aided Drug Discovery

5

been conceived to describe the overviewed techniques in detail, as most of them will be separately covered in other chapters of the book. Consequently, detailed protocols are expected to be provided in the corresponding chapter. This has been, therefore, conceived as a general introductory chapter. The general principles of rational drug design are also discussed tangentially.

2

Where to Start (1): Choosing a Pharmacological Target to Pursue Under the target-focused drug discovery paradigm, the selection, prioritization, and validation of drug targets are key problems. Currently, computational methods are strongly integrated into this initial step of a rational drug discovery project. Any targetdriven drug discovery project begins by choosing one (when singletarget agents are pursued) or more (when multitarget therapeutics are pursued) drug targets. What makes a good drug target? First, it must be disease-modifying. Second, it must be druggable; that is, it should be modulated by binding to a small molecule (although some authors also admit modulation by biologics as proof of druggability) [33, 34]. If no ligand is known to bind to the potential target, druggability prediction can be performed, which generally involves examining the target surface for binding sites or checking the existence of similar proteins that have already been proven to be druggable [35–37]. Other desirable features to consider can include assayability, differential expression throughout the body, liability to drug resistance, target vulnerability, and a favorable intellectual property situation (no competitors working on the same target) [33, 38, 39]. At present, the integration of multiple levels of data and network analysis are probably the most promising approaches for identifying potential new pharmacological targets. Networks describe the connectivity between elements of the same (e.g., protein–protein networks) or different nature (e.g., drug–protein, protein–disease, or gene–disease networks). By analyzing the topology of the network, that is, the connectivity between their constituent elements, one may infer which proteins are more favorable for therapeutic interventions [40, 41], reveal unknown associations (e.g., between a protein and a disease) [42], and explore synergistic drug combinations in a systematic and rational manner [43, 44]. The connections between the elements of a network may be established experimentally or may be predicted (for instance, a connection between two drugs may be inferred based on molecular similarity, and the structural connection between two proteins may be inferred based on sequence or structure similarities); hybrid data are usually used to feed the network, as the use of computational predictions to complement experimental data can be utilized to minimize the resources required for experimental assays.

6

Alan Talevi

Importantly, the choice of a suitable pharmacological target based on its topological features in a protein–protein network is greatly influenced by the nature of the target disease [30]. The central hit strategy, that is, targeting either central nodes of the network or choke points, is useful for identifying drug targets for antiinfectious or anticancer therapies. In contrast, the network influence strategy is preferred to address complex diseases and targets nodes that occupy strategically important disease-specific network positions that can influence central nodes. Of note, the functional connections of a protein vary dynamically with physiological and disease states. Interestingly, recent studies have used machine learning to combine different types of protein features (biological functions, network properties derived from protein–protein interaction networks, tissue specificity, localization, and solvent accessibility, among others) to distinguish between pharmacological targets and nontargets [45, 46]. Network analysis is also of interest in the framework of phenotypic-based drug discovery (a target-agnostic drug discovery strategy based on the screening of chemical libraries against complex systems, such as cell or animal models), as it might be useful in the target deconvolution stage [47, 48]. Besides (or complementary to) the application of network analysis coupled with machine learning approaches to identify potentially interesting drug targets from a functional perspective, in silico tools to determine particular aspects of relevance for a pharmacological target also abound. The druggability and essentiality of putative targets are the two properties that have been most extensively approached using computational predictive methods. Druggable binding pockets for small molecules can be identified in proteins based on structural information (from primary to tertiary or quaternary structures). Sequence-based approaches (at times also called “evolutionary algorithms”) rely on the analysis of residue conservation, under the assumption that binding residues are key to functionality and thus likely to be conserved through evolution [49, 50]. This type of methods are advantageous because they can make a prediction from an input sequence alone, but their accuracy tends to be low because non-binding residues can also be highly conserved owing to the other roles, such as fold stabilization. Moreover, allosteric binding sites, which recently attracted considerable interest in the drug discovery community [51], are less likely to be conserved across species. It is possible that purely sequence-based approximations will give way to structurebased approximations now that, as discussed in other sections, structure prediction tools have been considerably perfected. Structure-based druggability prediction tools comprise a combination of automated binding pocket detection and an algorithm (usually an empirical function obtained via machine learning) to

Computer-Aided Drug Discovery

7

quantify the degree of druggability of identified binding sites. Methods based on protein 3D structure can be roughly classified into those that identify the pockets by characterizing the surface cavities on the 3D structural model of the target protein (without any template, also known as geometry-based methods) and those that infer the binding pockets from known template proteins with established binding sites and global or local structural similarity to the query [50]. The latter represents an accurate option when templates with close homology are found in structural databases. Some recent template-based tools use a matching algorithm that aligns similar microenvironments or physicochemical properties between pairs of proteins, allowing the detection of similar sites between evolutionarily unrelated proteins [52]. Other categories of binding pocket predictive tools include energy-based methods that use probes to locate regions of the protein where intermolecular interactions such as hydrogen bonding or π-stacking are likely to be formed and machine learning methods that incorporate physicochemical descriptors into machine learning contexts of variable complexity, from random forests to deep learning, to identify binding sites [53]. Some approximations rely on the consensus of previously developed tools. Similarly, the performance of druggability scoring functions has recently been boosted using state-ofthe-art machine learning approaches such as deep learning and ensemble learning. A non-exhaustive list of some binding pocket prediction tools can be found in Table 1. While most pocket detection and druggability prediction methods depart from static protein structures, proteins have inherent flexibility, and the shape and properties of binding pockets may vary over time and upon binding events. Cryptic binding sites are not apparent in proteins if not in the presence of ligands that induce remarkable protein rearrangement upon binding [53–55]. Thus, they are fundamentally related to the notions of induced-fit binding and conformational selection binding [56] and have attracted much interest recently. This type of binding site may be key to modulating apparently undruggable targets, but they are clearly more difficult to identify by pocket detection algorithms, especially when using their apo structures. Molecular dynamics represent the best approach to identify and characterize cryptic binding sites (and to eventually sample conformational states to feed druggability prediction tools) [55, 57–59]. With regard to essentiality predictions (i.e., identifying gene products that are crucial for the growth and/or survival of a cell or organism), the available in silico methods can be categorized into homology mapping, constraint approaches, and machine learning approaches [60]. Homology mapping assumes that if the gene sequence from a target organism is highly similar to a sequence of an essential gene from a model organism, that gene is likely to be essential, which is not necessarily true. Constraint approaches

8

Alan Talevi

Table 1 An arbitrary, non-exhaustive selection of binding pocket detection tools with different underlying principles

Class Template based

Software package or tool Website 3DLigandSite FINDSITE

I-TASSER Suite ProBis

https://www.wass-michaelislab.org/3dlig/ https://mybiosoftware.com/findsite-1-0-ligandbinding-site-prediction-functional-annotation. html https://zhanggroup.org/I-TASSER/ http://probis.cmm.ki.si/

Geometry based

CASTp Fpocket Ghecom SURFNET

http://sts.bioe.uic.edu/castp/index.html?2was https://github.com/Discngine/fpocket https://pdbj.org/ghecom/ https://www.ebi.ac.uk/thornton-srv/software/ SURFNET/

Energy based

FTSite PocketFinder

https://ftsite.bu.edu/ https://www.molsoft.com/icmpocketfinder.html

Others (machine learning based, COACH-D DeepCSeqSite consensus approaches) Kalasanty LigandRFs PUResNet

https://yanglab.nankai.edu.cn/COACH-D/ https://github.com/yfCuiFaith/DeepCSeqSite https://gitlab.com/cheminfIBB/kalasanty https://mybiosoftware.com/ligandrfs-predictprotein-ligand-binding-sites.html https://github.com/jivankandel/PUResNet

analyze metabolic networks using, for instance, flux balance analysis to define which gene product is critical for producing a given metabolite. Once a model of the metabolic network has been proposed, gene knockout can be simulated to study its impact on the network [61]. Machine learning approaches are empirical approximations that involve training one or more classifiers with training data from model organisms to identify features associated with known essential and nonessential genes and to weigh their contribution to essentiality. The trained classifier can later be applied to predict the essentiality of genes in a target organism.

3

Where to Start (2): Identifying Active Scaffolds If entirely de novo approximations are excluded, any other approach requires starting with an active (and hopefully novel) scaffold into which chemical modifications are introduced. As previously mentioned, and leaving aside serendipitous discoveries (which are of course useful but intrinsically unsystematic), hints on potential active scaffolds of natural origin can be found in

Computer-Aided Drug Discovery

9

traditional medicine. Alternatively, one might resort to information on the natural ligand(s) of a pharmacological target to start a drug design project. At this point, it is worth underlining that chemical novelty is a crucial factor in the pharmaceutical sector. Novelty is a fundamental requisite for obtaining intellectual property rights for an invention (and thus, commercial exclusivity). Although in the last 15 years drug repurposing (finding new medical uses to already known drugs) has raised substantial interest in the pharmaceutical industry, it also faces nontrivial intellectual property and regulatory and commercial challenges [62, 63]. Accordingly, the search for novel active chemotypes remains a priority within the pharmaceutical industry owing to their intellectual property potential. High-throughput screening (HTS) methods are among the most frequently used approaches to explore the vast universe of known chemicals in search for novel active scaffolds. It is a modern version of traditional trial-and-error, “exhaustive” screening. The rationality of HTS lies in the integration of automation and miniaturization in the screening process, which results in efficient exploration of the chemical space [64]. Moreover, this approach has been greatly improved by the design of target-focused libraries [65] and the recognition of privileged scaffolds [66] (molecular frameworks/building blocks that are present in many biologically active ligands against a diverse array of targets). However, it should be mentioned that HTS requires expensive technological platforms which are not frequently found in the academic sector or low-tomiddle-income countries. In contrast, VS involves considerably more accessible technology, with many resources being publicly available, from specialized software to online chemical repositories. The term VS refers to the application of a diversity of computational approaches to rank digital chemical collections or libraries to establish which compounds are more likely to obtain favorable results when submitted to experimental assays using relevant in vitro and/or animal models. They have been conceived to minimize the volume of experimental testing and optimize the results, thus being advantageous in terms of cost efficiency, bioethics, and environmental impact. VS approaches can be essentially classified in two categories: structurebased (or direct or target-based) and ligand-based (or indirect) approximations. Molecular docking is prominently used for structure-based VS. Starting from an experimental structure or from a structural hypothesis of the target, the binding event is simulated, and a scoring function assigns a higher score to the ligand poses that are predicted to be more energetically favorable. While rigid (computationally undemanding) or flexible and more accurate (and computationally demanding) approximations are possible, docking can be considered a computationally demanding VS approach in comparison with ligand-based methods. A search/sampling algorithm

10

Alan Talevi

is used to generate diverse ligand-binding orientations (rigid-body approximations) or ligand-binding orientations and conformations (flexible approximations). Previously, a major obstacle in the implementation of structure-based VS approaches came from the fact that the structures of many validated or potential pharmacological targets have not yet been solved experimentally. This situation has dramatically changed with the introduction of AlphaFold2 [67] and other similar or derived structure prediction approaches, such as RoseTTA fold, ESMFold, or ColabFold [68–70], which have made previously used approaches (e.g., comparative modeling) close to obsolete. AlphaFold is an artificial intelligence system based on deep learning (an attention network) that predicts a protein’s 3D structure from its amino acid sequence, usually achieving an accuracy competitive with experimental structures for globular and transmembrane proteins. The AlphaFold DB makes these predictions freely available to the scientific community, with its last release containing over 200 million entries and providing broad coverage of UniProt [71]. Another caveat of molecular docking is related to the empirical nature of scoring functions, which generally include a variable degree of parameterization. This limits the reliability of the method owing to high incidence of false positives [72]. Because the scoring functions are parameterized/trained against several experimentally determined binding affinities or experimental structures, the performance of the docking approach tends to be highly systemdependent, and the scores are, at best, weakly predictive of affinities; the reliability of the predictions is sometimes improved when different scoring functions are combined into a consensus score [73]. A persistent problem with scoring functions is the elusive entropic contribution to free energy [72, 74] which is ignored in many cases or very approximately estimated in others. The reader should remember that, upon the binding event, the ligand will lose translational, rotational, and conformational freedom, whereas the target will mostly lose conformational freedom. The contributions of desolvation and water molecules mediating ligand–protein interactions (which also affect the initial and final entropy of the system) should not be neglected [75, 76]. Recent retrospective studies suggest that the accuracy of machine learning scoring functions substantially improves, in terms of higher hit rates, in comparison with classical scoring functions [77], boosting the possibilities to generate customized/tailored scoring functions [78, 79]. Free-energy simulations using molecular dynamics provide a much more rigorous solution for binding free-energy estimation [72, 80, 81]. The emergence of low-cost parallel computing is starting to relegate docking to the role of a prescreening tool in favor of molecular dynamics-based VS [72, 80]. What is more, docking models can be refined by sampling molecular dynamics simulations and extracting representative snapshots from the trajectory [82].

Computer-Aided Drug Discovery

11

Ligand-based approximations may be applied whenever a model of the target structure is not available (which is seldom at present, as previously discussed), when the pharmacological target of a set of ligands is yet to be identified (e.g., for hits emerging from a phenotypic screening) or to complement structure-based approximations. Concisely, ligand-based screening methods can be classified into similarity searches, machine learning approaches (supervised machine learning used in the context of the quantitative structure–activity relationship (QSAR) theory) and superposition approximations [83–85]. These techniques differ in a number of factors, from their requisites to active enrichment and scaffold hopping. Similarity search employs molecular fingerprints obtained from 2D or 3D molecular representations, and compounds from a chemical library are compared with one or more reference molecules in a pairwise manner. Remarkably, only one reference molecule (e.g., the physiological ligand of a target protein) is required to implement a similarity-based VS campaign. Similarity searches are frequently the only option to explore the chemical universe for active compounds when experimental knowledge on the target of interest is lacking or when the number of known ligands is too small and impedes the use of supervised machine learning approaches. In practice, every important chemical database today has integrated similarity searches. Supervised machine learning approaches operate by building models from example inputs to make data-driven predictions on the compounds of a chemical library. Machine learning approximations require several learning or calibration examples. The general model development protocol involves dataset compilation and curation, splitting the dataset into representative training (calibration) and test/validation sets (whenever the size of the database allows it), choosing which molecular descriptors should be included in the model, weighting the contribution of such descriptors to the modeled response, validating the model internally and externally, and checking the applicability domain of the model whenever a prediction is made [86]. The molecular diversity of the training samples is critical for VS applications of supervised machine learning: it is directly correlated with the wide applicability domain of the resulting model. Finally, superposition techniques are conformation-dependent methods that analyze how well a compound superposes onto a reference compound or, more frequently, how well they fit a fuzzy model (pharmacophore) in which functional groups are stripped off their exact chemical nature to become generic chemical properties relevant for the ligand–target interaction (e.g., hydrophobic points, H-bond donor, H-bond acceptors, charged groups, aromatic points). The pharmacophore is thus a geometric, 3D arrangement of generic, abstract features that are essential for drug–target

12

Alan Talevi

recognition. Some approaches used for pharmacophore generation can also include negative features (features that conspire against biological activity) in the model. In contrast to docking, which considers the key features required for drug–target interaction in a direct manner, superposition techniques capture them in an indirect way, by inferring such features from known ligands. Superimposition methods are by far the most visual, easy to interpret, and physicochemically intuitive ligand-based approaches. This process is facilitated if the modeler counts on an active rigid analog with limited conformational freedom. Usually, however, one may resort to flexible alignment (superimposition) of a set of flexible ligands, either generating a set of low-energy conformations and considering each conformer of each ligand in turn or exploring conformational space on the fly, that is, exploring the conformational space simultaneously with the pattern identification stage (alignment stage) [87, 88]. It should be noted that, when applying pharmacophore-based VS, orientation sampling is probably as important as conformational sampling, since chemical diversity is expected in the screened chemical library and defining an orientation criterion is thus nontrivial. It should also be mentioned that structure-based pharmacophores are also possible [89]. Which in silico screening method should be chosen to initiate a rational drug discovery project? As indicated in the preceding paragraphs, the selection is restricted by the available data (structurebased approaches require experimentally solved or theoretical 3D structures of the target; supervised machine learning requires a minimum of calibration samples, etc.). However, even if the technical requirements to implement any approach were met. . . Is there a single approach that universally, consistently outperforms the remaining ones? Is there a first-choice method? As a rule, the more complex approximations (structure-based approaches and pharmacophore superposition) are the most advantageous in terms of scaffold hopping (they retrieve more molecularly diverse hits), whereas simpler approaches are computationally more efficient while simultaneously achieving good active enrichment metrics [90]. Furthermore, structure-based approaches and pharmacophores explain, in an explicit or implicit way, the molecular basis of ligand–target interactions. They are visual and easily interpretable; these are two points which are not covered by other approximations and should not be underestimated. Not only are they important from an epistemological perspective (they provide results and explanations), but they also produce visual support to their predictions and visual support is extremely important for communicating results to nonspecialized audiences (e.g., scientific collaborators from other fields and investors). Moreover, we live in an increasingly visual society. Having said so, one should have in mind that the efficacy of a given technique is highly dependent on the chosen molecular target. Regarding VS approaches, a gold

Computer-Aided Drug Discovery

13

standard has not yet been found, which explains the need for rigorous in silico validation before moving to VS and subsequent wet experiments. Frequently, different techniques are complementary in nature [91], and the simplest methods have surprisingly good outcomes in some cases. This allows the definition of hybrid protocols that combine simple and complex approximations either serially or in parallel [92]; serially combined approaches tend to provide robust solutions. A final and important step to prune the hits emerging from systematic screening involves filtering out promiscuous compounds, unspecific inhibitory and reactive compounds, such as PAINS and REOS filters [93, 94].

4

The Actual Design: Hit to Lead and Beyond Let us assume that one or more hits have emerged from systematic (wet or in silico) screening (or, perhaps, that a starting active scaffold has been obtained from natural ligands of the intended target or from ethnopharmacological research or from a serendipitous observation). The actual drug design process starts here and involves introducing changes to the active scaffold to optimize the interactions with the target, thus gaining potency, and/or to provide selectivity in relation to nontargeted similar proteins (e.g., nontargeted isoforms). Currently, the optimization of other pharmaceutically relevant properties (e.g., chemical and biological stability, oral bioavailability, and clearance) is also considered. On the one hand, it should be considered that hits emerging from VS are usually active in the μM range (or, at best, in the high nM range) [95, 96]. A similar scenario has been observed in HTS campaigns [97]. Molecular optimization usually improves the dissociation constant by approximately two orders of magnitude. From the 1990s onward, however, the pharmaceutical sector has understood that potency is not the only property to take into consideration, a realization that was expressed in the adoption of the “fail early, fail cheap” philosophy with the inclusion of in silico in vitro absorption, distribution, metabolism, excretion, and toxicity (ADMET) filters in the early stages of drug discovery [98, 99] and the emergent interest in low-affinity ligands within certain therapeutic categories [100]. Classical optimization strategies include extension, ring variations, ring expansion or contraction, bioisosteric replacement, and rigidification, among others. In the case of (complex) active compounds of natural origin, simplification is also explored [101]. Except for similarity methods, which are not generally used for optimization purposes, all other approaches described in Subheading 3 of the chapter can be used to guide optimization. Structurebased approaches are currently the first choice for guiding optimization. They are the only methods that allow theoretically

14

Alan Talevi

exploring interactions with regions of the target that have not been exploited with previously known ligands in a rational manner and without the need of trial-and-error learning. Among ligand-based approximations, pharmacophore superposition is the friendliest approach for molecular optimization. However, the QSAR approach is also suitable for design purposes, guiding the substitutions made onto the active scaffold; moreover, the inverse QSAR approach (in which, from molecular descriptors, new molecules having the desired activity could be “recovered”) is also suitable for design purposes of de novo molecules [102–104]. It should be noted that, while classification models are useful for VS campaigns, as they can mitigate the noise related to data compiled from different laboratories, outlier compounds, and mislabeled data points, when the QSAR model is meant for optimization purposes, regression modeling can be particularly useful, because the training dataset is usually synthesized in-house and experimentally tested in the same laboratory, providing high-quality quantitative data obtained in a uniform manner (same assay, same equipment, same experimentalists, etc.). Furthermore, whereas VS applications require chemically diverse datasets, the QSAR models used in optimization campaigns would typically display a narrower applicability domain, as they are obtained from a set of compounds with a common scaffold that has been modified to explore the surrounding chemical space.

5

In Silico ADMET Filters and Antitargets Since the 1990s, the search for more potent derivatives of an active scaffold has been balanced with the early detection of potential bioavailability and toxicity issues. Consequently, in silico and in vitro ADME filters are now fully integrated in the early stages of drug discovery and development. This strategy has resulted in an impressive reduction in project termination rates related to ADME issues, although pharmacokinetics and bioavailability still represent a significant cause for attrition at the preclinical development stage and in early clinical trials [105–107]. Toxicological issues (both at the preclinical and clinical stages) also represent one of the key challenges faced by the pharmaceutical industry. The earliest ADME filters involved simple rules of thumb derived from the analysis of the physicochemical properties of drugs with or without a desired behavior. Lipinski’s rule of five at Pfizer pioneered this type of analysis [20], which was later followed by other similar rules related to the prediction of drug bioavailability, such as Veber’s [108]. This trend was also explored in relation to toxicity, e.g., the “3/75” rule [109]. Lately, however, arguments were raised against the rigid implementation of this kind of rules [110], and the possible advantages of moving beyond the “rule of

Computer-Aided Drug Discovery

15

five” chemical space for difficult targets have been emphasized [111, 112], as well as notable systematic exceptions to this rule (e.g., natural products) [112, 113]. Lipinski himself, when first reporting his famous rule, recognized that acceptable drug absorption depended on the triad potency–permeability–solubility, and that his computational alert did not factor in drug potency (a point of his analysis that is often overlooked) [20]; the contribution of drug formulation to oral bioavailability (with the potential aid of in silico tools) has also been emphasized [114]. It has been suggested that the control of physicochemical properties is unlikely to have a significant effect on attrition rates; moreover, if a safety issue results from the primary drug target mechanism or from specific off-target interactions (e.g., hERG channel blockade), it is unlikely that physicochemical properties would be predictive of toxicity [105]. A similar point can be made regarding the prediction of bioavailability issues linked to specific interactions with enzymes (e.g., CYP450 enzymes) or transporters (e.g., ABC efflux transporters). In these cases, using the previously discussed computational tools (docking, pharmacophores, QSAR models) in connection with the antitarget concept could be more advantageous. The use of more complex (yet simple) multiparameter algorithms that address the interplay of physicochemical properties could also prove rewarding [25, 26].

6

Final Remarks We have presented an overview of the most relevant methods and trends in CADDD, with a substantial contribution of state-of-theart structure prediction tools and target analysis tools. In recent years, advanced machine learning has been integrated to every possible aspect of computer-aided drug discovery, from druggability to essentiality analysis, from structure prediction to customized docking scoring functions. While human beings (and scientist in particular) are naturally inclined to a way of thinking based on pattern recognition and identification of generalities, successful drug design comprises such a complex interplay between a number of objectives (efficacy, safety, pharmacokinetics, biopharmaceutically relevant properties) that the drug designer should beware oversimplification and dogmatic principles, which may lead not only to bad decisions but also to loss of opportunities and novelty. As the name itself suggests, drug design per se resembles an attentive artisan craftwork. The screening stages and the application of ADMET-related computational alerts, in contrast, sometimes involve automated decisions, compatible with the idea of efficient exploration and fast pruning of a vast chemical universe. Fast

16

Alan Talevi

pruning usually leads, however, to an over-reduced chemical space. Flexible decision rules should be preferred over rigid ones, since they expand the borders of the more frequently explored regions of the chemical universe. The decision to stop a drug candidate for toxicological or pharmacokinetic reasons involves a complex and subtle judgments that should take into consideration cost–benefit analysis and available options to compensate the predicted difficulties (e.g., formulation alternatives, targeted drug carriers, etc.). It is advised to be careful with excessive automation, to favor critical case by case decision-making as much as possible and to consider difficulties in a multidisciplinary way, including contributions of different professionals involved in the drug discovery cycle at each stage of the drug project.

Acknowledgments The author thanks CONICET and University of La Plata, where he holds permanent positions. References 1. Klabunde T, Everts A (2005) GPCR antitarget modeling: pharmacophore models for biogenic amine binding GPCRs to avoid GPCR-mediated side effects. Chembiochem 6:876–889 2. Raschi E, Vasina V, Poluzzi E et al (2008) The hERG K+ channel: target and antitarget strategies in drug development. Pharmacol Res 57:181–195 3. Crivori P (2008) Computational models for P-glycoprotein substrates and inhibitors. In: Vaz RJ, Klabunde T (eds) Anti-atrgets: prediction and prevention of drug side effects. Wiley-VCH, Weinheim 4. Zamora I (2008) Site of metabolism predictions: facts and experiences. In: Vaz RJ, Klabunde T (eds) Anti-targets: prediction and prevention of drug side effects. Wiley-VCH, Weinheim 5. Fallico M, Alberca LN, Prada Gori DN et al (2022) Machine learning search of novel selective NaV1.2 and NaV1.6 inhibitors as potential treatment against Dravet syndrome. In: Ribeiro PRDA, Cota VR, Barone DAC, de Oliveira ACM (eds) Computational neuroscience. LAWCN 2021. Communications in computer and information science, vol 1519. Springer, Cham 6. Fatoba AJ, Okpeku M, Adeleke MA (2021) Subtractive genomics approach for

identification of novel therapeutic drug targets in Mycoplasma genitalium. Pathogens 10:921 7. Su¨ntar I (2020) Importance of ethnopharmacological studies in drug discovery: role of medicinal plants. Phytochem Rev 19:1199– 1209 8. Entzeroth M, Flotow H, Condron P (2009) Overview of high-throughput screening. Curr Protoc Pharmacol Chapter 9:Unit 9.4 9. Maia EHB, Assis LC, de Oliveira TA et al (2020) Structure-based virtual screening: from classical to artificial intelligence. Front Chem 8:343 10. Mouchlis VD, Afantitis A, Serra A et al (2021) Advances in de novo drug design: from conventional to machine learning methods. Int J Mol Sci 22:1676 11. Kirsch P, Hartman AM, Hirsch AKH et al (2019) Concepts and core principles of fragment-based drug design. Molecules 24: 4309 12. Romano P, Giugno R, Pulvirenti A (2011) Tools and collaborative environments for bioinformatics research. Brief Bioinform 12:549– 561 13. Gorgulla C, Boeszoermenyi A, Wang ZF et al (2020) An open-source drug discovery

Computer-Aided Drug Discovery platform enables ultra-large virtual screens. Nature 580:663–668 14. Cox PB, Gupta R (2022) Contemporary computational applications and tools in drug discovery. ACS Med Chem Lett 13:1016– 1029 15. Prada Gori DN, Alberca LN, Rodriguez S et al (2022) LIDeB Tools: a Latin American resource of freely available, open-source cheminformatics apps. Artif Intell Life Sci 2: 10049 16. Hartenfeller M, Schneider G (2011) De novo drug design. Methods Mol Biol 672:299–323 17. Kuttruff CA, Eastgate MD, Baran PS (2014) Natural product synthesis in the age of scalability. Nat Prod Rep 31:419–432 18. Nicolaou CA, Brown N (2013) Multiobjective optimization methods in drug design. Drug Discov Today Technol 10: e427–e435 19. Talevi A (2016) Tailored multi-target agents. Applications and design considerations. Curr Pharm Des 22:3164–3170 20. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 23:3–25 21. Pajouhesh H, Lenz GR (2005) Medicinal chemical properties of successful central nervous system drugs. NeuroRx 2:542–553 22. Gupta S, Kesarla R, Omri A (2013) Formulation strategies to improve the bioavailability of poorly absorbed drugs with special emphasis on self-emulsifying systems. ISRN Pharm 2013:848043 23. Miller DC, Klute W, Calabrese A et al (2009) Optimising metabolic stability in lipophilic chemical space: the identification of a metabolic stable pyrazolopyrimidine CRF-1 receptor antagonist. Bioorg Med Chem Lett 19: 6144–6147 24. Wager TT, Hou X, Verhoest PR et al (2016) Central nervous system multiparameter optimization desirability: application in drug discovery. ACS Chem Neurosci 7:767–775 25. Glen RC, Galloway WR, Spring DR et al (2016) Multiple-parameter optimization in drug discovery: example of the 5-HT1B GPCR. Mol Inform 35:599–605 26. Ghose AK, Ott GR, Hudkins RL (2017) Technically Extended MultiParameter Optimization (TEMPO): an advanced robust scoring scheme to calculate central nervous system druggability and monitor lead optimization. ACS Chem Neurosci 8:147–154

17

27. Winter R, Montanari F, Steffen A et al (2019) Efficient multi-objective molecular optimization in a continuous latent space. Chem Sci 10:8016–8024 28. Pennington LD, Muegge I (2021) Holistic drug design for multiparameter optimization in modern small molecule drug discovery. Bioorg Med Chem Lett 41:128003 29. He X (2009) Integration of physical, chemical, mechanical and biopharmaceutical properties in solid dosage oral form development. In: Solid dosage oral forms: pharmaceutical theory and practice. Academic Press, Burlington 30. Csermely P, Korcsma´ros T, Kiss HJ et al (2013) Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacol Ther 138: 333–408 31. Wang J, Guo Z, Fu Y et al (2017) Weakbinding molecules are not drugs?-toward a systematic strategy for finding effective weakbinding drugs. Brief Bioinform 18:321–332 32. Talevi A (2022) Antiseizure medication discovery: recent and future paradigm shifts. Epilepsia Open 7(Suppl 1):S133–S141 33. Gashaw I, Ellinghaus P, Sommer A et al (2011) What makes a good drug target. Drug Discov Today 16:1037–1043 34. Knowles J, Gromo G (2003) Target selection in drug discovery. Nat Rev Drug Discov 2: 63–69 35. Schmidtke P, Barril X (2010) Understanding and predicting druggability. A highthroughput method for detection of drug binding sites. J Med Chem 53:5858–5867 36. Yuan Y, Pei J, Lai L (2013) Binding site detection and druggability prediction of protein targets for structure-based drug design. Curr Pharm Des 19:2326–2333 37. Barril X (2013) Druggability predictions: methods, limitations and applications. Wires Comput Mol Sci 3:327–338 38. Talevi A, Carrillo C, Comini M (2019) The thiol-polyamine metabolism of Trypanosoma cruzi: molecular targets and drug repurposing strategies. Curr Med Chem 26:6614–6635 39. Tonge PJ (2018) Drug-target kinetics in drug discovery. ACS Chem Neurosci 9:29–39 40. Feng Y, Wang Q, Wang T (2017) Drug target protein-protein interaction networks: a systematic perspective. Biomed Res Int 2017: 1289259 41. Viacava Follis A (2021) Centrality of drug targets in protein networks. BMC Bioinf 22: 527

18

Alan Talevi

42. Sabetian S, Shamsir MS (2019) Computer aided analysis of disease linked protein networks. Bioinformation 15:513–522 43. Casas AI, Hassan AA, Larsen SJ et al (2019) From single drug targets to synergistic network pharmacology in ischemic stroke. Proc Natl Acad Sci U S A 116:7129–7136 ˜ ana P, Srivastava PK et al 44. Schidlitzki A, Bascun (2020) Proof-of-concept that network pharmacology is effective to modify development of acquired temporal lobe epilepsy. Neurobiol Dis 134:104664 45. Kim B, Jo J, Han J et al (2017) In silico re-identification of properties of drug target proteins. BMC Bioinf 18:248 ˝ Z, Ceccarelli M (2020) Machine 46. Dezso learning prediction of oncology drug targets based on protein and network properties. BMC Bioinf 21:104 47. Chen S, Jiang H, Cao Y et al (2016) Drug target identification using network analysis: taking active components in Sini decoction as an example. Sci Rep 6:24245 48. Ji X, Freudenberg JM, Agarwal P (2019) Integrating biological networks for drug target prediction and prioritization. Methods Mol Biol 1903:203–218 49. Capra JA, Singh M (2007) Predicting functionally important residues from sequence conservation. Bioinformatics 23:1875–1882 50. Yang J, Roy A, Zhang Y (2013) Proteinligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29:2588–2595 51. Han B, Salituro FG, Blanco MJ (2020) Impact of allosteric modulation in drug discovery: innovation in emerging chemical modalities. ACS Med Chem Lett 11:1810– 1819 52. Liu T, Ish-Shalom S, Torng W et al (2018) Biological and functional relevance of CASP predictions. Proteins 86(Suppl 1):374–386 53. Clark JJ, Orban ZJ, Carlson HA (2020) Predicting binding sites from unbound versus bound protein structures. Sci Rep 10:15856 54. Kuzmanic A, Bowman GR, Juarez-Jimenez J et al (2020) Investigating cryptic binding sites by molecular dynamics simulations. Acc Chem Res 53:654–661 55. Smith RD, Carlson HA (2021) Identification of cryptic binding sites using MixMD with standard and accelerated molecular dynamics. J Chem Inf Model 61:1287–1299 56. Paul F, Weikl TR (2016) How to distinguish conformational selection and induced fit based on chemical relaxation rates. PLoS Comput Biol 12:e1005067

57. Vajda S, Beglov D, Wakefield AE et al (2018) Cryptic binding sites on proteins: definition, detection, and druggability. Curr Opin Chem Biol 44:1–8 58. Martinez-Rosell G, Lovera S, Sands ZA et al (2020) PlayMolecule CrypticScout: predicting protein cryptic sites using mixed-solvent molecular simulations. J Chem Inf Model 60: 2314–2324 59. Zheng W (2021) Predicting cryptic ligand binding sites based on normal modes guided conformational sampling. Proteins 89:416– 426 60. Aromolaran O, Aromolaran D, Isewon I et al (2021) Machine learning approach to gene essentiality prediction: a review. Brief Bioinform 22(5):bbab128 61. Basler G (2015) Computational prediction of essential metabolic genes using constraintbased approaches. Gene Essentiality 1279: 183–204 62. Pushpakom S, Iorio F, Eyers PA et al (2019) Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov 18: 41–58 63. Talevi A, Bellera CL (2020) Challenges and opportunities with drug repurposing: finding strategies to find alternative uses of therapeutics. Expert Opin Drug Discov 15:397–401 64. Szymanski P, Markowicz M, Mikiciuk-Olasik E (2012) Adaptation of high-throughput screening in drug discovery – toxicological screening. Int J Mol Sci 13:427–452 65. Harris CJ, Hill RD, Sheppard DW, Slater MJ, Stouten PF (2011) The design and application of target-focused compound libraries. Comb Chem High Throughput Screen 14(6):521–531 66. Welsch ME, Snyder SA, Stockwell BR (2010) Privileged scaffolds for library design and drug discovery. Curr Opin Chem Biol 14: 347–361 67. Jumper J, Evans R, Pritzel A et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589 68. Baek M, DiMaio F, Anishchenko I et al (2021) Accurate prediction of protein structures and interactions using a three-track neural network. Science 373:871–876 69. Lin Z, Akin H, Rao R et al (2022) Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 2022.07.20.500902 70. Mirdita M, Schu¨tze K, Moriwaki Y et al (2022) ColabFold: making protein folding accessible to all. Nat Methods 19:679–682 71. Varadi M, Anyango S, Deshpande M et al (2022) AlphaFold Protein Structure

Computer-Aided Drug Discovery Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50: D439–D444 72. Procacci P (2016) Reformulating the entropic contribution of molecular docking scoring functions. J Comput Chem 37(19): 1819–1827 73. Gilson MK, Zhou HX (2007) Calculation of protein-ligand binding affinities. Annu Rev Biophys Biomol Struct 36:21–42 74. Bello M, Martı´nez-Archundia M, CorreaBasurto J (2013) Automated docking for novel drug discovery. Expert Opin Drug Discov 8:821–834 75. Bodnarchuck MS (2016) Water, water, everywhere. . . It’s time to stop and think. Drug Discov Today 21:1139–1146 76. Mysinger MM, Schoichet BK (2010) Rapid context-dependent ligand desolvation in molecular docking. J Chem Inf Model 50: 1561–1573 77. Li H, Sze KH, Lu G et al (2020) Machinelearning scoring functions for structure-based virtual screening. Wires Comput Mol Sci 11: e1478 78. Zhang X, Shen C, Guo X et al (2021) ASFP (Artificial Intelligence based Scoring Function Platform): a web server for the development of customized scoring functions. J Cheminform 13:6 79. Yang C, Chen EA, Zhang Y (2022) Proteinligand docking in the machine-learning era. Molecules 27:4568 80. Ge H, Wang Y, Li C et al (2013) Molecular dynamics-based virtual screening: accelerating the drug discovery process by highperformance computing. J Chem Inf Model 53:2757–2764 81. Wang L, Wu Y, Deng Y et al (2015) Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J Am Chem Soc 137: 2695–2703 82. Llanos MA, Alberca LN, Larrea SCV et al (2022) Homology modeling and molecular dynamics simulations of Trypanosoma cruzi phosphodiesterase b1. Chem Biodivers 19: e202100712 83. Lavechia A (2015) Machine-learning approaches in drug discovery: methods and applications. Drug Discov Today 20:318–331 84. Lemmen C, Zimmermann M, Lengauer T (2002) Multiple molecular superpositioning as an effective tool for virtual database screening. In: Virtual screening: an alternative

19

or complement to high-throughput screening? 1st edn. Kluwer Academic Publishers, Marburg 85. Kristensen TG, Nielsen J, Pedersen CNS (2013) Methods for similarity-based virtual screening. Comput Struct Biotechnol J 5: e201302009 86. Talevi A, Bruno-Blanch LE (2016) Virtual screening applications in the search of novel antiepileptic drug candidates. In: Antiepileptic drug discovery. Novel approaches. Humana Press, New York 87. Schneidman-Duhovny D, Dror O, Inbar Y et al (2008) Deterministic pharmacophore detection via multiple flexible alignment of drug-like molecules. J Comput Biol 15:737– 754 88. Cottrell SJ, Gillet VJ, Taylor R et al (2004) Generation of multiple pharmacophore hypothesis using multiobjective optimization techniques. J Comput Aided Mol Des 18: 665–682 89. Pirhadi S, Shiri F, Ghasemi JB (2013) Methods and applications of structure based pharmacophores in drug discovery. Curr Top Med Chem 13:1036–1047 90. Zhang Q, Muegge I (2006) Scaffold hopping through virtual screening using 2D and 3D similarity descriptors: ranking, voting, and consensus scoring. J Med Chem 9:1536– 1548 91. Kru¨ger DM, Evers A (2010) Comparison of structure- and ligand-based virtual screening protocols considering hit list complementarity and enrichment factors. ChemMedChem 5: 148–158 92. Talevi A, Gavernet L, Bruno-Blanch LE (2009) Combined virtual screening strategies. Curr Comput Aided Drug Des 5:23–37 93. Pouliot M, Jeanmart S (2016) Pan Assay Interference Compounds (PAINS) and other promiscuous compounds in antifungal research. J Med Chem 59:497–503 94. Walters WP, Stahl MT, Murcko MA (1998) Virtual screening – an overview. Drug Discov Today 3:160–178 95. Zhu T, Cao S, Su PC et al (2013) Hit identification and optimization in virtual screening: practical recommendations based upon a critical literature analysis. J Med Chem 56:6560– 6572 96. Ripphausen P, Nisius B, Pletason L et al (2010) Quo vadis, virtual screening? A comprehensive survey of prospective applications. J Med Chem 53:8461–8467 97. Neetoo-Isseliee Z, MacKenzie AE, Southern C et al (2013) High-throughput

20

Alan Talevi

identification and characterization of novel, species-selective GPR35 agonists. J Pharmacol Exp Ther 344:568–578 98. Kola I, Landis J (2004) Can the pharmaceutical industry reduce attrition rates? Nat Rev Drug Discov 3:711–716 99. Schuster D, Laggner C, Langer T (2005) Why drugs fail – a study on side effects in new chemical entities. Curr Pharm Des 11:3545– 3559 100. Talevi A (2016) Computational approaches for innovative antiepileptic drug discovery. Expert Opin Drug Discov 11:1001–1016 101. Wang S, Dong G, Sheng C (2019) Structural simplification of natural products. Chem Rev 119:4180–4220 102. Brown N, Lewis RA (2006) Exploiting QSAR methods in lead optimization. Curr Opin Drug Discov Devel 9:419–424 103. Wong WWL, Burkowski FJ (2009) A constructive approach for discovering new drug leads: using a kernel methodology for the inverse-QSAR problem. J Cheminform 1:4 104. Miyako T, Kaneko H, Funatsu K (2016) Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J Chem Inf Model 56:286–299 105. Waring MJ, Arrowsmith J, Leach AR et al (2015) An analysis of the attrition of drug candidates from four major pharmaceutical companies. Nat Rev Drug Discov 14:475– 486 106. Cook D, Brown D, Alexander R et al (2014) Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. Nat Rev Drug Discov 13:419–431

107. Roberts RA, Kavanagh SL, Mellor HR et al (2014) Reducing attrition in drug development: smart loading preclinical safety assessment. Drug Discov Today 19:341–347 108. Veber DF, Johnson SR, Cheng HY et al (2002) Molecular properties that influence the oral bioavailability of drug candidates. J Med Chem 45:2615–2623 109. Price DA, Blagg J, Jones L et al (2009) Physicochemical drug properties associated with in vivo toxicological outcomes: a review. Expert Opin Drug Metab Toxicol 5:921–931 110. Sutherland JJ, Raymond JW, Stevens JL et al (2012) Relating molecular properties and in vitro assay results to in vivo drug disposition and toxicity outcomes. J Med Chem 55: 6455–6466 111. Doak BC, Zheng J, Dobritzsch D et al (2016) How beyond rule of 5 drugs and clinical candidates bind to their targets. J Med Chem 59: 2312–2327 112. Doak BC, Over B, Giordanetto F et al (2014) Oral druggable space beyond the rule of 5: insights from drugs and clinical candidates. Chem Biol 21:1115–1142 113. Lipinski CA (2016) Rule of five in 2015 and beyond: target and ligand structural limitations, ligand chemistry structure and drug discovery project decisions. Adv Drug Deliv Rev 101:34–41 114. Bergstro¨m CAS, Charman WN, Porter CJH (2016) Computational prediction of formulation strategies for beyond-rule-of-5 compounds. Adv Drug Deliv Rev 101:6–21

Chapter 2 Virtual Screening Process: A Guide in Modern Drug Designing Umesh Panwar, Aarthy Murali, Mohammad Aqueel Khan, Chandrabose Selvaraj, and Sanjeev Kumar Singh Abstract Due to its capacity to drastically cut the cost and time necessary for experimental screening of compounds, virtual screening (VS) has grown to be a crucial component of drug discovery and development. VS is a computational method used in drug design to identify potential drugs from enormous libraries of chemicals. This approach makes use of molecular modeling and docking simulations to assess the small molecule’s ability to bind to the desired protein. Virtual screening has a bright future, as high computational power and modern techniques are likely to further enhance the accuracy and speed of the process. Key words Drug discovery, Virtual screening, Chemical libraries, Docking, Simulation, Binding affinity

1

Introduction The history of virtual screening dates to the early 1970s, where initial attempts were made to computationally predict the binding of small molecules with proteins. However, it was not until the early 1990s that the first successful virtual screening experiments were reported. Since then, virtual screening has become an integral part of the drug discovery process, with a growing number of successful applications. Now, it has developed into a crucial step in both commercial and academic drug development during the last two decades. Virtual screening (VS) is a computational method used in drug discovery to identify potential drug candidates from large compound libraries. It involves the use of various molecular modeling and docking simulations to evaluate the binding affinity of small molecules with target proteins. Virtual screening has revolutionized drug discovery and has become an essential tool in modern drug development [1–5].

Mohini Gore and Umesh B. Jagtap (eds.), Computational Drug Discovery and Design, Methods in Molecular Biology, vol. 2714, https://doi.org/10.1007/978-1-0716-3441-7_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

21

22

Umesh Panwar et al.

Traditionally, drug discovery and development have relied on a trial-and-error approach, where compounds are synthesized and tested for activity against the target protein. However, this process can be time-consuming and expensive, with low success rates. Virtual screening provides a more efficient and cost-effective alternative, allowing researchers to focus on the most promising compounds rather than conducting experimental testing on all compounds in the library based on their predicted binding affinity with the target protein [6, 7]. The virtual screening process involves the main steps: target selection, library preparation, and screening. Target selection involves the identification of the protein target, which can be a receptor, enzyme, or other protein involved in the disease pathway. The target protein must be well characterized, with a known threedimensional structure or a reliable homology model. Library preparation involves the generation of a virtual compound library, which can be sourced from various databases or created through de novo methods. The library must be diverse and representative of the chemical space, with compounds that have the potential to interact with the target protein. Screening involves the use of various algorithms and software tools to evaluate the binding affinity of compounds with the target protein. This can be done through molecular docking simulations, molecular dynamics simulations, or other methods. The output of the screening process is a ranked list of compounds, with the most promising candidates selected for further experimental validation [8–14]. There are several types of virtual screening methods, including ligand-based virtual screening and structure-based virtual screening. Ligand-based virtual screening uses the known active ligands or lead compounds to search for structurally similar molecules in a compound library. Structure-based virtual screening, on the other hand, uses the known three-dimensional structure of the target protein to evaluate the binding affinity of compounds. Both methods have their advantages, with ligand-based virtual screening being useful when the target protein is not well characterized and structure-based virtual screening being useful when the protein structure is known. There are several software and servers available for virtual screening, including commercial software such as Schro¨dinger, MOE, and Cresset, as well as open-source software such as AutoDock and OpenBabel. These software and servers have their advantages and limitations, and the choice of software and server depends on the specific research question and available resources [12, 15–29]. The general scheme of virtual screening process in drug discovery is represented in schematic diagram, shown in Fig. 1.

Virtual Screening Process

23

Fig. 1 Schematic representation of virtual screening process in drug discovery. Abbreviations: SBVS structure-based virtual screening, LBVS ligand-based virtual screening, ADME/T absorption, distribution, metabolism, excretion, and Toxicity, MD molecular dynamics

The uses of virtual screening are vast and diverse. It has been used in various drug discovery projects, including the identification of potential inhibitors for cancer, infectious diseases, and neurological disorders. Virtual screening has also been used to identify potential drug candidates for protein–protein interactions, which are notoriously difficult to target with traditional small molecule drugs. Furthermore, virtual screening has been used in the optimization of drug design, where the binding mechanisms of compounds with the target protein are analyzed to improve their efficacy and specificity [30–33].

24

Umesh Panwar et al.

In conclusion, virtual screening is a powerful tool in drug discovery and development. It has a rich history and has become an essential tool in modern drug discovery. The different types of virtual screening methods have their advantages, and the choice of method depends on the research question and available resources. The availability of software and servers has made virtual screening accessible to researchers worldwide, and its uses are diverse and expanding. Virtual screening will continue to play a critical role in drug discovery and has the potential to revolutionize the field [34– 38]. This chapter provides an overview of the general methods and procedures used in the virtual screening process for finding potential drugs that are effective against biological targets.

2

Material 1. Selection of target protein based on literature review or biological information from RCSB protein data bank (PDB). 2. Preparation of target protein. 3. Generation of receptor grid. 4. Selection of chemical database and their preprocessing for screening. 5. Molecular docking of ligand within binding pocket of protein. 6. ADME and PAINS filtering. 7. Molecular dynamics simulations. All the steps are processed by Schro¨dinger Drug Discovery Studio except PAINS filtering.

3

Methods Here, the fundamentals of the general process of virtual screening employing computational methods to test a library of chemicals against the target protein have been discussed. In the past three decades, numerous software programs and servers have been developed with revolutionary artwork for drug discovery and screening by using massive chemical libraries, as shown in Table 1. For a better improvement of the virtual screening procedure, additional facts may be initialized (see Notes 1–8).

3.1 Selection of Target Protein, Preparation, and Grid Generation 3.1.1

Preliminary Details

For explaining the protocol, a case study of virtual screening against the biological target HIV-1 integrase and human lens epitheliumderived growth factor (LEDGF/p75) is presented, which utilizes structure-based virtual screening to identify potent inhibitors to block the protein–protein interaction between them. The X-ray

Virtual Screening Process

25

Table 1 List of available chemical libraries, software, or server for virtual screening in drug discovery [12, 35– 44] Software or Computational process servers Chemical libraries

Asinex ChEMBL Chembridge Drug Bank Enamine REAL Lifechemicals Maybridge PubChem Specs Zinc Database

Link http://www.asinex.com/ https://www.ebi.ac.uk/chembl/ https://www.chembridge.com/screening_libraries/ https://www.drugbank.ca/ https://enamine.net/compound-collections/realcompounds/real-database/ https://lifechemicals.com/ https://www.maybridge.com/ https://pubchem.ncbi.nlm.nih.gov https://www.specs.net/ https://zinc.docking.org/

Protein preparation and Modeller Autodock docking SwissDock CB-Dock

https://salilab.org/modeller https://autodock.scripps.edu http://www.swissdock.ch http://clab.labshare.cn/cb-dock/php/index.php

Virtual screening

https://www.schrodinger.com/products/glide/

ADME/T prediction

Dynamics simulation

Glide, Schro¨dinger PyRx

https://pyrx.sourceforge.io/

Qikprop, Schro¨dinger Swiss ADME Biovia Discovery Studio admetSAR PreADMET

https://www.schrodinger.com/qikprop/

Amber Desmond Gromacs

http://ambermd.org/ https://www.schrodinger.com/desmond/ http://www.gromacs.org/

http://www.swissadme.ch/ https://discover.3ds.com/ http://lmmd.ecust.edu.cn/admetsar1/ https://preadmet.bmdrc.kr/

crystallography structural information of the dimeric CCD of HIV-1 IN in complex with quinoline derivative 3 (as LEDGINs, PDB code – 3LPU) was retrieved from the Protein Data Bank (PDB, www.rcsb.org) [39–41]. 3.1.2 Preprocessing of Targeted Protein

Preprocessing was successfully used to assign bond ordering; fill in missing hydrogens, side chains, and loops and cap termini; and remove pointless waters from the targeted protein. To assure the protein’s stability and quality, energy minimization was done after charge assignment and protonation state determination.

26

Umesh Panwar et al. Grid Generation

The grid is used to predict the binding poses and binding energies of the ligand molecules. The receptor grid generation process can be automated in Glide, making it a user-friendly tool for virtual screening and lead optimization studies. Thus, the receptor grid was generated by defining based on centroid co-crystallized ligand in protein using Glide [42].

3.2 Selection of Chemical Database and Their Preprocessing for Screening

Chemical compounds are developed with either a general purpose or a specific goal in mind. Libraries of ligands are available in numerous databases. These libraries provide details about the ligand molecules, including their molecular composition, threedimensional structure, and physical and chemical characteristics. Preprocessing of these libraries for correct optimization, ionization states, tautomer’s states, ring conformers, and molecular conversion from 2D to 3D is preferable before screening. In this instance, five distinct libraries have been utilized in total to find the potent compounds using a Virtual Screening Workflow (VSW) trio protocol as high-throughput virtual screening (HTVS), standard precision, and extra precision (XP) in Glide module (Schro¨dinger, Inc., LLC, New York, USA) [43].

3.3 Molecular Docking of Ligand Within Binding Pocket of Protein

Screening compounds were then docked into the prepared binding grid on centroid ligand present in binding pocket of the protein using an induced-fit docking algorithm, IFD module (Schro¨dinger, Inc., LLC, New York, USA). This algorithm considers the conformational changes of the protein upon ligand binding. It allows the protein to adapt its shape to accommodate the ligand, resulting in a more accurate prediction of the binding energy and binding pose [44, 45].

3.4 Prediction of ADME and PAINS Filtering

According to evidence from earlier studies, oral administration of a key molecule is an important route for drug delivery. Many medications have failed in the late stages of drug discovery and development due to poor pharmacokinetics and toxicity, which has a significant financial impact on the pharmaceutical industry. The QikProp module (Schro¨dinger, Inc., LLC, New York, USA) was used to predict the appropriate ADME of the ligands to locate a successful medicine without fail at a later stage. It offers a thorough analysis of key pharmaceutical and drug-like characteristics, whereas PAINS was employed to check for false positives on the most prominent compounds [46, 47] (see Note 5).

3.5 Dynamics Studies to Get the Stability of Selected Hits

Once all the above steps are completed, top ligand hits are ranked based on its binding affinity to the target protein. Finally, all these top compounds in complex with targeted protein were simulated for 30 ns to investigate the dynamics behavior and its interatomic interactions that aid to provide complex stability using Desmond Software (Schro¨dinger, Inc., LLC, New York, USA) [48, 49]. In

3.1.3

Virtual Screening Process

27

conclusion, the top three compounds were considered as potential inhibitors for the inhibition of protein–protein interactions between HIV- IN and LEDGF/p75 against viral infection (see Note 6).

4

Notes 1. One of the biggest challenges in virtual screening is the accurate prediction of binding affinity. Therefore, the development of more accurate scoring functions could better account for protein–ligand interactions and significantly improve the virtual screening process. Additionally, multiple docking programs may increase the accuracy and reliability of the results [50]. 2. Combining different virtual screening methods such as ligandbased and structure-based methods may increase the accuracy of predictions and improve hit identification than existing framework [51]. 3. The integration of machine learning and artificial intelligence could be helpful to identify patterns and trends in large datasets and may aid in the selection of potential drug candidates for experimental testing [52, 53]. 4. Increasing the size and diversity of compound libraries would help identify a broader range of potential ligands that could bind to the target protein [54]. 5. There are software’s that could also predict the physiochemical and pharmaceutical properties of screened ligands such as SwissADME (http://www.swissadme.ch), Biovia Discovery Studio (https://discover.3ds.com/), and admetSAR (http://lmmd. ecust.edu.cn/admetsar2). 6. Binding free energy calculations, such as MM-GBSA/PBSA, are widely used to predict protein–ligand binding affinities, which may be highly effective than docking scores in theorical process [55–59]. 7. Quantum mechanics and molecular mechanics simulations, which can accurately describe the electrical and structural characteristics of small molecule-protein interactions, have grown in significance in the drug discovery process. It is feasible to research the reaction mechanisms of drug candidates and their binding free energies using QM/MM simulations. In the future, the accuracy of the virtual screening process can be increased with the help of this information by further validating the screening chemicals [60–67]. 8. Drug development relies heavily on combinatorial library design, which makes it possible to explore a huge chemical

28

Umesh Panwar et al.

space. It increases the likelihood of discovering lead candidates with the desired properties by methodically producing a variety of molecules. This method facilitates the optimization of pharmacological activity, specificity, and safety profiles and speeds up the development of new drugs [68].

Acknowledgments UP, AM, and SKS thankfully acknowledge the DST-PURSE 2nd Phase Programme grant [No. SR/PURSE Phase 2/38 (G); DST-FIST Grant [(SR/FST/LSI—667/2016)]; MHRD RUSAPhase 2.0 grant sanctioned vide Letter no. [F.24‐51/2014‐U, Policy (TN Multi‐Gen), Department of Education, Govt of India]; Tamil Nadu State Council for Higher Education (TANSCHE) under [No. AU: S.O. (P&D): TANSCHE Projects: 117/ 202, File No. RGP/2019‐20/ALU/ HECP‐0048]; DBT-BIC, New Delhi, under Grant/Award [No. BT/PR40154/ BTIS/137/ 34/2021, dated 31.12.2021]; and DBT-NNP Project, New Delhi, under Grant/Award [No. BT/PR40156/BTIS/ 54/2023 dated 06.02.2023] for providing the research grant and infrastructure facilities in the lab. CS thankfully acknowledge the Saveetha University for providing the infrastructure facilities to perform this work. MAK thankfully acknowledge the Alagappa University for providing the RUSA 2.0 Senior Research Fellowship [Alu/RUSA/SRF-Bioinformatics/4156/2022 dated 30.11.2022]. References 1. Walters WP, Wang R (2020) New trends in virtual screening. J Chem Inf Model 60: 4109–4111 2. Gorgulla C, Boeszoermenyi A, Wang ZF, Fischer PD, Coote PW, Padmanabha Das KM, Malets YS, Radchenko DS, Moroz YS, Scott DA, Fackeldey K (2020) An open-source drug discovery platform enables ultra-large virtual screens. Nature 580:663–668 3. Lavecchia A, Di Giovanni C (2013) Virtual screening strategies in drug discovery: a critical review. Curr Med Chem 20:2839–2860 4. Grebner C, Malmerberg E, Shewmaker A, Batista J, Nicholls A, Sadowski J (2019) Virtual screening in the cloud: how big is big enough? J Chem Inf Model 60:4274–4282 5. Macarron R, Banks MN, Bojanic D, Burns DJ, Cirovic DA, Garyantes T, Green DV, Hertzberg RP, Janzen WP, Paslay JW, Schopfer U (2011) Impact of high-throughput screening

in biomedical research. Nat Rev Drug Discov 10:188–195 6. Doytchinova I (2022) Drug design – past, present, future. Molecules 27:1496 7. Kar S, Roy K (2013) How far can virtual screening take us in drug discovery? Expert Opin Drug Discov 8:245–261 8. Gimeno A, Ojeda-Montes MJ, Toma´sHerna´ndez S, Cereto-Massague´ A, Beltra´nDebo´n R, Mulero M, Pujadas G, Garcia-Vallve´ S (2019) The light and dark sides of virtual screening: what is there to know? Int J Mol Sci 20:1375 9. Berry M, Fielding B, Gamieldien J (2015) Practical considerations in virtual screening and molecular docking. In: Emerging trends in computational biology, bioinformatics, and Systems biology. Elsevier, p 487 10. De Vita S, Lauro G, Ruggiero D, Terracciano S, Riccio R, Bifulco G (2019) Protein preparation automatic protocol for high-

Virtual Screening Process throughput inverse virtual screening: accelerating the target identification by computational methods. J Chem Inf Model 59:4678–4690 11. Chiba S, Ishida T, Ikeda K, Mochizuki M, Teramoto R, Taguchi YH, Iwadate M, Umeyama H, Ramakrishnan C, Thangakani AM, Velmurugan D (2017) An iterative compound screening contest method for identifying target protein inhibitors using the tyrosineprotein kinase Yes. Sci Rep 7:12038 12. Panwar U, Chandra I, Selvaraj C, Singh SK (2019) Current computational approaches for the development of anti-HIV inhibitors: an overview. Curr Pharm Des 25:3390–3405 13. Panwar U, Singh SK (2018) An overview on Zika Virus and the importance of computational drug discovery. J Explor Res Pharmacol 3:43–51 14. Rampogu S, Lemuel MR, Lee KW (2022) Virtual screening, molecular docking, molecular dynamics simulations and free energy calculations to discover potential DDX3 inhibitors. Adv Cancer Res 4:100022 15. Kontoyianni M (2017) Docking and virtual screening in drug discovery. In: Proteomics for drug discovery: methods and protocols. Springer, pp 255–266 16. Aarthy M, Panwar U, Singh SK (2021) Magnitude and advancements of CADD in identifying therapeutic intervention against Flaviviruses. In: Innovations and implementations of computer aided drug discovery strategies in rational drug design. Springer, Singapore, pp 179–203 17. Varela-Rial A, Majewski M, De Fabritiis G (2022) Structure based virtual screening: fast and slow. Wiley Interdiscip Rev Comput Mol Sci 12:e1544 18. Bhrdwaj A, Abdalla M, Pande A, Madhavi M, Chopra I, Soni L, Vijayakumar N, Panwar U, Khan M, Prajapati L, Gujrati D (2023) Structure-based virtual screening, molecular docking, molecular dynamics simulation of EGFR for the clinical treatment of glioblastoma. Appl Biochem Biotechnol 28:1–26 19. Chopra I, Panwar U, Bhrdwaj A, Madhavi M, Soni L, Sharma K, Parihar AS, Mohan VP, Prajapati L, Joshi I, Sharma R (2023) Structural insights into conformational stability of ESR1 and structure base screening of new potent inhibitor for the treatment of Breast Cancer. https://doi.org/10.21203/rs.3.rs1413803/v1 20. Ferraz WR, Gomes RA, S Novaes AL, Goulart Trossini GH (2020) Ligand and structurebased virtual screening applied to the SARS-

29

CoV-2 main protease: an in silico repurposing study. Future Med Chem 12:1815–1828 21. Drwal MN, Griffith R (2013) Combination of ligand-and structure-based methods in virtual screening. Drug Discov Today Technol 10: e395–e401 22. Sharda S, Sarmandal P, Cherukommu S, Dindhoria K, Yadav M, Bandaru S, Sharma A, Sakhi A, Vyas T, Hussain T, Nayarisseri A (2017) A virtual screening approach for the identification of high affinity small molecules targeting BCR-ABL1 inhibitors for the treatment of chronic myeloid leukemia. Curr Top Med Chem 17:2989–2996 23. Reddy KK, Singh SK, Tripathi SK, Selvaraj C, Suryanarayanan V (2013) Shape and pharmacophore-based virtual screening to identify potential cytochrome P450 sterol 14α-demethylase inhibitors. J Recept Signal Transduct Res 33:234–243 24. Ranganathan S, Ilavarasi AV, Palaka BK, Kuppusamy D, Ampasala DR (2022) Cloning, functional characterization and screening of potential inhibitors for Chilo partellus chitin synthase A using in silico, in vitro and in vivo approaches. J Biomol Struct Dyn 40:1416– 1429 25. Patidar K, Deshmukh A, Bandaru S, Lakkaraju C, Girdhar A, Gutlapalli VR, Banerjee T, Nayarisseri A, Singh SK (2016) Virtual screening approaches in identification of bioactive compounds Akin to delphinidin as potential HER2 inhibitors for the treatment of breast cancer. Asian Pac J Cancer Prev 17: 2291–2295 26. Ranganathan S, Ampasala DR, Palaka BK, Ilavarasi AV, Patidar I, Poovadan LP, Sapam TD (2021) In silico binding profile analysis and in vitro investigation on chitin synthase substrate and inhibitors from maize stem borer, Chilo partellus. Curr Comput-Aided Drug Des 17:881–895 27. Selvaraj C, Singh SK, Tripathi SK, Reddy KK, Rama M (2012) In silico screening of indinavir-based compounds targeting proteolytic activity in HIV PR: binding pocket fit approach. Med Chem Res 21:4060–4068 28. Doucet D, Retnakaran A, Krell PJ, Feng Q, Ampasala DR (2016) Molecular cloning and structural characterization of ecdysis triggering hormone from Choristoneura fumiferana. Int J Biol Macromol 88:213–221 29. Selvaraj C, Panwar U, Dinesh DC, Boura E, Singh P, Dubey VK, Singh SK (2021) Microsecond MD simulation and multipleconformation virtual screening to identify potential anti-COVID-19 inhibitors against

30

Umesh Panwar et al.

SARS-CoV-2 main protease. Front Chem 8: 595273 30. Hamza A, Wei NN, Zhan CG (2012) Ligandbased virtual screening approach using a new scoring function. J Chem Inf Model 52:963– 974 31. Wu KJ, Lei PM, Liu H, Wu C, Leung CH, Ma DL (2019) Mimicking strategy for protein– protein interaction inhibitor discovery by virtual screening. Molecules 24:4428 32. Asadzadeh A, Samad-Soltani T, RezaeiHachesu P (2021) Applications of virtual and augmented reality in infectious disease epidemics with a focus on the COVID-19 outbreak. Inform Med Unlocked 24:100579 33. Selvaraj C, Panwar U, Ramalingam KR, Vijayakumar R, Singh SK (2022) Exploring the macromolecules for secretory pathway in cancer disease. Adv Protein Chem Struct Biol 133:55–83 34. Schottlender G, Prieto JM, Palumbo MC, Castello FA, Serral F, Sosa EJ, Turjanski AG, Martı` MA, Ferna´ndez Do Porto D (2022) From drugs to targets: reverse engineering the virtual screening process on a proteomic scale. Front Drug Discov 2 35. Murugan NA, Podobas A, Gadioli D, Vitali E, Palermo G, Markidis S (2022) A review on parallel virtual screening softwares for highperformance computers. Pharmaceuticals 15: 63 36. da Silva Rocha SFL, Olanda CG, Fokoue HH, Sant’Anna CMR (2019) Virtual screening techniques in drug discovery: review and recent applications. Curr Top Med Chem 19:1751– 1767 37. Suryanarayanan V, Panwar U, Chandra I, Singh SK (2018) De novo design of ligands using computational methods. In: Gore M, Jagtap U (eds) Computational drug discovery and design. Methods in molecular biology, vol 1762. Humana Press, New York 38. Zhang B, Li H, Yu K, Jin Z (2022) Molecular docking-based computational platform for high-throughput virtual screening. CCF Trans High Perform Comput 13:1–2 39. Panwar U, Singh SK (2018) Structure-based virtual screening toward the discovery of novel inhibitors for impeding the proteinprotein interaction between HIV-1 integrase and human lens epithelium-derived growth factor (LEDGF/p75). J Biomol Struct Dyn 36:3199–3217 40. Panwar U, Singh SK (2021) In silico virtual screening of potent inhibitor to hamper the interaction between HIV-1 integrase and LEDGF/p75 interaction using

E-pharmacophore modeling, molecular docking, and dynamics simulations. Comput Biol Chem 93:107509 41. Reddy KK, Singh P, Singh SK (2014) Blocking the interaction between HIV-1 integrase and human LEDGF/p75: mutational studies, virtual screening, and molecular dynamics simulations. Mol Biosyst 10:526–536 42. Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, Shaw DE (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739–1749 43. Panwar U, Singh SK (2019) Identification of novel pancreatic lipase inhibitors using in silico studies. Endocr Metab Immune Disord Drug Targets 19:449–457 44. Clark AJ, Tiwary P, Borrelli K, Feng S, Miller EB, Abel R, Friesner RA, Berne BJ (2016) Prediction of protein–ligand binding poses via a combination of induced fit docking and metadynamics simulations. J Chem Theory Comput 12:2990–2998 45. Panwar U, Singh SK (2021) Atom-based 3D-QSAR, molecular docking, DFT, and simulation studies of acylhydrazone, hydrazine, and diazene derivatives as IN-LEDGF/p75 inhibitors. Struct Chem 32:337–352 46. QikProp, Schro¨dinger, LLC, New York, NY (2021) 47. Tripathi SK, Selvaraj C, Singh SK, Reddy KK (2012) Molecular docking, QPLD, and ADME prediction studies on HIV-1 integrase leads. Med Chem Res 21:4239–4251 48. Desmond molecular dynamics system, D. E. Shaw Research, New York, NY (2021) 49. Reddy KK, Singh SK, Dessalew N, Tripathi SK, Selvaraj C (2012) Pharmacophore modelling and atom-based 3D-QSAR studies on N-methyl pyrimidones as HIV-1 integrase inhibitors. J Enzyme Inhib Med Chem 27: 339–347 50. Jones D, Kim H, Zhang X, Zemla A, Stevenson G, Bennett WD, Kirshner D, Wong SE, Lightstone FC, Allen JE (2021) Improved protein–ligand binding affinity prediction with structure-based deep fusion inference. J Chem Inf Model 61:1583–1592 51. Va´zquez J, Lo´pez M, Gibert E, Herrero E, Luque FJ (2020) Merging ligand-based and structure-based methods in drug discovery: an overview of combined virtual screening approaches. Molecules 25:4723 52. Luukkonen S, van den Maagdenberg HW, Emmerich MT, van Westen GJ (2023)

Virtual Screening Process Artificial intelligence in multi-objective drug design. Curr Opin Struct Biol 79:102537 53. Murugan NA, Priya GR, Sastry GN, Markidis S (2022) Artificial intelligence in virtual screening: models versus experiments. Drug Discov Today 18:1913 54. Lyu J, Irwin JJ, Shoichet BK (2023) Modeling the expansion of virtual screening libraries. Nat Chem Biol 16:1–7 55. Wang E, Sun H, Wang J, Wang Z, Liu H, Zhang JZ, Hou T (2019) End-point binding free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug design. Chem Rev 119:9478–9508 56. Genheden S, Ryde U (2015) The MM/PBSA and MM/GBSA methods to estimate ligandbinding affinities. Expert Opin Drug Discov 10:449–461 57. Aarthy M, Panwar U, Singh SK (2020) Structural dynamic studies on identification of EGCG analogues for the inhibition of Human Papillomavirus E7. Sci Rep 10(1): 8661 58. Majhi M, Ali MA, Limaye A, Sinha K, Bairagi P, Chouksey M, Shukla R, Kanwar N, Hussain T, Nayarisseri A, Singh SK (2019) An in silico investigation of potential EGFR inhibitors for the clinical treatment of colorectal cancer. Curr Top Med Chem 18(27):2355–2366 59. Ranganathan S, Ilavarasi AV, Palaka BK, Kuppusamy D, Ampasala DR (2022) Cloning, functional characterization and screening of potential inhibitors for Chilo partellus chitin synthase A using in silico in vitro and in vivo approaches. J Biomol Struct Dyn 40(3):1416– 1429 60. Reddy KK, Singh SK, Tripathi SK, Selvaraj C (2013) Identification of potential HIV-1 integrase strand transfer inhibitors: In silico virtual screening and QM/MM docking studies. SAR QSAR Environ Res 24(7):581–595

31

61. Cavasotto CN, Adler NS, Aucar MG (2018) Quantum chemical approaches in structurebased virtual screening and lead optimization. Front Chem 6:188 62. Vijayalakshmi P, Selvaraj C, Singh SK, Nisha J, Saipriya K, Daisy P (2013) Exploration of the binding of DNA binding ligands to Staphylococcal DNA through QM/MM docking and molecular dynamics simulation. J Biomol Struct Dyn 31(6):561–571 63. Gleeson MP, Gleeson D (2009) QM/MM calculations in drug discovery: a useful method for studying binding phenomena? J Chem Inf Model 49:670–677 64. Aarthy M, Panwar U, Selvaraj C, Singh SK (2017) Advantages of structure-based drug design approaches in neurological disorders. Curr Neuropharmacol 15(8):1136–1155 65. Reddy KK, Singh SK (2014) Combined ligand and structure-based approaches on HIV-1 integrase strand transfer inhibitors. Chem Biol Interact 218:71–81 66. Ranganathan S, Ampasala DR, Palaka BK, Ilavarasi AV, Patidar I, Poovadan LP, Sapam TD (2021) In silico binding profile analysis and in vitro investigation on chitin synthase substrate and inhibitors from maize stem borer, Chilo partellus. Curr Comput Aided Drug Des 17:881–895 67. Selvaraj C, Krishnasamy G, Jagtap SS, Patel SK, Dhiman SS, Kim TS, Singh SK, Lee JK (2016) Structural insights into the binding mode of D-sorbitol with sorbitol dehydrogenase using QM-polarized ligand docking and molecular dynamics simulations. Biochem Eng J 114: 244–256 68. Selvaraj C, Omer A, Singh P, Singh SK (2015) Molecular insights of protein contour recognition with ligand pharmacophoric sites through combinatorial library design and MD simulation in validating HTLV-1 PR inhibitors. Mol Biosyst 11(1):178–189

Chapter 3 Molecular Dynamics as a Tool for Virtual Ligand Screening Gre´gory Menchon, Laurent Maveyraud, and Georges Czaplicki Abstract Rational drug design is essential for new drugs to emerge, especially when the structure of a target protein or nucleic acid is known. To that purpose, high-throughput virtual ligand screening campaigns aim at discovering computationally new binding molecules or fragments to modulate particular biomolecular interactions or biological activities, related to a disease process. The structure-based virtual ligand screening process primarily relies on docking methods which allow predicting the binding of a molecule to a biological target structure with a correct conformation and the best possible affinity. The docking method itself is not sufficient as it suffers from several and crucial limitations (lack of full protein flexibility information, no solvation and ion effects, poor scoring functions, and unreliable molecular affinity estimation). At the interface of computer techniques and drug discovery, molecular dynamics (MD) allows introducing protein flexibility before or after a docking protocol, refining the structure of protein–drug complexes in the presence of water, ions, and even in membrane-like environments, describing more precisely the temporal evolution of the biological complex and ranking these complexes with more accurate binding energy calculations. In this chapter, we describe the up-to-date MD, which plays the role of supporting tools in the virtual ligand screening (VS) process. Without a doubt, using docking in combination with MD is an attractive approach in structure-based drug discovery protocols nowadays. It has proved its efficiency through many examples in the literature and is a powerful method to significantly reduce the amount of required wet experimentations (Tarcsay et al, J Chem Inf Model 53:2990–2999, 2013; Barakat et al, PLoS One 7:e51329, 2012; De Vivo et al, J Med Chem 59:4035–4061, 2016; Durrant, McCammon, BMC Biol 9:71–79, 2011; Galeazzi, Curr Comput Aided Drug Des 5:225–240, 2009; Hospital et al, Adv Appl Bioinforma Chem 8:37–47, 2015; Jiang et al, Molecules 20:12769–12786, 2015; Kundu et al, J Mol Graph Model 61:160–174, 2015; Mirza et al, J Mol Graph Model 66:99–107, 2016; Moroy et al, Future Med Chem 7:2317–2331, 2015; Naresh et al, J Mol Graph Model 61:272–280, 2015; Nichols et al, J Chem Inf Model 51:1439–1446, 2011; Nichols et al, Methods Mol Biol 819:93–103, 2012; Okimoto et al, PLoS Comput Biol 5:e1000528, 2009; RodriguezBussey et al, Biopolymers 105:35–42, 2016; Sliwoski et al, Pharmacol Rev 66:334–395, 2014). Key words Molecular dynamics, Drug design, Virtual screening, Clustering, Protein–ligand complex, Docking, Interaction energy, Affinity

Mohini Gore and Umesh B. Jagtap (eds.), Computational Drug Discovery and Design, Methods in Molecular Biology, vol. 2714, https://doi.org/10.1007/978-1-0716-3441-7_3, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

33

34

1

Gre´gory Menchon et al.

Introduction Virtual ligand screening has become an important tool in the world of rational drug design and early development process in the last decades but still remains a very challenging task for academics and the pharmaceutical industry in the hunt for providing new safe and effective drug candidates. The complexity of such a research is computationally expensive and requires more and more accurate and sophisticated programs and material to achieve a successful enrichment of active and biologically relevant compounds within a reasonable timescale [1–16]. Virtual screening can be coarsely divided into the two complementary ligand-based and structure-based methods [17]. We will focus on the latter method applied to protein targets in this chapter. In contrast with an experimental medium- to high-throughput assay, a precise knowledge of the 3D structure of a target protein is a prerequisite to a rational approach in the drug discovery pipeline. This precise structural information is brought experimentally through the NMR and X-ray crystallography approaches and more recently through the cryo-electron microscopy technique, which has today broken the atomic resolution barrier [18–21]. While homology modeling was able to deliver convincing structure, recent progress in machine learning procedures resulted in a huge increase in the accuracy of predicted models to the point that they are now credible alternative to experimentally determined structures [22, 23]. There is no unique protocol or solution when starting with a new drug discovery project, but the general workflow can be compared to a “funnel strategy” and usually starts with a high-throughput docking experiment, with filtered and enriched ligand libraries and a set of representative 3D protein structures determined with the previously mentioned techniques. See Fig. 1 for a general schematic overview of this procedure. Apart from the accurate protein structure availability, the main challenges in these in silico approaches are the time cost for sampling the huge available chemical space and the prediction of accurate binding energies and ADME properties of compounds. Nowadays, the emergence of new approaches (e.g., chemical space docking, synthon-based virtual screening), together with high-performance computers and software, and the increasing availability of challenging protein 3D structures (like GPCRs or protein–protein interaction interfaces, PPIs) and of ultra-large molecule libraries (with hundreds of millions of commercially available compounds) pave the way for more and more successful lead compound discoveries and with a significant time reduction [24–26]. Computational docking allows finding the best position, orientation, and theoretical affinity of a ligand in the binding pocket of a target receptor (Refs. [27–30] and references therein). The

Molecular Dynamics in Virtual Screening

35

Fig. 1 A general schematic illustration of the funnel-like approach to drug discovery. This chapter deals with virtual screening, representing the first step of this method. Figure created with free images from https://www. slideteam.net and https://www.freepik.com

different docking algorithms used for conformational sampling and scoring have been extensively described in the literature [31]. However, docking methods also face major issues, especially considering that the mechanisms of protein–ligand recognition and binding are dynamic processes and docking is only a static or semiflexible method [32]. Although flexible docking protocols easily predict different ligand conformations, the docking protocols and output results cannot account for full protein flexibility, induced-fit and solvent effects, nor for protein–ligand complex stability over time and accurate binding free energy estimation. In that situation, flexible docking technique can be used to afford protein flexibility during the ligand binding event, but it is only limited to a small number of residues usually in the proximity of the binding pocket. Thus, it cannot take into account important conformational changes occurring during drug binding (induced-fit effect) [33– 37]. Furthermore, docking scoring functions, which allow discriminating and ranking the compounds according to their predicted binding affinity, are limited in accuracy and can often be misleading [38–42]. MD is a powerful tool which helps to overcome these critical limitations and is generally implemented before and after docking.

36

Gre´gory Menchon et al.

Publications containing MD studies, related to either protein studies or drug discovery, have exponentially increased within the last 20 years [43]. MD is a computational simulation of a complex biological system which describes motions, interactions, and dynamics at the atomic level by choosing a “force field” describing all the interatomic interactions and by integrating the Newtonian equations which give position and speed of atoms over time [44– 47]. As protein and protein–ligand motions would involve highly complex and computationally expensive quantum mechanics (QM) calculations, MD aims at approximating these QM terms and movements governed by probability functions through Newtonian’s physics and by implementing force fields that are parameterized to fit physicochemical knowledge obtained from experimental data [48]. Before docking, MD allows a conformational sampling and clustering of a protein or enzyme to account for protein dynamics and the conformational selection by a ligand [49, 50]. This allows facing the limitations of a “static” experimental structure by docking molecules to a representative set of protein’s conformations. In some other cases where it is not obvious to find a proper binding site, it allows the detection of cryptic or allosteric cavities that were not present in the initial structure [51]. Computational simulations can also be used to study the conformational transitions caused by mutations, protonation or phosphorylation states, and the behavior of allosteric modulators or to help the scientist to interpret unclear electron densities in X-rays’ derived structures [52, 53]. MD is also used as a second-step filtering process to further validate a protein–ligand complex obtained from docking by determining the stability of the complex from a trajectory. It can be done through deviation and fluctuation analysis, such as RMSD (root mean square deviation), RMSF (root mean square fluctuation), or ROG (radius of gyration), by identifying persistent H-bonds and other interatomic interactions and by estimating the binding free energy [54] (see Fig. 2). MD brings more accurate binding free energies and kinetic calculations by implementing solvent and ion effects, both enthalpic and entropic contributions, and a full dynamics of the biological system [32]. This more accurate protein-ligand complex evaluation and validation allows better ranking of the hit molecules, facilitates the hit selection, and thus further reduces the time and cost of experimental testing as well as the number of false positives and negatives in wet laboratory experiments [55]. Finally, by including the induced-fit and solvent effects, we can work with a more realistic model, which may apply even to a membrane-like environment in the case of membrane proteins [56, 57]. These MD procedures will be described in detail in the Subheading 3.

Molecular Dynamics in Virtual Screening

37

Fig. 2 Examples of stability analysis (RMSD vs. time) for several typical cases (data not published): (a) ligand is anchored in the binding site of the receptor and shows a long-term stability; (b) ligand is stable in the binding site but shows internal mobility. Specifically, a phenyl ring undergoes a flip-flop motion. The overall energy of the complex does not change, as the atoms on both sides of the rotation axis are equivalent. However, the RMSD calculation does not take this into account, as it tracks atoms with their unique identifiers. As a result, periodic jumps of RMSD values are observed; (c) after some time spent in the originally docked position, ligand moves to a different site in the proximity of the previous one; (d) ligand is unstable and quickly leaves the original position and then moves around the receptor unable to find a specific binding site

MD has reached such a high performance and usage versatility in the modern drug development process [58, 59] that it has even been used to help in refining antibody design and stability [57, 60– 62], to study drug delivery carriers for therapeutic substances [63], and to successfully study and/or target the proteins that were formerly considered as “undruggable,” from the small molecule perspective (e.g., intrinsically disordered proteins, GPCRs, and highly dynamic proteins lacking a defined binding pocket like the well-known oncogenic RAS proteins family) [57, 64–68]. More recently, the SARS-CoV-2 pandemics prompted the scientific community to rush for the discovery of new antiviral drugs. MD was successfully applied to help understand the action or to find new virus protein inhibitors [69–72], to evaluate the putative efficacy of known molecules in a drug repurposing strategy [73, 74], or to understand the link between the variant’s protein dynamics and the infectivity [75].

38

Gre´gory Menchon et al.

Docking and MD approaches are highly complementary computational methods for drug screening. But in a way similar to docking, MD also faces its own limitations. Firstly, the current issues concern the high computational cost (computer power and required timescales) even though considerable improvements have been made in computer power and algorithmic efficiency [76– 80]. We are today able to simulate the dynamics and temporal evolution of complex biological systems and binding events but on a limited timescale (up to 10 μs) although a longer timescale would be sometimes required for more biological relevance. However, during the last decades, many technical feats such as replica exchange MD (REMD), accelerated MD (aMD), or dissipationcorrected targeted MD allowed to get closer to the sampling of whole system’s conformational space and dissect these system’s behavior above high energy barriers [81–84]. A second limitation refers to the approximations in the force fields used, which were originally developed for the simulation of globular ordered proteins. Parametrization of force fields is essential for correct representation of structural details, important for protein–ligand binding [85–89]. The Methods section will present protocols to prepare a protein and MD files, to launch an MD simulation, to analyze the MD trajectories, to extract representative receptor structures and use them in docking of a library of filtered molecules, and finally to validate the putative complexes by calculating their stability, by detecting interatomic interactions, and by computing ligand affinity parameters. In this chapter, we describe the general virtual ligand screening workflow by implementing docking coupled with MD simulations. Each step, from the preparation of the receptor and ligand input files to the validation of the complexes, is described in detail and gives the reader an overview of a complete and general strategy which is broadly used by computational biologists to find new binding molecules with pharmacological interest. We especially emphasize the need to use the information on molecular flexibility during a virtual screening procedure, which is mandatory to fully explain and predict a binding event between a receptor and a putative binding molecule.

2

Materials The existing MD and VS software runs on a variety of platforms, but in this chapter, we will focus on the programs we have used in our laboratory. Some of the other existing options will be given in references, but this is not meant to be an exhaustive review. The calculations have been performed on a Supermicro Server Tower 7049GP-TRT operating under the Linux OS (Centos 7.9),

Molecular Dynamics in Virtual Screening

39

equipped with two Intel Xeon Silver 4208 CPUs offering in total 32 threads in the HT mode as well as four NVidia RTX 2080 Ti GPU cards. The MD procedures outlined below have been developed with the Amber 20 and AmberTools 20 suite of programs [90, 91] but may work with other versions of the software. Some of the data conversion was done with the OpenBabel program (http://openbabel.org/wiki/Main_Page). The docking has been performed with the AutoDock Vina program [92, 93] using the AutoDock Tools [94] as well as with the OpenEye software suite (OpenEye Scientific Software, Santa Fe, NM, http://www. eyesopen.com). Chemical structures were drawn with MarvinSketch from ChemAxon (Marvin 20.19, 2020, http://www. chemaxon.com).Visualization has been done with Pymol v.2.6 OpenSource (https://www.pymol.org) and Vida v.4.3 (OpenEye). Part of the work was done using the resources of Calmip HPC supercomputer (Sequana, Atos-Bull, https://www.calmip.univtoulouse.fr).

3

Methods The objective of this chapter is to describe in detail how to perform virtual screening preserving the flexibility of both the receptor and the ligand. The receptor’s flexibility is mimicked by docking to a representative set of receptor’s structures, which can be obtained from the clustering analysis of an MD trajectory. The ligand’s flexibility depends on the number of the rotatable bonds. This problem is routinely handled by the recent software. The MD runs are executed on GPU cards rather than on CPUs in order to profit from the acceleration of the calculation offered by hugely increased numbers of processing units present in GPUs. Details of the procedures which refer specifically to GPU computing will be mentioned explicitly in the text below. The sections below describe the chain of commands to obtain a family of representative receptor structures, preparation of molecules for docking, the docking procedure, and finally the analysis of results.

3.1 Prepare Receptor for MD Simulation

1. Obtain the PDB file of the protein you are interested in. Typically, it will come from the PDB database (http:// www.rcsb.org, http://www.wwpdb.org/) if the structure was experimentally determined using either X-ray, NMR, or cryoEM studies, or it can be obtained by homology modeling [95– 97] or machine learning [22]. The NMR files include protons, the others usually only heavy atoms (see Note 1). 2. Add missing residues. Unstructured loops and other flexible regions of the receptor may not appear in crystallographic PDB files. The MD software, however, needs a complete molecule.

40

Gre´gory Menchon et al.

We use currently MODELLER v. 10 [95, 98, 99] to fill in the structural gaps. The python script “loop.py,” supplied with examples of MODELLER’s use, can be used for this purpose. Alter the line: self.residue_range(’X:A’, ’Y:A’)

to define your own region to model, replacing X, Y, and A by the relevant residue range and chain symbol, respectively. The input PDB file must contain the residues to be modeled, but their atomic coordinates may be arbitrary. The easiest way to achieve this is to create a linear chain with the missing residues in the extended conformation as a separate molecule and then paste it in the original PDB file. Finally, launch the following script (here called “Run_Modeller_Loop”) to generate N models, numbered from MOD_FROM to MOD_TO (this allows adding new models to the set of the existing ones): #!/bin/sh #

Usage:

Run_Modeller_Loop

input_file

MOD_FROM

MOD_TO fname=$1 from=$2 to=$3 cp $fname input.pdb i=$from while [ $i -le $to ] do echo "Model #"$i python loop.py $i > $i.log ((i++)) done

At the end, the best model can be extracted according to the optimized energy score, e.g., with the command: grep -H OBJECTIVE loop* | sort -n -k6

Generating hundreds of models may be time-consuming. Since we have a multicore server on hand, we can accelerate this process by parallelizing the procedure. A very simple way of achieving it is to divide the number of models we wish to create by the number of cores assigned to the task, so that each of them works on its own share of work independently. The script below (called Mod_Loop_Par) shows how this can be done without the explicit parallelization using either OpenMP or MPI:

Molecular Dynamics in Virtual Screening

41

#!/bin/sh # Run Modeller’s loop prediction software (N models on M cores). # Usage: Mod_Loop_Par input_file model_from model_to num_cpu # Subroutine usage: "do_job from to" do_job() { i=$1 to=$2 while [ $i -le $to ] do # echo "Model #"$i modpy.sh python loop.py $i > $i.log ((i++)) done } fname=$1 from=$2 to=$3 cpu=$4 if [ -d "tmp.1" ] then base=‘ls -l tmp.1/*.log | wc -l‘ else base=0 fi ((nmod=$to-$from+1)) ((mpc=$nmod/$cpu)) ((rem=$nmod-$cpu*$mpc)) if [ "$rem" != "0" ] then ((mpc++)) fi ((nmod=$mpc*$cpu)) echo "Calculating $nmod models on $cpu processors" echo "($mpc models per processor)" cp $fname input.pdb ic=1 while [ $ic -le $cpu ] do mkdir -p tmp.$ic cp input.pdb loop.py tmp.$ic cd tmp.$ic

42

Gre´gory Menchon et al. ((fr=($ic-1)*$mpc+1+$base)) ((to=$fr+$mpc-1+$base)) ( do_job $fr $to ) & cd .. ((ic++)) done

The script creates temporary directories, one per core, in which partial results are obtained. At the end, the analysis can be done with the command: grep –H OBJECTIVE tmp.*/loop.BL* | sort –g –k6 | column –t > out.txt

This produces a file (out.txt in this example) with sorted entries with the best models at the top of the list. There are several alternative approaches, based on Web servers offering loop modeling services. A non-exhaustive list can be found at the address https://www.vls3d.com/index.php/links/bioinformatics/3dstructure-prediction/modeling-loops. In general, the user has to select residues representing the beginning and the end of a gap as well as the sequence to be added. The results are sorted according to the degree of satisfaction of the spatial constraints of the loop within the sequence (see Note 2). 3. Clean the PDB file. Remove the CONECT records and all protons if they exist (as is the case with NMR structures). The reason to remove protons is that their identifiers are not always the same as those in templates of internal databases of MD software, which may cause problems. The software then adds the missing atoms according to the templates, with correct identifiers. Further, add TER records after the last residue of each molecular entity (polypeptide chain, cofactor, ligand) you wish to include in the MD. Look for structure-embedded disordered water molecules. Remove all but one population of them. Similarly, look for residues which exist in alternate conformations (e.g., ALYS and BLYS). Select one of them (preferably the one with the highest occupancy) for further processing, removing the rest. Also, correct partially built residues, such as surface side chains with atoms missing as a result of local disorder. 4. Decide on the protonation state of residues such as HIS and CYS (in some cases, ASP and GLU side chains may also be protonated). In the Amber software, HIS becomes HID, HIE, or HIP, if it is protonated in position delta, epsilon, or both, respectively. Deprotonated cysteines and those bound to metal atoms become CYM, while those involved in disulfide bridges become CYX (see Note 3).

Molecular Dynamics in Virtual Screening

43

5. To create a disulfide bridge, use a tleap command such as “bond mol.X.SG mol.Y.SG,” where mol is the object holding the protein structure, while X and Y stand for the residue numbers of cysteines involved. The residue numbers typically start from the first number in the PDB file and continue sequentially through chains and all extras, such as ions and ligands. Type “desc mol” in tleap to see all residue numberings. We use tleap rather than xleap to be independent of the graphical platform used and to be able to include this command in batch scripts. We launch tleap on a predefined set of scripts (e.g., “tleap –f script.txt”). An example of a tleap script follows: source leaprc.protein. ff14SB

(make sure this is in the path of the program)

set default PBradii mbondi2 (for affinity calculations; see

Subheading 3.7) mol=loadpdb prot.pdb

(loads the cleaned and completed PDB file)

list

(checks the contents)

check mol

(there should be no missing parameters)

bond mol.10.SG mol.20.SG

(example of a bridge between Cys10 and Cys20)

saveamberparm mol rec. prmtop rec.inpcrd

(create topology files)

savepdb mol rec.pdb

(optionally save the new PDB file)

quit

The saveamberparm command allows keeping the topology of the receptor, which will be needed at the analysis stage. The script checks if the structure and topology are correct (see Note 4). Should this not be the case, error messages will appear, helping to identify the source of the problem, which should be corrected before continuing. 6. If there are ions present in the structure (Mg2+, Zn2+. . .), treat each as a separate residue (i.e., insert TER before and after each line on which an ion is placed). Change the atom and residue names to those recognized by the force field used, such as ff14SB in the example above (e.g., for chlorine: use “Cl-” in the fields of both atom and residue names).

44

Gre´gory Menchon et al.

3.2 Preparation of MD Input Files

1. Prepare the whole system for the MD simulations in tleap by creating a solvated periodic box and adding charge neutralizing ions. The script may look like this (comments added for clarity should not appear in the script itself): source leaprc.ff14SB set default PBradii mbondi2 mol=loadpdb prot.pdb check mol

charge mol

(determine the net charge of the system)

bond mol.10.SG mol.20.SG

(example of a bridge between Cys10 and Cys20)

solvatebox mol TIP3PBOX 10

(create a cubic periodic box with 10 Å between the protein and the nearest box edge)

addions mol Na+ 0

(neutralize charges; use Na+ or Cldepending on the result of the charge command above)

saveamberparm mol sys.prmtop sys. inpcrd

(create topology files)

savepdb mol sys.pdb

(optionally save the new PDB file)

quit

2. The MD protocol we follow executes first of all a rapid minimization of the solvent by the steepest descent algorithm. The solute is restrained by a harmonic potential with the force constant equal to 10 kcal/mol/Å2. The typical values of different parameters are given in the script below: Initial energy minimization - step 1 (solvent only) &cntrl imin=1, ncyc=500, maxcyc=500, ntb=1, cut=10, ntr=1 / Hold the solute fixed 10.0 RES 1 291 END END

In the above, 291 is the number of residues of the receptor including any cofactor or ligand residues and should be adjusted as

Molecular Dynamics in Virtual Screening

45

needed. As for the choice of the cutoff, the general rule is that the box size should be at least twice the cutoff plus a margin of 1-2 Å. Since the distance between a protein and the box edge is 10 Å on each side, the box size is twice this value plus the diameter of the protein, in total ca. 40 Å in this example. Since the cutoff was defined as 10 Å, the above rule is satisfied. For the description of other parameters, see the relevant manuals on the Web page http:// ambermd.org/doc12/Amber20.pdf. The second stage is a more thorough minimization, where the initial 500 iterations of the steepest descent are followed by 1500 iterations (maxcyc-ncyc) of the conjugate gradient algorithm. This time, the minimization includes solute under weak restraints (force constant equal to 2 kcal/mol/Å2): Initial energy minimization - step 2 (solvent and restrained solute) &cntrl imin=1, ncyc=500, maxcyc=2000, ntb=1, cut=10, ntr=1 / Weak restraints 2.0 RES 1 291 END END

In the third stage, we perform a short MD run of the NVT ensemble (ntb=1), i.e., keeping the number of molecules (N), volume (V), and temperature (T) constant. The integration step is equal to 1 fs, and during the run, we increase the temperature from 0 to 300 K: MD

20 ps with weak restraints and step of 1 fs

&cntrl imin=0, irest=0, ntx=1, ntb=1, cut=10, ntr=1, ntc=2, ntf=2, tempi=0.1, temp0=300.0, ntt=3, gamma_ln=1.0, nstlim=20000, dt=0.001, ioutfm=1, ntxo=2, ntpr=1000, ntwx=1000, ntwr=1000 / Weak restraints 1.0 RES 1 291 END END

46

Gre´gory Menchon et al.

The ioutfm=1 keyword indicates the binary output trajectory in the NetCDF format. Similarly, ntxo=2 creates the restart file in the NetCDF format. The keywords ntpr, ntwx, and ntwr define the frequency of writing to the output file, to the trajectory file and to the restart file, respectively. In this case, we see new output every 1000 steps of 1 fs each, i.e., with the interval of 1 ps. In the next stage of the protocol, we perform a somewhat longer MD run on the NPT ensemble (ntb=2), i.e., keeping the number of molecules (N), pressure (P), and temperature (T) constant. We use the integration step of 2 fs. It is during this simulation that the system starts shrinking if the initial density was not optimal: MD 100 ps NPT at 300K; no restraints; step 2 fs &cntrl imin=0, irest=1, ntx=5, ntb=2, pres0=1.0, ntp=1, taup=2.0, cut=10, ntr=0, ntc=2, ntf=2, tempi=300.0, temp0=300.0, ntt=3, ig=-1, gamma_ln=1.0, nstlim=50000, dt=0.002, ioutfm=1, ntxo=2, ntpr=1000, ntwx=1000, ntwr=1000 / &ewald skinnb=4.0d0 /

In the above, we have added the &ewald section with one keyword (skinnb) whose value we wish to modify. Its default value is 2 Å and it corresponds to an extension of the cutoff in which the non-bond neighbor list is created. When the system is being equilibrated, it may shrink more than the size of this parameter. This causes no problems if calculations are done on CPUs, but the GPU code is not the same and may cause an execution error. To avoid it, we altered the skinnb parameter. This value is not necessarily optimal, so its fine-tuning for each specific case is encouraged. Finally, in the final stage of the protocol, we initiate a production run of 100 ns: Production run: MD 100 ns NPT at 300K; output once every 10 ps &cntrl imin=0, irest=1, ntx=5, ntb=2, pres0=1.0, ntp=1, taup=2.0, cut=10, ntr=0, ntc=2, ntf=2, tempi=300.0, temp0=300.0,

Molecular Dynamics in Virtual Screening

47

ntt=3, ig=-1, gamma_ln=1.0, nstlim=50000000, dt=0.002, ioutfm=1, ntxo=2, ntpr=5000, ntwx=5000, ntwr=5000 / &ewald skinnb=4.0d0 /

In this case, the output will be written to files every 5000 steps of 2 fs each, i.e., once every 10 ps. This gives 100 frames of the whole system per nanosecond. If the system is large, it will produce huge output trajectory files. It is for this reason that we use the NetCDF format, which roughly halves the original volume of the ASCII files (see Note 5). 3.3 Running the MD Simulation

1. Create a script that will launch all MD stages automatically. Here is an example of such a script, named “Run_pmemd_cuda”: #!/bin/sh # # Setup of AMBER CUDA jobs. # Each job needs: # - mdinX (X is the current job number) # - $BASE.prmtop (topology file) # - ${BASE}Y.rst (Y=X-1, coordinate file) # NOTE: the 1st coordinate file may be the original inpcrd file. # # Each job creates: # - ${BASE}X.out (output file) # - ${BASE}X.rst (last coordinates) # - ${BASE}X.mdcrd (trajectory) # # The CUDA version launches the tasks on a GPU. To select a given # device, use deviceQuery, then set CUDA_VISIBLE_DEVICES. # echo -n "Total number of jobs: " read Njobs echo -n "ID of the 1st job [1]: " read MDfirst echo -n "Basename for I/O files: " read BASE echo "List of available GPUs:" deviceQuery -noprompt | egrep ’^Device’ |

48

Gre´gory Menchon et al. while read dev; do echo " "$dev done echo -n "Enter device #: " read GPU export CUDA_VISIBLE_DEVICES=$GPU # echo "Launching jobs on GPU #"$GPU"..." if [ 0$MDfirst -eq "0" ] then MDfirst=1 fi MDlast=$Njobs MDcurr=$MDfirst MDinp=0 while [ $MDcurr -le $MDlast ] do echo -n "Job $MDcurr started on " date let "MDinp = ${MDcurr} - 1" $AMBERHOME/bin/pmemd.cuda -O -i mdin$MDcurr \ -o $BASE$MDcurr.out \ -p $BASE.prmtop \ -c $BASE$MDinp.rst \ -ref $BASE$MDinp.rst \ -r $BASE$MDcurr.rst \ -inf mdinfo$MDcurr \ -x $BASE$MDcurr.mdcrd let "MDcurr = ${MDcurr} + 1" done echo -n "Jobs finished on " date

The deviceQuery command comes with the CUDA Toolkit (https://developer.nvidia.com/cuda-toolkit), which has to be installed before using any GPU-related software. The CUDA version should correspond to the one with which the software was developed. If multiple GPU cards are to be used in parallel, the list of selected devices (variable “GPU” in the script above) contains comma-separated identifiers (e.g., “0,1”), and the first line of the launch command should be

Molecular Dynamics in Virtual Screening

49

mpirun -n NGPUS $AMBERHOME/bin/pmemd.cuda.MPI -O -i mdin$MDcurr \

where NGPUS is the number of GPU cards supposed to work in parallel on a given task. In order to adjust the input MD files to the above script, rename the file sys.inpcrd (see Subheading 3.2, step 1) to sys0.rst before the start of the simulations (see Note 6). 2. Log files from each stage of the simulation should be inspected for possible problems. If the first MD stage fails, it may be due to the failure of the first two minimization steps to remove the hot spots in the initial structure. Consequently, the input structure has to be examined and corrected. Ideally, the simulation would terminate without problems. However, large systems require long computation times, often weeks to months long. If a power failure occurs, the MD run is terminated prematurely. In order to restart the calculation, we can use the rst files saved with predefined frequencies. Suppose the production run is interrupted. First, check the length of the trajectory already calculated, by consulting the simulation time at the end of the sys5.out file. Then, copy mdin5 file to a new file called mdin6 and adjust the parameter nstlim so as to obtain the initially desired length of the simulation after completing this new stage. Then, simply launch the script Run_pmemd_cuda again, specifying the total number of stages to be 6 and beginning simulations from stage #6. The program will use the existing sys5.rst as input and continue the calculation of the trajectory. If desired, the two trajectories, sys5. mdcrd and sys6.mdcrd, can be combined by the ptraj or cpptraj utility program. 3.4 Analysis of the MD Trajectory

1. Monitor the time dependence of such variables as potential energy, density, and volume. The analysis will be performed on the equilibrated part of the trajectory, where these parameters are stable. To do it, use a Perl script from the Amber website, available at the address http://archive.ambermd. org/200507/att-0228/process_mdout.perl. It produces text files whose contents can be visualized by any plotting software. By looking at the graphs, one can see from which frame the trajectory can be considered stable. This and all subsequent frames to the end of the trajectory will be used in clustering to find representative and distinct structures of the receptor. 2. The clustering procedure involves several steps. We will use the kclust routine from the MMTSB software package, available from http://blue11.bch.msu.edu/mmtsb/Main_Page and from https://github.com/mmtsb/toolset. In the first step, we extract individual frames from the trajectory and put them

50

Gre´gory Menchon et al.

as PDB files in a subdirectory. Run the following command “ptraj sys.prmtop ptraj_PDB.in,” where the last argument is the name of the file containing the following lines: trajin sys5.mdcrd 3001 10000 1 # remove solvent and ions strip :291-999999 # remove trans & rot center :1-290 mass origin image origin center familiar # best fit to the first frame rms first mass :1-290@CA,C,N # put all the pdb frames in a subdirectory trajout PDB/frame.pdb pdb multi

The keyword multi makes sure that each frame will be stored as a separate PDB file. The example above assumes that the production run sys5.mdcrd is in the current directory and that the equilibrated trajectory spans frame numbers from 3001 to 10,000. The increment is equal to 1; hence, all 7000 frames will be extracted. Also, the receptor in the example above has 290 residues. You should adjust these numbers to your specific case. The second step adjusts the file numbering format to the input requirements of the kclust program. The numbers should have leading zeros and be written with five characters. This is done by the script below: #! /bin/csh set DIR=’PDB’ set ff = ‘ls -1 ${DIR}/*.pdb.* | head -1‘ set fnam = $ff:r set numfil = ‘ls -1 ${DIR}/*.pdb.???? | wc -l‘ if( $numfil != 0)then foreach fnam (${DIR}/*.pdb.????) set fr=$fnam:r set fnum=$fnam:e mv $fnam $fr.0$fnum echo $fnam $fr.0$fnum end endif set numfil = ‘ls -1 ${DIR}/*.pdb.??? | wc -l‘ if( $numfil != 0)then foreach fnam (${DIR}/*.pdb.???) set fr=$fnam:r set fnum=$fnam:e mv $fnam $fr.00$fnum echo $fnam $fr.00$fnum end

Molecular Dynamics in Virtual Screening

51

endif set numfil = ‘ls -1 ${DIR}/*.pdb.?? | wc -l‘ if( $numfil != 0)then foreach fnam (${DIR}/*.pdb.??) set fr=$fnam:r set fnum=$fnam:e mv $fnam $fr.000$fnum echo $fnam $fr.000$fnum end endif set numfil = ‘ls -1 ${DIR}/*.pdb.? | wc -l‘ if( $numfil != 0)then foreach fnam (${DIR}/*.pdb.?) set fr=$fnam:r set fnum=$fnam:e mv $fnam $fr.0000$fnum echo $fnam $fr.0000$fnum end endif ls -1 ${DIR}/*.pdb* > framelist

The second line of the script defines the PDB subdirectory and should point at the one used in the previous step. The file created at the end of the script (“framelist”) contains the list of all frames to be used in subsequent clustering. In the third step, we launch kclust and create an output file (“kclust.out”) with the results: #!/bin/sh rad=2.0 list=’framelist’ kclust -mode rmsd -centroid -cdist -heavy -lsqfit -radius $rad \ -maxerr 1 -iterate ${list} > kclust.out ncl=‘grep Cluster kclust.out | wc -l‘ echo Found $ncl clusters.

In the above, the name of the file with the list of frames (“framelist”) should match the one used in the previous step. The rad parameter is the radius in Å, defining the size of a cluster. If a structure has RMSD distance from the centroid that is larger than rad Å, it will not be included in the current cluster. By controlling this parameter, we control the number of generated clusters.

52

Gre´gory Menchon et al.

In the fourth step, we extract the centroids from the list of clusters: #!/bin/sh awk -f extract_centroids.awk kclust.out | tee centroids.stat

Here, we use an awk script (“extract_centroids.awk”), available from http://ambermd.org/tutorials/basic/tutorial3/ files/extract_centroids.awk: BEGIN{b0=2;} {centind=index($1,"#Centroid"); c=$2; getline;centind=index($0,"#Centroid"); FIL0 = sprintf("centroid%2.2d.member.dat",c) while(centind != 1){ print $1,$3 > FIL0 ; getline;centind=index($0,"#Centroid"); } numcent=0; print $2,$1,NR-b0; c=$2; getline; endrec = index("End",$0); while( endrec != 1 ){ FIL = sprintf("centroid%2.2d.pdb",c); print > FIL; getline; endrec = index($0,"#End"); } b0=NR+2; }

However, since the centroid structures have no physical meaning, we shall search for best cluster members, i.e., those of the clustered structures that are the closest to the centroids in the RMSD sense. Here is the fifth step of the procedure: #!/bin/csh set list=’framelist’ rm -f best_members.out foreach file (‘ls *.member.dat‘) set i=$file:r set j=$i:r set num=‘sort -nk2 $file | head -1 | cut -f1 -d’ ’‘ set rms=‘sort -nk2 $file | head -1 | cut -f2 -d’ ’‘ set i=1

Molecular Dynamics in Virtual Screening

53

foreach name (‘cat $list‘) if ($i == $num) then set m=$name:e echo ’Best member in’ $j’:’ ID = $m ’(rmsd = ’$rms’)’ echo ’Best member in’ $j’:\ ’ ID = $m ’(rmsd = ’$rms’)’ >> best_members.out cp $name ${j}_best_member.pdb endif set i=‘expr $i + 1‘ end end

Please note that at the beginning of the script we specify explicitly the name of the file with the frame list, defined at the clustering step. The output is saved in the text file best_members. out. It lists the identifiers of the best structures in each cluster along with their corresponding RMSD values (see Note 7). As a result, we now have a relatively short list, containing representative structures of the receptor (one per each cluster found), which will be used in subsequent docking. 3.5 Virtual Screening of Ligands and Multiple Receptor Structures

Although there is no perfect approach to docking [100, 101], in this chapter, we will deal with two possible virtual screening strategies. The first uses Fred [102] and the associated OpenEye software (see Subheading 2) and considers both the ligand and the receptor to be rigid. This significantly increases the speed of the calculation, at the cost of precision. It is therefore usual to use several starting conformations for each ligand to be evaluated. The second strategy considers flexible ligand but rigid receptor and relies on AutoDock Vina [92, 93], in which case a single conformation of each ligand is included in the library. It should be noted that Vina can, in principle, define some receptor’s side chains as being flexible, but this significantly increases the computation time and is not sufficient to fully mimic the conformational adaptation of the receptor to the ligand binding. 1. Prepare files for docking. Download the chemical compounds library you are interested in or use in-house preassembled libraries. Most of the online chemical structure libraries are freely accessible for downloading (e.g., ZINC database [103]), and most of the compounds should be commercially available. These files often have an .sdf or .mol2 extension (see Note 8). First of all, generate 3D coordinates for the molecules. If you are planning to use AutoDock Vina, split them into .pdbqt files via OpenBabel with added hydrogens, removed salts, and charges corresponding to the protonation state at physiological pH. The following commands can be used:

54

Gre´gory Menchon et al. obabel –isdf drugs.sdf –omol2 –O drugs.mol2 –r –h – p7.2 –-gen3d obabel –imol2 drugs.mol2 –opdbqt –O drugs.pdbqt –m

Filter your molecules according to selected physicochemical properties, e.g., Lipinski’s rules of five [104, 105]. The Screening Assistant v.2 (http://sa2.sourceforge.net) can be used to remove known reactive compounds (covalent binders) and warheads (non-covalent binders) and to eliminate PAINS (pan-assay interference) compounds [106]. Alternatively, you can use the programs from the OpenEye software suite to enumerate tautomeric states for each molecule, to filter them, calculate partial atomic charges, and to generate low-energy conformers: tautomers -in drugs.sdf -out taut_drugs.sdf filter -i taut_drugs.sdf -o filt_drugs.sdf –typecheck fixpka -i filt_drugs.sdf -o fixpka_drugs.sdf molcharge

-method

am1bccsym

fixpka_drugs.sdf

drugs.mol2 omega2 drugs.mol2 drugs.oeb.gz

In general, one can filter molecules with respect to such properties as molecular weight, logP, PSA, the number of hydrogen bond donors and/or acceptors, number of rotatable bonds, number of rings and/or ring systems, and number of N and O atoms and charges. To this end, the ChemAxon’s cxcalc program can be used (e.g., cxcalc -g logP --pH 7.4 input.sdf) or obabel (e.g., obabel -isdf input.sdf -o sdf -O filtered.sdf -r --filter “MW> $LOGFILE BEGIN=‘date +%s‘ for (( j=0; j out_$i.log 2>&1 & done wait END=‘date +%s‘ ELAPSED_TIME=$(( END - BEGIN )) DAYS=$(( ELAPSED_TIME/86400 )) TMR=$(( ELAPSED_TIME-DAYS*86400 )) HOURS=$(( TMR/3600 ))

58

Gre´gory Menchon et al. TMP=$(( TMR-3600*HOURS )) MINUTES=$(( TMP/60 )) SECONDS=$(( TMP-60*MINUTES )) echo "TIME SPENT on $NB_CORES cores = $ELAPSED_TIME seconds, or ${DAYS}d ${HOURS}h ${MINUTES}m ${SECONDS}s." >> $LOGFILE exit

The script above uses a Linux command mpstat, which may not be available by default on your system. In such a case, you should install the package sysstat. And here is the script Launch_Vina to which the above code refers: #!/bin/sh WORK_PATH=$1 LIGLIST=$2 cat $LIGLIST | while read name; do b=‘basename $name .pdbqt‘; outfile=$WORK_PATH/${b}_docked.pdbqt; logfile=$WORK_PATH/$b.log; vina --config vina.conf --ligand $name --out $outfile --log $logfile --cpu 1 done

An example of the input file vina.conf is given below: # specify cpu, ligand, out & log on the command line receptor = receptor.pdbqt center_x = 4.531 center_y = -23.703 center_z = 66.949 size_x = 42.0 size_y = 42.0 size_z = 45.0 num_modes = 20 exhaustiveness = 32 energy_range = 5

Molecular Dynamics in Virtual Screening

59

Adapt the names of files and the grid size/center coordinates to your case. If you decide to include some flexible side chains of the receptor in the docking, there are a few additional steps to follow. Please refer to this manual: https://autodock-vina.readthedocs.io/en/ latest/docking_flexible.html. 3. As stated above, Vina has been parallelized for shared memory architectures (multicore machines), but it can also be launched on distributed memory architectures, such as HPC clusters. This is accomplished by launching an array job, available in many queuing systems, such as SGE (https://docs.oracle. com/cd/E19279-01/820-3257-12/n1ge.html), PBS Profesor sional (https://altairengineering.fr/pbs-professional), SLURM (https://slurm.schedmd.com). Ask your system manager for a generic script for your machine. We have used the resources of the Calmip supercomputer (Sequana, Atos-Bull) using the SLURM job manager. The launch script uses the command salloc to allocate the desired number of nodes for the job: #!/usr/bin/bash #set -x PREFIX=‘hostname‘ if [ ! $1 ] then echo "Error! You need to indicate the number of nodes for computations as an argument." echo "Ex: ./Run_Vina_job.sh 64 > out.log 2>&1 &" echo "(use 64 nodes with redirection and in background)" exit else NB_NODES=$1 fi WORK_PATH=‘pwd‘ now=‘date +%Y-%m-%d_%H-%M-%S‘ RESULTS_DIR=./results_$now rm -rf $RESULTS_DIR mkdir -p $RESULTS_DIR salloc -N $NB_NODES -J Vina --exclusive ./Run_nodes.sh $WORK_PATH $RESULTS_DIR exit

60

Gre´gory Menchon et al.

Each reserved node runs the script “Run_nodes.sh”: #!/usr/bin/bash WORK_PATH=$1 RESULTS_DIR=$2 PREFIX=‘hostname‘ LIGANDS_LIST=./ligands_list ls ./ligands/* > $LIGANDS_LIST NB_LIGANDS=‘cat $LIGANDS_LIST | wc -l‘ NB_LIGANDS_PER_NODE=$(( NB_LIGANDS / SLURM_NNODES )) NB_EXTRA_LIGANDS=$(( NB_LIGANDS % SLURM_NNODES )) NODELIST=(‘nodeset -e $SLURM_NODELIST‘) echo

"$PREFIX

Split

$NB_LIGANDS

ligands

into

$SLURM_NNODES files with $NB_LIGANDS_PER_NODE ligands entries." echo "$PREFIX There are $NB_EXTRA_LIGANDS ligands entries left to distribute among the $SLURM_NNODES nodes." for (( i=0; i ${LIGANDS_LIST}_${NODELIST[$i]} done for (( j=0; j> ${LIGANDS_LIST}_${NODELIST[$j]} done echo "$PREFIX LAUNCH Run_cores.sh" clush -w $SLURM_NODELIST -f $SLURM_NNODES $WORK_PATH/Run_cores.sh $WORK_PATH $RESULTS_DIR exit

The command clush (cluster shell) runs the script on all reserved nodes in parallel:

cores.sh

#!/usr/bin/bash

Run_-

Molecular Dynamics in Virtual Screening

61

WORK_PATH=$1 RESULTS_DIR=$2 PREFIX=‘hostname‘ cd $WORK_PATH FILENAME=./ligands_list_‘hostname‘ NB_CORES=‘cat /proc/cpuinfo | grep processor | wc -l‘ FILENAM=‘basename -s .bullx $FILENAME‘ NB_LIGANDS=‘cat $FILENAM | wc -l‘ NITER=$(( $NB_LIGANDS / $NB_CORES )) REST=$(( $NB_LIGANDS - $(( $NITER * $NB_CORES )) )) echo "$PREFIX $NB_LIGANDS ligands are divided into $NITER groups of $NB_CORES ligands." echo "$PREFIX There are $REST ligands left to distribute among the $NB_CORES cores." export KMP_AFFINITY=disabled export OMP_NUM_THREADS=1 exec < $FILENAM BEGIN=‘date +%s‘ echo "Run started on $BEGIN" for (( i=1; i> x_tmp; i=‘expr $i + 1‘; done sort -n –k1 x_tmp > sorted_NRG rm x_tmp

The output file sorted_NRG contains the sorted list of the ligands, best on top. Inspect the intermolecular contacts and select a number of candidates for the following steps. Then, run MD simulation to validate the selected protein–ligand complexes. Visualization of the OpenEye docking results can be done with Vida, e.g., with the command “vida docked.oeb.gz.” One can also save the structures in the mol2 format (as one single file) to view them with Pymol (as individual models). The best poses are ranked according to the Vina scoring function, but this is by no means guaranteed to provide best ligand affinities. For this reason, it is recommended to rescore the results with different scoring functions. Some of them have been developed specifically for Vina [107]. One of the recent ones which we use is RF-Score-VS v2 [108]. It can be downloaded from https:// github.com/oddt/rfscorevs and can be applied to the output ligand set with the command: rf-score-vs --receptor rec.pdb ligs.pdbqt –o csv

The program supports multiple input/output file formats. The affinity estimates are given in units –pKi (-pKi ≈ 0.7335.E[kcal/ mol]). One can obtain the consensus score by computing the average over results from different scoring functions. 3.6 Validation of Complexes by MD Simulations

1. Putative complex structures resulting from docking studies should be validated from the point of view of their stability. The MD simulation performed even for a relatively short stretch of time (20–50 ns) can give a wealth of information about the behavior of the ligand within the binding site as well as about the nature of the intermolecular contacts. The preparation of input data for MD simulation of complex structures is similar to the procedure described above for the receptor. The difference is that a new molecule (ligand) has to be added to the system, and we need its topology. In general, it will require the use of a different force field. In what follows, we assume that ligands are small organic molecules which can be treated by the general Amber force field (GAFF) [109]. We start with a PDB file of a ligand molecule. Make sure the ligand is protonated. Use the antechamber utility to generate the first of the tleap input files:

64

Gre´gory Menchon et al. antechamber –nc 0 –rn LIG –i lig.mol2 –fi mol2 –o lig. prep \ -fo prepc –c bcc –s 2

The meaning of the options is as follows: nc = net charge, = residue name, i = input file, fi = format of the input file, o = output file, fo = format of the output file (prepc or prepi), c = charge model, and s = output verbosity. When the lig.prep file is ready, we check if all force field parameters are available: rn

parmchk -i lig.prep -f prepc -o lig.frcmod

The contents of the output frcmod file should be examined, and if problems occur, they should be fixed before continuing (see Note 10). In the next step, use tleap to create topology and coordinate files for the ligand. The following script shows an example of tleap’s input: source leaprc.ff14SB

(for basic definitions)

source leaprc.gaff

(for the ligand)

set default PBradii mbondi2 loadamberprep lig.prep

(antechamber-generated file)

loadamberparams lig.frcmod

(parmchk-generated file)

check LIG

(there should be no error messages)

saveamberparm LIG lig.prmtop (save the ligand topology) lig.inpcrd savepdb LIG lignew.pdb

(use this file to add ligand to the receptor)

quit

Note that we are using the same residue name (LIG) as the one we defined for antechamber (see Note 11). 2. Prepare a receptor–ligand complex by combining protein and ligand in a single PDB file (e.g., “complex.pdb”). Copy the ligand’s PDB file at the end of the receptor’s PDB file (use the new PDB file provided by antechamber, called NEWPDB. PDB). Attribute a unique chain identifier to the ligand’s residues. Insert TER records after each molecule. Then, run tleap on the following script:

Molecular Dynamics in Virtual Screening

65

source leaprc.ff14SB source leaprc.gaff set default PBradii mbondi2 loadamberprep lig.prep loadamberparams lig.frcmod mol=loadpdb complex.pdb # add modifications, such as disulfide bonds, if any check mol # the unit should be OK, except warnings about close contacts. # They will be removed by subsequent energy minimizations. saveamberparm mol cpl.prmtop cpl.inpcrd (save topology) savepdb mol cpl.pdb # saves newly assigned residue numbers, protons included quit

If the script terminates without errors, the complex structure is correct. See the contents of the output cpl.pdb file for newly assigned residue numbers (containing protons). Solvate the system and add charge neutralizing ions if necessary, by completing the above script with the solvatebox and addions commands, as discussed in Subheading 3.2, step 1 (see Note 12). This will produce topology files for the whole system, with names such as sys. prmtop, sys.inpcrd, and sys.pdb. 3. Launch MD simulation following the recipes in Subheadings 3.2 and 3.3. The procedures for the complex are identical as for the receptor alone. In the two initial energy minimization steps of the procedure, the ligand should be considered as part of the protein, i.e., it should be restrained. Given that ligands are very small compared with the receptor, there will be no noticeable differences in MD execution times. The above steps should be repeated for all ligands obtained from the docking procedure. 4. The final validation of structures comes from the analysis of MD trajectories of the studied complexes. A stable complex should be characterized by a low root mean square deviation (RMSD) value for the ligand throughout the entire trajectory. However, the RMSD criterion is not sufficient to conclude about the stability of the complex. In spite of thermal fluctuations, the ligand should remain in the same position within the binding site, showing persistent intermolecular contacts. The information about the amplitude of atomic thermal motions can be obtained from the root mean square fluctuations (RMSF). The intermolecular contacts can be determined from the analysis of atomic proximities, which also permit to

66

Gre´gory Menchon et al.

observe the formation of intermolecular hydrogen bonds. The analysis makes use of the cpptraj utility, which can be run with the command “cpptraj complex.prmtop ptraj.in,” where the contents of the input script ptraj.in depend on the profile of the analysis. To obtain the values of RMSD and RMSF, the following example can be used: trajin complex5.mdcrd 1 999999 strip :292-999999 center :1-290 mass origin image origin center familiar rms first mass out RMSD-rec.txt time 10 :1-290@C,N, CA rms first mass out RMSD-lig.txt time 10 :291 nofit atomicfluct out RMSF-lig.txt :291

In this example, we read all frames from the input trajectory and then discard all components except the complex (in this example, the receptor has 290 residues and the ligand has only one residue, #291). We superpose each frame on the first one using the mass-weighted receptor’s backbone and calculate RMSD for the receptor. Then, using the previous superposition, we compute the ligand’s RMSD from the dispersion of all of its atoms. The time step (10 ps) is meant to set the units on the X-axis of the generated plot. This graph will tell us if the gravity center of the ligand remains in the same position, but not whether the pose is stable. If ligand moves within the binding site, its atoms will show high values of RMSF. The last line of the above script produces a file containing RMSF of atomic fluctuations. Finally, there remains the question of intermolecular contacts. These can be obtained from the following script: trajin complex5.mdcrd 3001 10000 10 #-- Donors from standard amino acids donor mask :GLN@OE1 donor mask :GLN@NE2 donor mask :ASN@OD1 donor mask :ASN@ND2 donor mask :TYR@OH donor mask :ASP@OD1 donor mask :ASP@OD2 donor mask :GLU@OE1 donor mask :GLU@OE2 donor mask :SER@OG donor mask :THR@OG1 donor mask :HIS@ND1 donor mask :HIE@ND1

Molecular Dynamics in Virtual Screening

67

donor mask :HID@NE2 #-- Acceptors from standard amino acids acceptor mask :ASN@ND2 :ASN@HD21 acceptor mask :ASN@ND2 :ASN@HD22 acceptor mask :TYR@OH :TYR@HH acceptor mask :GLN@NE2 :GLN@HE21 acceptor mask :GLN@NE2 :GLN@HE22 acceptor mask :TRP@NE1 :TRP@HE1 acceptor mask :LYS@NZ :LYS@HZ1 acceptor mask :LYS@NZ :LYS@HZ2 acceptor mask :LYS@NZ :LYS@HZ3 acceptor mask :SER@OG :SER@HG acceptor mask :THR@OG1 :THR@HG1 acceptor mask :ARG@NH2 :ARG@HH21 acceptor mask :ARG@NH2 :ARG@HH22 acceptor mask :ARG@NH1 :ARG@HH11 acceptor mask :ARG@NH1 :ARG@HH12 acceptor mask :ARG@NE :ARG@HE acceptor mask :HIS@NE2 :HIS@HE2 acceptor mask :HIE@NE2 :HIE@HE2 acceptor mask :HID@ND1 :HID@HD1 acceptor mask :HIP@ND1,NE2 :HIP@HE2,HD1 #-- Backbone donors and acceptors for this particular molecule # N-H for prolines do not exist so are not in the mask # in this example res181 is supposed to be PRO and excluded donor mask @O acceptor mask :2-180,182-290@N :2-290@H # Terminal residues have different atom names donor mask @OXT acceptor mask :1@N :1@H1 acceptor mask :1@N :1@H2 acceptor mask :1@N :1@H3 # hbond print .05 series hbt time 10 distance 3.5 angle 120.0 \ out hbond.dat solventdonor WAT O solventacceptor WAT O H1 \ solventacceptor WAT O H2

The generic version of the example above is downloadable from the address http://ambermd.org/tutorials/basic/tutorial3/files/ analyse_hbond.ptraj. Adjust it according to your needs. The output file (“hbond.dat”) contains information about the formation and

68

Gre´gory Menchon et al.

breaking of hydrogen bonds throughout the trajectory. The total combined information should be used to assess the stability of the studied complex. 3.7 Estimation of Ligand Affinity

One of the most attractive features offered by the analysis of an MD trajectory is the estimation of the protein–ligand interaction energy. This in turn should be linked to ligand affinity. Unfortunately, this is a very complex issue and not yet sufficiently resolved in practice. Theoretically, the intermolecular interaction energy can be calculated from the difference between the energy of the complex and the sum of energies of its individual components. The Gibbs free energy of the system is the sum of the enthalpic and the entropic terms. The force field-based energy is an approximation to the enthalpic term of the free energy expression [110–112]. The entropic term is often as important as the enthalpic one, but there are enormous difficulties in computing it [113], and so in practice, it is usually neglected. This simplification nevertheless allows comparison of similar ligands, for which the entropic terms may not vary too much. In spite of these difficulties, several methods are frequently used. The most popular approach in evaluating intermolecular interaction energy is the molecular mechanics (MM) combined with the Poisson–Boltzmann (PB) or generalized Born (GB) and surface area (SA) continuum solvation methods (MM-PBSA and MM-GBSA) [114]. Other more sophisticated methods have also been developed, such as interactive Linear Interaction Energy (iLIE) approach [115–118], but they require calibration on a known set of ligands with experimentally determined affinities, which constitutes a major limitation. Moreover, although they may run faster, they lack the accuracy of the PBSA/GBSA methods [119]. In what follows, we focus on the most widely used MM-PBSA and MM-GBSA methods, as available in the Amber software suite. These methods are based on the implicit solvent approach and need as input the separate trajectories of unsolvated complex, of the protein and of the ligand. In practice, this information is extracted from the MD trajectory of the solvated protein– ligand complex. Consequently, the program requires the corresponding topologies, which have already been generated with tleap in intermediate steps of the procedure (files named sys.prmtop, cpl.prmtop, rec.prmtop, and lig.prmtop, corresponding to the solvated system, the unsolvated complex, the receptor, and the ligand, respectively). 1. Use the MMPBSA.py.MPI [120] program to profit from multicore systems. The script below demonstrates its use in calculating the protein–ligand intermolecular energy with both the MM-PBSA and the MM-GBSA methods:

Molecular Dynamics in Virtual Screening

69

#!/bin/sh echo -n "Input script name: [mmpbsa_py.in] " read INSCRIPT if [ -z $INSCRIPT ] then INSCRIPT="mmpbsa_py.in" fi echo -n "Solvated prmtop file: [sys.prmtop] " read SYSFILE if [ -z $SYSFILE ] then SYSFILE="sys.prmtop" fi echo -n "Complex prmtop file: [cpl.prmtop] " read CPLFILE if [ -z $CPLFILE ] then CPLFILE="cpl.prmtop" fi echo -n "Receptor prmtop file: [rec.prmtop] " read RECFILE if [ -z $RECFILE ] then RECFILE="rec.prmtop" fi echo -n "Ligand prmtop file: [lig.prmtop] " read LIGFILE if [ -z $LIGFILE ] then LIGFILE="lig.prmtop" fi echo -n "Trajectory file (name with ext.): " read MDCRD if [ -z "$MDCRD" ] then echo "*** Error: Trajectory files must be specified explicitly." exit fi echo -n "Number of CPUs to use: [4] "

70

Gre´gory Menchon et al. read NPROC if [ -z $NPROC ] then NPROC=4 fi echo "Launching MMPBSA.py.MPI on $NPROC CPUs." echo -n "JOB STARTED on " date mpiexec -np $NPROC $AMBERHOME/bin/MMPBSA.py.MPI -O -i $INSCRIPT \ -o out_binding.dat \ -do out_decomp.dat \ -sp $SYSFILE \ -cp $CPLFILE \ -rp $RECFILE \ -lp $LIGFILE \ -y $MDCRD > out.log echo -n "JOB FINISHED on " date exit 0

Note that there are two output files with predefined names: and out_decomp.dat. The first one contains information on binding energies and the second one lists energy decomposition details. An example of the input script for per-residue energy decomposition (mmpbsa_py.in) is given below: out_binding.dat

MMPBSA input file for running per-residue decomposition &general startframe=3001, endframe=5000, interval=2, / &gb # for igb=5, use "set default PBradii mbondi2" # in scripts producing prmtop files # for igb=7, use "set default PBradii bondi" (not for nucleic acids) igb=5, saltcon=0.010 / &pb istrng=0.010, indi=4, /

Molecular Dynamics in Virtual Screening

71

&decomp idecomp=1, print_res=’1-303’, csv_format=0, dec_verbose=2, /

The &pb and &gb sections refer to the MM-PBSA and the MM-GBSA methods, respectively (see Note 13). As usual, adjust the relevant variables as needed. 2. In order to perform pairwise energy decomposition, use the above script with the following minor modifications: set idecomp to 3 and dec_verbose to 0 (otherwise, the output will be too voluminous). In both cases, the output can be sorted to focus on residues important for the intermolecular interactions. 3. Sometimes, errors are reported by the tool described above, which prevent successful energy computations. Very often, the input prmtop files cause problems and the error sources may be traced to inconsistent structures used. However, if the input data are correct and the errors persist, the generation of input prmtop files may be facilitated by the tool ante-MMPBSA.py. If the errors still persist, one can use pymdpbsa, which in our experience works best under most circumstances. However, it is a serial program, which executes calculations sequentially. In order to improve its performance on multicore machines, we wrote several scripts which accelerate this task, by moving the program’s internal loop over trajectory frames to the external loop, executed simultaneously by the available cores. Here is a simple example of its use to obtain protein–ligand interaction energies. We start with the preparation of the input data by feeding cpptraj with the following script: trajin ../sys5.mdcrd 1001 2000 10 # remove solvent a ions strip :171-999999 # remove trans & rot center :1-167 mass origin image origin center familiar # superpose on 1st frame’s 1st mol rms first mass :1-167@C,N,CA trajout PDB/frame.pdb pdb multi

In the above, we extract every 10th frame from the second half of the target MD trajectory, strip it of ions and water, and write out individual frames in the folder PDB (which must be created before running this script). The result is a series of 100 frames in the PDB format. Next, we call the script which executes energy calculations

72

Gre´gory Menchon et al.

on all frames distributed among CPU cores (e.g., to launch the script on 10 cores, Energy_calc-Par PDB 10): #!/usr/bin/bash FPATH=$1 NCORES=$2 FrameList="FrameList" #--- Define as appropriate ----Ctop="cpl.prmtop" Rtop="rec.prmtop" Ltop="lig.prmtop" Lres=170 Solve=1 #------------------------------ls -1 $FPATH/*pdb* > $FrameList # Total number of frames Nframes=‘cat $FrameList | wc -l‘ # Number of frames per core Nfrpc=$(( Nframes / NCORES )) Nextra=$(( Nframes % NCORES )) for (( i=0; i ${FrameList}_$i done for (( j=0; j> ${FrameList}_$j done for (( j=0; j < $NCORES; j++ )) do i=$((j+1)) flist=‘echo ${FrameList}_$j‘ taskset -c $j Calc_energy $flist $Ctop $Rtop $Ltop $Lres $Solve $i & done wait rm -fr *.tmpdir *.nrg ${FrameList}* exit

Molecular Dynamics in Virtual Screening

73

In the simple script above, no checks are made to assure the availability of the cores. If more sophisticated version is desired, consult other scripts given in the preceding sections. For more details, see Note 13. The above script launches calculations via another script, called Calc_energy: #!/bin/sh # Calculates intermolecular energy with pymdpbsa. # We launch it on single frames and do statistics by ourselves. Flist=$1 Ctop=$2 Rtop=$3 Ltop=$4 Lres=$5 Solve=$6 Ind=$7 j=1 cat $Flist | while read frame; do pymdpbsa --proj=NRG${Ind}_$j

--traj=$frame

--

cprm=$Ctop --lprm=$Ltop --rprm=$Rtop --lig=$Lres --start=1 --stop=1 --step=1 --solv=$Solve; ((j++)); Done

This produces a set of NRG*.sum files, which contain results of energy calculations for individual frames. In order to extract the information on the average interaction energy and its fluctuations, we wrote a short in-house program which reads these files and does the computation. It is available in the script archive (see the concluding remarks in Subheading 3.8). Here is an example of the output produced by this program, listing different types of energies, their averages, and standard deviations: ================================================ Results of calculations for 100 PDB files. ================================================ -----Ligand Energies----------------------------Etot = -36.297 ( 5.384) Ebat = 67.023 ( 5.153) Evdw = 11.650 ( 2.287) Ecoul = -100.39 ( 3.690) EGB = -19.468 ( 2.707) Esasa = 4.8832 ( 0.088) -----Receptor Energies--------------------------Etot = -6049.3 ( 46.542) Ebat = 4068.1 ( 38.536)

74

Gre´gory Menchon et al. Evdw = -698.67 ( 24.390) Ecoul = -6865.0 ( 111.633) EGB = -2615.4 ( 95.408) Esasa = 61.654 ( 0.972) -----Complex Energies---------------------------Etot = -6104.7 ( 48.651) Ebat = 4135.1 ( 38.599) Evdw = -709.65 ( 26.824) Ecoul = -6974.0 ( 113.547) EGB = -2619.1 ( 98.532) Esasa = 62.871 ( 1.534) -----Interaction Energy Components--------------Etot = -19.075 ( 6.703) Ebat = 0.10000E-03 ( 0.005) Evdw = -22.634 ( 8.086) Ecoul = -8.5607 ( 6.260) EGB = 15.785 (6.477) Esasa = -3.6661 ( 1.055)

The program also creates raw data (X,Y points) which can be used to create Etot energy graphs. The data are available in the output NRG*.dat files. 4. There are other methods to calculate interaction energies, which we do not explore in this chapter. You can consult the relevant tutorials at the address https://ambermd.org/ tutorials/FreeEnergy.php or the reference [121] and also see the review of the importance of different parameters in free energy calculations [122]. 3.8 Concluding Remarks

In conclusion, docking and MD are fully complementary. Docking is fast and inexpensive, and MD allows taking into account full system flexibility giving more reliable and “realistic” interaction and affinity information. MD is not an easy task and its protocols are constantly improving with better algorithms and force fields and in parallel with better computer performance, allowing today to simulate a protein in the microsecond timescale and in complex membrane-like environments. Docking with simultaneous MD calculation would be an appropriate solution with all steps included in one pass but would currently take too long to execute and would face difficulties in the interpretation if the system got trapped in local minima. Implementing MD protocols within a virtual ligand screening process is necessary to increase the hit compound discovery success rate and enter a well-known “hit-to-lead” strategy to obtain molecules with higher affinity and specificity against medically relevant biological targets. The choice of methods presented in this chapter has been dictated by our own experience, but there are numerous alternative approaches, which the readers are encouraged to explore [108, 121–133]. The archive, containing the scripts presented in this chapter (compressed in several different formats), is available for download from the following URL: https://mycore.core-cloud.net/index.php/s/ZtXZ4 uVU86u1IM6.

Molecular Dynamics in Virtual Screening

4

75

Notes 1. PDB files from NMR studies usually contain 20 superposed structures. Pick the first one, as it should have the least constraint violations. The choice is not critical because the MD will relax the molecule and drive it to equilibrium. 2. The loop modeling as described above works up to the loop length of ~30 residues. For longer loops, a different strategy has to be adopted. One can model loops incrementally, a fragment at a time. Another possibility is to run an MD simulation with the loop as a separate chain, with distance constraints on loop termini to obtain spatial proximity to the receptor, and then running the MD again to equilibrate the loop incorporated in the structure. 3. If you want to simulate deprotonated HIS, see Ref. [131]. 4.

renumbers residues beginning from 1. Subsequent residue referencing (e.g., for the “bond” command) must take this into account.

tleap

5. If multiple GPU cards are supposed to execute the MD protocol, the first two stages (i.e., the two consecutive minimizations) have to be performed on a single card. At the time of the writing, this part of the code is not yet parallelized for GPUs. Multiple GPU cards can work on the MD protocol beginning with the third stage, i.e., from the short MD run with sample heating. 6. In order to use GPUs for MD, install the appropriate version of NVidia’s CUDA library on your system, and then compile the CUDA version of the software (e.g., Amber’s pmemd.cuda). This will create a binary compatible with your hardware. There is a problem in NVidia’s software with the detection of GPU cards and the assignment of unique identifiers. The order in which GPU cards are detected is different depending on whether we launch a command from a terminal (nvidiasmi), from a GUI interface (nvidia-settings), or via the CUDA library (deviceQuery). In each case, the identifiers of GPU cards may be different, which is confusing. However, in practice, only the output from deviceQuery is valid for the launch of the GPU software. The number of GPU cards that can be used simultaneously depends on the computer’s architecture. The workstations we use have two CPUs and four GPUs each, with each CPU controlling two of GPU cards. In order to launch a computation on all four GPUs, the NVidia’s P2P (Peer-to-Peer) protocol would have to pass between the two CPUs. Since the latter

76

Gre´gory Menchon et al.

use the Intel’s QPI (QuickPath Interconnect) to communicate, the P2P doesn’t work. Hence, one can only launch a calculation on either one or two GPU cards. To use the machine fully, one can launch two independent simulations, each on two GPU cards controlled by the same CPU. This restriction is not necessarily present on computers with different motherboard architectures. 7. The kclust program handles up to 50,000 files. In practice, this is not a significant limitation. However, its output contains an error: the PDB files written by kclust show residue numbers in Columns 24–27, instead of Columns 23–26. Pymol displays them correctly, but VMD does not. This can be corrected by shifting the residue numbers by one column to the left. 8. The SDF files and libraries available for download are predominantly in V2000 format. Some are in the newer V3000 format, which offers greater flexibility, but software such as cxcalc or obabel may crash when they encounter what they consider an invalid SDF entry. Check the format before using the files and adapt it to your software, if necessary. The V2000 format description can be found on this page, https://www. nonlinear.com/progenesis/sdf-studio/v0.9/faq/sdf-file-for mat-guidance.aspx, and the differences between the two formats can be found here, https://depth-first.com/ articles/2021/11/17/ten-reasons-to-adopt-the-v3000molfile-format/. 9. Currently, there are over 30 scoring functions, which represent diverse approaches to the calculation of ligand affinity [42]. It is a good idea to rescore your results with several different functions and accept ligands as hits only if there is a consensus, i.e., when ligands score well with most of the methods. A review of fast scoring methods can be found in [133]. 10. When preparing ligands with the antechamber utility, the program sometimes fails on PDB files. In this case, try the mol2 format. However, it is likely that the reasons of these problems come from erroneous or ambiguous definition of the ligand’s structure. If the initial structure is not optimized, the missing CONECT records may be the reason of the failure. 11. When preparing a PDB file of a complex with several ligands, make sure that each ligand has a different chain identifier and also that you use different residue numbers for different ligands. Otherwise, Amber complains about split residues. 12. Adding charge neutralizing ions to the system can be done using the addions command. Another version exists: addions2. It is longer in execution, but it uses a more sophisticated algorithm for placement of ions within the periodic box.

Molecular Dynamics in Virtual Screening

77

Fig. 3 Comparison of the Etot, protein–ligand interaction energy component, as a function of the frame number, as calculated by the GB method (blue curves) and the PB method (red curves). The calculations were performed for several complexes (a–f) of the NRas protein with different ligands (data not published). In most cases, the PB method gives either comparable or slightly higher energy values than the GB approach, although in one case (c) we observed the opposite effect

13. The line “Solve=1” defines the Generalized Born (GB) calculations. To switch to the Poisson–Boltzmann (PB) method, change this line to “Solve=3”. We compared the results of energy calculations with these two methods on some of the complexes studied in our laboratory. In general, the total energy components Etot calculated by these two approaches were close to each other, with the PB method showing a tendency to have slightly higher values than the GB approach. Figure 3 shows several examples from our studies.

78

Gre´gory Menchon et al.

Acknowledgments We acknowledge financial support from PICT, Genotoul platform of Toulouse, CNRS; Universite´ de Toulouse, UPS; European structural funds; and the Midi-Pyre´ne´es region, CNRS. Part of this work was performed using HPC resources from CALMIP (Grant 2022-P22015). References 1. Tarcsay A, Paragi G, Vass M, Jojart B, Bogar F, Keseru GM (2013) The impact of molecular dynamics sampling on the performance of virtual screening against GPCRs. J Chem Inf Model 53:2990–2999 2. Barakat KH, Jordheim LP, Perez-Pineiro R, Wishart D, Dumontet C, Tuszynski JA (2012) Virtual screening and biological evaluation of inhibitors targeting the XPA-ERCC1 interaction. PLoS One 7: e51329 3. De Vivo M, Masetti M, Bottegoni G, Cavalli A (2016) Role of molecular dynamics and related methods in drug discovery. J Med Chem 59:4035–4061 4. Durrant JD, McCammon JA (2011) Molecular dynamics simulations and drug discovery. BMC Biol 9:71–79 5. Galeazzi R (2009) Molecular dynamics as a tool in rational drug design: current status and some major applications. Curr Comput Aided Drug Des 5:225–240 6. Hospital A, Goni JR, Orozco M, Gelpi JL (2015) Molecular dynamics simulations: advances and applications. Adv Appl Bioinforma Chem 8:37–47 7. Jiang L, Zhang X, Chen X, He Y, Qiao L, Zhang Y, Li G, Xiang Y (2015) Virtual screening and molecular dynamics study of potential negative allosteric modulators of mGluR1 from Chinese herbs. Molecules 20:12769– 12786 8. Kundu A, Dutta A, Biswas P, Das AK, Ghosh AK (2015) Functional insights from molecular modeling, docking, and dynamics study of a cypoviral RNA dependent RNA polymerase. J Mol Graph Model 61:160–174 9. Mirza SB, Salmas RE, Fatmi MQ, Durdagi S (2016) Virtual screening of eighteen million compounds against dengue virus: combined molecular docking and molecular dynamics simulations study. J Mol Graph Model 66: 99–107

10. Moroy G, Sperandio O, Rielland S, Khemka S, Druart K, Goyal D, Perahia D, Miteva MA (2015) Sampling of conformational ensemble for virtual screening using molecular dynamics simulations and normal mode analysis. Future Med Chem 7:2317– 2331 11. Naresh KN, Sreekumar A, Rajan SS (2015) Structural insights into the interaction between molluscan hemocyanins and phenolic substrates: an in silico study using docking and molecular dynamics. J Mol Graph Model 61:272–280 12. Nichols SE, Baron R, Ivetac A, McCammon JA (2011) Predictive power of molecular dynamics receptor structures in virtual screening. J Chem Inf Model 51:1439–1446 13. Nichols SE, Riccardo B, McCammon JA (2012) On the use of molecular dynamics receptor conformations for virtual screening. Methods Mol Biol 819:93–103 14. Okimoto N, Futatsugi N, Fuji H, Suenaga A, Morimoto G, Yanai R, Ohno Y, Narumi T, Taiji M (2009) High-performance drug discovery: computational screening by combining docking and molecular dynamics simulations. PLoS Comput Biol 5:e1000528 15. Rodriguez-Bussey IG, Doshi U, Hamelberg D (2016) Enhanced molecular dynamics sampling of drug target conformations. Biopolymers 105:35–42 16. Sliwoski G, Kothiwale S, Meiler J, Lowe EW Jr (2014) Computational methods in drug discovery. Pharmacol Rev 66:334–395 17. Vazquez J, Lopez M, Gibert E, Herrero E, Luque FJ (2020) Merging ligand-based and structure-based methods in drug discovery: an overview of combined virtual screening approaches. Molecules 25:4723 18. Bartesaghi A, Merk A, Banerjee S, Matthies D, Wu X, Milne JLS, Subramaniam S (2015) 2.2 Å resolution cryo-EM structure of β-galactosidase in complex with a cellpermeant inhibitor. Science 348:1147–1151

Molecular Dynamics in Virtual Screening 19. Nakane T, Kotecha A, Sente A et al (2020) Single-particle cryo-EM at atomic resolution. Nature 587:152 20. Peplov M (2020) Cryo-electron microscopy reaches resolution milestone. ACS Cent Sci 6: 1274–1277 21. Yip KM, Fischer N, Paknia E, Chari A, Stark H (2020) Atomic-resolution protein structure determination by cryo-EM. Nature 587: 157 22. Jumper J, Evans R, Pritzel A et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589 23. Lin Z, Akin H, Rao R et al (2022) Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv. https://doi.org/10.1101/2022.07. 20.500902 24. Gorgulla C (2022) Recent developments in structure-based virtual screening approaches. ArXiv221103208v1 Q-BioBM. https://doi. org/10.48550/arXiv.2211.03208 25. Beroza P, Crawford JJ, Ganichkin O, Gendelev L, Harris SF, Klein R, Miu A, Steinbacher S, Klingler F-M, Lemmen C (2022) Chemical space docking enables large-scale structure-based virtual screening to discover ROCK1 kinase inhibitors. Nat Commun 13:6447 26. Sadybekov AA, Sadybekov AV, Liu Y et al (2022) Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature 601:452 27. Hughes JP, Rees S, Kalindjian SB, Philpott KL (2011) Principles of early drug discovery. Br J Pharmacol 162:1239–1249 28. Kuenemann MA, Sperandio O, Labbe CM, Lagorce D, Miteva MA, Villoutreix BO (2015) In silico design of low molecular weight protein-protein interaction inhibitors: overall concept and recent advances. Prog Biophys Mol Biol 119:20–32 29. Ramirez D (2016) Computational methods applied to rational drug design. Open Med Chem J 10:7–20 30. Rognan D (2015) Rational design of proteinprotein interaction inhibitors. Med Chem Commun 6:51–60 31. Murugan NA, Podobas A, Gadioli D, Vitali E, Palermo G, Markidis S (2022) A review on parallel virtual screening softwares for highperformance computers. Pharmaceuticals 15: 63 32. Gioia D, Bertazzo M, Recanatini M, Masetti M, Cavalli A (2017) Dynamic docking: a paradigm shift in computational drug discovery. Molecules 22:2029

79

33. B-Rao C, Subramanian J, Sharma SD (2009) Managing protein flexibility in docking and its applications. Drug Discov Today 14:394–400 34. Cavasotto CN, Abagyan RA (2004) Protein flexibility in ligand docking and virtual screening to protein kinases. J Mol Biol 337:209– 225 35. Durrant JD, McCammon JA (2010) Computer-aided drug-discovery techniques that account for receptor flexibility. Curr Opin Pharmacol 10:770–774 36. Totrov M, Abagyan R (2008) Flexible ligand docking to multiple receptor conformations: a practical alternative. Curr Opin Struct Biol 18:178–184 37. Armen RS, Chen J, Brooks CL III (2009) An evaluation of explicit receptor flexibility in molecular docking using molecular dynamics and torsion angle molecular dynamics. J Chem Theory Comput 5:2909–2923 38. Feher M (2006) Consensus scoring for protein–ligand interactions. Drug Discov Today 11:421–428 39. Politi R, Convertino M, Popov K, Dokholyan NV, Tropsha A (2016) Docking and scoring with target-specific pose classifier succeeds in native-like pose identification but not binding affinity prediction in the CSAR 2014 benchmark exercise. J Chem Inf Model 56:1032– 1041 40. Quiroga R, Villarreal MA (2016) Vinardo: a scoring function based on AutoDock Vina improves scoring, docking, and virtual screening. PLoS One 11:e0155183 41. Wang Z, Sun H, Yao X, Li D, Xu L, Li Y, Tian S, Hou T (2016) Comprehensive evaluation of ten docking programs on a diverse set of protein–ligand complexes: the prediction accuracy of sampling power and scoring power. Phys Chem Chem Phys 18:12964– 12975 42. Seifert MHJ (2009) Targeted scoring functions for virtual screening. Drug Discov Today 14:562–569 43. Sinha S, Tam B, Wang SM (2022) Applications of molecular dynamics simulation in protein study. Membranes 12:844 44. Leach AR (2001) Molecular modelling: principles and applications, 2nd edn. Pearson, Dorchester, Dorset 45. Ryckaert J-P, Ciccotti G, Berendsen HJC (1977) Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J Comput Phys 23:327–341

80

Gre´gory Menchon et al.

46. Schlick T (2002) Molecular modeling and simulation: an interdisciplinary guide. Springer, New York 47. Stanley N, De Fabritiis G (2015) High throughput molecular dynamics for drug discovery. Silico Pharmacol 3:3–6 48. Shakib SM, Naz A, Hussien MA (2021) Significance of MD simulation in pharmaceutical sciences: a review. Am J Biomed Sci Res 13: 449–455 49. Lin JH, Perryman AL, Schames JR, McCammon JA (2003) The relaxed complex method: accommodating receptor flexibility for drug design with an improved scoring scheme. Biopolymers 68:47–62 50. Sinko W, Lindert S, McCammon JA (2013) Accounting for receptor flexibility and enhanced sampling methods in computeraided drug design. Chem Biol Drug Des 81: 41–49 51. Cala O, Remy M-H, Guillet V, Merdes A, Mourey L, Milon A, Czaplicki G (2013) Virtual and biophysical screening targeting the gamma-tubulin complex – a new target for the inhibition of microtubule nucleation. PLoS One 8:e63908 52. Hollingsworth SA, Dror RO (2018) Molecular dynamics simulation for all. Neuron 99: 1129–1143 53. Wych DC, Aoto PC, Vu L, Wolff AM, Mobley DL, Fraser JS, Taylor SS, Wall ME (2022) Molecular-dynamics simulation methods for macromolecular crystallography. bioRxiv. https://doi.org/10.1101/2022.04.04. 486986 54. Adelusi TI, Oyedele A-QK, Boyenle ID et al (2022) Molecular modeling in drug discovery. Inform Med Unlocked 29:100880 55. Alonso H, Bliznyuk AA, Gready JE (2006) Combining docking and molecular dynamic simulations in drug design. Med Res Rev 26: 531–568 56. Mori T, Miyashita N, Im W, Feig M, Sugita Y (2016) Molecular dynamics simulations of biological membranes and membrane proteins using enhanced conformational sampling algorithms. Biochim Biophys Acta 1858:1635–1651 57. Salo-Ahen OMH, Alanko I, Bhadane R et al (2021) Molecular dynamics simulations in drug discovery and pharmaceutical development. Processes 9:71 58. Go¨tz AW, Williamson MJ, Xu D, Poole D, Le Grand S, Walker RC (2012) Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized born. J Chem Theory Comput 8:1542–1555

59. Salomon-Ferrer R, Go¨tz AW, Poole D, Le Grand S, Walker RC (2013) Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. J Chem Theory Comput 9: 3878–3888 60. Yamashita T (2018) Toward rational antibody design: recent advancements in molecular dynamics simulations. Int Immunol 30:133– 140 61. Huang Y, Li Z, Hong Q et al (2022) A stepwise docking molecular dynamics approach for simulating antibody recognition with substantial conformational changes. Comput Struct Biotechnol J 20:710–720 62. Bekker G-J, Fukuda I, Higo J, Kamiya N (2020) Mutual population-shift driven antibody-peptide binding elucidated by molecular dynamics simulations. Sci Rep 10: 1406 63. Albano JMR, de Paula E, Pickholz M (2018) Molecular dynamics simulations to study drug delivery systems. In: Molecular dynamics. IntechOpen, London, pp 73–90. https:// doi.org/10.5772/intechopen.75748 64. Spencer-Smith R, O’Bryan JP (2019) Direct inhibition of RAS: quest for the holy grail? Semin Cancer Biol 54:138–148 65. Zou Y, Ewalt J, Ng H-L (2019) Recent insights from molecular dynamics simulations for G protein-coupled receptor drug discovery. Int J Mol Sci 20:4237 66. Ruan H, Yu C, Niu X et al (2021) Computational strategy for intrinsically disordered protein ligand design leads to the discovery of p53 transactivation domain I binding compounds that activate the p53 pathway. Chem Sci 12:3004 67. Sullivan SS, Weinzierl ROJ (2020) Optimization of molecular dynamics simulations of c-MYC1-88—an intrinsically disordered system. Life 10:109 68. Rieloff E, Skepo¨ M (2021) Molecular dynamics simulations of phosphorylated intrinsically disordered proteins: a force field comparison. Int J Mol Sci 22:10174 69. Liang S, Liu X, Zhang S, Li M, Zhang Q, Chen J (2022) Binding mechanism of inhibitors to SARS-CoV-2 main protease deciphered by multiple replica molecular dynamics simulations. Phys Chem Chem Phys 24:1743 70. Elkaeed EB, Youssef FS, Eissa IH, Elkady H, Alsfouk AA, Ashour ML, El Hassab MA, Abou-Seri SM, Metwaly AM (2022) Multistep in silico discovery of natural drugs against

Molecular Dynamics in Virtual Screening COVID-19 targeting main protease. Int J Mol Sci 23:6912 71. Sanachai K, Mahalapbutr P, Lee VS, Rungrotmongkol T, Hannongbua S (2021) In silico elucidation of potent inhibitors and rational drug design against SARS-CoV-2 papain-like protease. J Phys Chem B 125: 13644–13656 72. Alzain AA, Elbadwi FA, Alsamani FO (2022) Discovery of novel TMPRSS2 inhibitors for COVID-19 using in silico fragment-based drug design, molecular docking, molecular dynamics, and quantum mechanics studies. Inform Med Unlocked 29:100870 73. Al-Karmalawy AA, Dahab MA, Metwaly AM, Elhady SS, Elkaeed EB, Eissa IH, Darwish KM (2021) Molecular docking and dynamics simulation revealed the potential inhibitory activity of ACEIs against SARS-CoV-2 targeting the hACE2 receptor. Front Chem 9: 661230 74. Lazniewski M, Dermawan D, Hidayat S, Muchtaridi M, Dawson WK, Plewczynski D (2022) Drug repurposing for identification of potential spike inhibitors for SARS-CoV2 using molecular docking and molecular dynamics simulations. Methods 203:498–510 75. de Souza AS, de Freitas Amorim VM, de Souza RF, Guzzo CR (2022) Molecular dynamics simulations of the Spike trimeric ectodomain of the SARS-CoV-2 Omicron variant: structural relationships with infectivity, evasion to immune system and transmissibility. J Biomol Struct Dyn:1–18 76. Arthur EJ, Brooks CL III (2016) Efficient implementation of constant pH molecular dynamics on modern graphics processors. J Comput Chem 37:2171–2180 77. Ge H, Wang Y, Li C et al (2013) Molecular dynamics-based virtual screening: accelerating the drug discovery process by highperformance computing. J Chem Inf Model 53:2757–2764 78. Iakovou G, Hayward S, Laycock SD (2015) Adaptive GPU-accelerated force calculation for interactive rigid molecular docking using haptics. J Mol Graph Model 61:1–12 79. Kazachenko S, Giovinazzo M, Hall KW, Cann NM (2015) Algorithms for GPU-based molecular dynamics simulations of complex fluids: applications to water, mixtures, and liquid crystals. J Comput Chem 36:1787– 1804 80. Kutzner C, Pall S, Fechner M, Esztermann A, de Groot BL, Grubmu¨ller H (2015) Best bang for your buck: GPU nodes for

81

GROMACS biomolecular simulations. J Comput Chem 36:1990–2008 81. Qi R, Wei G, Ma B, Nussinov R (2018) Replica exchange molecular dynamics: a practical application protocol with solutions to common problems and a peptide aggregation and self-assembly example. Methods Mol Biol 1777:101–119 82. Pawnikar S, Bhattarai A, Wang J, Miao Y (2022) Binding analysis using accelerated molecular dynamics simulations and future perspectives. Adv Appl Bioinform Chem 15: 1–19 83. Wolf S, Lickert B, Bray S, Stock G (2020) Multisecond ligand dissociation dynamics from atomistic simulations. Nat Commun 11:2918 84. Araki M, Matsumoto S, Bekker G-J, Isaka Y, Sagae Y, Kamiya N, Okuno Y (2021) Exploring ligand binding pathways on proteins using hypersound-accelerated molecular dynamics. Nat Commun 12:2793 85. Robustelli P, Piana S, Shaw DE (2018) Developing a molecular dynamics force field for both folded and disordered protein states. PNAS 115:4758–4766 86. Lin F-Y, MacKerell AD Jr (2019) Force fields for small molecules. Methods Mol Biol 2022: 21–54 87. Chmiela S, Sauceda HE, Mu¨ller K-R, Tkatchenko A (2018) Towards exact molecular dynamics simulations with machine-learned force fields. Nat Commun 9:3887 88. Nerenberg PS, Head-Gordon T (2018) New developments in force fields for biomolecular simulations. Curr Opin Struct Biol 49:129– 138 89. Fro¨hlking T, Bernetti M, Calonaci N, Bussi G (2020) Toward empirical force fields that match experimental observables. J Chem Phys 152:230902 90. Salomon-Ferrer R, Case DA, Walker RC (2013) An overview of the Amber biomolecular simulation package. WIREs Comput Mol Sci 3:198–210 91. Case DA, Cheatham TE, Darden T, Gohlke H, Luo R, Merz KM, Onufriev A, Simmerling C, Wang B, Woods R (2005) The Amber biomolecular simulation programs. J Computat Chem 26:1668–1688 92. Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31:455–461 93. Eberhardt J, Santos-Martins D, Tillack AF, Forli S (2021) AutoDock Vina 1.2.0: new

82

Gre´gory Menchon et al.

docking methods, expanded force field, and Python bindings. J Chem Inf Model 61: 3891–3898 94. Morris GM, Huey R, Lindstrom W, Sanner MF, Belew RK, Goodsell DS, Olson AJ (2009) AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J Comput Chem 30:2785–2791 95. Eswar N, Eramian D, Webb B, Shen M-Y, Sali A (2008) Protein structure modeling with MODELLER. In: Kobe B, Guss M, Huber T (eds) Structural proteomics highthroughput methods. Humana Press, Totowa, pp 145–159 96. Song Y, DiMaio F, Wang RY-R, Kim D, Miles C, Brunette T, Thompson J, Baker D (2013) High resolution comparative modeling with RosettaCM. Structure 21:1735– 1742 97. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJ (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 10:845–858 98. Fiser A, Kinh Gian Do R, Sali A (2000) Modeling of loops in protein structures. Protein Sci 9:1753–1773 99. Jamroz M, Kolinski A (2010) Modeling of loops in proteins: a multi-method approach. BMC Struct Biol 10:5–13 100. Scior T, Bender A, Tresadern G, MedinaFranco JL, Martinez-Mayorga K, Langer T, Cuanalo-Contreras K, Agrafiotis DK (2012) Recognizing pitfalls in virtual screening: a critical review. J Chem Inf Model 52:867–881 101. Wang G, Zhu W (2016) Molecular docking for drug discovery and development: a widely used approach but far from perfect. Future Med Chem 8:1707–1710 102. McGann M (2011) FRED pose prediction and virtual screening accuracy. J Chem Inf Model 51:578–596 103. Irwin JJ, Shoichet BK (2005) ZINC-a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182 104. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2001) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 46:3–26 105. Lipinski CA (2004) Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov Today Technol 4:337–341 106. Baell J, Walters MA (2014) Chemical con artists foil drug discovery. Nature 513:481– 483

107. Macari G, Toti D, Pasquadibisceglie A, Polticelli F (2020) DockingApp RF: a state-of-theart novel scoring function for molecular docking in a user-friendly interface to AutoDock Vina. J Mol Sci 21:9548 108. Wojcikowski M, Ballester P, Siedlecki P (2017) Performance of machine-learning scoring functions in structure-based virtual screening. Sci Rep 7:46710–46719 109. Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA (2004) Development and testing of a general Amber force field. J Comput Chem 25:1157–1174 110. Chong S-H, Ham S (2015) Structural versus energetic approaches for protein conformational entropy. Chem Phys Lett 627:90–95 111. Kassem S, Ahmed M, El-Sheikh S, Barakat KH (2015) Entropy in bimolecular simulations: a comprehensive review of atomic fluctuations-based methods. J Mol Graph Model 62:105–117 112. Procacci P (2016) Reformulating the entropic contribution in molecular docking scoring functions. J Comput Chem 37:1819–1827 113. Meirovitch H (2010) Methods for calculating the absolute entropy and free energy of biological systems based on ideas from polymer physics. J Mol Recognit 23:153–172 114. Genheden S, Ryde U (2015) The MM/PBSA and MM/GBSA methods to estimate ligandbinding affinities. Expert Opin Drug Discov 10:449–461 115. Vosmeer CR, Pool R, van Stee MF, PericHassler L, Vermeulen NPE, Geerke DP (2014) Towards automated binding affinity prediction using an iterative linear interaction energy approach. Int J Mol Sci 15:798–816 116. Rosendahl Kjellgren E, Skytte Glue OE, Reinholdt P, Egeskov Meyer J, Kongsted J, Poongavanam V (2015) A comparative study of binding affinities for 6,7-dimethoxy-4-pyrrolidylquinazolines as phosphodiesterase 10A inhibitors using the linear interaction energy method. J Mol Graph Model 61:44–52 117. Stjernschantz E, Oostenbrink C (2010) Improved ligand-protein binding affinity predictions using multiple binding modes. Biophys J 98:2682–2691 118. Aqvist J, Luzhkov VB, Brandsdal BO (2002) Ligand binding affinities from MD simulations. Acc Chem Res 35:358–365 119. King E, Aitchison E, Li H, Luo R (2021) Recent developments in free energy calculations for drug discovery. Front Mol Biosci 8: 712085 120. Miller BR, McGee TD, Swails JM, Homeyer N, Gohlke H, Roitberg AE (2012)

Molecular Dynamics in Virtual Screening MMPBSA.py: an efficient program for end-state free energy calculations. J Chem Theory Comput 8:3314–3321 121. Song LF, Lee T-S, Zhu C, York DM, Merz KM Jr (2019) Using AMBER18 for relative free energy calculations. J Chem Inf Model 59:3128–3135 122. Huggins DJ (2022) Comparing the performance of different AMBER protein forcefields, partial charge assignments, and water models for absolute binding free energy calculations. J Chem Theory Comput 18:2616– 2630 123. Borhani DW, Shaw DE (2012) The future of molecular dynamics simulations in drug discovery. J Comput Aided Mol Des 26:15–26 124. Decherchi S, Masetti M, Vyalov I, Rocchia W (2015) Implicit solvent methods for free energy estimation. Eur J Med Chem 91:27– 42 125. Le L (2012) Incorporating molecular dynamics simulations into rational drug design: a case study on influenza a neuraminidases. In: Bioinformatics. InTech, Horacio Pe´rez-Sa´nchez, pp 159–184 126. Mortier J, Rakers C, Bermudez M, Murgueitio MS, Riniker S, Wolber G (2015) The impact of molecular dynamics on drug design: applications for the characterization of ligand– macromolecule complexes. Drug Discov Today 20:686–702

83

127. Tautermann CS, Seeliger D, Kriegl JM (2015) What can we learn from molecular dynamics simulations for GPCR drug design? Comput Struct Biotechnol J 13:111–121 128. Zhao H, Caflisch A (2015) Molecular dynamics in drug design. Eur J Med Chem 91:4–14 129. Okimoto N, Suenaga A, Taiji M (2016) Evaluation of protein–ligand affinity prediction using steered molecular dynamics simulations. J Biomol Struct Dyn 35(15): 3221–3231 130. Li MS, Mai BK (2012) Steered molecular dynamics-a promising tool for drug design. Curr Bioinforma 7:342–351 131. Pang Y-P, Xu K, El Yazal J, Prendergast FG (2000) Successful molecular dynamics simulation of the zinc-bound farnesyltransferase using the cationic dummy atom approach. Protein Sci 9:1857–1865 132. Menchon G, Bombarde O, Trivedi M et al (2016) Structure-based virtual ligand screening on the XRCC4/DNA ligase IV interface. Sci Rep 6:22878–22890 133. Tran-Nguyen V, Bret G, Rognan D (2021) True accuracy of fast scoring functions to predict high-throughput screening data from docking poses: the simpler the better. J Chem Inf Model 61:2788–2797

Chapter 4 Antiviral Drug Target Identification and Ligand Discovery Hershna Patel and Dipankar Sengupta Abstract This chapter intends to provide a general overview of web-based resources available for antiviral drug discovery studies. First, we explain how the structure for a potential viral protein target can be obtained and then highlight some of the main considerations in preparing for the application of receptor-based molecular docking techniques. Thereafter, we discuss the resources to search for potential drug candidates (ligands) against this target protein receptor, how to screen them, and preparing their analogue library. We make specific reference to free, online, open-source tools and resources which can be applied for antiviral drug discovery studies. Key words Antiviral, Binding site, Compound library, Databases, Drug target, Protein structure

1

Introduction Viral infections in humans continue to lead to significant mortality, serious health complications, and in the worst cases, devastating pandemics which have resulted in large-scale economic and social disruption. Some of the most common viruses known to cause acute or chronic human infection include rhinovirus, influenza A virus, coronavirus, human immunodeficiency virus (HIV), respiratory syncytial Virus (RSV), norovirus, as well as viruses responsible for causing diseases such as hepatitis, measles, mumps, rubella, and herpes. What is more concerning is that the current World Health Organization (WHO) list of priority diseases which pose the greatest risk to public health are all caused by viruses [1]; hence, there is an ongoing requirement for developing robust prevention strategies and increasing treatment options globally. Since the recent COVID-19 pandemic caused by transmission of the SARS-CoV2 virus, more scientists and research groups have focused attention toward computational antiviral drug discovery, and a number of notable approaches for COVID-19 drug discovery have been reviewed [2]. Additionally, computational drug discovery and bioinformatics are increasingly being implemented into teaching

Mohini Gore and Umesh B. Jagtap (eds.), Computational Drug Discovery and Design, Methods in Molecular Biology, vol. 2714, https://doi.org/10.1007/978-1-0716-3441-7_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

85

86

Hershna Patel and Dipankar Sengupta

curricula across various educational levels; therefore, this chapter is primarily directed at undergraduate or postgraduate students and those with no (or limited) prior experience of computational drug discovery. Several new and improved computational tools and databases for carrying out studies in this area have become available over the last decade, and undoubtedly, many more will be under development. In the context of in silico antiviral drug discovery and target identification, viral proteins with major functional importance have already been well characterized for many of the most significant viruses that cause infections in humans. Sequence and structural data for these proteins are readily available as well as their substrates and inhibitors and provide a useful starting resource for further research. These include host cell receptor attachment proteins such as the spike protein of influenza virus and coronavirus, genome replication proteins such as the polymerase enzymes, and other structural or nonstructural proteins, all of which are attractive antiviral drug targets [3]. However, depending on the virus, some proteins may be more suitable targets than others based on their function, conservation, structural importance, abundance in the cell, and ease of accessibility for drug binding. Ultimately, the protein target selected must result in functional inhibition and reduced viral replication upon ligand binding to alleviate the disease burden. The lead compound(s) selected should also display desired drug-like properties and specificity to the target which is best proven through in vitro experiment. Furthermore, antiviral drug resistance is a major reason for treatment failure. Therefore, the potential of antiviral drug resistance emerging and the rate of viral protein evolution are concerning issues that need to be considered in drug discovery. Fortunately, bioinformatics methods can assist with this to an extent and will be addressed in this chapter.

2

Receptor Protein Target The following protocol provides a starting point for a receptorbased molecular docking or virtual screening antiviral drug discovery study.

2.1 Obtaining the Structure of the Receptor Protein Target

Searching the UniProt database or the RCSB Protein Data Bank (PDB) is generally the first step toward finding an experimentally resolved structure file or model for a receptor protein to target. Both of these primary databases are regularly updated and maintained, although other dedicated databases have been developed which can be searched to download viral protein sequences and structures of interest. Specific virus structural databases which include core virus-related resources include SARS-CoV-2 3D (https://sars3d.com/) for SARS-Cov-2 proteins as well as the

Antiviral Drug Discovery

87

VirusMED (https://virusmed.biocloud.top/) database which contains binding site information about all viral proteins that have a determined structure and provides direct links to proteins from the PDB [4]. Alternatively, it is also possible to perform a Protein BLAST search against the PDB using the amino acid sequence of the target protein as input to identify any homologous structures in the PDB database which can be used as a reference. We elaborate the procedure taking an example of SARS-CoV-2 main protease as a potential receptor protein target: 1. Go to the UniProt homepage (https://www.uniprot.org/) to retrieve the amino acid sequence for the SARS-CoV-2 main protease (Mpro). This protease is a cleavage product of the replicase polyprotein 1a and is also referred to as the 3C-like protease. Enter the search term, SARS-CoV-2 main proteinase, in the UniProt homepage search bar. Select the first entry P0DTC1 and scroll down to the PTM/Processing section. Expand the 3C-like proteinase link and download the protein sequence in FASTA format. 2. Go to the NCBI BLAST database and select the protein BLAST server, (https://blast.ncbi.nlm.nih.gov/Blast.cgi? PAGE=Proteins). Copy and paste the 3C-like protease sequence in FASTA format into the query sequence box. 3. Within the Choose Search Set section, ensure to change the Database to be searched to the Protein Data Bank proteins (pdb) using the drop down arrow. Leave the remaining parameters as default and click on BLAST to run the BLASTP search. 4. A number of search results showing a description of sequences producing significant alignments to the query protease should be returned (Fig. 1) as well as the accession code for the corresponding protein structure in the PDB. Inspect the top “hits,” and select one of the protein entries displaying high percentage identity and coverage of the query, such as the entry 6XA4. Alternatively, you may select several structures to create a library of potential drug targets (see Note 1). 2.2 Structure Suitability and Validation Check

Protein structures deposited in the PDB are resolved by different experimental procedures such as X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy and will vary in resolution. Generally, the lower the resolution given in Angstrom (Å) will mean a higher quality structure. The publication associated with the structure should be read to gain insight into the structural domains, binding and interaction sites, and the procedures employed to determine the final experimental structure (see Note 2).

88

Hershna Patel and Dipankar Sengupta

Fig. 1 Screenshot of the output from the BLASTP search using the SARS-CoV-2 main protein sequence as input

1. Go the RCSB Protein Data Bank (https://www.rcsb.org/) and enter the accession code 6XA4 provided by the BLASTP search. Download the main protease protein structure file: Select Download Files-PDB format. The file should end with the extension .pdb and be readable by molecular visualization programs such as PyMol (https://pymol.org/2/), RasMol (http://www.openrasmol.org/), or Chimera(https://www. cgl.ucsf.edu/chimera/). 2. It is good practice to work with a complete structure before proceeding with binding site predictions and docking or virtual screening studies so that critical interactions between molecules can be identified. Hence, it is important to check the query coverage in the BLASTP result. Short missing fragments at the very ends of the -N and -C terminals can be ignored. To check where there are any regions missing atomic coordinates, click on the “Sequence” tab at the top of the PDB entry page for a graphical overview, or alternatively, manually inspect the PDB file REMARKS. You can do this by opening the file using a text editor or by displaying the file in PDB format in the

Antiviral Drug Discovery

89

browser. Some protein entries may also consist of multiple identical chains or ANISOU records, any of which can be removed by simply deleting the atom records from the PDB file. Similarly, unnecessary ligands, ions, and solvents can be removed by deleting the atom records for these molecules. For example, the Unix “grep” command can be used to delete the water molecules and ligands from the 6XA4.pdb structure file: grep – v HETATM 6XA4.pdb > 6XA4_NoHETATM.pdb

3. A number of molecular modelling programs or tools can be used such as Modeler [5] to automatically “fill in” these missing regions. A structural refinement technique by energy minimization should then be applied if the modelling program does not include this to mitigate the effects of any clashing atoms. Discovering or designing drugs for novel viruses may be more challenging if no (suitable) structure for the target protein is available. However, structure prediction techniques can be employed to predict a protein model of the target, and only the complete protein sequence for the target will be required as input. The target site for docking, such as the substrate binding site, should ideally be known in advance, either from the literature, through database searching, or from the location of a co-crystalized bound ligand or inhibitor within an experimental structure complex. In the case of structure 6XA4, the inhibitor molecule UAW241 is bound to the Mpro active site. If the target site is unknown or a novel site is to be investigated, a protein–ligand binding site prediction server could be used to identify potential “druggable hot spots” that may interact favorably and specifically with drug-like compounds. Several online tools for ligand binding site prediction are available [6] and most only require either a protein sequence or structure file as input. Furthermore, these predictions may also facilitate drug design based on the properties of the receptor binding site. Molecular dynamics simulations of the protein could also be performed to uncover cryptic binding pockets.

3

Analyzing Viral Genome and Protein Conservation for Drug Target Selection Viruses, particularly those with a segmented RNA genome (such as influenza), are known to mutate rapidly. For antiviral drug discovery, it is important for drugs entering clinical trials to be longlasting and withstand the effect of sequence mutations and protein evolution; however, it is difficult to accurately predict where

90

Hershna Patel and Dipankar Sengupta

Fig. 2 Multiple sequence alignment of Mpro sequences from the USA, the UK, and China from 2020 and 2022 using Clustal Omega. The amino acid substitution from Proline to Histidine at position 132 is highlighted

mutations will occur in the future and their impact. Multiple sequence alignment (MSA) is a very useful technique for analyzing the degree of conservation or variability among nucleic acid or protein sequences, as specific changes in amino acid residues at certain positions can be observed. As multiple alignments of protein sequences are generally more informative than nucleic acid sequence alignments, we recommend using a dataset which includes a large sample of nonredundant sequences of similar length, including those from the most recently sequenced variants in order to provide an up-to-date evolutionary profile (see Note 3). An example of a MSA is shown in Fig. 2, where six Mpro amino acid sequences have been aligned. Overall, there is extremely high conservation among the six sequences, with a difference at only one amino acid position, 132. Widely applied multiple sequence alignment software that can handle large datasets include Clustal Omega [7] and MAFFT [8]. Furthermore, coevolution methods based on evolutionary information derived from a multiple sequence alignment can identify amino acid residues which coevolve over time [9–11]. Coevolution analysis can therefore be employed to evaluate the suitability of the protein target. For example, if an interaction between coevolving residue pairs of the same viral protein (intra-protein contacts) and between two viral proteins or a viral and host protein (interprotein contacts) which are closely located can be disrupted through ligand binding, then, this may be a more appropriate target site to explore. While positions that coevolve with amino acid positions where known drug-resistance mutations occur may display lower conservation and be less suitable drug targets. A structural model of the viral protein–drug complex can also be useful to predict whether specific mutations to overcome drug selection pressures would result in lower drug binding affinities based on in silico mutagenesis experiments followed by molecular docking of the inhibitor drug to the mutant protein.

Antiviral Drug Discovery

91

To perform inter-protein coevolution analysis, sequences for the individual proteins must first be obtained and aligned. Then, based on the sequence identifier for each strain type, the sequences of the first protein need to be appended to the sequences of the other, resulting in a concatenated sequence dataset which can be used as input to the coevolution analysis software if the method does not precompute this. The “concat” command from the SeqKit software [12] can be used to concatenate sequences with the same ID from multiple files.

4

Databases for Selecting Compound Libraries and Ligands There are several online repositories and databases that contain chemical structures of drug-like compounds for virtual screening, some of which are interlinked. Three popular databases which are often used in research studies include the following: 1. The Drugbank database, https://go.drugbank.com/ [13]. 2. The ZINC database, https://zinc.docking.org [14], which contains links to several catalog libraries of commercially available compounds from various international vendors. 3. The bioactive compound database ChEMBL, https://www. ebi.ac.uk/chembl/ [15]. Vendors offering specific antiviral compound libraries are also available which can be used in molecular docking studies to investigate their target protein, binding mechanism, for ligand-based drug design, or pharmacophore modelling: • The MedChemExpress Antiviral Compound Library, https:// www.medchemexpress.com/screening/Anti-virus_Com pound_Library.html. Contains 1132 compounds that target several viruses and can be downloaded in SDF format. • The Enamine Ltd Antiviral Library, https://enamine.net/ compound-libraries/targeted-libraries/antiviral-library. Consists of 3200 compounds designed for discovery of new nucleoside-like antivirals. • The Life Chemical Antiviral Screening compound library, https://lifechemicals.com/screening-libraries/targeted-andfocused-screening-libraries/antiviral-libraries. Contains over 3500 compounds in SDF format. Chemical compounds will commonly be encountered in the Structured Data File (.sdf), Simplified Molecular Input Line Entry System (.smi), and Mol2 (.mol2) formats and can be easily converted into other formats compatible with docking software (see Note 4).

92

4.1

Hershna Patel and Dipankar Sengupta

ADME Screening

4.2 Developing An Analogue Library

Ideally, the compound(s) selected should display drug-like properties with regard to absorption, distribution, metabolism, and excretion (ADME), and Lipinski’s rule of five can be applied to filter the compound library [16]. If currently approved and commercially available drug compounds, nutraceuticals, and natural products are investigated, then, this may accelerate the drug discovery process. The ZINC database contains a subset of natural compounds as well as the Collection of Open Natural Products database (COCONUT) [17] (see Note 5). The SwissADME website http://www.swissadme.ch/ provides a tool to predict ADME parameters as well as other parameters to support drug discovery projects [18]. You may either draw the chemical structure of the compounds you are interested in screening or provide their SMILE (Simplified Molecular Input Line Entry System) notations [like, SMILE notation of benzene can be “c1ccccc1” or “C1=CC=CC=C1”] [19, 20] (see Note 6). Let’s elaborate on how to use this with the example of α-ketoamides that are potential candidate inhibitors targeting SARS-CoV-2 main protease [21, 22]. Telaprevir [Pubchem ID: 3010818] and boceprevir [Pubchem ID: 10324367] are the among the best peptidomimetic inhibitors. Go to the SwissADME website, and input their SMILE notations, telaprevir: CCCC(C (=O)C(=O)NC1CC1)NC(=O)C2C3CCCC3CN2C(=O)C(C (C)(C)C)NC(=O)C(C4CCCCC4)NC(=O)C5=NC=CN=C5 Boceprevir: CC1(C2C1C(N(C2)C(=O)C(C(C)(C)C)NC (=O)NC(C)(C)C)C(=O)NC(CC3CCC3)C(=O)C(=O)N)C]. Next, click on the “Run” option. It may take few minutes to analyze the respective molecules and display the results corresponding to each of the inputs. In this example, the results of Molecule 1 (Fig. 3a) is for telaprevir and Molecule 2 (Fig. 3b) is for boceprevir. On comparing their results, we observe telaprevir is having more drug-likeness violations compared to boceprevir, which also comparatively has a higher bioavailability and better water solubility. Alternatively, you can also download the results as a CSV file [click on Retrieve Data: CSV option] and compare your drug candidate molecules. Using prior knowledge of the binding site such as the volume, geometry, presence of specific functional groups, and charge distribution, ligands can also be manually selected or designed based on their complementary properties. Once the 3D coordinates of a ligand (to be a potential drug candidate) have been obtained (e.g., from Zinc database or PubChem), its derivatives can be designed to prepare an analogue

Antiviral Drug Discovery

93

library. Using the ligand as a template, identify the functional groups or side chains that can be replaced by sterically and conformationally allowed substituents. For example, from ADME screening, we found boceprevir to be a potential candidate. Ketone is a prominent functional group in this molecule (Fig. 4a) and functionally can be replaced by other members of the chalcogen family (Fig. 4b). Thereafter, we need to ensure each derivative analogue is assigned with an appropriate bond order, and the final step is optimizing them with a force field [23]. For small or organic molecules, MMFF94 or MMFF94s is the recommended force field to optimize their energy conformation [24]. An easy-to-use application for the aforesaid process is Avogadro (Fig. 5), open-

Fig. 3 (a) Screenshot of the output [Molecule 1, telaprevir] from the ADME screening search using SMILE notations of telaprevir and boceprevir as input. (b) Screenshot of the output [Molecule 2, boceprevir] from the ADME screening search using SMILE notations of telaprevir and boceprevir as input

94

Hershna Patel and Dipankar Sengupta

Fig. 3 (continued)

Fig. 4 (a) 2D representation of boceprevi. (b) Potential sites in boceprevir that can be substituted

Antiviral Drug Discovery

95

Fig. 5 Screenshot of the Avogadro software used for optimizing boceprevir [Ball and Stick representation]

source software (https://two.avogadro.cc/) to manipulate molecular structures and interact with programs like OpenBabel [25]. This step would aid in preparing the set of ligands and their analogues for virtual screening against the identified protein target receptor. Docking studies may thereafter be employed for analyzing the interaction of these molecules from the analogue library into the binding site of the protein target.

5

Notes 1. To find identical protein structures, click on the PDB accession code from the BLASTP results page. This will take you to the protein entry in NCBI. At the top of the page, below the header, click on the “Identical Proteins” hyperlink. A table of identical protein structures will be returned. 2. It is possible that many conformations of the same target protein may be available, in which case the user will need to inspect the PDB entry and accompanying publication to select the structure(s) which will be the most suitable for the study objective. An ensemble of different conformations of the protein receptor could be used to give an indication of target

96

Hershna Patel and Dipankar Sengupta

specificity based on the binding affinity score. Binding affinities may be stronger with one specific conformation. The drug target should not closely resemble a human protein homologue to avoid cross-reactivity. 3. Viral protein sequences can be obtained from the NCBI Virus database https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/. This database allows searching by virus and provides a link for automatic alignment of up to 500 selected sequences; however, the dataset may require some filtering and manual editing to remove redundant sequences or sequences containing large gaps and nonstandard amino acids. 4. The file format may need to be converted to a format which is compatible with the molecular docking. The OpenBabel tool [26] is a useful resource to do this and can also filter molecules based on various criteria. Additional compound preparation may be required such as correcting protonation states. 5. Some further considerations before proceeding with the docking process include whether the receptor binding site residues are to be treated as flexible or rigid—the incorporation of flexible residues is not provided by some docking software— and the use of constraints for molecular docking to guide the prediction. 6. This is a line notation method, which is used to represent a chemical molecule (and chemical reactions), making it easy for computer programs to read them. PubChem (https:// pubchem.ncbi.nlm.nih.gov/) is a major online resource storing information of chemical molecules, which can be used to also obtain the SMILE notations. Alternatively, OEChem TK 2.3.0 toolkit [https://docs.eyesopen.com/toolkits/python/ oechemtk/releasenotes/version2_3_0.html] can be used in Python programming for this purpose.

6

Summary In this chapter, we have covered how to identify target protein receptors, importance of conserved regions, potential drug candidates based on target site, and their analogues for antiviral drug discovery studies, with the example of SARS-CoV-2 main protease. Here is a quick summary of the resources or databases we have discussed (Table 1):

Antiviral Drug Discovery

97

Table 1 A summary of the databases and resources referred to in this chapter

Resource/database

Available information or computing Web address (URL) utility

UniProt

Protein sequence and function information

https://www.uniprot.org/

RCSB Protein Data Bank (PDB)

Protein 3D structures (experimentally determined)

https://www.rcsb.org/

SARS-Cov-2-3D

SARS-Cov-2-3D proteome database https://sars3d.com/

VirusMED

Individual hot spots for viral proteins https://virusmed.biocloud.top/

NCBI BLAST

Find regions of similarity between the sequences

https://blast.ncbi.nlm.nih.gov/Blast. cgi?

PyMol

Molecular visualization program

https://pymol.org/2/

RasMol

Molecular visualization program

http://www.openrasmol.org

Chimera

Molecular visualization program

https://www.cgl.ucsf.edu/chimera/

Clustal Omega

Multiple sequence alignment

https://www.ebi.ac.uk/Tools/msa/ clustalo/

MAFFT

Multiple sequence alignment

https://www.ebi.ac.uk/Tools/msa/ mafft/

SeqKit

Toolkit for FASTA/Q file manipulation

https://bioinf.shenwei.me/seqkit/

Drugbank database

Knowledge base for drug candidates https://go.drugbank.com/

ZINC database

Database of commercially available compounds for virtual screening

ChEMBL

Database of manually curated https://www.ebi.ac.uk/chembl/ bioactive molecules with drug-like properties

MedChemExpress

Collection of 1144 antivirus compounds

https://www.medchemexpress.com/ screening/Anti-virus_Compound_ Library.html

Enamine Ltd Antiviral Library

Collection of 3200 compounds designed for discovery of new nucleoside-like antivirals

https://enamine.net/compoundlibraries/targeted-libraries/antivirallibrary

The Life Chemical Antiviral Screening compound library

Antiviral libraries of over 13,700 drug-like screening compounds

https://lifechemicals.com/screeninglibraries/targeted-and-focusedscreening-libraries/antiviral-libraries

SwissADME

ADME screening for potential drug http://www.swissadme.ch candidates

PubChem

Chemical information

https://pubchem.ncbi.nlm.nih.gov/

Avogadro

Semantic chemical builder and platform for visualization and analysis

https://two.avogadro.cc/

https://zinc.docking.org

98

Hershna Patel and Dipankar Sengupta

References 1. WHO (2022) Prioritizing diseases for research and development in emergency contexts. https://www.who.int/activities/prioritizingdiseases-for-research-and-development-inemergency-contexts. Accessed 26 Aug 2022 2. Muratov EN, Amaro R, Andrade CH et al (2021) A critical overview of computational approaches employed for COVID-19 drug discovery. Chem Soc Rev 50:9121–9151. https://doi.org/10.1039/D0CS01065K 3. Li G, De Clercq E (2021) Chapter 1: overview of antiviral drug discovery and development: viral versus host targets. In: Antiviral discovery for highly pathogenic emerging viruses, pp 1–27 4. Zhang H, Chen P, Ma H et al (2021) VirusMED: an atlas of hotspots of viral proteins. IUCrJ 8:931–942. https://doi.org/10.1107/ S2052252521009076/BE5290SUP1.PDF 5. Webb B, Sali A (2016) Comparative protein structure modeling using MODELLER. Curr Protoc Bioinforma 54:5.6.1–5.6.37. https:// doi.org/10.1002/CPBI.3 6. Zhao J, Cao Y, Zhang L (2020) Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J 18:417–426. https://doi.org/10. 1016/J.CSBJ.2020.02.008 7. Sievers F, Wilm A, Dineen D et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539. https://doi. org/10.1038/MSB.2011.75 8. Katoh K, Misawa K, Kuma K-I, Miyata T (2022) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30(14): 3059–3066 9. Champeimont R, Laine E, Hu SW et al (2016) Coevolution analysis of Hepatitis C virus genome to identify the structural and functional dependency network of viral proteins. Sci Reports 6:26401. https://doi.org/10. 1038/srep26401 10. Mintaev RR, Alexeevski AV, Kordyukova LV (2014) Co-evolution analysis to predict protein–protein interactions within influenza virus envelope. J Bioinforma Comput Biol 12(2):1441008. https://doi.org/10.1142/ S021972001441008X 11. Priya P, Shanker A (2021) Coevolutionary forces shaping the fitness of SARS-CoV2 spike glycoprotein against human receptor ACE2. Infect Genet Evol 87:104646.

https://doi.org/10.1016/J.MEEGID.2020. 104646 12. Shen W, Le S, Li Y, Hu F (2016) SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11: e0163962. https://doi.org/10.1371/JOUR NAL.PONE.0163962 13. Wishart DS, Feunang YD, Guo AC et al (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 46: D1074–D1082. https://doi.org/10.1093/ NAR/GKX1037 14. Irwin JJ, Tang KG, Young J et al (2020) ZINC20 - a free Ultralarge-scale chemical database for ligand discovery. J Chem Inf Model 60:6065–6073. https://doi.org/10. 1021/ACS.JCIM.0C00675/ASSET/ IMAGES/LARGE/CI0C00675_0007.JPEG 15. Davies M, Nowotka M, Papadatos G et al (2015) ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res 43:W612. https://doi.org/ 10.1093/NAR/GKV352 16. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ (2001) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 46:3–26. https://doi.org/10.1016/S0169-409X(00) 00129-0 17. Sorokina M, Merseburger P, Rajan K et al (2021) COCONUT online: collection of open natural products database. J Cheminform 13:1–13. https://doi.org/10.1186/S13321020-00478-9/FIGURES/4 18. Daina A, Michielin O, Zoete V (2017) SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci Rep 7: 42717. https://doi.org/10.1038/srep42717 19. Weininger D (1988) SMILES, a chemical language and information system: 1: introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36. https://doi.org/ 10.1021/CI00057A005/ASSET/ CI00057A005.FP.PNG_V03 20. O’Boyle NM (2012) Towards a Universal SMILES representation - a standard method to generate canonical SMILES based on the InChI. J Cheminform 4:1–14. https://doi. org/10.1186/1758-2946-4-22/ COMMENTS 21. Sharma M, Prasher P, Mehta M et al (2020) Probing 3CL protease: rationally designed

Antiviral Drug Discovery chemical moieties for COVID-19. Drug Dev Res 81:911–918. https://doi.org/10.1002/ DDR.21724 22. Zhang L, Lin D, Kusov Y et al (2020) α-Ketoamides as Broad-Spectrum inhibitors of Coronavirus and enterovirus replication: structure-based design, synthesis, and activity assessment. J Med Chem 63:4562–4578. h t t p s : // d o i . o r g / 1 0 . 1 0 2 1 / A C S . JMEDCHEM.9B01828/SUPPL_FILE/ JM9B01828_SI_002.CSV 23. Lin FY, MacKerell AD (2019) Force fields for small molecules. In: Methods in molecular biology 24. Halgren TA (1999) MMFF VII. Characterization of MMFF94, MMFF94s, and other widely

99

available force fields for conformational energies and for intermolecular-interaction energies and geometries. J Comput Chem 20(7): 730–748. https://doi.org/10.1002/(SICI) 1096-987X(199905)20:73.0.CO;2-T 25. Hanwell MD, Curtis DE, Lonie DC et al (2018) Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J Cheminform 4(1):17 26. O’Boyle NM, Banck M, James CA et al (2011) Open Babel: an open chemical toolbox. J Cheminform 3:1–14. https://doi.org/10.1186/ 1758-2946-3-33/TABLES/2

Chapter 5 GRAMM Web Server for Protein Docking Amar Singh, Matthew M. Copeland, Petras J. Kundrotas, and Ilya A. Vakser Abstract Prediction of the structure of protein complexes by docking methods is a well-established research field. The intermolecular energy landscapes in protein–protein interactions can be used to refine docking predictions and to detect macro-characteristics, such as the binding funnel. A new GRAMM web server for protein docking predicts a spectrum of docking poses that characterize the intermolecular energy landscape in protein interaction. A user-friendly interface provides options to choose free or templatebased docking, as well as other advanced features, such as clustering of the docking poses, and interactive visualization of the docked models. Key words Protein–protein interactions, Energy landscape, Free docking, Template-based docking, Web-based resource

1

Introduction Many biological systems are based on protein–protein interactions. Thus, modeling these interactions is important for understanding fundamental principles of biomolecular mechanisms and developing our ability to manipulate them. Computational techniques for prediction of the structure of protein–protein complexes (protein docking) can be broadly divided into (a) template-based (or homology) docking, which relies on the availability of experimentally determined structures of complexes of proteins (templates) that are similar to the target ones, and (b) free docking, which is based on physical principles, such as shape and physicochemical complementarity of the interacting proteins [1]. Most docking algorithms consist of two basic steps: global search for the tentative binding mode (binding funnel) and local scoring/ refinement within the candidate funnel. The global search stage is typically based on simple, computationally inexpensive scoring functions and fast search algorithms. At the scoring/refinement

Mohini Gore and Umesh B. Jagtap (eds.), Computational Drug Discovery and Design, Methods in Molecular Biology, vol. 2714, https://doi.org/10.1007/978-1-0716-3441-7_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

101

102

Amar Singh et al.

stage, the tentative predictions are rescored/re-ranked by more accurate scoring functions [2]. The free and the template-based docking paradigms have their strengths and weaknesses. The template-based docking is generally more accurate than the free docking and has high tolerance to the structural inaccuracy of the target proteins [3]. However, it critically depends on the availability of adequate docking templates [4]. While the free docking does not depend on the template availability, its performance drops significantly for protein structures of lower accuracy [3]. Recently, docking methods based on deep learning have gained popularity [5– 7]. Such methods can accurately predict structures of protein assemblies starting from the sequence of the target proteins. However, they face problems in modeling antigen–antibody complexes, multimeric assemblies, and transient interactions. A number of docking approaches have been implemented in web-based servers available to the research community, such as ClusPro [8], HADDOCK [9], ZDOCK [10], HEX [11], ATTRACT [12], PRISM [13], SwarmDock [14], pyDockWeb [15], PatchDock [16], RosettaDock [17], FRODOCK [18], LZerD [19], HDOCK [20], and InterEvDock3 [21]. Here, we present the GRAMM docking server (http://gramm.compbio.ku. edu), which provides a simple and easy-to-use web interface for protein–protein docking. This server is a successor of our previous GRAMM-X web server [22], which has been popular in the research community since its public release in 2006. The GRAMM server offers both the template-based and the free docking capability. In addition to the rescored and refined predictions, users can download docked structures generated at the scan stage with shape complementarity score only, which can be used to map the intermolecular energy landscape for further sampling [23]. The server also has advanced docking features, such as clustering of docking poses in free docking and template selection in the template-based docking.

2 2.1

Materials User Input

The user web interface contains input boxes to upload the protein structures for docking (see Note 1). The larger and the smaller proteins in a complex are defined as receptor and ligand, respectively (see Note 2). Users are asked to provide a valid email address to return the docking results. A job can be submitted either for the free or the template-based docking. There is a number of advanced options offered after the selections of the docking method (see Subheading 3). The basic workflow and the advanced options are shown in Fig. 1. Each submitted job is assigned a unique ID and queued for processing.

GRAMM Web Server

103

Fig. 1 Workflow of the GRAMM docking web server. The two different docking methods (free and template based) and their advanced options are shown by different colors 2.2

3 3.1

Implementation

The web server is implemented primarily in Python 3. The serverside web application uses the Flask framework with Jinja for templates in creating the client-side interface. The client-side user interface uses JavaScript for the dynamic interface control and visualization features. When a job is submitted, its metadata is stored in a JSON format with information necessary for scheduling placed into an SQLite3 database. The database is monitored by the job execution system. Jobs requiring longer execution time (those performing clustering and template-based docking) are placed in a separate queue to minimize the backlog of the short-term jobs. After the job is finished, an email is sent to the user with a link to the results. Jobs are run on a dedicated PowerEdge R730 server hosted at The University of Kansas. The production server is kept in the university’s monitored enterprise data center. In keeping with best practices for server operations, a separate identical server is used for the development, quality assurance, and diagnosing problems, in order to minimize unplanned downtime of the production server.

Methods Free Docking

The free docking is performed by our standard FFT (fast Fourier transform) algorithm, which has been extensively described in the literature over the years [24, 25]. It performs systematic 6D rigidbody search using FFT for the convolution of the translational

104

Amar Singh et al.

coordinates, a method extensively utilized in the protein docking community (see Note 3). The procedure performs medium resolution docking with a 3.5 Å grid step and 10° angular interval, which are optimal for the unbound docking (docking of the unbound high-resolution experimentally determined or modeled structures; see Note 4). These default parameter values can be changed in the advanced options. Users have an option to change the number of the output docking poses according to their needs (the default number is 30,000). Users also can choose the number of models to output in the PDB format, with different chain identifiers (chain A is used for receptor and chain B for the ligand; see Note 5). The Euler rotation angles and the translation vectors for generation of the docking poses from the initial coordinates of the proteins are stored for further analysis and processing, such as clustering of the docking poses and their re-ranking by sophisticated scoring functions (see Note 6). Protein docking can be significantly improved when the search space is constrained by the a priori information on interacting residues. GRAMM server provides an option to filter/rescore the docking poses based on the user-supplied list of interacting residues of one or both proteins. Users are asked to provide the confidence scores for such residues in 0–10 range (higher score indicates higher reliability of the constraint; see Note 7). To detect the near-native docking poses, one can use clustering of the docking poses generated at the scan stage based on some criterion of similarity (e.g., RMSD or MM-score between the docked models [26]). The basic assumption underlying the clustering approach is that the native structure of the complex corresponds to the binding funnel on the intermolecular energy landscape where the low-energy docking poses are clustered. The GRAMM server implements the sequential clustering (see Note 8), in which the lowest-energy pose is designated as the representative structure of the first cluster and higher energy docking poses within the clustering threshold (based on Cα RMSD) are assigned to that cluster. After generation of the first cluster, the lowest-energy unassigned pose is selected as representative for the second cluster, etc. The procedure is iterated until all docking poses are assigned to clusters. An advanced option is to select the total number of the docking poses for clustering and the clustering threshold (the default values are 30,000 and 10 Å, respectively; see Note 9). The server also evaluates the quality of the predicted docking models within each cluster with respect to the native structure (if provided by the user, see Note 10) according to the CAPRI quality criteria [27].

GRAMM Web Server

105

3.2 Template-Based Docking

In the past decade, template-based docking methods have gained popularity as more protein–protein complexes have been experimentally determined. In GRAMM web server, we implemented the protocols for the structure-based homology docking, previously developed in our group [28]. The template-based docking is performed by the alignment (using TM-align [29]) of the full structure of the target proteins to the full or to the interface-only structure of the templates (see Note 11). Scoring of the resulting docking models is performed by the combined scoring function [30]. In the advanced options section, users can choose the fullstructure or the interface-only algorithm (default is the full structure). Users can also define their own set of templates (see Note 12). The template-based docking by default uses the library of templates consisting of 12,470 full structures and 12,430 interface structures of binary protein–protein complexes from our DOCKGROUND resource [31]. These templates are nonredundant, with redundancy removed at MM-score >0.9. The interface templates were extracted from the full-structure templates using 12 Å distance cutoff from the other chain. We also provide an option for selecting templates from binary protein–protein interactions extracted from the PDB [32] and stored on the server. Upon entering the PDB ID, the interacting chains appear in the user interface for selection as templates (see Note 13). In future server releases, we will provide the option for users to upload their own docking templates.

3.3 Output of Docking Results

Upon successful completion of a docking job, users receive an email with the web link to download and view the docking results. On the results page, the top six docked models are displayed using JSmol [33]. In case of the free docking, a table of results for the largest ten clusters (see above) is displayed. For each cluster, a representative structure is linked to the download, and the other docking predictions are shown by the geometric centers for visualization.

3.4 Case Studies: Free Docking

An example of the free docking results with the default parameters is shown in Fig. 2. The heterodimeric complex 1gpw consists of 253 residues for the receptor and 201 residues for the ligand. At the global scan stage, 30,000 docking poses were generated by the shape complementarity. The top six predictions are visualized using JSmol. Clustering of the docking poses was performed based on Cα RMSD with the default threshold 10 Å. The top cluster is shown in Fig. 3. The free docking successfully predicted the near-native structures (defined as acceptable or better quality, according to the CAPRI criteria) in the top-scored docking poses. The near-native docking poses may not always be among the top predictions of the global scan. Their ranking can be significantly improved by a priori identification of the interface residues or

106

Amar Singh et al.

Fig. 2 Example of free docking predictions. The structure of the complex is 1gpw. Receptor is in green, and the docked ligand is in blue. Each model can be individually downloaded

Fig. 3 Clustering of the docked ligands. The clustering is based on Cα RMSD. The receptor is in green, and the representative (the lowest energy) structure of the ligand for each cluster is in blue. The other ligand poses in the clusters are shown by their geometric center in red. Each cluster can be individually downloaded

atoms. This is shown by applicaton of the contraints on the interface residues to the docking of heterodimeric complex (1usu) of heat shock protein Hsp90 and co-chaperone Aha1. Hsp90 is a molecular chaperone essential to the activation and assembly of many key eukaryotic signaling and regulatory proteins. This protein–protein complex has a typical chain lengths (260 and 170 residues for receptor and ligand, respectively). The proteins were docked using default parameters. Figure 4 shows the top six

GRAMM Web Server

107

Fig. 4 Docking with constrains. (a) The co-crystalized structure of a protein–protein complex 1usu. The receptor is in green, and the ligand is in cyan. The top six predicted docking poses are shown in different colors: (b) predicted by shape complementarity only and (c) refined by the interface residues constrains. After accounting for the constrains, five of the top six predictions become near-native ones of acceptable or higher quality per CAPRI criteria

docking predictions obtained with and without constraints. The shape complementarity alone failed to place the near-native match in the top six predictions. However, the constraints (interface residues form the native structure) filtered out high-ranking non-native poses, which resulted in five out of the top six predictions to be acceptable or higher quality according to the CAPRI criteria. 3.5 Case Studies: Template-Based Docking

A transforming protein RhoA (UniProt ID P61586) regulates the signal transduction pathway and interacts with many different proteins. Structures of these complexes are well studied making them available as docking templates. To illustrate the template-based docking, we used input structures of the RhoA-myosin protein complex 5hpy. These proteins of typical lengths (232 and 185 residues) were docked with the default parameters. The top six predicted models are shown in Fig. 5. The output also includes a plaintext list of templates used in the docking along with the alignment scores for each monomer. The results file also contains the translation and rotation transformation for each of the utilized docking templates. The top docking poses are stored in PDB format and linked for the download. The docking results based on the full-structure and the interface-only alignment methodologies can be different (see Subheading 3). A typical example is shown in Fig. 6 for 3sic, in which a Streptomyces subtilisin inhibitor protein (chain I) was docked to subtilisin BPN’ (chain E). The highest-ranking models based on the full- and interface-only structure alignment are shown in Fig. 6a, b, respectively (in both cases, a complex of subtilisin Carlsberg and Eglin C, an elastase inhibitor from the leech Hirudo medicinalis, 1cse, was used as the template). The target receptor

108

Amar Singh et al.

Fig. 5 Example of template-based docking predictions. The protein–protein complex is 5hpy. The receptor is in green, and the docked ligand is in blue. Each model can be individually downloaded

has similar structure to the template monomer with sequence identity 71%. However, the target ligand’s structure, overall, is drastically different from the corresponding template monomer (sequence identity 0.1%). Thus, while the full-structure alignment minimized the distance between all Cα target/template atoms (Fig. 6c), the structure similarity between the binding loops (interface alignment) correctly aligned the interface parts of the target and the template, yielding a high-quality model with ligand RMSD 2.3 Å and interface RMSD 0.89 Å (Fig. 6d).

4

Notes 1. The input structure files should be in standard PDB format, containing ATOM records. 2. In protein–protein docking, the terms “receptor” and “ligand” are used to describe the two proteins involved in the interaction. The convention of designating the larger protein as the receptor and the smaller protein as the ligand is common, although there are exceptions.

GRAMM Web Server

109

Fig. 6 Comparison of the template-based docking by full- versus interface-only structural alignment. The co-crystalized structure of a protein–protein complex 3sic and chains E (green) and I (cyan) is aligned to (a) full-structure template and (b) interface-only structure template (orange). (c) Template-based docking pose of the ligand (orange) by full and (d) interface-only structure alignment. The ligand RMSD for predicted models are 12.9 Å and 2.3 Å for the full and the interface alignments, respectively

3. In GRAMM algorithm, the receptor protein is held in a fixed position, while the ligand protein is rotated and translated to identify the optimal binding pose. The algorithm calculates the energy of the two proteins’ interaction in each position, based on the geometric fit and energy-based scoring, and selects the ligand pose with the lowest energy. 4. GRAMM uses a rigid-body docking approach, which means that it assumes that the receptor and ligand proteins are rigid and do not undergo significant conformational changes upon binding. This assumption limits the accuracy of the predictions, particularly for protein–protein complexes that involve significant conformational changes upon binding.

110

Amar Singh et al.

5. If there is more than one chain in each molecule (receptor or ligand), the unique chain IDs are labeled alphabetically starting with A in receptor proteins. The docked models are stored as a protein–protein complex in PDB format named model_#.pdb (# = rank of the predicted model). 6. The docking scores and the transformation matrix (rotation angles and translation vectors for generation of the docking poses from the initial coordinates) of each completed job are stored in the “receptor-ligand.res” file. The first column is the rank of the docked model, the second is the docking score (energy), and the next three columns are rotation angles, followed by three columns of translation coordinates. 7. By using docking constraints, GRAMM can improve the accuracy of predicted interaction, particularly for protein–protein complexes with known binding sites or where the general location of the binding interface can be assumed. However, since the docking constraints may not always be available or reliable, and the accuracy of the predictions depends on their quality, the users are asked to provide the confidence scores in the 0–10 range. 8. In the current version of the web server, the clustering of docking poses is limited to the free docking only. 9. For the clustering, the selected number of docking poses should not be greater than the total number of the docking poses in the advanced option. 10. The quality of the predicted docking models is evaluated based on the user submitted structures and the existence of any interfaces. 11. The template-based docking procedure has been benchmarked on binary protein–protein complexes. The user can upload multichain structures; however, docking scores are evaluated based on the structure alignment with templates of binary protein–protein complexes, and thus the prediction accuracy may vary. 12. The success of the template-based approach depends on the availability of suitable templates that can be used to generate a model of the protein–protein complex. If no suitable templates are available, the procedure may not be able to generate an accurate model. 13. In case of a long list of custom templates, the user can enter comma-separated template IDs in the following format, xxxx_#A_#B where the first four characters are the PDB ID followed by two chain IDs as a binary complex (e.g., chain A and B of model 1 from PDB 12as is listed as “12as_1A_1B”).

GRAMM Web Server

5

111

Concluding Remarks The GRAMM docking server is a web application for predicting the three-dimensional structures of protein–protein complexes by free and template-based docking methodologies. An easy-to-use web interface allows users to submit jobs without having to register. The server provides docked structures of protein–protein complexes that can be further post-processed (e.g., for rescoring, refinement, clustering). The docking procedures have been previously extensively benchmarked and are widely used in the docking practice. The docked structures can be used to map the intermolecular energy landscape for further sampling. The GRAMM server is an ongoing project, and its further developments will focus on implementing (a) an automated selection of the docking methods, e.g., performing free docking if templates are not available; (b) prediction of the dimerization probability; and (c) accepting protein sequences as input, retrieving structures from available databases of protein structures.

Acknowledgments This study was supported by NIH grant R01GM074255 and NSF grant DBI1917263. The authors wish to acknowledge the contribution of Andrey Tovchigrechko who wrote the previous version of the server (GRAMM-X). References 1. Vakser IA (2014) Protein-protein docking: from interaction to interactome. Biophys J 107:1785–1793 2. Moal IH, Moretti R, Baker D, FernandezRecio J (2013) Scoring functions for protein– protein interactions. Curr Opin Struct Biol 23: 862–867 3. Singh A, Dauzhenka T, Kundrotas PJ, Sternberg MJE, Vakser IA (2020) Application of docking methodologies to modeled proteins. Proteins 88:1180–1188 4. Kundrotas PJ, Zhu Z, Janin J, Vakser IA (2021) Templates are available to model nearly all complexes of structurally characterized proteins. Proc Natl Acad Sci U S A 109:9438– 9441 5. Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T et al (2022) Protein complex prediction with AlphaFold-Multimer. bioRxiv. https://doi.org/10.1101/2021.10.04. 463034

6. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR et al (2021) Accurate prediction of protein structures and interactions using a three-track neural network. Science 373:871–876 7. Gao M, Nakajima An D, Parks JM, Skolnick J (2022) AF2Complex predicts direct physical interactions in multimeric proteins with deep learning. Nat Commun 13:1744 8. Kozakov D, Hall DR, Xia B, Porter KA, Padhorny D, Yueh C et al (2017) The ClusPro web server for protein-protein docking. Nat Protoc 12:255–278 9. de Vries SJ, van Dijk M, Bonvin AM (2010) The HADDOCK web server for data-driven biomolecular docking. Nat Protoc 5:883–897 10. Pierce BG, Wiehe K, Hwang H, Kim BH, Vreven T, Weng Z (2014) ZDOCK server: interactive docking prediction of proteinprotein complexes and symmetric multimers. Bioinformatics 30:1771–1773

112

Amar Singh et al.

11. Macindoe G, Mavridis L, Venkatraman V, Devignes MD, Ritchie DW (2010) HexServer: An FFT-based protein docking server powered by graphics processors. Nucleic Acids Res 38: W445–W449 12. de Vries SJ, Schindler CE, Chauvot de Beauchene I, Zacharias M (2015) A web interface for easy flexible protein-protein docking with ATTRACT. Biophys J 108:462–465 13. Ogmen U, Keskin O, Aytuna AS, Nussinov R, Gursoy A (2005) PRISM: protein interactions by structural matching. Nucleic Acids Res 33: W331–W3W6 14. Torchala M, Moal IH, Chaleil RA, FernandezRecio J, Bates PA (2013) SwarmDock: a server for flexible protein-protein docking. Bioinformatics 29:807–809 15. Jimenez-Garcia B, Pons C, Fernandez-Recio J (2013) pyDockWEB: a web server for rigidbody protein-protein docking using electrostatics and desolvation scoring. Bioinformatics 29:1698–1699 16. Schneidman-Duhovny D, Inbar Y, Nussinov R, Wolfson HJ (2005) PatchDock and SymmDock: servers for rigid and symmetric docking. Nucleic Acids Res 33:W363–W367 17. Lyskov S, Gray JJ (2008) The RosettaDock server for local protein-protein docking. Nucleic Acids Res 36:W233–W238 18. Ramirez-Aportela E, Lopez-Blanco JR, Chacon P (2016) FRODOCK 2.0: fast proteinprotein docking server. Bioinformatics 32: 2386–2388 19. Christoffer C, Bharadwaj V, Luu R, Kihara D (2021) LZerD protein-protein docking webserver enhanced with de novo structure prediction. Front Mol Biosci 8:724947 20. Yan Y, Zhang D, Zhou P, Li B, Huang SY (2017) HDOCK: a web server for proteinprotein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res 45:W365–WW73 21. Quignot C, Postic G, Bret H, Rey J, Granger P, Murail S et al (2021) InterEvDock3: a combined template-based and free docking server with increased performance through explicit modeling of complex homologs and integration of covariation-based contact maps. Nucleic Acids Res 49:W277–WW84

22. Tovchigrechko A, Vakser IA (2006) GRAMMX public web server for protein-protein docking. Nucleic Acids Res 34:W310–W314 23. Vakser IA, Grudinin S, Jenkins NW, Kundrotas PJ, Deeds EJ (2022) Docking-based long timescale simulation of cell-size protein systems at atomic resolution. Proc Natl Acad Sci U S A 119:e2210249119 24. Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci U S A 89:2195–2199 25. Vakser IA (1995) Protein docking for low-resolution structures. Protein Eng 8:371– 377 26. Lorenzen S, Zhang Y (2007) Identification of near-native structures by clustering protein docking conformations. Proteins 68:187–194 27. Lensink MF, Wodak SJ (2013) Docking, scoring, and affinity prediction in CAPRI. Proteins 81:2082–2095 28. Sinha R, Kundrotas PJ, Vakser IA (2010) Docking by structural similarity at proteinprotein interfaces. Proteins 78:3235–3241 29. Zhang Y, Skolnick J (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33:2302– 2309 30. Kundrotas PJ, Anishchenko I, Badal VD, Das M, Dauzhenka T, Vakser IA (2018) Modeling CAPRI targets 110-120 by templatebased and free docking using contact potential and combined scoring function. Proteins 86 (Suppl 1):302–310 31. Collins KW, Copeland MM, Kotthoff I, Singh A, Kundrotas PJ, Vakser IA (2022) DOCKGROUND resource for protein recognition studies. Protein Sci 31:e4481 32. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242 33. Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org

Chapter 6 Protein–Ligand Blind Docking Using CB-Dock2 Yang Liu and Yang Cao Abstract Protein–ligand blind docking is a widely used method for studying the binding sites and poses of ligands and receptors in pharmaceutical and biological research. Recently, our new blind docking server named CB-Dock2 has been released and is currently being utilized by researchers worldwide. CB-Dock2 outperforms state-of-the-art methods due to its accuracy in binding site identification and binding pose prediction, which are enabled by its knowledge-based docking engine. This highly automated server offers interactive and intuitive input and output web interfaces, making it an efficient and user-friendly tool for the bioinformatics and cheminformatics communities. This chapter provides a brief overview of the methods, followed by a detailed guide on using the CB-Dock2 server. Additionally, we present a case study that evaluates the performance of protein–ligand blind docking using this tool. Key words CB-Dock2, Web server, Binding site prediction, Protein–ligand docking, Blind docking

1

Introduction Predicting interactions between proteins and small molecules is a fundamental task in biochemistry and molecular biology [1, 2]. The ability to understand how proteins interact with small molecules, including drugs, is crucial for developing new therapeutic agents that can target specific proteins involved in various diseases. One of the most powerful approaches for predicting protein– small molecule interactions is protein–ligand blind docking. Blind docking is a computational method that identifies the binding regions of a protein and predicts the binding pose of a molecule, even when the binding site is not known beforehand [3–7]. This makes it particularly useful for discovering new ligands that can bind to a protein of interest. With the advent of breakthrough protein structure determination techniques such as AlphaFold2 [8] and RoseTTAFold [9], the field of protein–ligand docking has seen a significant boost in recent years. These techniques have enabled the determination of protein structures with unprecedented accuracy and speed, paving the way for the exploration of

Mohini Gore and Umesh B. Jagtap (eds.), Computational Drug Discovery and Design, Methods in Molecular Biology, vol. 2714, https://doi.org/10.1007/978-1-0716-3441-7_6, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2024

113

114

Yang Liu and Yang Cao

new target therapies [10, 11]. Blind docking methods, such as SwissDock [12, 13], COACH-D [13], EDock [14], and MTiAutoDock [15], have been developed and extensively used for exploring potential binding sites or ligand-binding poses. Our team has recently published an efficient and user-friendly protein–ligand blind docking server called CB-Dock2 [16]. CBDock2 builds upon the structure-based cavity detection and docking module of its predecessor, CB-Dock [17], and includes a novel template-based molecular docking module to improve accuracy. Our benchmark tests (shown in Fig. 1) demonstrate that CB-Dock2 achieves a success rate of approximately 85% for binding pose prediction (RMSD 2 Å), and since the binding pose in F1 refers to a known template, it can be assumed that this prediction is more reliable. In addition, C2 and F2 also belong to the same pocket region, and the binding pose in C2 is comparable to that in F2 (RMSD